🚦

OpenAI API Rate Limit Exceeded

Q: Why do I hit rate limits even though I'm not making many requests?

The most common reason: your requests use many tokens each, and you're hitting the TPM limit (not RPM). A single GPT-4o call with a 10,000-token prompt plus 5,000 tokens of output uses 15,000 tokens — half of a Tier 1 TPM budget in one request. To debug: log the usage field in each response (response.usage.total_tokens) and sum it over a minute. That number vs. your TPM limit explains the 429.

Q: What is the fastest fix for rate limit errors in a batch job?

Add a semaphore to cap concurrency. In Python: 'asyncio.Semaphore(10)' wrapping each API call. Start with 10 concurrent requests for Tier 1, scale up as you verify you're not hitting limits. This single change typically reduces 429 errors by 90% in batch jobs because it prevents the instantaneous burst that consumes your entire RPM budget in the first second. Combine with 'max_retries=5' in the OpenAI client for full coverage.

Q: Are rate limits per API key or per organization?

Rate limits are enforced at the organization level, not per API key. If you have five API keys in one organization, all five share the same TPM and RPM limits combined. Creating additional API keys does not give you additional rate limit capacity. To get more capacity, either upgrade your usage tier through the organization's billing or create a separate OpenAI organization with its own billing.

Q: Why does GPT-4o hit rate limits but GPT-3.5 Turbo doesn't?

GPT-4o has significantly lower TPM limits than GPT-3.5 Turbo at the same tier. At Tier 1: GPT-4o is limited to 30,000 TPM, while GPT-3.5 Turbo allows 200,000 TPM. If you need to process high volumes, use model routing: GPT-4o for tasks requiring advanced reasoning, GPT-4o mini or GPT-3.5 Turbo for high-volume simpler tasks. This alone often solves rate limiting for mixed workloads.

Q: How do I calculate whether my batch will hit rate limits before running it?

Use tiktoken to count tokens: sum up (input_tokens + expected_output_tokens) for all items in your batch, then divide by 60 to get required TPM. Compare to your limit at platform.openai.com/settings/organization/limits. For RPM: count total items divided by 60. If either number exceeds your limit, your batch will 429. Solution: add a per-second rate limiter so throughput stays under 80% of your limits — the 20% buffer absorbs measurement imprecision.

Q: What should my exponential backoff look like for OpenAI API?

Recommended pattern: wait = min(base * (2 ** attempt) + random.uniform(0, 1), max_wait) where base=1s, max_wait=60s, and attempt starts at 0. Always also check the Retry-After header in the 429 response and use max(calculated_wait, retry_after_seconds) so you don't retry before OpenAI says it's safe. Cap at 5 retries total. Log each retry with the wait time for observability. The openai SDK's built-in retry implements a similar pattern when you set max_retries.

OpenAI API rate limits exist at two levels: requests per minute (RPM) and tokens per minute (TPM). New API accounts start at Tier 1 with very low limits. The 429 error tells you exactly which limit you hit and by how much — reading the error message carefully is the fastest path to the right fix. Verified April 2026.

Symptoms You Might See

⚠️

HTTP 429 Too Many Requests response from api.openai.com

⚠️

Error: 'RateLimitError: Rate limit reached for gpt-4o in organization ORG-XXXXX on tokens per min. Limit: 30,000, Used: 30,000, Requested: 1,500. Please try again in 3s.'

⚠️

Error: 'RateLimitError: Rate limit reached for requests. Limit: 500/1min. Current: 501. Please try again in 120ms.'

⚠️

Burst of API calls succeeds then fails with 429 mid-batch

⚠️

Streaming requests cut off with a 429 mid-generation

⚠️

429 errors only on certain models (gpt-4o) while gpt-3.5-turbo succeeds

Common Causes

Hitting tokens-per-minute (TPM) limit

The most common rate limit in production. TPM counts input + output tokens across all concurrent requests within a rolling 1-minute window. New accounts (Tier 1) have TPM limits as low as 30,000 for GPT-4o — a single large prompt can use half the minute's budget. The error message will say 'tokens per min' and show your exact usage vs. limit.

Hitting requests-per-minute (RPM) limit

RPM limits how many API calls you can make per minute, regardless of token count. Tier 1 starts at 500 RPM for most models. Parallel processing loops, batch jobs, and retry storms are the most common RPM triggers. The error message will say 'requests' and show RPM values.

Using a low-quota model tier

GPT-4o, o1, o3, and GPT-4 Turbo have much stricter limits than GPT-3.5 Turbo or GPT-4o mini — often 5-10x lower TPM at the same usage tier. This is why switching models (or mixing models for different task types) is the fastest workaround when hitting limits.

Unbounded parallelism / concurrent requests

Using asyncio.gather() or Promise.all() to fire hundreds of API calls at once will hit both RPM and TPM limits instantly. Without a semaphore or rate limiter in your code, any batch job that scales past a few dozen items will generate 429 errors regardless of your usage tier.

No retry logic catching retryable 429s

OpenAI's 429 responses include a Retry-After header with the exact number of seconds to wait. Applications that don't implement exponential backoff or check Retry-After will fail immediately and noisily instead of silently recovering. The openai SDK doesn't retry by default before version 4.x; many older codebases lack this entirely.

New API account at Tier 1 with default limits

OpenAI's API usage tiers start at Tier 1 (requires $5 lifetime spend) with deliberately low limits. Tier 2 starts at $50 lifetime spend. If you've just created an API account or are using a fresh key, your limits are very low by design. Check platform.openai.com/settings/organization/limits to see your current tier.

Step-by-Step Fixes

Read the error message — it tells you exactly what to do

When to try: First — takes 10 seconds and tells you exactly what changed

The 429 response body contains the limit type (tokens per min vs. requests per min), your current usage, your limit, and how long to wait. Example: 'Rate limit reached for gpt-4o on tokens per min. Limit: 30000, Used: 30000, Requested: 1500. Please try again in 3s.' The Retry-After response header also gives the exact seconds to wait. Don't guess — parse the error and act on the specific limit you hit.

Implement exponential backoff with jitter

When to try: If you don't already have retry logic — this is the single highest-ROI fix

Add retry logic with exponential backoff to your API calls. Python example using the official SDK: 'from openai import OpenAI; client = OpenAI(max_retries=5)' — the SDK's built-in retry handles 429s automatically in openai>=1.0.0. For manual implementation, wait 1s, 2s, 4s, 8s, 16s between retries, plus random jitter (0-1s) to prevent thundering herd. Always cap at 5 retries and check the Retry-After header for minimum wait time.

Add a semaphore to limit concurrent requests

When to try: For batch processing or parallel API call patterns

Wrap your async API calls with a concurrency cap. Python asyncio example: 'semaphore = asyncio.Semaphore(10); async with semaphore: response = await client.chat.completions.create(...)'. Setting max concurrent requests to 10-20 for Tier 1/2 and 50-100 for higher tiers prevents instantaneous RPM spikes. For Node.js, use p-limit: 'const limit = pLimit(10)'. This is the fastest fix for batch-processing jobs that hit rate limits.

Switch to a higher-quota model

When to try: When TPM limits are the bottleneck and task quality allows a smaller model

GPT-4o mini and GPT-3.5 Turbo have much higher TPM limits than GPT-4o at the same tier. For tasks where GPT-4o is overkill (classification, summarization, simple Q&A), use GPT-4o mini — same API call, replace the model name string, get 5-10x more capacity. Check platform.openai.com/settings/organization/limits for per-model limit breakdowns at your tier.

Count tokens before sending and split large inputs

When to try: When large documents or long conversations are triggering TPM limits

Use the tiktoken library (Python: pip install tiktoken) to count tokens before building API requests. For GPT-4o: 'import tiktoken; enc = tiktoken.encoding_for_model("gpt-4o"); token_count = len(enc.encode(your_text))'. If a request exceeds 20% of your TPM limit, split the input into chunks. This prevents single large requests from consuming your entire minute budget.

Check and upgrade your usage tier

When to try: If you've implemented backoff and semaphores but still hit limits regularly

Log into platform.openai.com → Settings → Organization → Limits. Your tier is shown here, along with exact limits per model. Tier progression: Tier 1 ($5 spend, 30K TPM on GPT-4o), Tier 2 ($50, 450K TPM), Tier 3 ($100, 800K TPM), Tier 4 ($250, 2M TPM), Tier 5 ($1000, 30M TPM). Tier upgrades happen automatically after reaching each spend threshold — you can also contact OpenAI sales for higher limits if you're building at scale.

Implement a token bucket or rate limiter in your code

When to try: For production services handling multiple users

For production systems, add a proper rate limiter library. Python: use 'ratelimit' (pip install ratelimit) or 'tenacity'. Node.js: use 'bottleneck' (npm install bottleneck). Example with bottleneck: 'const limiter = new Bottleneck({ minTime: 100, maxConcurrent: 10 })'. Configure the limiter to stay under 80% of your limits to avoid hitting the ceiling. Monitor your usage with the x-ratelimit-remaining-tokens header in each API response.

Use streaming to reduce effective token cost

When to try: For user-facing applications where long waits are a problem

Streaming responses (stream=True in Python SDK) doesn't reduce TPM consumption but it distributes request processing time, which can help with RPM limits on long generations. More importantly, streaming improves UX during high-load periods by showing partial results rather than timing out. Switch to: 'client.chat.completions.create(model="gpt-4o", messages=[...], stream=True)' and iterate the streamed chunks.

How to Prevent This

✓

Always instrument your code with the x-ratelimit-remaining-tokens and x-ratelimit-remaining-requests headers in every response — build a dashboard to monitor headroom

✓

Set usage limits in platform.openai.com → Settings → Billing to cap monthly spend and prevent runaway costs from retry storms

✓

Design with usage tiers in mind — Tier 1 limits are deliberately low; plan your architecture for what Tier 2/3 allows

✓

Use model routing: GPT-4o for complex tasks, GPT-4o mini or GPT-3.5 Turbo for high-volume simple tasks

✓

Add alerting when your error rate for 429s exceeds 1% — it's a leading indicator of approaching your tier ceiling

When to Contact Support

Contact OpenAI support at platform.openai.com/support if: (1) You've hit Tier 5 limits and need enterprise-level capacity above the self-serve ceiling, (2) Your account tier doesn't match your spend history (e.g., you've spent $500 but still see Tier 2 limits), (3) You need higher limits before reaching the spend threshold for your tier (OpenAI sometimes grants early tier increases to startups and enterprise customers). For general rate limit questions, the self-serve tier upgrade system handles 99% of cases.

Frequently Asked Questions

What does the OpenAI 429 error response look like exactly?▼

A 429 response from OpenAI includes a JSON body with the error type, message, code, and param. Example: {"error": {"message": "Rate limit reached for gpt-4o in organization org-XXXXXX on tokens per min (TPM): Limit 30000, Used 29800, Requested 1500. Please try again in 3s. Contact us if you need to raise your limit.", "type": "requests", "param": null, "code": "rate_limit_exceeded"}}. The response headers also include x-ratelimit-limit-tokens, x-ratelimit-remaining-tokens, x-ratelimit-reset-tokens, and Retry-After — use these for precise backoff timing.

What is the difference between TPM and RPM limits?▼

TPM (tokens per minute) is a budget of total tokens consumed per minute across all API calls — input tokens + output tokens combined. RPM (requests per minute) is simply the count of API calls, regardless of size. You can hit TPM with a few large requests or RPM with many small requests. The error message tells you which one you hit. At Tier 1 with GPT-4o: 30,000 TPM and 500 RPM. Both reset on a rolling 1-minute window.

How do I check my current OpenAI API usage tier?▼

Go to platform.openai.com → Settings → Organization → Limits. This page shows: your current usage tier (1-5), your lifetime API spend, the spend threshold for the next tier, and per-model TPM/RPM limits. The tier upgrades automatically when you cross the spend threshold — there's no manual request needed for Tiers 1-5. For limits above Tier 5, contact OpenAI sales.

Does the openai Python SDK handle 429 errors automatically?▼

In openai>=1.0.0 (released late 2023), the SDK retries 429 and 5xx errors automatically by default with exponential backoff. The default max_retries is 2. To increase it: 'client = OpenAI(max_retries=5)'. For older SDK versions (<1.0.0) or the legacy openai.ChatCompletion interface, there is no automatic retry — you need to add your own with tenacity or a similar library.

Why do I hit rate limits even though I'm not making many requests?▼

The most common reason: your requests use many tokens each, and you're hitting the TPM limit (not RPM). A single GPT-4o call with a 10,000-token prompt plus 5,000 tokens of output uses 15,000 tokens — half of a Tier 1 TPM budget in one request. To debug: log the usage field in each response (response.usage.total_tokens) and sum it over a minute. That number vs. your TPM limit explains the 429.

What is the fastest fix for rate limit errors in a batch job?▼

Add a semaphore to cap concurrency. In Python: 'asyncio.Semaphore(10)' wrapping each API call. Start with 10 concurrent requests for Tier 1, scale up as you verify you're not hitting limits. This single change typically reduces 429 errors by 90% in batch jobs because it prevents the instantaneous burst that consumes your entire RPM budget in the first second. Combine with 'max_retries=5' in the OpenAI client for full coverage.

Are rate limits per API key or per organization?▼

Rate limits are enforced at the organization level, not per API key. If you have five API keys in one organization, all five share the same TPM and RPM limits combined. Creating additional API keys does not give you additional rate limit capacity. To get more capacity, either upgrade your usage tier through the organization's billing or create a separate OpenAI organization with its own billing.

Why does GPT-4o hit rate limits but GPT-3.5 Turbo doesn't?▼

GPT-4o has significantly lower TPM limits than GPT-3.5 Turbo at the same tier. At Tier 1: GPT-4o is limited to 30,000 TPM, while GPT-3.5 Turbo allows 200,000 TPM. If you need to process high volumes, use model routing: GPT-4o for tasks requiring advanced reasoning, GPT-4o mini or GPT-3.5 Turbo for high-volume simpler tasks. This alone often solves rate limiting for mixed workloads.

How do I calculate whether my batch will hit rate limits before running it?▼

Use tiktoken to count tokens: sum up (input_tokens + expected_output_tokens) for all items in your batch, then divide by 60 to get required TPM. Compare to your limit at platform.openai.com/settings/organization/limits. For RPM: count total items divided by 60. If either number exceeds your limit, your batch will 429. Solution: add a per-second rate limiter so throughput stays under 80% of your limits — the 20% buffer absorbs measurement imprecision.

What should my exponential backoff look like for OpenAI API?▼

Recommended pattern: wait = min(base * (2 ** attempt) + random.uniform(0, 1), max_wait) where base=1s, max_wait=60s, and attempt starts at 0. Always also check the Retry-After header in the 429 response and use max(calculated_wait, retry_after_seconds) so you don't retry before OpenAI says it's safe. Cap at 5 retries total. Log each retry with the wait time for observability. The openai SDK's built-in retry implements a similar pattern when you set max_retries.

Can I pay to increase my OpenAI API rate limits immediately?▼

Not instantly, but through spend thresholds. OpenAI's tier system upgrades your limits automatically as you accumulate API spend: Tier 2 at $50, Tier 3 at $100, Tier 4 at $250, Tier 5 at $1,000. You can accelerate tier upgrades by adding API credits (prepaying). For limits beyond Tier 5 or for custom limits at any tier, contact OpenAI sales through platform.openai.com. There's no immediate manual upgrade path that bypasses the spend requirements for standard accounts.

What is the openai-api-rate-limit error when using LangChain or other frameworks?▼

LangChain, LlamaIndex, and similar frameworks wrap the OpenAI SDK and surface the same RateLimitError. LangChain has built-in retry with 'from langchain.chat_models import ChatOpenAI; llm = ChatOpenAI(max_retries=3)'. Check the framework's documentation for its specific retry configuration. If the framework doesn't retry, catch openai.RateLimitError exceptions in your code and implement backoff at the application level. Verified April 2026.

Other Common AI Errors

⚠️

ChatGPT Not Working

Service Issues

🚦

Rate Limit Exceeded

Usage Limits

📏

Context Length Exceeded

Input Limits

🔥

Model Overloaded

Service Issues

🚦

OpenAI API Rate Limit Exceeded

Symptoms You Might See

⚠️

HTTP 429 Too Many Requests response from api.openai.com

⚠️

Error: 'RateLimitError: Rate limit reached for gpt-4o in organization ORG-XXXXX on tokens per min. Limit: 30,000, Used: 30,000, Requested: 1,500. Please try again in 3s.'

⚠️

Error: 'RateLimitError: Rate limit reached for requests. Limit: 500/1min. Current: 501. Please try again in 120ms.'

⚠️

Burst of API calls succeeds then fails with 429 mid-batch

⚠️

Streaming requests cut off with a 429 mid-generation

⚠️

429 errors only on certain models (gpt-4o) while gpt-3.5-turbo succeeds

Common Causes

Hitting tokens-per-minute (TPM) limit

Hitting requests-per-minute (RPM) limit

Using a low-quota model tier

Unbounded parallelism / concurrent requests

No retry logic catching retryable 429s

New API account at Tier 1 with default limits

Step-by-Step Fixes

Read the error message — it tells you exactly what to do

When to try: First — takes 10 seconds and tells you exactly what changed

Implement exponential backoff with jitter

When to try: If you don't already have retry logic — this is the single highest-ROI fix

Add a semaphore to limit concurrent requests

When to try: For batch processing or parallel API call patterns

Switch to a higher-quota model

When to try: When TPM limits are the bottleneck and task quality allows a smaller model

Count tokens before sending and split large inputs

When to try: When large documents or long conversations are triggering TPM limits

Check and upgrade your usage tier

When to try: If you've implemented backoff and semaphores but still hit limits regularly

Implement a token bucket or rate limiter in your code

When to try: For production services handling multiple users

Use streaming to reduce effective token cost

When to try: For user-facing applications where long waits are a problem

How to Prevent This

✓

Always instrument your code with the x-ratelimit-remaining-tokens and x-ratelimit-remaining-requests headers in every response — build a dashboard to monitor headroom

✓

Set usage limits in platform.openai.com → Settings → Billing to cap monthly spend and prevent runaway costs from retry storms

✓

Design with usage tiers in mind — Tier 1 limits are deliberately low; plan your architecture for what Tier 2/3 allows

✓

Use model routing: GPT-4o for complex tasks, GPT-4o mini or GPT-3.5 Turbo for high-volume simple tasks

✓

Add alerting when your error rate for 429s exceeds 1% — it's a leading indicator of approaching your tier ceiling

When to Contact Support

Frequently Asked Questions

What does the OpenAI 429 error response look like exactly?▼

What is the difference between TPM and RPM limits?▼

How do I check my current OpenAI API usage tier?▼

Does the openai Python SDK handle 429 errors automatically?▼

Why do I hit rate limits even though I'm not making many requests?▼

What is the fastest fix for rate limit errors in a batch job?▼

Are rate limits per API key or per organization?▼

Why does GPT-4o hit rate limits but GPT-3.5 Turbo doesn't?▼

How do I calculate whether my batch will hit rate limits before running it?▼

What should my exponential backoff look like for OpenAI API?▼

Can I pay to increase my OpenAI API rate limits immediately?▼

What is the openai-api-rate-limit error when using LangChain or other frameworks?▼

Other Common AI Errors

⚠️