OpenAI API rate limits exist at two levels: requests per minute (RPM) and tokens per minute (TPM). New API accounts start at Tier 1 with very low limits. The 429 error tells you exactly which limit you hit and by how much β reading the error message carefully is the fastest path to the right fix. Verified April 2026.
HTTP 429 Too Many Requests response from api.openai.com
Error: 'RateLimitError: Rate limit reached for gpt-4o in organization ORG-XXXXX on tokens per min. Limit: 30,000, Used: 30,000, Requested: 1,500. Please try again in 3s.'
Error: 'RateLimitError: Rate limit reached for requests. Limit: 500/1min. Current: 501. Please try again in 120ms.'
Burst of API calls succeeds then fails with 429 mid-batch
Streaming requests cut off with a 429 mid-generation
429 errors only on certain models (gpt-4o) while gpt-3.5-turbo succeeds
The most common rate limit in production. TPM counts input + output tokens across all concurrent requests within a rolling 1-minute window. New accounts (Tier 1) have TPM limits as low as 30,000 for GPT-4o β a single large prompt can use half the minute's budget. The error message will say 'tokens per min' and show your exact usage vs. limit.
RPM limits how many API calls you can make per minute, regardless of token count. Tier 1 starts at 500 RPM for most models. Parallel processing loops, batch jobs, and retry storms are the most common RPM triggers. The error message will say 'requests' and show RPM values.
GPT-4o, o1, o3, and GPT-4 Turbo have much stricter limits than GPT-3.5 Turbo or GPT-4o mini β often 5-10x lower TPM at the same usage tier. This is why switching models (or mixing models for different task types) is the fastest workaround when hitting limits.
Using asyncio.gather() or Promise.all() to fire hundreds of API calls at once will hit both RPM and TPM limits instantly. Without a semaphore or rate limiter in your code, any batch job that scales past a few dozen items will generate 429 errors regardless of your usage tier.
OpenAI's 429 responses include a Retry-After header with the exact number of seconds to wait. Applications that don't implement exponential backoff or check Retry-After will fail immediately and noisily instead of silently recovering. The openai SDK doesn't retry by default before version 4.x; many older codebases lack this entirely.
OpenAI's API usage tiers start at Tier 1 (requires $5 lifetime spend) with deliberately low limits. Tier 2 starts at $50 lifetime spend. If you've just created an API account or are using a fresh key, your limits are very low by design. Check platform.openai.com/settings/organization/limits to see your current tier.
When to try: First β takes 10 seconds and tells you exactly what changed
The 429 response body contains the limit type (tokens per min vs. requests per min), your current usage, your limit, and how long to wait. Example: 'Rate limit reached for gpt-4o on tokens per min. Limit: 30000, Used: 30000, Requested: 1500. Please try again in 3s.' The Retry-After response header also gives the exact seconds to wait. Don't guess β parse the error and act on the specific limit you hit.
When to try: If you don't already have retry logic β this is the single highest-ROI fix
Add retry logic with exponential backoff to your API calls. Python example using the official SDK: 'from openai import OpenAI; client = OpenAI(max_retries=5)' β the SDK's built-in retry handles 429s automatically in openai>=1.0.0. For manual implementation, wait 1s, 2s, 4s, 8s, 16s between retries, plus random jitter (0-1s) to prevent thundering herd. Always cap at 5 retries and check the Retry-After header for minimum wait time.
When to try: For batch processing or parallel API call patterns
Wrap your async API calls with a concurrency cap. Python asyncio example: 'semaphore = asyncio.Semaphore(10); async with semaphore: response = await client.chat.completions.create(...)'. Setting max concurrent requests to 10-20 for Tier 1/2 and 50-100 for higher tiers prevents instantaneous RPM spikes. For Node.js, use p-limit: 'const limit = pLimit(10)'. This is the fastest fix for batch-processing jobs that hit rate limits.
When to try: When TPM limits are the bottleneck and task quality allows a smaller model
GPT-4o mini and GPT-3.5 Turbo have much higher TPM limits than GPT-4o at the same tier. For tasks where GPT-4o is overkill (classification, summarization, simple Q&A), use GPT-4o mini β same API call, replace the model name string, get 5-10x more capacity. Check platform.openai.com/settings/organization/limits for per-model limit breakdowns at your tier.
When to try: When large documents or long conversations are triggering TPM limits
Use the tiktoken library (Python: pip install tiktoken) to count tokens before building API requests. For GPT-4o: 'import tiktoken; enc = tiktoken.encoding_for_model("gpt-4o"); token_count = len(enc.encode(your_text))'. If a request exceeds 20% of your TPM limit, split the input into chunks. This prevents single large requests from consuming your entire minute budget.
When to try: If you've implemented backoff and semaphores but still hit limits regularly
Log into platform.openai.com β Settings β Organization β Limits. Your tier is shown here, along with exact limits per model. Tier progression: Tier 1 ($5 spend, 30K TPM on GPT-4o), Tier 2 ($50, 450K TPM), Tier 3 ($100, 800K TPM), Tier 4 ($250, 2M TPM), Tier 5 ($1000, 30M TPM). Tier upgrades happen automatically after reaching each spend threshold β you can also contact OpenAI sales for higher limits if you're building at scale.
When to try: For production services handling multiple users
For production systems, add a proper rate limiter library. Python: use 'ratelimit' (pip install ratelimit) or 'tenacity'. Node.js: use 'bottleneck' (npm install bottleneck). Example with bottleneck: 'const limiter = new Bottleneck({ minTime: 100, maxConcurrent: 10 })'. Configure the limiter to stay under 80% of your limits to avoid hitting the ceiling. Monitor your usage with the x-ratelimit-remaining-tokens header in each API response.
When to try: For user-facing applications where long waits are a problem
Streaming responses (stream=True in Python SDK) doesn't reduce TPM consumption but it distributes request processing time, which can help with RPM limits on long generations. More importantly, streaming improves UX during high-load periods by showing partial results rather than timing out. Switch to: 'client.chat.completions.create(model="gpt-4o", messages=[...], stream=True)' and iterate the streamed chunks.
Always instrument your code with the x-ratelimit-remaining-tokens and x-ratelimit-remaining-requests headers in every response β build a dashboard to monitor headroom
Set usage limits in platform.openai.com β Settings β Billing to cap monthly spend and prevent runaway costs from retry storms
Design with usage tiers in mind β Tier 1 limits are deliberately low; plan your architecture for what Tier 2/3 allows
Use model routing: GPT-4o for complex tasks, GPT-4o mini or GPT-3.5 Turbo for high-volume simple tasks
Add alerting when your error rate for 429s exceeds 1% β it's a leading indicator of approaching your tier ceiling
Contact OpenAI support at platform.openai.com/support if: (1) You've hit Tier 5 limits and need enterprise-level capacity above the self-serve ceiling, (2) Your account tier doesn't match your spend history (e.g., you've spent $500 but still see Tier 2 limits), (3) You need higher limits before reaching the spend threshold for your tier (OpenAI sometimes grants early tier increases to startups and enterprise customers). For general rate limit questions, the self-serve tier upgrade system handles 99% of cases.
A 429 response from OpenAI includes a JSON body with the error type, message, code, and param. Example: {"error": {"message": "Rate limit reached for gpt-4o in organization org-XXXXXX on tokens per min (TPM): Limit 30000, Used 29800, Requested 1500. Please try again in 3s. Contact us if you need to raise your limit.", "type": "requests", "param": null, "code": "rate_limit_exceeded"}}. The response headers also include x-ratelimit-limit-tokens, x-ratelimit-remaining-tokens, x-ratelimit-reset-tokens, and Retry-After β use these for precise backoff timing.
TPM (tokens per minute) is a budget of total tokens consumed per minute across all API calls β input tokens + output tokens combined. RPM (requests per minute) is simply the count of API calls, regardless of size. You can hit TPM with a few large requests or RPM with many small requests. The error message tells you which one you hit. At Tier 1 with GPT-4o: 30,000 TPM and 500 RPM. Both reset on a rolling 1-minute window.
Go to platform.openai.com β Settings β Organization β Limits. This page shows: your current usage tier (1-5), your lifetime API spend, the spend threshold for the next tier, and per-model TPM/RPM limits. The tier upgrades automatically when you cross the spend threshold β there's no manual request needed for Tiers 1-5. For limits above Tier 5, contact OpenAI sales.
In openai>=1.0.0 (released late 2023), the SDK retries 429 and 5xx errors automatically by default with exponential backoff. The default max_retries is 2. To increase it: 'client = OpenAI(max_retries=5)'. For older SDK versions (<1.0.0) or the legacy openai.ChatCompletion interface, there is no automatic retry β you need to add your own with tenacity or a similar library.
The most common reason: your requests use many tokens each, and you're hitting the TPM limit (not RPM). A single GPT-4o call with a 10,000-token prompt plus 5,000 tokens of output uses 15,000 tokens β half of a Tier 1 TPM budget in one request. To debug: log the usage field in each response (response.usage.total_tokens) and sum it over a minute. That number vs. your TPM limit explains the 429.
Add a semaphore to cap concurrency. In Python: 'asyncio.Semaphore(10)' wrapping each API call. Start with 10 concurrent requests for Tier 1, scale up as you verify you're not hitting limits. This single change typically reduces 429 errors by 90% in batch jobs because it prevents the instantaneous burst that consumes your entire RPM budget in the first second. Combine with 'max_retries=5' in the OpenAI client for full coverage.
Rate limits are enforced at the organization level, not per API key. If you have five API keys in one organization, all five share the same TPM and RPM limits combined. Creating additional API keys does not give you additional rate limit capacity. To get more capacity, either upgrade your usage tier through the organization's billing or create a separate OpenAI organization with its own billing.
GPT-4o has significantly lower TPM limits than GPT-3.5 Turbo at the same tier. At Tier 1: GPT-4o is limited to 30,000 TPM, while GPT-3.5 Turbo allows 200,000 TPM. If you need to process high volumes, use model routing: GPT-4o for tasks requiring advanced reasoning, GPT-4o mini or GPT-3.5 Turbo for high-volume simpler tasks. This alone often solves rate limiting for mixed workloads.
Use tiktoken to count tokens: sum up (input_tokens + expected_output_tokens) for all items in your batch, then divide by 60 to get required TPM. Compare to your limit at platform.openai.com/settings/organization/limits. For RPM: count total items divided by 60. If either number exceeds your limit, your batch will 429. Solution: add a per-second rate limiter so throughput stays under 80% of your limits β the 20% buffer absorbs measurement imprecision.
Recommended pattern: wait = min(base * (2 ** attempt) + random.uniform(0, 1), max_wait) where base=1s, max_wait=60s, and attempt starts at 0. Always also check the Retry-After header in the 429 response and use max(calculated_wait, retry_after_seconds) so you don't retry before OpenAI says it's safe. Cap at 5 retries total. Log each retry with the wait time for observability. The openai SDK's built-in retry implements a similar pattern when you set max_retries.
Not instantly, but through spend thresholds. OpenAI's tier system upgrades your limits automatically as you accumulate API spend: Tier 2 at $50, Tier 3 at $100, Tier 4 at $250, Tier 5 at $1,000. You can accelerate tier upgrades by adding API credits (prepaying). For limits beyond Tier 5 or for custom limits at any tier, contact OpenAI sales through platform.openai.com. There's no immediate manual upgrade path that bypasses the spend requirements for standard accounts.
LangChain, LlamaIndex, and similar frameworks wrap the OpenAI SDK and surface the same RateLimitError. LangChain has built-in retry with 'from langchain.chat_models import ChatOpenAI; llm = ChatOpenAI(max_retries=3)'. Check the framework's documentation for its specific retry configuration. If the framework doesn't retry, catch openai.RateLimitError exceptions in your code and implement backoff at the application level. Verified April 2026.