Skip to main content
OpenOpen8 enforces rate limits at the user and token level to prevent any single user or application from saturating the gateway and degrading service for others. Rate limits are configured by your administrator and apply per user, per model, per minute.

How rate limits work

When your administrator configures a model, they can set a maximum number of requests per minute that any single user may send to that model. If you exceed that limit, OpenOpen8 returns an HTTP 429 Too Many Requests response and rejects the request — it is not queued or retried automatically. Rate limits are tracked independently per model. Sending many requests to gpt-4o does not count against your limit for claude-3-5-sonnet.
Rate limits protect the shared infrastructure. If your workload regularly hits the limit, contact your administrator to request a higher limit for your account or group.

Detecting rate limit errors

A rate-limited request returns:
  • HTTP status: 429 Too Many Requests
  • Response body: a JSON error object describing the rate limit
{
  "error": {
    "message": "Rate limit exceeded. Please try again later.",
    "type": "rate_limit_error",
    "code": 429
  }
}
Check for status 429 in your error-handling logic to distinguish rate limit errors from authentication errors (401) or upstream provider errors (5xx).

Handling rate limits in your code

The most common strategies for dealing with rate limits are exponential backoff and request queuing.
Retry the request after a delay that increases with each failed attempt. This avoids hammering the gateway while giving it time to recover.
import time
import random
import httpx

def call_with_backoff(client, payload, max_retries=5):
    for attempt in range(max_retries):
        response = client.post("/v1/chat/completions", json=payload)
        if response.status_code != 429:
            return response
        wait = (2 ** attempt) + random.uniform(0, 1)
        print(f"Rate limited. Retrying in {wait:.1f}s...")
        time.sleep(wait)
    raise Exception("Max retries exceeded")
Add a small random jitter (as shown above) to avoid thundering-herd problems when multiple processes retry at the same time.

Tips for staying within limits

  • Batch requests where possible — some models support sending multiple prompts in one request, which counts as a single request against the rate limit.
  • Cache responses — if your application asks the same question repeatedly, cache the response instead of re-requesting it.
  • Spread load across tokens — if you control multiple tokens, distributing requests across them does not bypass per-user limits (limits apply per user, not per token), but it can help with per-token quota management.
  • Use less expensive models for high-volume tasks — switching from a large model to a smaller one for tasks that don’t require the larger model reduces both rate limit pressure and credit consumption.

Requesting higher limits

Rate limits are set by your administrator at the model level. If your use case requires a higher limit, reach out to your admin and describe:
  • Which model(s) you need higher limits for
  • Your expected request volume (requests per minute)
  • The nature of your workload (batch processing, interactive application, etc.)
Admins can configure per-group limits, so your admin may create a higher-limit group and assign your account to it.