Rate limits in OpenOpen8: how they work and how to handle them

OpenOpen8 enforces rate limits at the user and token level to prevent any single user or application from saturating the gateway and degrading service for others. Rate limits are configured by your administrator and apply per user, per model, per minute.

How rate limits work

When your administrator configures a model, they can set a maximum number of requests per minute that any single user may send to that model. If you exceed that limit, OpenOpen8 returns an HTTP 429 Too Many Requests response and rejects the request — it is not queued or retried automatically. Rate limits are tracked independently per model. Sending many requests to gpt-4o does not count against your limit for claude-3-5-sonnet.

Rate limits protect the shared infrastructure. If your workload regularly hits the limit, contact your administrator to request a higher limit for your account or group.

Detecting rate limit errors

A rate-limited request returns:

HTTP status: 429 Too Many Requests
Response body: a JSON error object describing the rate limit

{
  "error": {
    "message": "Rate limit exceeded. Please try again later.",
    "type": "rate_limit_error",
    "code": 429
  }
}

Check for status 429 in your error-handling logic to distinguish rate limit errors from authentication errors (401) or upstream provider errors (5xx).

Handling rate limits in your code

The most common strategies for dealing with rate limits are exponential backoff and request queuing.

Exponential backoff
Request queue

Retry the request after a delay that increases with each failed attempt. This avoids hammering the gateway while giving it time to recover.

import time
import random
import httpx

def call_with_backoff(client, payload, max_retries=5):
    for attempt in range(max_retries):
        response = client.post("/v1/chat/completions", json=payload)
        if response.status_code != 429:
            return response
        wait = (2 ** attempt) + random.uniform(0, 1)
        print(f"Rate limited. Retrying in {wait:.1f}s...")
        time.sleep(wait)
    raise Exception("Max retries exceeded")

Add a small random jitter (as shown above) to avoid thundering-herd problems when multiple processes retry at the same time.

For high-throughput workloads, maintain a queue and dispatch requests at a controlled rate rather than sending them all at once.

import asyncio
import httpx

# Limit to N concurrent requests
semaphore = asyncio.Semaphore(5)

async def throttled_request(client, payload):
    async with semaphore:
        response = await client.post("/v1/chat/completions", json=payload)
        return response

async def process_batch(payloads):
    async with httpx.AsyncClient() as client:
        tasks = [throttled_request(client, p) for p in payloads]
        return await asyncio.gather(*tasks)

Tips for staying within limits

Batch requests where possible — some models support sending multiple prompts in one request, which counts as a single request against the rate limit.
Cache responses — if your application asks the same question repeatedly, cache the response instead of re-requesting it.
Spread load across tokens — if you control multiple tokens, distributing requests across them does not bypass per-user limits (limits apply per user, not per token), but it can help with per-token quota management.
Use less expensive models for high-volume tasks — switching from a large model to a smaller one for tasks that don’t require the larger model reduces both rate limit pressure and credit consumption.

Requesting higher limits

Rate limits are set by your administrator at the model level. If your use case requires a higher limit, reach out to your admin and describe:

Which model(s) you need higher limits for
Your expected request volume (requests per minute)
The nature of your workload (batch processing, interactive application, etc.)

Admins can configure per-group limits, so your admin may create a higher-limit group and assign your account to it.

Get Started

Basics

Guides

Help

Rate limits in OpenOpen8: how they work and how to handle them

How rate limits work

Detecting rate limit errors

Handling rate limits in your code

Tips for staying within limits

Requesting higher limits

Get Started

Basics

Guides

Help

​How rate limits work

​Detecting rate limit errors

​Handling rate limits in your code

​Tips for staying within limits

​Requesting higher limits

How rate limits work

Detecting rate limit errors

Handling rate limits in your code

Tips for staying within limits

Requesting higher limits