OpenAI o3 Latency & TTFT — Complete Performance Guide

Real-time o3 performance data: TTFT, latency, tokens per second, and how it compares to o4-mini, GPT-4o, and DeepSeek R1.

OpenAI's o3 is one of the most capable reasoning models available. But how fast is it? We've been monitoring o3 every 5 minutes since launch, and here's what the data shows.

o3 Performance at a Glance

o3 is a reasoning model — it thinks before responding. This means its TTFT includes internal reasoning time, making it fundamentally different from standard models like GPT-4o.

Here's what we typically see:

  • TTFT: 800-1400ms (includes reasoning overhead)
  • Total Latency: 900-1500ms
  • Throughput: ~23 tokens/second
  • Error Rate: <1%

TTFT: What to Expect

o3's Time to First Token is higher than GPT-4o because it reasons before generating output. A typical request:

  1. 0-800ms: o3 is thinking (you see nothing)
  2. 800ms+: tokens start streaming
  3. This is normal behavior for reasoning models. If your application has a 1-second timeout, you'll see intermittent failures. We recommend setting timeouts to at least 3 seconds for o3.

    o3 vs Other Reasoning Models

    How does o3 compare to other reasoning models?

    | Model | Avg TTFT | Throughput | Provider |

    |-------|----------|-----------|----------|

    | o4-mini | ~1100ms | ~62 tok/s | OpenAI |

    | o3 | ~1350ms | ~23 tok/s | OpenAI |

    | DeepSeek R1 | ~1150ms | varies | DeepSeek |

    | Gemini 2.5 Pro | ~2000ms+ | varies | Google |

    o4-mini is faster and cheaper than o3 for most reasoning tasks. o3's advantage is in the hardest problems where more reasoning depth helps.

    o3 vs Standard Models

    If you don't need reasoning, standard models are significantly faster:

    | Model | Avg TTFT | Type |

    |-------|----------|------|

    | Gemini 2.5 Flash Lite | ~450ms | Standard |

    | Claude Haiku 4.7 | ~500ms | Standard |

    | GPT-4o Mini | ~600ms | Standard |

    | GPT-4o | ~700ms | Standard |

    | o3 | ~1350ms | Reasoning |

    o3 is roughly 2-3x slower on TTFT than standard models. The tradeoff is accuracy on complex tasks.

    Output Tokens Per Second

    o3 generates output at ~23 tokens/second once it starts streaming. This is slower than standard models (GPT-4o does ~28 tok/s, Gemini Flash does 100+ tok/s) because reasoning models often output in larger batches rather than smooth token-by-token streaming.

    Reliability

    o3 has been highly reliable in our monitoring:

    • Uptime: 99.5%+
    • Error rate: <1% (excluding rate limits)
    • Timeouts: Rare, but P99 latency can hit 3-5 seconds

    When to Use o3

    Use o3 when:

    • You need step-by-step reasoning
    • Accuracy matters more than speed
    • The task is complex (math, logic, code generation)

    Use GPT-4o or Claude Sonnet instead when:

    • You need fast responses (<1 second)
    • The task is straightforward
    • You're optimizing for throughput

    Monitor o3 in Real-Time

    All the data in this post comes from modelstats.ai, where we ping o3 and 15 other models every 5 minutes. Switch to the "Reasoning Models" tab to see o3's live performance alongside o4-mini, DeepSeek R1, and Gemini 2.5 Pro.

See the live data

All metrics updated every 5 minutes on the ModelStats dashboard.

View Dashboard