LLM API Latency Spikes: What Average TTFT Doesn't Tell You
Every LLM API has latency spikes. We analyzed 24 hours of real ping data to show the gap between P50 and P99 for Claude, GPT-4, Gemini, and DeepSeek.
If you're choosing an LLM API based on average latency, you're missing half the picture.
We monitor every major LLM API every 5 minutes from a single US datacenter. Over the past 24 hours, we found that every provider has significant latency spikes — and the gap between a typical request and a worst-case request can be 5-10x.
The Data
Here's what 24 hours of real TTFT (Time to First Token) data looks like across 10 standard models:
| Model | P50 | P95 | P99 | Max | Spike Factor |
|-------|-----|-----|-----|-----|-------------|
| Gemini 2.5 Flash Lite | 311ms | 1,142ms | 3,080ms | 3,334ms | 9.9x |
| Gemini 2.5 Flash | 477ms | 776ms | 1,309ms | 1,879ms | 2.7x |
| GPT-4o Mini | 536ms | 844ms | 4,461ms | 5,946ms | 11.1x |
| Claude Haiku 4.5 | 541ms | 1,591ms | 4,117ms | 11,611ms | 21.5x |
| GPT-4o | 639ms | 1,144ms | 4,517ms | 7,010ms | 11.0x |
| GPT-4.1 Mini | 661ms | 1,087ms | 2,169ms | 4,279ms | 6.5x |
| GPT-4.1 | 726ms | 1,220ms | 1,946ms | 4,813ms | 6.6x |
| Claude Sonnet 4.6 | 941ms | 2,939ms | 4,787ms | 5,453ms | 5.8x |
| DeepSeek V3 | 1,479ms | 1,926ms | 2,492ms | 3,105ms | 2.1x |
| Claude Opus 4.6 | 1,626ms | 2,420ms | 3,797ms | 5,291ms | 3.3x |
The "Spike Factor" is the ratio of P99 to P50 — how much worse your worst 1% of requests are compared to a typical request.
What This Means
Claude Haiku 4.5 has the highest spike factor at 21.5x. Its P50 TTFT is a fast 541ms, but 1% of requests take over 4 seconds, and the worst single request in 24 hours took 11.6 seconds. If your application has a 2-second timeout, you'll see intermittent failures even though the model is technically operational.
Gemini 2.5 Flash is the most consistent, with only a 2.7x spike factor. The gap between P50 (477ms) and P99 (1,309ms) is relatively small. If you need predictable latency, this is the best choice right now.
DeepSeek V3 is also surprisingly consistent (2.1x factor), though its baseline P50 of 1,479ms is higher than most. The tradeoff is predictability vs absolute speed.
OpenAI models cluster around 6-11x spike factors. GPT-4o Mini is the fastest at P50 but has one of the highest spike factors — its P99 is over 4 seconds.
Why Spikes Happen
Every LLM provider runs distributed GPU clusters. Latency spikes come from:
- Cold starts — when your request hits a GPU that needs to load the model weights
- Load balancing — getting routed to a busier or more distant cluster
- Queuing — during high-demand periods, requests wait for GPU availability
- Network variance — routing changes between your server and the provider
These aren't bugs — they're the reality of running inference at scale. But they matter for your application's user experience.
What To Do About It
Set timeouts higher than you think. If a model's P95 is 2 seconds, a 2-second timeout will fail 5% of requests. Set it to the P99 value or add a retry with exponential backoff.
Use P95, not averages, for capacity planning. Your average response time might be 500ms, but your users experience the P95. That's the number that determines whether your app feels fast or broken.
Consider fallback models. If Claude Sonnet times out after 3 seconds, automatically retry with Claude Haiku or GPT-4o Mini. The speed/quality tradeoff is better than a timeout error.
Monitor the tail, not the mean. A dashboard showing "average TTFT: 600ms" hides the fact that 1 in 100 users waits 4+ seconds. Track P95 and P99 alongside averages.
Key Takeaways
- Every LLM API has latency spikes. No exceptions.
- The gap between P50 and P99 ranges from 2x (DeepSeek, Gemini Flash) to 21x (Claude Haiku).
- Fastest average doesn't mean most reliable. Gemini Flash Lite is fast but spiky. Gemini Flash is slightly slower but much more consistent.
- Always design for the P99, not the P50.
All data on this page is from real API monitoring at modelstats.ai, updated every 5 minutes.