Which LLM API is the fastest?

ModelStats monitors all major LLM APIs every 10 minutes. Gemini 2.5 Flash Lite currently has the fastest TTFT, while throughput varies by model. Check modelstats.ai for real-time data.

What is TTFT (Time to First Token)?

TTFT measures how long it takes from sending a request to receiving the first token of the response. Lower TTFT means the model starts responding faster, which is critical for real-time applications.

What is Inter-Token Latency (ITL)?

Inter-Token Latency is the average time between each streamed token. Lower ITL means smoother streaming output. It measures how consistently fast a model generates text.

How does ModelStats collect performance data?

ModelStats pings every major LLM API (Claude, GPT-4, Gemini, DeepSeek) every 10 minutes with a real streaming request. We measure TTFT, total latency, inter-token latency, throughput, and error rates from actual API responses.

LLM API Latency Spikes: What Average TTFT Doesn't Tell You

If you're choosing an LLM API based on average latency, you're missing half the picture.

We monitor every major LLM API every 10 minutes from a single US datacenter. Over the past 24 hours, we found that every provider has significant latency spikes — and the gap between a typical request and a worst-case request can be 5-10x.

The Data

Here's what 24 hours of real TTFT (Time to First Token) data looks like across 10 standard models:

Model	P50	P95	P99	Max	Spike Factor
Gemini 2.5 Flash Lite	311ms	1,142ms	3,080ms	3,334ms	9.9x
Gemini 2.5 Flash	477ms	776ms	1,309ms	1,879ms	2.7x
GPT-4o Mini	536ms	844ms	4,461ms	5,946ms	11.1x
Claude Haiku 4.5	541ms	1,591ms	4,117ms	11,611ms	21.5x
GPT-4o	639ms	1,144ms	4,517ms	7,010ms	11.0x
GPT-4.1 Mini	661ms	1,087ms	2,169ms	4,279ms	6.5x
GPT-4.1	726ms	1,220ms	1,946ms	4,813ms	6.6x
Claude Sonnet 4.6	941ms	2,939ms	4,787ms	5,453ms	5.8x
DeepSeek V3	1,479ms	1,926ms	2,492ms	3,105ms	2.1x
Claude Opus 4.6	1,626ms	2,420ms	3,797ms	5,291ms	3.3x

The "Spike Factor" is the ratio of P99 to P50 — how much worse your worst 1% of requests are compared to a typical request.

What This Means

Claude Haiku 4.5 has the highest spike factor at 21.5x. Its P50 TTFT is a fast 541ms, but 1% of requests take over 4 seconds, and the worst single request in 24 hours took 11.6 seconds. If your application has a 2-second timeout, you'll see intermittent failures even though the model is technically operational.

Gemini 2.5 Flash is the most consistent, with only a 2.7x spike factor. The gap between P50 (477ms) and P99 (1,309ms) is relatively small. If you need predictable latency, this is the best choice right now.

DeepSeek V3 is also surprisingly consistent (2.1x factor), though its baseline P50 of 1,479ms is higher than most. The tradeoff is predictability vs absolute speed.

OpenAI models cluster around 6-11x spike factors. GPT-4o Mini is the fastest at P50 but has one of the highest spike factors — its P99 is over 4 seconds.

Why Spikes Happen

Every LLM provider runs distributed GPU clusters. Latency spikes come from:

Cold starts — when your request hits a GPU that needs to load the model weights
Load balancing — getting routed to a busier or more distant cluster
Queuing — during high-demand periods, requests wait for GPU availability
Network variance — routing changes between your server and the provider

These aren't bugs — they're the reality of running inference at scale. But they matter for your application's user experience.

What To Do About It

Set timeouts higher than you think. If a model's P95 is 2 seconds, a 2-second timeout will fail 5% of requests. Set it to the P99 value or add a retry with exponential backoff.

Use P95, not averages, for capacity planning. Your average response time might be 500ms, but your users experience the P95. That's the number that determines whether your app feels fast or broken.

Consider fallback models. If Claude Sonnet times out after 3 seconds, automatically retry with Claude Haiku or GPT-4o Mini. The speed/quality tradeoff is better than a timeout error.

Monitor the tail, not the mean. A dashboard showing "average TTFT: 600ms" hides the fact that 1 in 100 users waits 4+ seconds. Track P95 and P99 alongside averages.

Key Takeaways

Every LLM API has latency spikes. No exceptions.
The gap between P50 and P99 ranges from 2x (DeepSeek, Gemini Flash) to 21x (Claude Haiku).
Fastest average doesn't mean most reliable. Gemini Flash Lite is fast but spiky. Gemini Flash is slightly slower but much more consistent.
Always design for the P99, not the P50.

All data on this page is from real API monitoring at modelstats.ai, updated every 10 minutes.

The Data

What This Means

Why Spikes Happen

What To Do About It

Key Takeaways

Keep reading