Which LLM API is the fastest?

ModelStats monitors all major LLM APIs every 10 minutes. Gemini 2.5 Flash Lite currently has the fastest TTFT, while throughput varies by model. Check modelstats.ai for real-time data.

What is TTFT (Time to First Token)?

TTFT measures how long it takes from sending a request to receiving the first token of the response. Lower TTFT means the model starts responding faster, which is critical for real-time applications.

What is Inter-Token Latency (ITL)?

Inter-Token Latency is the average time between each streamed token. Lower ITL means smoother streaming output. It measures how consistently fast a model generates text.

How does ModelStats collect performance data?

ModelStats pings every major LLM API (Claude, GPT-4, Gemini, DeepSeek) every 10 minutes with a real streaming request. We measure TTFT, total latency, inter-token latency, throughput, and error rates from actual API responses.

What's a Good TTFT for a Production LLM App?

"What's a good TTFT?" is one of the most common questions developers ask when they start measuring LLM API speed. The honest answer is: it depends on what you're building. A number that feels instant in an agent loop can feel broken in an autocomplete box.

This post gives you concrete target ranges by use case, grounds them in what users actually perceive, and explains why the average TTFT you see on a status page is the wrong number to design around.

Note on data: Our live monitoring API was unreachable from our build environment this week, so the specific model numbers below are drawn from our most recent committed snapshot (April 2026) and are clearly labeled as such. For current figures across all 15 models we track, check the live dashboard.

First, What TTFT Actually Controls

TTFT — Time to First Token — is the gap between sending a request and receiving the first token back. In a streaming UI, it's the delay before anything appears on screen. It says nothing about how fast the rest of the response generates (that's throughput and inter-token latency); it only governs how long the user stares at an empty box.

That makes TTFT primarily a perceived responsiveness metric. And perceived responsiveness has well-studied thresholds.

The UX Baseline

Decades of HCI research, summarized in Nielsen Norman Group's classic three response-time limits, give us the anchors:

~0.1s (100ms): feels instantaneous; the user perceives direct manipulation.
~1s: the user notices the delay but their flow of thought stays uninterrupted. No feedback needed beyond the result appearing.
~10s: the upper limit for keeping attention. Beyond this, people switch tasks.

These aren't LLM-specific, but they're the bar your users subconsciously hold your app to. A streaming chat response that starts within a second feels responsive even if the full answer takes ten seconds to finish — which is exactly why streaming exists.

Target TTFT by Use Case

Mapping those thresholds onto real product surfaces:

Use case	Good TTFT	Why
Autocomplete / inline suggestions	< 200ms	Competes with the user's own typing; anything slower gets ignored
Interactive chat (streaming)	< 1,000ms	First token under ~1s reads as "instant" once streaming begins
Chat (non-streaming)	< 2,000ms	The whole answer must land before the user sees anything
Agent / tool-calling steps	< 1,500ms per call	Latency compounds across a chain — a 5-step agent at 1.5s each is already 7.5s
Batch / async jobs	TTFT largely irrelevant	Optimize throughput instead; nobody is watching the first token

The single biggest mistake is applying a chat target to an agent system. If your agent makes eight sequential model calls, a "fine" 1.5s TTFT per call becomes 12 seconds of pure first-token latency before any generation time is counted. For chained workloads, total latency matters more than TTFT, and you should be ruthless about TTFT on every hop.

What Real Models Hit

To calibrate those targets, here's the TTFT spread from our April 2026 snapshot (labeled as historical — see the live dashboard for today's numbers):

Fast-tier models landed first tokens in the 400–600ms range.
Mid and flagship models sat around 900–1,600ms.
Reasoning models stretched well past 6,000ms, because they think before they stream.

The takeaway that holds regardless of the exact figures: there was roughly a 14x spread between the fastest and slowest models we track. "Good" is achievable on the fast tier today — sub-second first tokens are normal for the lightweight models — but you have to choose for it. If first-token speed is your priority, compare the fast-tier options directly, like Claude Haiku 4.5 vs GPT-4o Mini, rather than defaulting to a flagship.

Budget for the Tail, Not the Average

Here's the part most teams get wrong: a "good average TTFT" can still produce a bad app. Our monitoring consistently shows that the gap between a typical request (P50) and a worst-case request (P99) can be 5–20x for the same model, driven by cold starts, load balancing, and queuing on the provider's side.

If your model averages 600ms but its P99 is 4 seconds, a 2-second timeout will fail a meaningful slice of real requests even though the model is "operational." So when you set a TTFT target, set it against a percentile:

Design for P95, not the mean. That's the experience most of your users actually get.
Set timeouts at or above P99, and pair them with a retry or a faster fallback model.
Re-measure under your own load and region — TTFT shifts with time of day and how far your servers sit from the provider.

A Practical Rule of Thumb

If you want a single number to start from: aim for a P95 TTFT under 1 second for anything interactive, and under 200ms for anything that competes with typing. Treat everything above 2 seconds as a signal to switch to a faster model, add streaming, or restructure the request — not as something users will quietly tolerate.

Key Takeaways

"Good" TTFT is use-case dependent: < 200ms for autocomplete, < 1s for interactive chat, < 1.5s per hop for agents.
Anchor targets to perceived-responsiveness thresholds: ~1s feels responsive, ~10s loses attention.
Streaming buys you a lot — a fast first token matters more than a fast full response for chat.
Design against P95/P99, not the average; the tail is where apps feel broken.

All current performance data comes from real API monitoring at modelstats.ai, updated every 10 minutes across the 15 models we track.