What's a Good TTFT for a Production LLM App?
What counts as a good TTFT for an LLM API? Practical target ranges by use case — chat, autocomplete, agents — and why you should budget for the P95 tail.
"What's a good TTFT?" is one of the most common questions developers ask when they start measuring LLM API speed. The honest answer is: it depends on what you're building. A number that feels instant in an agent loop can feel broken in an autocomplete box.
This post gives you concrete target ranges by use case, grounds them in what users actually perceive, and explains why the average TTFT you see on a status page is the wrong number to design around.
Note on data: Our live monitoring API was unreachable from our build environment this week, so the specific model numbers below are drawn from our most recent committed snapshot (April 2026) and are clearly labeled as such. For current figures across all 15 models we track, check the live dashboard.
First, What TTFT Actually Controls
TTFT — Time to First Token — is the gap between sending a request and receiving the first token back. In a streaming UI, it's the delay before anything appears on screen. It says nothing about how fast the rest of the response generates (that's throughput and inter-token latency); it only governs how long the user stares at an empty box.
That makes TTFT primarily a perceived responsiveness metric. And perceived responsiveness has well-studied thresholds.
The UX Baseline
Decades of HCI research, summarized in Nielsen Norman Group's classic three response-time limits, give us the anchors:
- ~0.1s (100ms): feels instantaneous; the user perceives direct manipulation.
- ~1s: the user notices the delay but their flow of thought stays uninterrupted. No feedback needed beyond the result appearing.
- ~10s: the upper limit for keeping attention. Beyond this, people switch tasks.
These aren't LLM-specific, but they're the bar your users subconsciously hold your app to. A streaming chat response that starts within a second feels responsive even if the full answer takes ten seconds to finish — which is exactly why streaming exists.
Target TTFT by Use Case
Mapping those thresholds onto real product surfaces:
| Use case | Good TTFT | Why |
|---|---|---|
| Autocomplete / inline suggestions | < 200ms | Competes with the user's own typing; anything slower gets ignored |
| Interactive chat (streaming) | < 1,000ms | First token under ~1s reads as "instant" once streaming begins |
| Chat (non-streaming) | < 2,000ms | The whole answer must land before the user sees anything |
| Agent / tool-calling steps | < 1,500ms per call | Latency compounds across a chain — a 5-step agent at 1.5s each is already 7.5s |
| Batch / async jobs | TTFT largely irrelevant | Optimize throughput instead; nobody is watching the first token |
The single biggest mistake is applying a chat target to an agent system. If your agent makes eight sequential model calls, a "fine" 1.5s TTFT per call becomes 12 seconds of pure first-token latency before any generation time is counted. For chained workloads, total latency matters more than TTFT, and you should be ruthless about TTFT on every hop.
What Real Models Hit
To calibrate those targets, here's the TTFT spread from our April 2026 snapshot (labeled as historical — see the live dashboard for today's numbers):
- Fast-tier models landed first tokens in the 400–600ms range.
- Mid and flagship models sat around 900–1,600ms.
- Reasoning models stretched well past 6,000ms, because they think before they stream.
The takeaway that holds regardless of the exact figures: there was roughly a 14x spread between the fastest and slowest models we track. "Good" is achievable on the fast tier today — sub-second first tokens are normal for the lightweight models — but you have to choose for it. If first-token speed is your priority, compare the fast-tier options directly, like Claude Haiku 4.5 vs GPT-4o Mini, rather than defaulting to a flagship.
Budget for the Tail, Not the Average
Here's the part most teams get wrong: a "good average TTFT" can still produce a bad app. Our monitoring consistently shows that the gap between a typical request (P50) and a worst-case request (P99) can be 5–20x for the same model, driven by cold starts, load balancing, and queuing on the provider's side.
If your model averages 600ms but its P99 is 4 seconds, a 2-second timeout will fail a meaningful slice of real requests even though the model is "operational." So when you set a TTFT target, set it against a percentile:
- Design for P95, not the mean. That's the experience most of your users actually get.
- Set timeouts at or above P99, and pair them with a retry or a faster fallback model.
- Re-measure under your own load and region — TTFT shifts with time of day and how far your servers sit from the provider.
A Practical Rule of Thumb
If you want a single number to start from: aim for a P95 TTFT under 1 second for anything interactive, and under 200ms for anything that competes with typing. Treat everything above 2 seconds as a signal to switch to a faster model, add streaming, or restructure the request — not as something users will quietly tolerate.
Key Takeaways
- "Good" TTFT is use-case dependent: < 200ms for autocomplete, < 1s for interactive chat, < 1.5s per hop for agents.
- Anchor targets to perceived-responsiveness thresholds: ~1s feels responsive, ~10s loses attention.
- Streaming buys you a lot — a fast first token matters more than a fast full response for chat.
- Design against P95/P99, not the average; the tail is where apps feel broken.
All current performance data comes from real API monitoring at modelstats.ai, updated every 10 minutes across the 15 models we track.
See the live data
Every model tested every 10 minutes — compare them on the live dashboard.
Browse all models