Understanding LLM Performance Metrics: TTFT, ITL, and Why They Matter
Understanding LLM Performance Metrics: TTFT, ITL, and Why They Matter
If you're building with LLM APIs, you've probably encountered slow responses, inconsistent throughput, or unexplained latency spikes. Understanding the metrics behind these issues is the first step to fixing them.
Here's a breakdown of the four core performance metrics we track at ModelStats.ai, what they measure, and when each one matters.
Time to First Token (TTFT)
What it is: The time in milliseconds between sending your API request and receiving the first token of the response.
Why it matters: TTFT directly controls perceived responsiveness. In a streaming chat interface, this is the delay before the user sees anything start to appear. Research on user experience consistently shows that delays over 1 second feel sluggish, and delays over 3 seconds cause users to disengage.
Real numbers: As of April 2026, TTFT ranges widely across models. Gemini 2.5 Flash Lite responds in 420ms. Claude Opus 4.6 takes 1,604ms. Gemini 2.5 Pro takes over 6,000ms. That's a 14x spread across the models we track.
When to optimize for it: Chat interfaces, autocomplete, any interactive application where the user is waiting for a response to start.
Inter-Token Latency (ITL)
What it is: The average time in milliseconds between consecutive tokens during streaming. Once the first token arrives, ITL determines how smoothly the rest of the response flows.
Why it matters: High ITL makes streamed responses feel choppy. Low ITL creates a smooth, readable text stream. This is especially noticeable for longer outputs where users are reading as the text generates.
Real numbers: ITL varies dramatically. OpenAI's o3 model has an ITL of just 0.44ms — tokens arrive almost instantaneously after each other. GPT-4o sits at 8.54ms. Claude Opus 4.6 is at 419.56ms, meaning there's a noticeable delay between each token. Claude Haiku 4.5 is at 130.06ms. Gemini 2.5 Flash comes in at 100.96ms.
When to optimize for it: Long-form content generation, any use case where users read streamed output in real time.
Tokens Per Second (Throughput)
What it is: The number of output tokens generated per second. This is effectively the inverse of ITL, but measured as a rate rather than a delay.
Why it matters: Throughput determines how long it takes to generate a complete response. For batch processing, agent workflows, or any workload where you're waiting for the full output, this is the metric that drives total wall-clock time.
Real numbers: Gemini 2.5 Flash leads at 74.73 tokens/sec. OpenAI's o4-mini delivers 41.88 tokens/sec. GPT-4.1 comes in at 15.60 tokens/sec. Claude Haiku 4.5 produces 13.44 tokens/sec. At the low end, Gemini 2.5 Flash Lite generates 4.76 tokens/sec.
When to optimize for it: Batch processing, code generation, document summarization, agent loops — anywhere total generation time matters more than initial responsiveness.
Total Latency
What it is: The time from request to the final token. This is TTFT + (number of output tokens x ITL), roughly speaking.
Why it matters: For non-streaming use cases (or when you need the complete response before acting on it), total latency is the metric that matters. Agent systems that chain multiple LLM calls are especially sensitive to this — each call's latency compounds.
Real numbers: Gemini 2.5 Flash Lite completes in 420ms. GPT-4.1 in 1,026ms. Claude Sonnet 4.6 in 1,559ms. DeepSeek R1, a reasoning model, takes 4,352ms. Gemini 2.5 Pro takes 13,154ms.
Uptime
What it is: The percentage of successful API responses over a rolling time window.
Why it matters: The fastest API in the world is useless if it's down. Uptime tracking helps you build fallback strategies — if your primary model drops below a threshold, route to a backup. As of April 2026, all 14 models we track report 100% uptime, but this hasn't always been the case and can change without warning.
How to Use These Metrics
- Pick the right metric for your use case. Chat apps should optimize for TTFT. Batch workloads should optimize for throughput. Agent systems should optimize for total latency.
- Monitor continuously. Performance fluctuates throughout the day. A model that's fast at 2am may be slow at 2pm under peak load.
- Set alerts. If your primary model's TTFT crosses a threshold, you want to know before your users do.
ModelStats tracks all of these metrics across 14 models, every 5 minutes, 24/7. See the live dashboard at modelstats.ai.