Which LLM API is the fastest?

ModelStats monitors all major LLM APIs every 10 minutes. Gemini 2.5 Flash Lite currently has the fastest TTFT, while throughput varies by model. Check modelstats.ai for real-time data.

What is TTFT (Time to First Token)?

TTFT measures how long it takes from sending a request to receiving the first token of the response. Lower TTFT means the model starts responding faster, which is critical for real-time applications.

What is Inter-Token Latency (ITL)?

Inter-Token Latency is the average time between each streamed token. Lower ITL means smoother streaming output. It measures how consistently fast a model generates text.

How does ModelStats collect performance data?

ModelStats pings every major LLM API (Claude, GPT-4, Gemini, DeepSeek) every 10 minutes with a real streaming request. We measure TTFT, total latency, inter-token latency, throughput, and error rates from actual API responses.

Understanding LLM Performance Metrics: TTFT, ITL, and Why They Matter

If you're building with LLM APIs, you've probably encountered slow responses, inconsistent throughput, or unexplained latency spikes. Understanding the metrics behind these issues is the first step to fixing them.

Here's a breakdown of the four core performance metrics we track at ModelStats.ai, what they measure, and when each one matters.

Time to First Token (TTFT)

What it is: The time in milliseconds between sending your API request and receiving the first token of the response.

Why it matters: TTFT directly controls perceived responsiveness. In a streaming chat interface, this is the delay before the user sees anything start to appear. Research on user experience consistently shows that delays over 1 second feel sluggish, and delays over 3 seconds cause users to disengage.

Real numbers: As of April 2026, TTFT ranges widely across models. Gemini 2.5 Flash Lite responds in 420ms. Claude Opus 4.6 takes 1,604ms. Gemini 2.5 Pro takes over 6,000ms. That's a 14x spread across the models we track.

When to optimize for it: Chat interfaces, autocomplete, any interactive application where the user is waiting for a response to start.

Inter-Token Latency (ITL)

What it is: The average time in milliseconds between consecutive tokens during streaming. Once the first token arrives, ITL determines how smoothly the rest of the response flows.

Why it matters: High ITL makes streamed responses feel choppy. Low ITL creates a smooth, readable text stream. This is especially noticeable for longer outputs where users are reading as the text generates.

Real numbers: ITL varies dramatically. OpenAI's o3 model has an ITL of just 0.44ms — tokens arrive almost instantaneously after each other. GPT-4o sits at 8.54ms. Claude Opus 4.6 is at 419.56ms, meaning there's a noticeable delay between each token. Claude Haiku 4.5 is at 130.06ms. Gemini 2.5 Flash comes in at 100.96ms.

When to optimize for it: Long-form content generation, any use case where users read streamed output in real time.

Tokens Per Second (Throughput)

What it is: The number of output tokens generated per second. This is effectively the inverse of ITL, but measured as a rate rather than a delay.

Why it matters: Throughput determines how long it takes to generate a complete response. For batch processing, agent workflows, or any workload where you're waiting for the full output, this is the metric that drives total wall-clock time.

Real numbers: Gemini 2.5 Flash leads at 74.73 tokens/sec. OpenAI's o4-mini delivers 41.88 tokens/sec. GPT-4.1 comes in at 15.60 tokens/sec. Claude Haiku 4.5 produces 13.44 tokens/sec. At the low end, Gemini 2.5 Flash Lite generates 4.76 tokens/sec.

When to optimize for it: Batch processing, code generation, document summarization, agent loops — anywhere total generation time matters more than initial responsiveness.

Total Latency

What it is: The time from request to the final token. This is TTFT + (number of output tokens x ITL), roughly speaking.

Why it matters: For non-streaming use cases (or when you need the complete response before acting on it), total latency is the metric that matters. Agent systems that chain multiple LLM calls are especially sensitive to this — each call's latency compounds.

Real numbers: Gemini 2.5 Flash Lite completes in 420ms. GPT-4.1 in 1,026ms. Claude Sonnet 4.6 in 1,559ms. DeepSeek R1, a reasoning model, takes 4,352ms. Gemini 2.5 Pro takes 13,154ms.

Uptime

What it is: The percentage of successful API responses over a rolling time window.

Why it matters: The fastest API in the world is useless if it's down. Uptime tracking helps you build fallback strategies — if your primary model drops below a threshold, route to a backup. As of April 2026, all 14 models we track report 100% uptime, but this hasn't always been the case and can change without warning.

How to Use These Metrics

Pick the right metric for your use case. Chat apps should optimize for TTFT. Batch workloads should optimize for throughput. Agent systems should optimize for total latency.
Monitor continuously. Performance fluctuates throughout the day. A model that's fast at 2am may be slow at 2pm under peak load.
Set alerts. If your primary model's TTFT crosses a threshold, you want to know before your users do.

ModelStats tracks all of these metrics across 14 models, every 10 minutes, 24/7. See the live dashboard at modelstats.ai.