Which LLM API is the fastest?

ModelStats monitors all major LLM APIs every 10 minutes. Gemini 2.5 Flash Lite currently has the fastest TTFT, while throughput varies by model. Check modelstats.ai for real-time data.

What is TTFT (Time to First Token)?

TTFT measures how long it takes from sending a request to receiving the first token of the response. Lower TTFT means the model starts responding faster, which is critical for real-time applications.

What is Inter-Token Latency (ITL)?

Inter-Token Latency is the average time between each streamed token. Lower ITL means smoother streaming output. It measures how consistently fast a model generates text.

How does ModelStats collect performance data?

ModelStats pings every major LLM API (Claude, GPT-4, Gemini, DeepSeek) every 10 minutes with a real streaming request. We measure TTFT, total latency, inter-token latency, throughput, and error rates from actual API responses.

GPT-4o vs Gemini Flash: Latency and Throughput Compared

If you're building a chat or agent feature and have narrowed the field to OpenAI's GPT-4o and Google's Gemini Flash tier, the deciding factor often isn't capability — both are more than good enough for most production work. It's speed. And "speed" splits into two numbers that pull in different directions: how fast the model starts answering (TTFT) and how fast it finishes (throughput). GPT-4o and Gemini Flash sit on opposite ends of that tradeoff, which is exactly what makes the matchup worth a close look.

A note on data this week: ModelStats' live monitoring feed was unreachable while this post was written, so the numbers below come from our last published snapshot (April 3, 2026), clearly labeled as such. For current, real-time TTFT, latency, and tokens/sec across the models we track, check the live dashboard at modelstats.ai.

Round 1: Time to first token

TTFT is the most user-visible metric — the gap between your API call and the first streamed token landing. It's what makes a chatbot feel instant or sluggish.

In our April 3, 2026 snapshot, the two models lined up like this:

Model	TTFT (ms)	Total latency (ms)	Throughput (tok/s)
Gemini 2.5 Flash	492	696	74.73
GPT-4o	758	968	—

Gemini Flash starts faster. At 492ms it crossed into sub-500ms territory — the rough threshold where responses begin to feel instantaneous. GPT-4o, at 758ms, is still well under a second and perfectly usable for chat, but it's roughly 266ms slower off the line. For a single request that's imperceptible; for a high-frequency agent loop firing thousands of calls, those milliseconds compound.

It's worth saying plainly: these are point-in-time figures from a single snapshot, and TTFT drifts with region, time of day, and load. Treat the ordering as more durable than the exact millisecond counts.

Round 2: Throughput

Once streaming begins, throughput (tokens per second) decides how quickly a long answer completes. This is where the two models diverge hardest.

In the same April snapshot, Gemini 2.5 Flash streamed 74.73 tokens/sec — by a wide margin the highest of any model we tracked at the time. GPT-4o didn't return a throughput figure in that probe (a known artifact of short test outputs, where there aren't enough generated tokens to measure a stable rate), so we won't put a number on it we didn't measure. But across prior snapshots GPT-4o's standard-endpoint throughput has consistently sat in the low-to-mid double digits — nowhere near Gemini Flash's territory.

The practical read: for short, conversational replies, throughput barely matters — the answer is done before the difference shows up. But for long-form generation, summarization of big documents, or anything emitting thousands of output tokens, Gemini Flash's throughput advantage translates directly into lower wall-clock time and less concurrency needed to hit a deadline.

Round 3: Total latency

Total latency (time to the last token) folds TTFT and generation speed into one end-to-end number. In April, Gemini 2.5 Flash finished a full round-trip in 696ms versus GPT-4o's 968ms — Gemini ahead on both ends of the request.

The generational caveat you can't skip

Here's the honesty check. The snapshot above measured Gemini 2.5 Flash, but Google has since moved its fast tier forward: Gemini 3.5 Flash shipped on May 19, 2026 and is now the current-generation model in that slot (and the one we track as gemini-3.5-flash). We don't yet have a published ModelStats snapshot for it, and provider-reported versus independent throughput figures for brand-new models routinely diverge — so we're not going to quote a measured tokens/sec for 3.5 Flash that we haven't captured ourselves.

What that means for you: the shape of this comparison — Gemini Flash optimized for fast starts and very high throughput, GPT-4o optimized for steady, predictable mid-range latency — is likely to hold across the generational bump, but the exact numbers will move. Verify on the live data before you commit. You can line the current pairing up directly on GPT-4o vs Gemini 3.5 Flash, which plots both on real monitoring data as it comes in.

How to choose

Optimizing for perceived speed in a chat UI? TTFT is your metric. Gemini Flash's faster start is the edge, but GPT-4o's sub-second start is genuinely fine for most chat experiences — don't over-index on a few hundred milliseconds if other factors favor it.
Generating long outputs or running batch jobs? Throughput dominates wall-clock time. Gemini Flash's tokens/sec lead is the more decisive advantage here.
Want predictability across a model family? OpenAI's standard models historically cluster in a tight latency band, which makes capacity planning simpler if you mix several of them.
Either way, measure on your own prompts. Throughput and TTFT shift with prompt shape, output length, region, and time of day. Vendor benchmarks use idealized conditions; your traffic won't.

Takeaways

Gemini Flash starts faster and finishes faster. In our April 3, 2026 snapshot it beat GPT-4o on TTFT (492ms vs 758ms), total latency (696ms vs 968ms), and led all tracked models on throughput at 74.73 tok/s.
GPT-4o stays comfortably sub-second on TTFT and is a safe default when consistency across a model family matters more than topping the speed charts.
The numbers above are a labeled April snapshot, and Google's fast tier has since advanced to Gemini 3.5 Flash (May 19, 2026) — so confirm the current figures before you decide.
Track the live matchup at modelstats.ai, updated every 10 minutes.