All posts
· 4 min read

Cheapest LLM API for High Throughput in 2026

The cheapest LLM APIs for high-throughput work in 2026: DeepSeek V4-Pro's 75% price cut, Gemini's fast Flash tier, and how to weigh cost vs tokens/sec.

If you're running a high-throughput LLM workload — bulk summarization, classification, agent loops, anything that generates millions of output tokens a day — the headline price-per-token is only half the story. The other half is throughput: how many tokens per second the model actually streams back. A model that's cheaper per token but half as fast can cost you more in wall-clock time, infrastructure, and concurrency. Here's how the cheap, fast tier looks heading into June 2026, and how to reason about it.

A note on data this week: ModelStats' live monitoring feed was unavailable while this post was written, so the numbers below come from cited third-party sources and from our last published snapshot (April 3, 2026), each labeled with its date. For current, real-time tokens/sec and TTFT across the models we track, check the live dashboard at modelstats.ai.

The price floor dropped again this week

The cheap tier got cheaper. On May 25, 2026, DeepSeek made its V4-Pro promotional discount permanent — a 75% cut that lands input at $0.435 and output at $0.87 per million tokens, with cache hits near $0.0036/M (Engadget, InfoWorld). DeepSeek tied the cut to cheaper inference hardware coming online in the second half of the year, and InfoWorld framed it plainly as an escalation of the ongoing API price war.

That matters for throughput economics specifically: output tokens are where high-volume jobs spend their money, and an $0.87/M output rate is roughly an order of magnitude below frontier flagship pricing. For a job that emits 50M output tokens a month, the output line item is about $44 at that rate.

Cost is only worth it if the tokens actually arrive quickly, though — which is the other half of the equation.

Fast is getting cheaper, too

Google shipped Gemini 3.5 Flash on May 19, 2026, positioned squarely at throughput-heavy, agentic workloads. Output-speed measurements vary by source — Google's own API figures land around 175 tokens/sec while independent benchmarks like Artificial Analysis measured north of 280 tokens/sec (Artificial Analysis, Google). Either way, it's one of the faster general-purpose models on the market right now, and it's a reminder that the "Flash" tier is where providers compete hardest on tokens/sec.

Treat brand-new-model throughput claims as early and vendor-influenced until you've measured them on your own traffic — provider-reported and independent numbers for Gemini 3.5 Flash already differ by more than 1.5x, which tells you how much methodology matters.

What ModelStats data has shown about throughput

ModelStats tracks tokens/sec, TTFT, total latency, inter-token latency, and error rate for ~15 production LLM APIs, pinging each every ten minutes. In our last published snapshot (April 3, 2026), Gemini 2.5 Flash was the clear throughput leader among tracked models at 74.73 tokens/sec, with OpenAI's reasoning models (o3, o4-mini) following near 41 tokens/sec. Gemini 2.5 Flash Lite, by contrast, won on time-to-first-token (420ms) but pushed far fewer tokens/sec — a sharp illustration that "fastest to start" and "fastest to finish" are different metrics.

The newest models discussed above (DeepSeek V4-Pro, Gemini 3.5 Flash) aren't in our tracked set yet, so we're not going to put a tokens/sec number on them we haven't measured. When they're added, you'll be able to compare them head-to-head on the live dashboard. In the meantime, you can already line up the current cheap-tier contenders — for example Claude Haiku 4.5 vs Gemini 2.5 Flash Lite or GPT-4o vs Gemini 2.5 Flash Lite — on real monitoring data.

How to actually pick a cheap high-throughput API

Price-per-token rankings are a starting point, not an answer. A practical checklist:

  • Cost the job, not the token. Estimate monthly output-token volume and multiply by the output rate. Output is almost always the dominant cost for generation-heavy work; input-heavy retrieval workloads weight differently.
  • Check tokens/sec, not just TTFT. For long outputs and batch jobs, throughput determines total wall-clock time and how much concurrency you need to hit a deadline. TTFT barely matters once an answer is 2,000 tokens long.
  • Stack the discounts. Prompt caching (up to ~90% off at Anthropic and OpenAI) and batch APIs (~50% off) compound. A workload that's cacheable and tolerant of batch latency can run at a fraction of the on-demand sticker price.
  • Measure on your own prompts. Throughput varies with prompt shape, output length, region, and time of day. Vendor benchmarks use idealized conditions; your traffic won't.
  • Watch error rate. A cheap, fast model that fails 2% of calls forces retries that quietly erase the savings. Reliability is part of the price.

Takeaways

  • Cheapest per token in this cycle: DeepSeek's V4-Pro permanent cut (May 25, 2026) set a new low at roughly $0.435 in / $0.87 out per million tokens — verify current rates before you budget.
  • Throughput is the hidden cost. For high-volume generation, tokens/sec drives wall-clock time and concurrency as much as the per-token rate does.
  • The Flash tier is the battleground. Gemini 3.5 Flash (May 19, 2026) is the latest entrant optimized for speed; treat its early throughput numbers as provisional.
  • Measure, don't assume. Provider and independent throughput figures already diverge by more than 1.5x for the newest models — track real numbers for the models you actually run at modelstats.ai.

See the live data

Every model tested every 10 minutes — compare them on the live dashboard.

Browse all models