The need for (token) speed

February 28th, 2026

If you’ve used a few different AI models, you might have implicitly noticed something. One responds instantly and the text steadily flows. Another makes you wait, then blurts everything out at once.

The two numbers that capture this are latency and throughput. Latency, sometimes called Time to First Token (TTFT), is how long you wait before any output appears. Throughput, measured in tokens per second (TPS), is how fast the text flows once generation begins. On OpenRouter, both are listed for every model and provider, which makes it a useful place to compare. As a rough baseline, GPT 5.2 Instant, the model behind ChatGPT’s free tier, is today running at around 3 seconds of latency and 36 TPS. Low latency and high throughput is the dream, but achieving both at scale is hard.

A token is roughly a chunk of a word. “Unbelievable” might be three tokens. “The” is one. As a rule of thumb, 100 tokens is about 75 words. So at 50 TPS, a model is writing around 37 words per second, faster than most people read. But even if you can’t keep up in real time, you’ll feel the difference. To show you, I built this small simulation!

Speed:

Click a speed to start

Throughput is becoming a product

Something is fundamentally shifting in how the AI industry thinks about speed. For years, the implicit rule was that intelligence and speed traded off against each other. Smarter models were slow, and faster models were comparatively dumb.

That assumption is starting to crack. Over the last few weeks, both OpenAI and Anthropic seem to have arrived at the same conclusion: speed at the frontier is a market in and of itself, and people may be willing to pay handsomely for it.

OpenAI recently released GPT-5.3-Codex-Spark, a coding model designed to feel instantaneous. It runs on hardware by Cerebras, a chipmaker whose Wafer-Scale Engine 3 is essentially a single piece of silicon the size of a dinner plate, built specifically for low latency inference. Codex-Spark delivers over 1,000 tokens per second. It’s currently locked behind the $200 USD/month ChatGPT Pro tier, so no, I haven’t tried it yet!

Anthropic meanwhile has been even more explicit about treating speed as a luxury good. They recently shipped Fast Mode for Claude Opus 4.6. It’s the exact same model with the same intelligence, just about 2.5x faster in output throughput. The catch? It costs 6x the standard rate. Standard Opus 4.6 is $5/$25 per million tokens (input/output). Fast Mode jumps to $30/$150. If you stack that with the 1M token context window premium, you can end up paying $60/$225 per million tokens, an eye watering markup.

Both labs are betting that power users will pay a massive premium for the faster frontier.

And then, slightly further from the mainstream spotlight, a 25 person startup called Taalas unveiled the HC1. This is a custom chip running Llama 3.1 8B at around 15,000 to 17,000 tokens per second. Those zeros are not typos. They achieved this by literally etching the model’s weights directly into the silicon. Instead of shuttling data back and forth from memory, inference becomes a matter of electrical signals flowing through physical, hardwired logic gates. Designing an entirely custom chip just to run a singular model is an extreme, expensive trade-off, but it proves what is physically possible when you build for speed from the ground up.

Where speed actually matters

For someone opening a chat window to rewrite their LinkedIn bio, there’s a speed ceiling where extra throughput stops mattering. We read at around 250 words per minute, roughly 5 to 6 tokens per second. At 50 TPS, a model is already outrunning us by 10x. Past a certain point, tokens don’t stream in, they just apparate.

So when OpenAI and Anthropic market to “power users”, they’re really talking to developers building with agentic harnesses.

At 1,000 TPS or higher, the feedback loop between agent and developer can get tight enough within Codex and Claude Code that the waiting time between prompts where we pick up our phones and doomscroll vanishes. It’s a step towards collaboration rather than delegation.

Speed also changes what’s practical for chain-of-thought reasoning, the technique where a model works through a problem in a long internal scratchpad before returning an answer. Right now, a 10,000 word internal monologue takes time. At 15,000 TPS, it becomes unnoticeable.

The area I find most interesting, and have been working on recently, is generated interfaces. Most UI design is done upfront. Someone decides what the dashboard looks like, what data it shows, how it’s laid out. But for messy, dynamic data, the kind that changes often and doesn’t always fit neatly into a pre-built template, there’s a compelling case for a model generating a custom interface on the fly, shaped around the specific dataset itself. The capability exists today. What’s missing is the speed to make it feel native.

For now, all of this comes at a price. Speed at the frontier is being sold at a premium, and the markups are steep. But the underlying direction is clear. As models continue to improve and inference gets cheaper, the question may change from whether models are intelligent enough to whether they’re fast enough to get out of the way.

Thank you!