When you're dictating a design doc at 11pm, no word limit matters more than s...

When you're dictating a design doc at 11pm, no word limit matters more than speed

You're designing a payment settlement system. Three weeks of architecture meetings, ticket discussions, code review comments. Now it's 11pm and the logic needs to live somewhere other than your head before you lose it. You open Notion, hit record on your voice tool, and start dictating. The spec flows. Idempotent retries, reconciliation edge cases, what happens if a webhook retries after the payment already cleared. You're talking faster than you'd ever type. The thinking is clear.

Then, at the third major section, you hit the word limit.

Every voice dictation tool caps the free tier

The pricing tells you something nobody wants to say out loud. Wispr charges $14 a month or caps you at 600 words a month on the free plan. Willow costs $12 a month or limits you to 500 words. Superwhisper is $8.49 a month. They're not expensive, but they're all structured the same way: they meter transcription like it's cloud infrastructure, because transcription happens in their cloud. Every word costs them, or they're worried it might. So they cap you.

The cap arrives exactly when you're in flow. Mid-thought, mid-design-doc, mid-explanation of why the retry logic can't be idempotent if the webhook fires twice. You stop. You have to wrap up early or pay.

The psychology of the cap

The word limit isn't arbitrary. It's a business constraint disguised as a feature. The tools are building a model where speech recognition is a metered resource, like API calls or storage. You get a free allocation, and if you want more, you pay. It makes sense as a business model if you're running everything in the cloud.

But it misses something about how voice works in a developer's workflow. When you're dictating a design doc, you're in flow. The thinking is uninterrupted. A word cap at 600 or 500 words is like an IDE with an arbitrary line limit. You don't stop to think about the limit, but the limit stops you.

The structural problem runs deeper

They built the business model backward. They made transcription a cloud service, then had to figure out how to price it. But transcription doesn't have to be a cloud service. Whisper-large-v3, the speech-to-text model that actually works, runs on your device. No cloud call. No word meter. No variable cost per transcription. The model lives on your GPU or CPU. The math happens on hardware you own.

If transcription is local, you don't have a reason to cap the free tier. The cost to the company is zero. The cost to you is zero. Unlimited transcription is the natural outcome.

This is why the industry split. Some tools (Wispr, Willow) chose cloud transcription and metering. Others chose local. Recitey chose local. That choice cascades into everything: the pricing model, the feature set, the privacy story, the performance characteristics.

Recitey runs Whisper locally with no limit

Recitey runs Whisper-large-v3 locally on Windows. No word limit on the free tier. No meter. You dictate as long as you need.

The trade is real: local transcription is slower than cloud. Not slow, but you'll notice a 2 to 3 second delay before the text appears on screen. Whisper-large-v3 on a modern GPU is fast enough that the delay doesn't interrupt the flow. On a CPU (any Windows machine without a dedicated GPU) it's slower, maybe 5 to 8 seconds per phrase, but still usable if you're not dictating rapid-fire.

Cloud services like Google Cloud Speech-to-Text or Azure Speech will finish faster, under a second, but they cost money and your voice leaves your device. If you're a backend engineer at a fintech company, you probably care that payment settlement logic stays on your machine. Not because you're paranoid, but because sending recordings of yourself explaining transaction reconciliation to a third party is a risk that doesn't need to exist.

The math is simple. Local transcription costs you latency. Cloud transcription costs you privacy and money. Pick your poison.

The model is open source and accurate

The model matters. Whisper is open (you can inspect it, run it yourself, fine-tune it), trained on 680,000 hours of multilingual audio, and hits 96.3% word accuracy on LibriSpeech. It's not as fast as a proprietary cloud model, but it's accurate. Accurate enough that the 2 to 3 second delay is a better trade than the word cap.

Proprietary models are faster and sometimes more accurate, but you don't know what they do with your data, and you can't run them yourself. Whisper's open source. You can read the code. You can run it on your machine. You know what's happening.

The workflow change is deliberate, not invisible

You end up dictating longer because there's no limit. You need to be thoughtful with what you're saying, not ramble. Design docs written by voice sound different from design docs written by typing. More exploratory, less polished. More complete, because you don't have to rush. You accept that the first draft is rough and you'll refine it, same as you would if you'd typed it.

The rewrite part, the cleaning-up part, is optional. Recitey's pro tier uses a cloud service to polish the rough transcription into structured prose. But you don't have to use it. You can use the raw Whisper output, edit it yourself, or pass it to Claude and ask for a rewrite. The point is: local transcription is free, unlimited, and stays on your device. Everything else is optional.

Marcus, a backend engineer at a Series B fintech in Stockholm, uses Cursor because Cursor's autocomplete reduces the number of times he has to dictate the same phrase twice. He refuses to use cloud transcription for payment system design docs. The uncapped free tier means he can dictate a full design doc at 11pm without losing the thinking. The thinking doesn't get fragmented by a word limit. He hits send on the Notion doc and goes to bed. The design is complete, even if it needs editing in the morning.

That's the moment where local transcription matters. Not speed. Completion.