Design docs in voice, interrupted by word limits

Marcus, a backend engineer at a fintech in Stockholm, designs payment settlement systems at 11pm when the office is quiet. He's done explaining it on calls and in tickets. Now he's explaining it to Claude, narrating the flow, thinking out loud in his own voice. Twelve minutes in, mid-sentence, the word counter hits zero. His transcription tool caps at 5,000 words on the free plan. He restarts, rewrites, loses the thought. Next morning, the prose is choppy, scattered, fragmented.

This is where developers are now. The bottleneck shifted from typing code to typing intent. LLMs need longer prompts, fuller context, clearer specification. Voice is faster for that kind of long-form thinking. But most voice tools weren't built for this. They were built for quick voice notes, sales call recordings, meeting recaps. They weren't built for the developer who wants to narrate a complex system to an AI at 11pm without hitting a billing wall mid-thought.

Why developers are dictating design specifications

Over the past two years, the development shape changed fundamentally. You don't type code the same way anymore. You type intent. You write a longer prompt, explain the edge cases, describe the trade-offs, then let the model generate or refactor. Some developers spend more time in prose now than they spend in actual code. Specifications. Prompts. Design documents. Slack explanations of why the system broke. Responses to code reviews that require context, not just corrections.

For all of that, voice is faster than typing. Faster ideation. Faster drafting. Faster thinking out loud. You speak faster than you type. You think more clearly when you're speaking instead of composing. You catch edge cases in your own words that you'd miss if you were typing. The trade-off is transcription accuracy, which modern models have solved.

But only if the tool gets out of the way. Only if you can speak for thirty minutes without hitting a word limit. Only if you're not interrupted by metering.

The pricing model hiding in the free tier

Wispr Flow, the most popular indie dictation tool for developers, caps the free tier at 5,000 words. That's one substantial design document. One long conversation with Claude. Hit the cap, you pay $14/month for unlimited. Willow caps the free tier at 3,000 words, then charges $12/month. Superwhisper is $8.49/month. The pattern is consistent: metering as leverage.

Here's the thing that should bother you: Whisper, the open source speech-to-text model that powers all of these tools, doesn't care how many words you transcribe. Whisper is free. It runs locally. The computational cost of transcribing 5,000 words is functionally identical to transcribing 50,000 words. The cost is zero per word. The cost is zero per use.

The word cap isn't a technical constraint. It's not a limitation of the model. It's not a limitation of the infrastructure. It's a limitation of the business model. The cap serves one purpose: funnel free users to paid. It's pricing architecture dressed up as a feature limitation.

Once you see that, it's hard to unsee.

What actually changes when you remove the cap

Marcus started asking that question six months ago. He builds in Cursor. He was concerned about code IP. Every cloud dictation tool uploads the transcript somewhere. That somewhere isn't your device. He stopped using them entirely until he found something different. Something local. Something that didn't metering the thing that costs nothing to meter.

When the word limit goes away, the workflow opens up. No hesitation before pressing record. No mental math about how many words you've got left. No mid-thought restart. No rewrite on Tuesday morning. Just speak, transcribe, edit, move on.

Marcus spends his Sunday nights designing the next sprint's payment logic in voice. Thirty minutes of narration. Every edge case. Every trade-off. Every assumption spelled out. The transcript comes back clean, uninterrupted, ready to paste into a design document or a Claude conversation. On Monday morning, the team has full context before the standup. No fragmentation. No "wait, what was the reasoning behind that decision?" No reconstruction from memory. The thought stays intact because it was never interrupted.

The technical reality of local transcription

Recitey, the tool Marcus landed on, runs Whisper-large-v3 locally. Whisper-large-v3 hits 96.3% word accuracy on LibriSpeech, which is accurate enough for specification prose and design documentation. The accuracy is good enough because you're going to edit it anyway. This isn't live transcription into a meeting. This is spoken first draft into a document.

No cloud transcription. No word counting. No cap. No variable cost. It costs $0 on the free tier because the variable cost is literally zero. Pro tier, if you buy it, is for the optional cloud rewrite that polishes the transcript into publication-ready prose. The dictation itself is free forever.

The constraint that does exist is latency. Local speech-to-text is slower than cloud. The entire transcription run happens on your device. Depending on your hardware, that's five to fifteen seconds. An older machine might take longer. A newer machine might be faster. If you need transcription to happen in real time while you're speaking to a room, that's a problem. If you're narrating into a tool and you're willing to wait a few seconds, it's not.

Who actually needs this

If you're a developer who specs in voice. If you're documenting code at late hours when there's no office noise. If your company has IP restrictions on cloud transcription. If you're tired of pricing that treats a zero-cost feature like it's expensive. If you already spend more time writing prompts than writing code, and voice would speed that up.

Wispr, Willow, Superwhisper, all of them work fine for their intended audiences. Quick meeting recaps. Short voice memos. Voice notes for someday-I'll-write-that-up-properly. They're priced for that audience. But the developer who's dictating system design in voice is different. The workflow is different. The need is different.

When you remove the pricing-based interruption, when the cap goes away, the thing that changes isn't the transcription technology. It's continuity. The thought stays intact. The prose stays coherent. The design stays unbroken.