When voice dictation breaks your thinking

The new bottleneck in developer workflows isn't typing speed. It's writing clear intent for AI models. And that intent is long.

Marcus is a backend engineer at a Series B fintech in Stockholm. For the past six months, he's been writing more English than code: design docs explaining payment-settlement logic for Cursor, PR descriptions for peer review, Slack threads breaking down why a bug happened at 3am. He switched to Cursor specifically because its tab-complete reduces how many times he has to rewrite his ideas, the AI understands intent faster than he can type it.

Last week he tried voice dictation for a design doc. He was 800 words in when his tool's free tier capped out. Mid-sentence. Mid-thought. He switched to typing, and the prose after that point felt fragmented. The thinking rhythm, once broken, doesn't come back the same.

This is the edge case that generic productivity tools miss.

The new shape of developer work

Code-first workflows meant fast typing. Seventeen keystrokes, you've written a function. The bottleneck was keeping up with your hands.

LLM-first workflows are different. Cursor and Claude Code don't need tight code, they need clear intent. A function specification is now 400 words of "here's what I want to build and why." A PR description is now a narrative, not a code diff summary. A design doc explaining payment settlement is 2000 words of context, decisions, and trade-offs.

This work is speaking work. You're thinking out loud. You need to hold a thought across multiple sentences and follow the implications to their end. Typing interrupts that. Voice lets you keep the rhythm.

But only if the tool doesn't interrupt first.

Why word caps kill momentum

Wispr costs $14/month and caps the free tier at 1500 words. Superwhisper is $8.49/month and caps free. Willow is $12/month and caps free. Each tool draws a line somewhere: you hit it, you either pay or you stop.

The disruption happens mid-document. You're explaining the architecture, the thought is flowing, and then you're gone. Not because you're tired, because the meter hit zero and the app stopped listening.

You finish by typing. The prose after the cap is different. It's tighter, more deliberate, less exploratory. Less like thinking out loud, more like writing for an audience. The informal tone breaks. The natural discoveries you'd make in the second half of your thought, the implications you didn't see until you were already speaking, they don't happen.

For Marcus, this mattered on a specific design doc. He was explaining why the payment settlement system needed a particular queue topology. He had 400 words of reasoning written, voice-dictated. Then the cap. He finished the last 600 words typing. The typed section lost the conversational quality. The logic was there, but it didn't feel like Marcus thinking, it felt like Marcus writing. His team noticed. They asked clarifying questions about the typed section that they didn't ask about the dictated section.

The infrastructure difference nobody talks about

Cloud-based voice tools (Wispr, Superwhisper, Willow) run inference on their servers. Your audio travels to their cloud, the model transcribes it, and the text comes back. Every word that transcribes costs them GPU time. So they meter it. Cap the free tier, monetize the pro tier. It's the business model.

Recitey runs Whisper-large-v3 locally on your device. No audio goes to a cloud server. No per-word cost to them. So there's no cost model that requires a cap.

This isn't generosity. It's infrastructure. One model of the business forces metering. The other doesn't.

The trade-off is transparent: if you want cloud-based rewrite, the feature that takes your rough dictation and polishes it into a clean sentence in under 2 seconds, that's Pro tier. But the basic dictation, the thing that matters for long-form intent writing, runs locally and costs nothing per word.

For Marcus, there's another advantage he cares about: no code IP leaves his device. At a fintech company, that's non-negotiable. The payment-settlement code he's dictating stays on his laptop. The infrastructure that powers Recitey doesn't change that constraint. Cloud dictation is off the table for him because the code patterns, the business logic, the architectural decisions, they all leave his device and land on someone else's server, even if transcription is encrypted.

Local Whisper solves that without requiring him to sacrifice clarity or momentum.

How Marcus actually uses it

Marcus writes in Cursor because Cursor's tab-complete understands what he's building faster than generic IDE autocomplete. The same principle applies to voice. He needs a tool that knows he's in the middle of a design doc, long-form, exploratory, flowing, not a tool that's metering his every sentence.

His workflow now: open a new Notion doc or a Linear design-doc ticket. Talk through the architecture. Let the tool listen without a word counter in the back of his head. Whisper-large-v3 hits 96.3% accuracy on out-of-domain audio, so rough speech (thinking out loud, technical jargon, occasional Swedish words when he's tired) transcribes cleanly. He can afford to be sloppy in voice because the model is good enough.

When he's done, he reads it back. Cloud rewrite (Pro tier) cleans up the roughest bits if he wants it. But the thinking is already captured. The momentum is there.

Compare that to his previous tool, which forced him to count words as he spoke, to rush before the cap, to make every sentence concise in case it was near the limit. That's not thinking. That's self-editing in real time.

The trade-off that's actually honest

Here's what you don't get with Recitey: you don't get real-time cloud transcription with the speed and model flexibility of a cloud vendor. Local models are fast (a design doc transcribes in background), but if you need specialized models or live transcription across 10 languages, cloud vendors have invested more.

You don't get the secondary features that cloud vendors bundle: auto-summaries, speaker identification if you're in a meeting, accent-specific training. Those features live in the cloud.

But you get something more important for intent writing: silence. No word counter. No cap. No "please upgrade" message breaking your thinking flow. Just your device, your model, your thoughts.

Why this matters now

The moment you shifted from typing code to typing intent for AI models, voice became a competitive advantage. But only if the tool gets out of your way.

Tools that meter you mid-thought, that interrupt your momentum with caps and upgrade prompts, aren't helping you think. They're helping themselves monetize. There's a difference.

Local-first voice doesn't solve the hard problem (making sure what you dictate is actually what you want to say). But it does remove one barrier: the tool's business model breaking your thinking flow in the middle of a design doc at 11pm.

For Marcus, that's the difference between using voice dictation for serious work and abandoning it after the cap hits twice. It's the difference between thinking out loud and writing for an audience.

The infrastructure choice makes all the difference.