Using Voice Input
Activating Voice
Hold the microphone button in the input composer to talk. Audio is captured for the duration of the hold. Release the button to finalize the transcription. The transcription appears inline in the input composer, where it can be reviewed and edited before sending.
Editing Before Sending
After releasing the microphone button, the transcribed text appears in the standard input composer as editable text. You can tap into the transcription to correct any word, add punctuation, or rewrite portions. The text behaves identically to manually typed input — select, delete, and retype as needed. When satisfied, send the message normally.
Switching Engines
Navigate to Settings → Voice Input → Engine to choose between "Evern Voice" (Moonshine, the default) and "Platform Native" (Android SpeechRecognizer or iOS SFSpeechRecognizer). The preprocessing pipeline and personal dictionary apply regardless of which engine is selected.
Personal Corrections
Personal corrections are triggered automatically. When you edit a transcription before sending, the system computes a word-level diff between the original transcription and your edited version. Each changed word pair is recorded as a correction candidate. No explicit action is required beyond editing and sending — the diff is recorded in the background. See the Personal Dictionary section for details on how corrections are promoted.
Engine Architecture
Evern defaults to Moonshine v2 Tiny (~30MB), a fully on-device speech recognition model bundled with the app via sherpa-onnx. No audio or transcription data leaves the device. The model runs inference locally on the device CPU; no network connection is required for voice input.
Platform-native STT engines serve as fallback: Android SpeechRecognizer and iOS SFSpeechRecognizer. Users can switch between "Evern Voice" and "Platform Native" in Settings → Voice Input → Engine. Both engines feed into the same preprocessing pipeline, so filler removal, syntax expansion, dictionary normalization, and personal corrections apply regardless of the selected engine.
Streaming Integration
Audio streams through sherpa-onnx's sliding-window pipeline. Partial results stabilize with ~200ms debounce before being displayed. The user sees real-time transcription updating in the input composer as they speak. When the microphone button is released, a final recognition pass runs, and the complete transcription is placed in the composer for inline editing before send.
Preprocessing Pipeline
Raw transcription passes through three stages before the user sees it: filler word removal, spoken syntax expansion, and developer dictionary normalization. These stages run sequentially in the shared Rust core, which is compiled for both Android and iOS.
Filler Word Removal
Position-aware removal of filler words at utterance boundaries and between pauses. The following fillers are removed: um, uh, like, actually, basically, right, well, hmm, ah, er, okay so, let me think. Removal is position-aware — "like" is only removed when it appears as a discourse filler (e.g., sentence-initial or between pauses), not when it functions as a verb or preposition.
Spoken Syntax Expansion
27 spoken-to-syntax mappings transform natural speech into terminal-ready input. When a spoken phrase matches a mapping, it is replaced with the corresponding symbol. These mappings are applied as whole-word matches to avoid false replacements within longer words. See the complete mapping table below.
Spoken Syntax Mappings
The following table lists all 27 spoken-to-syntax mappings supported by the preprocessing pipeline. When any of these phrases appear as whole words in the transcription, they are replaced with the corresponding symbol. Mappings are case-insensitive.
| Spoken Phrase | Symbol | Category |
|---|---|---|
| open paren | ( |
Grouping |
| close paren | ) |
Grouping |
| open bracket | [ |
Grouping |
| close bracket | ] |
Grouping |
| open brace | { |
Grouping |
| close brace | } |
Grouping |
| pipe | | |
Operators |
| tilde | ~ |
Operators |
| backtick | ` |
Operators |
| at sign | @ |
Operators |
| hash | # |
Operators |
| dollar sign | $ |
Operators |
| ampersand | & |
Operators |
| asterisk | * |
Operators |
| caret | ^ |
Operators |
| percent | % |
Operators |
| exclamation | ! |
Operators |
| semicolon | ; |
Punctuation |
| colon | : |
Punctuation |
| single quote | ' |
Punctuation |
| double quote | " |
Punctuation |
| forward slash | / |
Punctuation |
| backslash | \ |
Punctuation |
| equals | = |
Comparison |
| dash / hyphen | - |
Comparison |
| greater than | > |
Comparison |
| less than | < |
Comparison |
Developer Dictionary
A ~500-term dictionary normalizes developer terminology that general-purpose STT engines consistently misrecognize. Terms are organized by category and applied as post-processing corrections after the STT engine produces raw text. The dictionary is compiled into the shared Rust core and runs on both platforms identically.
Each entry maps a misrecognized form (or set of forms) to the correct developer term. For example, "pie torch" and "py torch" both map to "PyTorch". Matching is case-insensitive and applies to whole words only to avoid false positives within longer terms.
| Category | Examples | Terms |
|---|---|---|
| CLI Tools | kubectl, terraform, docker, npm, yarn, pip, cargo, brew | ~60 |
| Git | rebase, cherry-pick, stash, bisect, reflog | ~30 |
| Languages | TypeScript, PostgreSQL, Kotlin, Swift, PyTorch | ~80 |
| Frameworks | React, NextJS, FastAPI, SwiftUI, Django | ~70 |
| Patterns | async/await, useState, println, console.log | ~50 |
| Infrastructure | nginx, Redis, Kafka, Kubernetes | ~50 |
| Corrections | "pie torch" → PyTorch, "post gres" → Postgres, "cube cuttle" → kubectl | ~160 |
Context-Aware Biasing
The preprocessing pipeline receives context from the active session: recent commands, server names, and the detected AI agent. This context biases correction toward terms the user is likely saying.
If the user recently typed kubectl commands, the voice engine is more likely to recognize Kubernetes-related terms. If connected to a server named "staging-api", that name is added to the recognition vocabulary. Context biasing operates as a re-ranking step in the post-processing pipeline — it does not modify the STT engine itself, but increases the priority of contextually relevant dictionary entries when resolving ambiguous transcriptions.
The context window includes the last 20 commands from the active session and all server names from the current connection list. Context is refreshed on each voice activation, so it always reflects the most recent session state.
Personal Dictionary
Evern learns from your corrections. When you edit a transcription before sending, the system computes a word-level diff between the raw transcription and your edited version. Each changed word pair is recorded as a correction entry. No explicit action is required — edit the transcription, send the message, and the diff is recorded automatically in the background.
Confidence Tiers
Corrections follow a confidence tier system before becoming active:
- Candidate — first correction recorded, stored but not applied
- Learned — second identical correction promotes the entry; auto-applied going forward
- Locked — 5+ identical corrections; high confidence, always applied
On promotion to "learned", a toast notification appears: "Learned: {word}" with an undo action. Tapping undo reverts the entry to "candidate" status. Corrections are stored in SQLite and persist across sessions. The personal dictionary is shared between both engine options (Moonshine and platform-native), since corrections apply in the post-processing pipeline that runs after either engine.
Platform Integration
Both platforms share the same Rust core for preprocessing, dictionary lookup, and personal corrections. The primary difference is in custom language model support for the platform-native fallback engine.
| Feature | Android | iOS |
|---|---|---|
| Default Engine | Moonshine v2 Tiny (sherpa-onnx) | Moonshine v2 Tiny (sherpa-onnx) |
| Fallback Engine | SpeechRecognizer | SFSpeechRecognizer |
| Audio Capture | Platform-native | Platform-native |
| Custom Language Model | N/A | Programming vocabulary model (planned) |
| Preprocessing | Shared Rust core | Shared Rust core |
| Personal Dictionary | Shared SQLite | Shared SQLite |
| Privacy | Fully on-device, no cloud | Fully on-device, no cloud |
Custom Language Model: iOS vs. Android
The "Programming vocabulary model" entry for iOS refers to Apple's SFCustomLanguageModelData API, which allows apps to supply a custom vocabulary list that biases the platform-native SFSpeechRecognizer toward specific terms. Evern plans to use this API to feed developer terminology into the iOS fallback engine, improving recognition accuracy when the user selects "Platform Native" as their engine. This feature is planned but not yet implemented.
Android's SpeechRecognizer does not expose an equivalent API for custom vocabulary biasing, so the Android fallback engine relies entirely on the post-processing pipeline (developer dictionary, context-aware biasing, and personal corrections) for developer term accuracy. This is why the Android row shows "N/A" for custom language model support.
This discrepancy only affects the platform-native fallback engines. When using the default Moonshine engine (the recommended setting), both platforms behave identically: the shared Rust developer dictionary handles all terminology normalization in post-processing, and no platform-specific language model is involved.