03

Voice Input

Evern provides push-to-talk voice input using on-device speech recognition. Transcription runs through a preprocessing pipeline that removes filler words, expands 27 spoken syntax mappings, and applies a ~500-term developer dictionary. A personal correction system records word-level edits and learns user-specific vocabulary over time.

Using Voice Input

Activating Voice

Hold the microphone button in the input composer to talk. Audio is captured for the duration of the hold. Release the button to finalize the transcription. The transcription appears inline in the input composer, where it can be reviewed and edited before sending.

Editing Before Sending

After releasing the microphone button, the transcribed text appears in the standard input composer as editable text. You can tap into the transcription to correct any word, add punctuation, or rewrite portions. The text behaves identically to manually typed input — select, delete, and retype as needed. When satisfied, send the message normally.

Switching Engines

Navigate to Settings → Voice Input → Engine to choose between "Evern Voice" (Moonshine, the default) and "Platform Native" (Android SpeechRecognizer or iOS SFSpeechRecognizer). The preprocessing pipeline and personal dictionary apply regardless of which engine is selected.

Personal Corrections

Personal corrections are triggered automatically. When you edit a transcription before sending, the system computes a word-level diff between the original transcription and your edited version. Each changed word pair is recorded as a correction candidate. No explicit action is required beyond editing and sending — the diff is recorded in the background. See the Personal Dictionary section for details on how corrections are promoted.

voice input workflow
# 1. Hold microphone button
[recording...]
 
# 2. Release — transcription appears
"run cube cuttle get pods"
 
# 3. Preprocessing auto-corrects
"run kubectl get pods"
 
# 4. Edit inline if needed, then send

Engine Architecture

Evern defaults to Moonshine v2 Tiny (~30MB), a fully on-device speech recognition model bundled with the app via sherpa-onnx. No audio or transcription data leaves the device. The model runs inference locally on the device CPU; no network connection is required for voice input.

Platform-native STT engines serve as fallback: Android SpeechRecognizer and iOS SFSpeechRecognizer. Users can switch between "Evern Voice" and "Platform Native" in Settings → Voice Input → Engine. Both engines feed into the same preprocessing pipeline, so filler removal, syntax expansion, dictionary normalization, and personal corrections apply regardless of the selected engine.

Streaming Integration

Audio streams through sherpa-onnx's sliding-window pipeline. Partial results stabilize with ~200ms debounce before being displayed. The user sees real-time transcription updating in the input composer as they speak. When the microphone button is released, a final recognition pass runs, and the complete transcription is placed in the composer for inline editing before send.

Push-to-Talk Activation
Audio Capture (platform-native)
STT Engine (Moonshine v2 Tiny)
Preprocessing (filler removal, syntax expansion)
Context-Aware Correction
Inline Editing & Send

Preprocessing Pipeline

Raw transcription passes through three stages before the user sees it: filler word removal, spoken syntax expansion, and developer dictionary normalization. These stages run sequentially in the shared Rust core, which is compiled for both Android and iOS.

Filler Word Removal

Position-aware removal of filler words at utterance boundaries and between pauses. The following fillers are removed: um, uh, like, actually, basically, right, well, hmm, ah, er, okay so, let me think. Removal is position-aware — "like" is only removed when it appears as a discourse filler (e.g., sentence-initial or between pauses), not when it functions as a verb or preposition.

Spoken Syntax Expansion

27 spoken-to-syntax mappings transform natural speech into terminal-ready input. When a spoken phrase matches a mapping, it is replaced with the corresponding symbol. These mappings are applied as whole-word matches to avoid false replacements within longer words. See the complete mapping table below.

before / after preprocessing
# Before
"um add input validation to the
sign up form using like zod"
 
# After
"Add input validation to the
signup form using Zod"
 
# Before
"run kubectl get pods pipe
grep error"
 
# After
"run kubectl get pods | grep error"

Spoken Syntax Mappings

The following table lists all 27 spoken-to-syntax mappings supported by the preprocessing pipeline. When any of these phrases appear as whole words in the transcription, they are replaced with the corresponding symbol. Mappings are case-insensitive.

Spoken Phrase Symbol Category
open paren ( Grouping
close paren ) Grouping
open bracket [ Grouping
close bracket ] Grouping
open brace { Grouping
close brace } Grouping
pipe | Operators
tilde ~ Operators
backtick ` Operators
at sign @ Operators
hash # Operators
dollar sign $ Operators
ampersand & Operators
asterisk * Operators
caret ^ Operators
percent % Operators
exclamation ! Operators
semicolon ; Punctuation
colon : Punctuation
single quote ' Punctuation
double quote " Punctuation
forward slash / Punctuation
backslash \ Punctuation
equals = Comparison
dash / hyphen - Comparison
greater than > Comparison
less than < Comparison

Developer Dictionary

A ~500-term dictionary normalizes developer terminology that general-purpose STT engines consistently misrecognize. Terms are organized by category and applied as post-processing corrections after the STT engine produces raw text. The dictionary is compiled into the shared Rust core and runs on both platforms identically.

Each entry maps a misrecognized form (or set of forms) to the correct developer term. For example, "pie torch" and "py torch" both map to "PyTorch". Matching is case-insensitive and applies to whole words only to avoid false positives within longer terms.

Category Examples Terms
CLI Tools kubectl, terraform, docker, npm, yarn, pip, cargo, brew ~60
Git rebase, cherry-pick, stash, bisect, reflog ~30
Languages TypeScript, PostgreSQL, Kotlin, Swift, PyTorch ~80
Frameworks React, NextJS, FastAPI, SwiftUI, Django ~70
Patterns async/await, useState, println, console.log ~50
Infrastructure nginx, Redis, Kafka, Kubernetes ~50
Corrections "pie torch" → PyTorch, "post gres" → Postgres, "cube cuttle" → kubectl ~160

Context-Aware Biasing

The preprocessing pipeline receives context from the active session: recent commands, server names, and the detected AI agent. This context biases correction toward terms the user is likely saying.

If the user recently typed kubectl commands, the voice engine is more likely to recognize Kubernetes-related terms. If connected to a server named "staging-api", that name is added to the recognition vocabulary. Context biasing operates as a re-ranking step in the post-processing pipeline — it does not modify the STT engine itself, but increases the priority of contextually relevant dictionary entries when resolving ambiguous transcriptions.

The context window includes the last 20 commands from the active session and all server names from the current connection list. Context is refreshed on each voice activation, so it always reflects the most recent session state.

context-aware API
fn preprocess_voice_input_with_context(
input: String,
recent_commands: Vec<String>,
server_names: Vec<String>,
personal_dictionary:
Vec<VoiceCorrectionPair>,
) -> String

Personal Dictionary

Evern learns from your corrections. When you edit a transcription before sending, the system computes a word-level diff between the raw transcription and your edited version. Each changed word pair is recorded as a correction entry. No explicit action is required — edit the transcription, send the message, and the diff is recorded automatically in the background.

Confidence Tiers

Corrections follow a confidence tier system before becoming active:

  • Candidate — first correction recorded, stored but not applied
  • Learned — second identical correction promotes the entry; auto-applied going forward
  • Locked — 5+ identical corrections; high confidence, always applied

On promotion to "learned", a toast notification appears: "Learned: {word}" with an undo action. Tapping undo reverts the entry to "candidate" status. Corrections are stored in SQLite and persist across sessions. The personal dictionary is shared between both engine options (Moonshine and platform-native), since corrections apply in the post-processing pipeline that runs after either engine.

personal dictionary schema
# voice_corrections table
 
raw_phrase TEXT
corrected_phrase TEXT
occurrences INTEGER
status TEXT
 
# Status progression:
# candidate (1) → learned (2) → locked (5+)

Platform Integration

Both platforms share the same Rust core for preprocessing, dictionary lookup, and personal corrections. The primary difference is in custom language model support for the platform-native fallback engine.

Feature Android iOS
Default Engine Moonshine v2 Tiny (sherpa-onnx) Moonshine v2 Tiny (sherpa-onnx)
Fallback Engine SpeechRecognizer SFSpeechRecognizer
Audio Capture Platform-native Platform-native
Custom Language Model N/A Programming vocabulary model (planned)
Preprocessing Shared Rust core Shared Rust core
Personal Dictionary Shared SQLite Shared SQLite
Privacy Fully on-device, no cloud Fully on-device, no cloud

Custom Language Model: iOS vs. Android

The "Programming vocabulary model" entry for iOS refers to Apple's SFCustomLanguageModelData API, which allows apps to supply a custom vocabulary list that biases the platform-native SFSpeechRecognizer toward specific terms. Evern plans to use this API to feed developer terminology into the iOS fallback engine, improving recognition accuracy when the user selects "Platform Native" as their engine. This feature is planned but not yet implemented.

Android's SpeechRecognizer does not expose an equivalent API for custom vocabulary biasing, so the Android fallback engine relies entirely on the post-processing pipeline (developer dictionary, context-aware biasing, and personal corrections) for developer term accuracy. This is why the Android row shows "N/A" for custom language model support.

This discrepancy only affects the platform-native fallback engines. When using the default Moonshine engine (the recommended setting), both platforms behave identically: the shared Rust developer dictionary handles all terminology normalization in post-processing, and no platform-specific language model is involved.