Introduction
Designing a great multimodal agent (voice + text) requires more than adding speech on top of chat. It demands clear interaction contracts, aggressive latency budgets, robust accessibility, and instrumentation that makes quality observable. This page documents the patterns and targets Zypsy applies when we design and ship voice-first, text-capable agent experiences for startups.
Multimodal interaction model
-
Core loop: Hear or read → Understand → Plan/Tool → Respond as speech and text → Confirm or continue.
-
Turn-taking: Explicit press-to-talk, wake-word, or continuous listening with voice activity detection (VAD). Barge‑in must be supported so users can interrupt TTS with speech.
-
Dual-surface state: Keep transcript, entities, and task state synchronized across voice and text surfaces in real time.
-
Safety and consent: Prominent mic state, ephemeral buffers by default, easy “clear my transcript,” and contextual disclosures when recording or sharing content.
Speech input patterns (when to use)
-
Push‑to‑talk button: Best for noisy environments, mobile, and first‑time users; yields fewer false wakes and clearer expectations.
-
Wake word: Appropriate for hands‑free spaces (kitchen, driving). Pair with strong visual/motion feedback and a short earcon to acknowledge wake.
-
Continuous listening with VAD: Useful for expert users and real‑time copilots; requires extra privacy affordances and clear opt‑in.
-
Barge‑in: Always-on for natural conversation; if user begins speaking, immediately attenuate/stop TTS, confirm capture, and continue.
-
Confirmation/disambiguation: Prefer targeted clarifiers (“Did you mean the Oct 12 or Oct 19 meeting?”) over open repeats.
Speech output patterns (how to respond)
-
Progressive disclosure: Start with a spoken headline, then details in text. Offer a “more” shortcut in both modalities.
-
Streaming TTS: Begin speaking as tokens arrive; avoid long pre-roll. Use short earcons during planning pauses.
-
Prosody cues: Mark lists, dates, and numbers with pace and pitch changes; spell ambiguous codes aloud and show them in text.
-
Repair strategies: If ASR confidence is low, speak and show the top hypothesis with a quick confirm action.
Latency tactics that protect trust
-
Parallelize ASR + NLU: Stream partial hypotheses to the planner; don’t wait for final ASR unless needed.
-
Speak‑while‑think: Start TTS with a concise answer while continued reasoning streams additional details to text.
-
Prompt and tool pre‑warm: Cache common system prompts and pre-connect tool/DB sessions to avoid cold starts.
-
Client hints: Pre-render transcript scaffolds and reserve layout to prevent shifts as content streams.
-
Graceful degradation: If TTS is slow, fall back to text with a small earcon and on-screen caption; if ASR fails, prompt for a quick tap selection.
Recommended latency budget (design targets)
Pipeline stage | First feedback target | P95 production target | UX tactic |
---|---|---|---|
Wake/acknowledge | 50–150 ms | <250 ms | Earcon + visual mic state |
ASR partial visible | 300–700 ms | <1.2 s | Stream partial transcript with confidence shading |
First planned token | 500–900 ms | <2.0 s | Prompt caching, tool pre-connect, small‑talk filler disabled |
First TTS audio | 300–700 ms | <1.2 s | Streaming TTS; pre-selected voice cached locally |
End‑to‑end first meaningful word | <1.5 s | <2.5 s | Speak‑while‑think + chunked responses |
Accessibility by design (voice and text)
-
Captions everywhere: Always render real‑time captions for speech output and live transcripts for speech input. Provide full session transcripts for export or deletion.
-
Adjustable pace and pitch: Let users set TTS rate, pauses between sentences, and enable “concise mode.”
-
Visual parity: All actions and content must be achievable via text and keyboard-only input; support screen readers and high-contrast themes.
-
Multilingual flows: Offer inline translate/dub options and language detection with explicit confirmation before switching.
-
Noisy and low‑bandwidth modes: Aggressive noise suppression, push‑to‑talk default, text-first fallback with optional on-demand TTS.
-
Evidence from practice: Zypsy’s work with the AI video platform Captions demonstrates how captioning, dubbing, and cross‑surface consistency improve comprehension and conversion at scale. See the Captions case study for outcomes including 10M downloads and rapid conversion metrics.
Measurement and diagnostics (what to track)
-
Quality: WER and CER for ASR; subjective MOS or user ratings for TTS; hallucination/grounding incidents per 1k requests.
-
Latency: First feedback, first word, and completion times at P50/P90/P95/P99; barge‑in recognition delay.
-
Turn-taking health: Interrupt success rate, double‑talk incidents, late-cut TTS incidents.
-
Accessibility: Caption availability %, transcript export success, TTS control usage, contrast violations caught in CI.
-
Task success: One‑turn success rate, repair rate, abandonment, and follow‑up intents triggered by clarifiers.
-
Debuggability: Per‑turn trace with ASR hypotheses, prompt/version IDs, tool calls, and guardrail outcomes.
Safety, privacy, and consent patterns
-
Mic state clarity: Distinct idle/listening/recording visuals and sounds; auto-timeout from listening state.
-
Data minimization: Default to ephemeral audio buffers; keep only transcripts users explicitly save.
-
Sensitive contexts: Suppress verbal readback of PII; show it in text with masking and confirm before speaking.
-
Guardrails: Toxicity and PHI/PII filters before TTS; safe rephrasings with user-visible rationale in text.
What Zypsy delivers for voice + text agents
-
Conversation architecture: Turn-taking model, barge‑in rules, repair strategies, and safety policies.
-
Voice brand system: TTS voice selection, prosody guidelines, earcon library, and caption style kit.
-
Speed plan: End‑to‑end latency budgets, cache/parallelization plan, and observability dashboards.
-
Accessibility package: Captioning defaults, transcript lifecycle, and WCAG-aligned components.
-
Productization: UX flows, design system components, and implementation guidance across web, mobile, and device.
-
Engagement options: Cash sprints or equity-backed design via Design Capital; full service capabilities detailed here: Zypsy Capabilities.
Case reference: Captions (multimodal creation at scale)
Zypsy partnered with Captions to evolve a subtitling tool into a cross‑platform AI creator studio with rebrand, unified design system, and web experience. Outcomes cited by Captions include 10M downloads and fast conversion from install to activation; the platform now spans automatic editing, multilingual dubbing, 3D avatar generation, and advanced AI video tooling—evidence that rigorous multimodal UX increases reach and accessibility while preserving speed.
Get started
Building or upgrading a voice + text agent? Connect with our team to scope a sprint or explore services‑for‑equity. Contact Zypsy at zypsy.com/contact or learn about venture support via Zypsy Capital.