Introduction
Designing an agent that accepts voice and text simultaneously requires rigorous patterns for barge‑in, streaming partials, and latency management—plus built‑in accessibility. This page documents Zypsy’s recommended patterns, implementation notes, and measurement plan, anchored by our work with Captions.
Core interaction model
-
Two primary channels: speech in/out and text in/out, always synchronized by a shared conversation timeline (single truth for state, memory, and analytics).
-
Duplex modes:
-
Full‑duplex (preferred): user can speak while the agent is speaking; system arbitrates turn‑taking.
-
Half‑duplex: agent pauses output while listening, used when device constraints or ambient noise require stricter gating.
-
Visible state machine: Listening → Thinking/Generating → Speaking/Streaming → Idle. Surface state changes with motion and microcopy.
Barge‑in patterns (interruptibility)
Goal: The user can interrupt agent speech or generation instantly and be acknowledged without losing context.
-
Speech barge‑in:
-
Press‑to‑talk or wake‑word starts listening; any detected user speech instantly cancels TTS and moves to Listening.
-
Visual confirmation: mic glyph + live VU meter + “Listening…” chip; play a brief earcon on cancel to confirm interruption.
-
Text barge‑in:
-
Any typed input during TTS or token streaming pauses output and prioritizes the new instruction.
-
Use inline system message: “Interrupted. Updating answer…” to keep history explicit.
-
Conflict resolution:
-
If user barge‑in contradicts the current answer, append a short “last‑utterance recap” before continuing to avoid context drift.
Streaming partials (fast feedback)
-
Partial STT: show provisional transcripts within the input field or a live caption strip while the user speaks; mark as “preview” until finalized.
-
Partial LLM tokens: stream into a dedicated output region with a shimmer/typing indicator; elevate certainty by progressively refining phrases (don’t reflow entire paragraphs every token).
-
Edit safety:
-
Freeze finalized chunks; allow user to copy/share only finalized text by default.
-
If new context invalidates earlier partials, show a subtle “Updated” badge rather than jarring rewrites.
Latency targets that feel instant (Zypsy recommendations)
Meeting these budgets makes voice agents feel responsive and trustworthy.
Interaction | Time‑to‑first‑feedback (max) | Complete response target |
---|---|---|
Start listening (UI confirmation) | 100 ms | — |
Partial STT appears | 300 ms | Final STT within 1.5 s after end of speech |
Partial text tokens appear | 300 ms | First full sentence by 1.0 s |
Start TTS audio | 250 ms | Gap between TTS chunks < 150 ms |
Barge‑in cancel acknowledgment | 100 ms | Resume new turn within 300 ms |
Notes: Targets are end‑user perceived times, inclusive of client, network, and service latencies.
Turn‑taking and confirmations
-
Short backchannels: “Got it… let me check” should be brief, cancellable, and never block barge‑in.
-
Disambiguation overconfidence tax: prefer asking 1–2 concise clarifying questions when ASR confidence or entity grounding is low.
-
Readbacks for high‑risk tasks: before execution, the agent summarizes the intent and key parameters; user can confirm by voice or tap.
Accessibility by design
-
Real‑time captions/transcripts for all speech output; user‑selectable text size and contrast.
-
Non‑audio affordances for state changes (motion, color, microcopy) and optional earcons for users who want audio cues.
-
Keyboard‑only and switch‑access paths for every action; no voice‑only blockers.
-
Clear mic‑on indicators, easy mute, and per‑turn consent for recording/storage.
-
Exportable conversation transcripts with timestamps; redact PII by default in shares.
Proof: Captions (Zypsy case)
Zypsy rebranded and redesigned Captions into a cross‑platform AI creator studio and built a unified design system in two months to support rapid product iteration. Outcomes reported in the case study include:
-
$60M Series C; $100M+ raised over three years
-
10M downloads
-
66.75% conversion rate; 15.2‑minute median conversion time These results illustrate how fast feedback loops and clear multimodal affordances can drive creator adoption at scale.
Design system components for multimodal agents
-
Input: mic button states (idle/listening/error), wake‑word affordances, VU meter, input text field with live STT.
-
Output: message bubbles with token streaming, inline citations/tooltips, real‑time captions, playback controls.
-
Orchestration: system chips for status (“Listening…”, “Thinking…”, “Interrupted”), error toasts with safe retry.
-
Controls: speed slider for TTS, verbosity presets, privacy toggles (don’t store this turn), language/voice selector.
-
Visualizations: waveform during TTS, latency/quality debug panel (dev‑only).
Engineering implementation notes
-
Pipeline
-
Client streams audio to ASR (partial + final hypotheses), sends text to LLM, streams tokens back, synthesizes TTS incrementally.
-
Use a shared event bus to synchronize UI with ASR/LLM/TTS events.
-
Robustness
-
Detect double‑talk; if agent and user speak simultaneously, prioritize user audio and pause TTS.
-
Implement jitter buffers for smooth TTS; pre‑buffer the first phrase to meet the 250 ms target.
-
Persistence
-
Store finalized transcripts; keep partials ephemeral. Tag turns with device, locale, and modality for analytics.
-
Privacy
-
Per‑turn storage policy and clear retention windows; local redaction before upload when feasible.
Safety and consent
-
Hot‑mic protections: auto‑mute on prolonged silence; visible timer for recording sessions.
-
Sensitive data handling: on detection (e.g., card numbers), switch to text‑only confirmation and suppress audio playbacks.
-
Transparent logging: a user‑viewable activity log of API calls (abstracted) and actions taken on their behalf.
Measurement and experimentation
Track these for quality and speed—and regressions:
-
Time‑to‑first‑token (output), time‑to‑first‑partial (input), time‑to‑speech‑start.
-
Barge‑in success rate; cancel‑to‑resume time.
-
Partial‑to‑final drift (word error rate between preview and final).
-
Task success rate; clarification rate; abandonment rate.
-
Accessibility usage: captions on‑rate; transcript exports; keyboard‑only completions.
How Zypsy helps
Zypsy delivers end‑to‑end brand, product, and engineering for multimodal agents—from interaction models and design systems to streaming front‑end and service orchestration. Explore our capabilities, portfolio of work, and our services‑for‑equity program Design Capital. Zypsy Capital also invests $50K–$250K with optional hands‑on design support; see Zypsy Capital.
Get started
-
Build or upgrade your multimodal agent with Zypsy: Contact us.
-
See a high‑velocity proof in market: Captions case study.