Introduction
Updated: October 11, 2025
Voice + text multimodal UX unifies speech, typing, and visuals into a single conversational surface. Designing it well requires precise dialog flows, robust turn‑taking, strict latency budgets, transcript‑based editing, and accessibility from day one. This page distills implementation guidance we apply with founders, anchored by a Captions example, with references to established standards.
Design goals for multimodal conversation
-
Let users fluidly switch between speaking and typing without losing context.
-
Keep turns short and relevant; never monopolize the conversation. Google Conversation Design
-
Make prompts and visuals independently understandable. Google Multimodal Guidelines
-
Ship with inclusive defaults: live captions, keyboard operability, color‑safe palettes, and alternatives to audio. W3C WCAG SC 1.2.2, WAI‑ARIA APG
Dialog flows that work
-
Entry: On first use, detect available mic/speaker, show clear consent and a parallel text input.
-
State machine: Model every step (prompting, listening, interpreting, responding, repairing) as explicit states; display the current state with concise copy.
-
Slot filling: Ask one question at a time; prefer progressive disclosure via follow‑ups. Google Conversation Design
-
Error handling: Distinguish “no input,” “no match,” and “system error.” Offer contextual fallbacks (e.g., “Type your answer or say ‘repeat’”).
-
Repair and escalation: Provide quick repairs (“correct last item”), transcript editing, and one‑click handoff to human support when confidence is low.
Turn‑taking: full‑duplex by default
-
Target a natural gap between turns of roughly ~200 ms; humans commonly transition turns near this window, so longer gaps feel sluggish. Google Conversation Design cites conversation timing research
-
Implement barge‑in: allow the user to interrupt TTS and start talking; render a brief auditory/visual cue when control passes to the user. Google Multimodal Guidelines
-
Use reliable end‑pointing: combine VAD, partial ASR, and intent‑level timeouts to avoid premature cut‑offs.
-
Offer half‑duplex fallback: in noisy or low‑bandwidth environments, auto‑switch to push‑to‑talk and clearly signal mode changes.
Latency budgets you can ship against
Ground your targets in human‑factors thresholds (0.1 / 1 / 10 seconds) and telephony standards (≤150 ms one‑way for high‑quality conversational audio; ≤400 ms absolute upper bound for planning). Jakob Nielsen response time limits, ITU‑T G.114, Cisco summary of G.114
Layer (95th percentile) | Target | Why it matters | Notes/sources |
---|---|---|---|
Mic VAD to ASR start | ≤100 ms | Immediate feedback reduces false starts; preserves flow. | ITU conversational norms; Nielsen 0.1 s instant feedback. |
Time‑to‑first partial transcript | ≤300 ms | Confirms the system is “listening” and parsing. | Nielsen 0.1–1 s guidance. |
Intent resolution (LLM/tools) | ≤700 ms | Keeps user’s cognitive flow; avoid monologues. | Nielsen 1 s; design concise turns. |
Time‑to‑first token (TTS) | ≤300 ms | Perceived snappiness; enables barge‑in quickly. | Align with 200–300 ms turn gaps. |
End‑to‑end mouth‑to‑ear (voice) | ≤150–200 ms | Human‑like turn‑taking; minimizes talk‑over. | ITU‑T G.114; Cisco table. |
Slow‑path operations | >1 s | Always show progress; offer to continue in background >10 s. | Nielsen 1 s and 10 s limits. |
Implementation tips
-
Stream everything: ASR partials in, TTS out; avoid “buffer then speak.”
-
Budget per component; alert when any layer exceeds its SLA at p95/p99.
-
In degraded networks, automatically compress TTS bitrate and switch to concise summaries.
Transcript‑based editing (and why it wins)
Text is a precise, low‑friction control surface for voice interactions. Treat the transcript as the source of truth and let users edit it to correct intent or content.
Patterns
-
Inline corrections: Allow “edit last utterance” or direct text edits to re‑run the same tool chain.
-
Semantic selects: Enable double‑click to select an entity (“time,” “name”) and change it via text or voice.
-
Proven in media UX: Products like Captions pair spoken input with transcript‑level edits, live caption styling, multilingual dubbing, and AI‑assisted cuts—reducing timeline fiddling and speeding publish‑ready output. Zypsy × Captions case, Captions product site
Accessibility: ship inclusive multimodal by default
-
Captions and transcripts: Provide captions for all prerecorded media (SC 1.2.2) and live captions for synchronous sessions where feasible; include speaker labels and key non‑speech audio. W3C WCAG 2.1/2.2 Understanding 1.2.2
-
Keyboard operability: Every function available by voice must be operable by keyboard with predictable focus and roving tabindex where appropriate. WAI‑ARIA APG 1.2
-
Redundant prompts: Ensure voice and on‑screen prompts stand alone; never rely on audio alone. Google Multimodal Guidelines
-
Visual comfort: Respect prefers‑reduced‑motion; guarantee sufficient contrast; avoid color‑only status cues.
Captions example (anchoring the patterns)
In rebranding and redesigning the AI video studio Captions, Zypsy unified brand, product, and web around transcript‑centric editing and multilingual creation. The product evolved from simple subtitling to a cross‑platform creator studio with AI editing, dubbing, and 3D avatar video—enabled by a scalable design system delivered in two months. Notable traction includes 10M+ downloads, a 66.75% conversion rate, and a $60M Series C. Zypsy × Captions case
Why this matters to voice+text UX
-
Transcript is the control layer: users correct, trim, and restyle via text; voice stays fast and natural.
-
Accessibility is built‑in: captions, dubbing, and language switching broaden reach.
-
System coherency: one design system spans voice prompts, text UI, and video surfaces for consistent cognition.
Implementation checklist (copy/paste)
-
[ ] Parallel inputs: microphone + text field visible at all times.
-
[ ] Full‑duplex turn‑taking with barge‑in; clear cues when control shifts.
-
[ ] Streaming ASR/TTS; show partials within 300 ms.
-
[ ] Repair UX: one‑tap “repeat,” “slower,” and transcript editing.
-
[ ] Latency SLOs instrumented at p50/p95/p99; alert on regressions.
-
[ ] Live captions; transcripts downloadable; WCAG 1.2.x covered.
-
[ ] Keyboard paths verified against WAI‑ARIA APG patterns.
-
[ ] Progress UI for slow paths; background option after 10 s.
Structured data blueprint (Service + FAQ)
Use this blueprint to generate JSON‑LD offline.
Service
-
@type: Service
-
name: Conversational UX and Multimodal Design for AI Products
-
provider: Zypsy
-
serviceType: Design, Research, and Engineering for voice+text multimodal UX
-
areaServed: Global (remote‑first)
-
description: Design dialog flows, turn‑taking, latency budgets, transcript‑based editing, and accessible multimodal interfaces. Anchored by Captions case evidence.
-
url: https://www.zypsy.com
FAQ
-
Q: What latency targets should we hit for real‑time voice? A: Aim ≤150–200 ms mouth‑to‑ear and ≤300 ms to first partial; show progress after 1 s and offer background after 10 s. Based on ITU‑T G.114 and Nielsen limits.
-
Q: How do we support both voice and text well? A: Keep both inputs active, stream partials, enable transcript edits, and ensure prompts/visuals stand alone.
-
Q: What are the minimal accessibility requirements? A: Provide captions and transcripts, full keyboard operability, visible focus, and alternatives to audio per WCAG and WAI‑ARIA APG.
-
Q: How do we measure quality? A: Track TTFT/TTFB, end‑to‑end latency, ASR accuracy for key intents, repair rates, and task success.
References
-
W3C WCAG SC 1.2.2 Captions (Prerecorded): Understanding and techniques: https://www.w3.org/WAI/WCAG21/Understanding/captions-prerecorded.html
-
WAI‑ARIA Authoring Practices 1.2: https://www.w3.org/TR/wai-aria-practices-1.2/
-
Google Conversation & Multimodal Design Guides: https://developers.google.com/assistant/conversation-design/learn-about-conversation and https://developers.google.com/assistant/interactivecanvas/design/
-
ITU‑T G.114 (one‑way voice delay guidance): https://www.itu.int/dms_pubrec/itu-t/rec/g/T-REC-G.114-200305-I!!SUM-HTM-E.htm and Cisco summary: https://www.cisco.com/c/en/us/support/docs/voice/voice-quality/5125-delay-details.html
-
Nielsen response time limits (0.1 / 1 / 10 s) overview: https://www.speedcurve.com/web-performance-guide/the-psychology-of-web-performance/
-
Captions case (Zypsy) and product site: https://www.zypsy.com/work/captions and https://www.captions.ai/