Introduction

Voice user interfaces succeed when they feel natural, fast, and correct. This page consolidates proven conversation design patterns for turn‑taking and barge‑in, defines practical latency budgets, details confirmation/repair strategies, and provides SSML patterns and prototyping snippets (Voiceflow and Rasa). Zypsy applies these patterns within sprint-based engagements that integrate brand, product, and engineering, with optional services‑for‑equity via Design Capital and cash investment via Zypsy Capital. See our capabilities and investment pages for how we engage.

Turn‑taking and barge‑in

Core loop
Listen → Recognize (ASR) → Understand (NLU) → Plan (Policy) → Speak (TTS) → Listen.
Turn boundaries
Detect end‑of‑utterance (EOU) with combined silence‑timeout + prosodic features; keep dynamic based on user tempo and domain risk.
Provide subtle audio cues at start/end of system turns; keep cues short (<150 ms) to avoid masking first phonemes.
Barge‑in policy
Allow interruption while TTS is speaking for: confirmations, lists, help, and any repeatable prompt; disable only during critical compliance statements.
On barge‑in, immediately: stop TTS, checkpoint dialog state, re‑score intent with higher weight on barge‑in tokens (first 500 ms), and route to repair or execute short‑path intent.
Over‑talk handling
If user starts 0–300 ms after TTS start, assume impatience; privilege user audio and cancel TTS.
If collision occurs mid‑sentence, truncate to next clause boundary before listening; never half‑speak a sensitive numeral (e.g., one‑time codes).

Latency budgets (end‑to‑end)

Aim for near‑instant feedback while preserving accuracy. Use the below single‑turn budgets (95th percentile) as default targets; tighten for time‑critical domains (e.g., IVR deflection) and relax for long‑form tasks (dictation). Optimize for stable p95s, not just p50s.

Stage	On‑device (ms)	Hybrid edge (ms)	Cloud (ms)	Notes
Wake/press to ready tone	50–120	60–150	80–180	Audible cue within 150 ms improves perceived responsiveness.
Streaming ASR first token	120–220	150–280	180–350	Partial hypotheses unblock NLU.
Intent ready (NLU)	200–350	230–420	260–500	Use incremental NLU on partial ASR.
Policy/action selection	5–30	10–40	15–60	Cache rules; pre‑compute slot prompts.
TTS first audio	120–220	150–280	180–350	Prefer neural streaming TTS.
Perceived turn response	≤ 700	≤ 900	≤ 1,200	Keep p95 sub‑second where possible.

Confirmation and repair strategies

When to confirm
Explicit confirmation for high‑risk intents (payments, bookings, PII). Use concise yes/no with a short summary.
Implicit confirmation for low‑risk intents by embedding the understood slot: “Playing Lofi Focus on Spotify.”
Repair taxonomy and prompts
ASR uncertainty (acoustic): “I may have misheard. Did you say ‘Paris’ or ‘Perris’?”
NLU ambiguity (semantic): “Got it. Are you asking to ‘transfer funds’ or ‘check balance’?”
Missing slot: “What date should I use?”
Business rule failure: “That flight is sold out. Would you like the 7:10 PM instead?”
Multi‑turn disambiguation
Ask one question per turn; confirm final composite before execution.
Offer escape hatches: “Say ‘start over’, ‘help’, or ‘agent’ at any time.”
Error limits
After two failed repairs, pivot modality (send a link, show choices) or escalate.

SSML patterns that scale

Use consistent SSML “tokens” so copywriters and engineers work from the same library.

Emphasis and pacing

<speak>
  <p><s>Heads up.</s> <break time="120ms"/> <emphasis level="moderate">Your transfer is almost done.</emphasis></p>
</speak>

Numbers, currencies, and spelling

<speak>
  <say-as interpret-as="digits">4 2 9 7</say-as>
  <break time="80ms"/>
  <say-as interpret-as="currency">USD39.99</say-as>
</speak>

Lists and turn‑shortening

<speak>
  <p>I found three options.</p>
  <break time="80ms"/>
  <p>Say <emphasis level="reduced">one</emphasis> for Downtown, <emphasis level="reduced">two</emphasis> for Airport, or <emphasis level="reduced">three</emphasis> for Riverside.</p>
</speak>

Readability under noise

<speak>
  <prosody rate="90%" pitch="-1st">Security code is</prosody>
  <say-as interpret-as="digits">7 0 2 1</say-as>
</speak>

Prompt + audio timing micro‑gallery

Concise timing charts help teams reason about perceived speed. Example timelines (ms; target p95):

Fast confirm (barge‑in‑friendly)

0  — user ends
120— ASR partial → NLU
220— policy
380— TTS starts
380–850— user can barge‑in

Disambiguation with list

0   — user ends
180 — NLU detects 2 intents
260 — TTS prompt (3 choices)
260–1200 — barge‑in window
1200 — capture selection

Sensitive action (explicit confirm)

0   — user intent recognized
240 — summarize slots
420 — ask yes/no
≤900— await user; repeat once at 6 s

Prototyping snippets (Voiceflow and Rasa)

Voiceflow‑style flow (pseudo‑JSON)

{
  "nodes": [
    {"type":"start"},
    {"type":"capture", "slot":"destination", "prompt":"Where would you like to go?"},
    {"type":"if", "cond":"conf>0.85", "then":"confirm", "else":"disambiguate"},
    {"id":"confirm", "type":"speak", "ssml":"<speak>Going to {destination}. Is that right?</speak>", "bargeIn": true},
    {"type":"choice", "choices":["yes","no"]},
    {"type":"api", "name":"bookRide"}
  ],
  "settings": {"eouMs": 350, "tts":"neural-streaming"}
}

Rasa (domain, NLU, stories)

# domain.yml (excerpt)

intents:

  - book_ride
entities:

  - destination
slots:
  destination:
    type: text
responses:
  utter_ask_destination:

    - text: "Where would you like to go?"
  utter_confirm:

    - text: "Going to {destination}. Is that right?"

# nlu.yml (excerpt)

- intent: book_ride
  examples: |

    - book a ride to [airport](destination)

    - get me to [downtown](destination)

# stories.yml (excerpt)

- story: happy path
  steps:

    - intent: book_ride

    - action: utter_ask_destination

    - slot_was_set:

        - destination: "airport"

    - action: utter_confirm

    - intent: affirm

    - action: action_book_ride

Evaluation and instrumentation

Core metrics: task success, turns per task, correction/repair rate, barge‑in rate, abandonment, WER/CER, and end‑to‑end latency p50/p95/p99.
Logging schema (min): user_id (pseudonymous), session_id, locale, device, ASR hypothesis + confidence, NLU intent/slots + confidence, policy action, TTS voice, durations, barge‑in events, outcome.
Guardrails: cap prompt length (<14 words avg), track prompt reuse ratio, and maintain copy library diffs with A/B IDs.

Accessibility and inclusion

Support accents and disfluencies via diverse acoustic models; provide slower speech mode and numeric repetition on request.
Alternate modalities: on‑screen captions, visual lists, and touch selection for repairs.
Safety words: always‑on commands like “stop,” “repeat,” “agent.”

Security and privacy

Redact PII in logs; store audio ephemerally unless user opts in.
Encrypt in transit/at rest; prefer on‑device wake word; document third‑party processors in privacy notices.
For regulated flows (health/finance), provide forced explicit confirmation and immutable audit events.

What Zypsy delivers in a VUI sprint

Conversation architecture: intents, entities/slots, dialog policy, error taxonomy, and state machine.
Prompt library: tone, style guide, and SSML token set mapped to use cases.
Latency plan: target budgets (p50/p95), measurement hooks, and remediation backlog.
Prototype assets: Voiceflow and Rasa projects with sample journeys and analytics events.
Brand coherence: voice persona aligned with brand identity across product and web.
Handoff package: test cases, success metrics, and escalation paths.

Engage via an upfront sprint or, for eligible founders, through Design Capital where Zypsy exchanges 8–10 weeks of senior design (up to ~$100k value) for ~1% equity via SAFE. To discuss a VUI sprint or investment paired with design support, contact us via Zypsy Capabilities or Contact.