Introduction

Designing a high‑performing copilot (LLM‑powered assistant embedded in your product) requires more than prompts. This guide details a practical UX framework across five pillars—intent modeling, tool use, guardrails, memory UX, and evaluation—grounded in Zypsy’s work designing AI products for founders. See related examples in Captions, Copilot Travel, Robust Intelligence, and Crystal DBA. For broader capabilities, visit Zypsy Capabilities.

System framing for copilots

A copilot is an intent‑driven orchestration layer over tools and data, with safety and memory shaped by user consent.

Objectives: accelerate time‑to‑value, reduce operational toil, and unlock tasks that are hard to script.
Surfaces: inline suggestions, side panels, chat, command palettes, and background automations.
Constraints: latency budgets, safety policies, data residency, access control, and observability.

Use this simple map when you scope a copilot:

User intents → capability graph → tool adapters → policy/guardrails → memory → outputs → feedback/telemetry.

Pillar 1: Intent modeling

Translate messy user asks into precise actions with explicit disambiguation.

Build an intent taxonomy: goals (e.g., “fix build”), sub‑goals ("rerun flaky tests"), slots (repo, branch, env), constraints (SLA, cost caps), and success criteria.
Design disambiguation patterns:
Lightweight clarifiers: single‑turn questions when confidence is low.
Candidate plans: show top 1–3 plans with editable parameters before execution.
Safe defaults: pre‑fill from context (selection, page, entity) and show how it was inferred.
Instrument training data: curate gold tasks, counter‑examples, and edge cases; label success/failure states and side‑effects.
UX artifacts to ship: intent map, prompt+tool schemas, example pairs, fallback tree, and error taxonomy.

Signals to watch: clarification rate, plan acceptance rate, and abandon vs. correct‑and‑retry.

Pillar 2: Tool use and orchestration

Make tool execution reliable, legible, and reversible where possible.

Tool catalog: define each tool’s contract (name, purpose, required/optional params, auth scope, latency budget, idempotency).
Orchestration patterns:
Plan‑then‑execute: present a plan; allow users to edit; run with audit trail.
Stepwise with checkpoints: pause before risky steps; require confirmation or policy pass.
Parallelizable steps: batch safe reads; serialize writes.
Error UX: structured errors with remediation suggestions; automatic retries for transient classes; graceful degradation to read‑only insights.
Output UX: return both a human‑readable summary and the raw artifacts (logs, diffs, links) so users can verify.
Observability: assign a Run ID; log tool calls, inputs (redacted), outputs, policy decisions, and user overrides for post‑hoc review.

Pillar 3: Guardrails and safety

Codify what the copilot may do, when, and on whose behalf—then show that policy to users.

Policy layers:
Pre‑filters: input validation, PII detection, permission checks.
Execution gates: role/entitlement checks, approval workflows, blast‑radius limits (e.g., max rows affected, budget caps).
Post‑filters: toxicity/unsafe content checks, data‑loss prevention, provenance and citation requirements.
Risk UX:
Explain why an action is blocked and the path to proceed (change scope, request approval, or simulate).
Offer “simulate first” for destructive ops; present diffs before apply.
Enterprise readiness: immutable audit logs, redaction, regional data controls, and incident review flows.

Example: In Robust Intelligence, AI safety and governance are first‑class. Enterprise copilots benefit from similar pre‑deployment checks, automated stress tests, and clear operator controls.

Pillar 4: Memory UX

Remember with consent. Forget on request. Make it editable.

Memory types:
Session memory: short‑lived context such as the last entity, filter, or file.
Long‑term memory: explicit user preferences, saved entities, named workflows.
Derived memory: learned heuristics (e.g., preferred tone) only if user opts in.
UX patterns:
Memory panel: show what’s remembered and why; allow edit/delete; show TTL.
Scoped recall: “Use past settings from Project X only.”
Sensitive defaults: do not memorize secrets; store references, not raw values.
Governance: per‑workspace policies, exportability, and cross‑device sync with principle of least privilege.

Pillar 5: Evaluation

Measure user value, safety, and system reliability—continuously.

Offline eval: task suites with gold answers, adversarial sets, safety red‑team prompts, and tool‑call accuracy checks.
Online eval: success rate, time‑to‑task, correction rate, fallback rate, plan acceptance, incident rate, and user‑rated helpfulness.
Funnel metrics: activation → successful first task → repeat usage → retained power users. For consumer contexts, track conversion and time‑to‑value. For example, Captions reports 10M downloads and a 66.75% conversion rate on its platform; design and UX systems that remove friction can materially influence such outcomes.
Experimentation: A/B guardrail thresholds, UI disambiguation depth, tool selection strategies, and memory defaults with guard‑banded cohorts.

Patterns by surface

Use the right UI surface for the job.

Surface	Best for	Design goals	Key risks
Inline suggestions	Micro‑edits, contextual help	Low friction, speed	Unwanted auto‑changes, distraction
Side panel	Multi‑step tasks w/ references	Visibility + workspace context	Context switching
Chat	Exploratory asks, debugging	Natural language breadth	Ambiguity, long‑turn drift
Command palette	Power‑user actions	Speed + recall	Discoverability
Background automations	Routine upkeep	Reliability + alerts	Silent failures

What Zypsy ships in a copilot engagement

Our sprint‑based delivery is tuned for founders. See Zypsy Capabilities.

Product framing: capability graph, user/jobs map, success criteria.
UX system: surfaces, states, empty/error/confirm patterns, and content strategy.
Prompts + tools: prompt library, tool contracts, disambiguation flows, and fallback trees.
Safety: policy matrix, risk UX, and incident workflows.
Memory: schema, consent UX, and governance controls.
Evaluation: task suite, dashboards, and experiment design.

Examples from Zypsy’s AI work

Copilot Travel: AI booking assistants and operational copilots benefit from plan‑then‑execute UX, clear provenance, and simulation before ticketing.
Crystal DBA: An “AI teammate” for databases needs guard‑railed tool calls (read vs. write), diffs before apply, and durable audit logs.
Robust Intelligence: Enterprise AI requires pre‑deployment testing, continuous monitoring, and explainable risk controls built into the UX.
Captions: Fast paths to first successful creation and clear controls around model features raise activation and conversion.

Implementation checklist

Intent
[ ] Taxonomy defined; gold/counter sets curated; fallback tree authored.
[ ] Disambiguation prompts and UI reviewed against latency budgets.
Tools
[ ] Contracts versioned; auth scopes and timeouts set; idempotency documented.
[ ] Error classes mapped to UX states; observability with Run IDs.
Guardrails
[ ] Pre/exec/post policies configured; simulation/diff path available.
[ ] Approval and blast‑radius limits enforced; audit log immutable.
Memory
[ ] Consent flows; memory panel; edit/delete/export; TTL policies.
[ ] Sensitive data redaction; workspace scoping; cross‑device sync rules.
Evaluation
[ ] Offline suites (tasks/safety/tool accuracy); online KPIs and dashboards.
[ ] A/B plan; incident review loop; rollback procedures.

FAQs

What’s the fastest path to a V1 copilot? Start with a narrow, high‑value task suite, plan‑then‑execute UX, and read‑only tools. Add writes after you validate success, guardrails, and failure handling.
How do we keep hallucinations from harming users? Prefer tool‑grounded answers with citations/artifacts, require simulate‑then‑apply for destructive ops, and block outputs that fail post‑filters or provenance checks.
Where should memory start? Begin with session‑only context. Add explicit, user‑named preferences later with an editable memory panel and TTLs.
What latency targets should we design for? Reads: <1–2s perceived with progressive disclosure. Writes: tolerate 2–5s with activity indicators, streamed updates, and cancel/suspend where safe.
How do we evaluate quality without a perfect gold set? Combine small expert‑labeled suites, synthetic adversarial prompts, and online metrics (success rate, correction rate). Iterate weekly.
How does this differ from “agent interfaces”? Copilots emphasize assistive, scoped actions with strong policy and UX constraints. “Agent” patterns often expand autonomy; many enterprise teams start with copilot UX for control and auditability.

Copilot UX

Introduction

System framing for copilots

Pillar 1: Intent modeling

Pillar 2: Tool use and orchestration

Pillar 3: Guardrails and safety

Pillar 4: Memory UX

Pillar 5: Evaluation

Patterns by surface

What Zypsy ships in a copilot engagement

Examples from Zypsy’s AI work

Implementation checklist

FAQs

Further reading (Zypsy)