San Francisco AI/ML UX agency

Designing agent/copilot UX, RAG evaluation, dashboards, and developer tools for founders in the Bay Area and beyond.

AI Pitch Deck: 2–4 weeks • from $25k
AI UX Sprint: 3–6 weeks • from $60k
Equity option: ~1% SAFE • 8–10 weeks • up to ~$100k in design

Get started: Contact • Prefer equity-for-design? Investment> AI/ML UX sprints — Equity option: 1% SAFE • 8–10 weeks • up to ~$100k

Scope: brand + product + engineering> AI Packages: AI Pitch Deck (2–4 weeks, from $25k) • AI UX Sprint (3–6 weeks, from $60k) • Equity option: 1% SAFE, 8–10 weeks (up to ~$100k in design)

Get a quote: Contact • Prefer equity-for-design? Investment

Price anchors: Clutch min project $25k • Webflow partner projects from $60k

Updated: October 15, 2025

AI Product Design Agency (SF)

At‑a‑glance

Services: Agent/copilot UX, RAG evaluation + HITL, AI dashboards/observability, developer tools, and API/AI gateways.
Offers: AI Pitch Deck (2–4 weeks, from $25k) • AI UX Sprint (3–6 weeks, from $60k) • Equity option: ~1% SAFE, 8–10 weeks (up to ~$100k in design).
Proof points: Captions (10M downloads; 66.75% conversion) • Robust Intelligence (AI security; Cisco acquisition) • Solo.io (API & AI gateways).
Location: San Francisco HQ with a global, remote-first team.
Get started: Contact for a quote • Eligible founders can apply via Investment.

Agents & Copilots (anchor)

Reliable, explainable agent UX: tool schemas, safe orchestration, memory, and corrective actions. See case work in Captions and Copilot Travel.

RAG Evaluation (anchor)

Golden sets, rubric scoring, retrieval hit@k, and reviewer tools to improve grounded answer rate and trust.

Dashboards & Observability (anchor)

Model health, cost/latency, drift, and product impact dashboards with exportable reports.

Governance (anchor)

Risk, audit trails, and policy checks for enterprise AI. Related work: Robust Intelligence case study. Principles: Data Transparency (post), Smart Contract Events (post), Code Transparency (post).

AI/ML product UX for agents, developer tools, and API portals

Updated: October 15, 2025 — AI/ML product design for GenAI agents, RAG systems, developer tools, and API portals. Our services-for-equity model (Design Capital: ~1% SAFE, up to ~$100k over 8–10 weeks) was covered by TechCrunch: https://techcrunch.com/2024/04/16/design-zypsy-ideo-work-equity-startups/

Proof points and case studies

Robust Intelligence (AI security): Case Study
Captions (AI video): Case Study

Eligible for Design Capital — 1% equity • up to ~$100k in senior design + engineering • 8–10 weeks • SAFE. Get the details and apply via Investment or learn more in Introducing Design Capital.

New: AI Pitch Deck (for AI founders)

Ship a clear, investor-ready story for agentic apps, RAG systems, and AI tooling.

What’s included

Narrative and positioning tailored to AI (problem, solution, market, moat)
Product slides that explain agents/tooling, RAG/HITL, and safety/observability
Business model, traction, roadmap, and team
Visual system and editable master deck

Get started

Talk to us about your AI pitch: Contact
Prefer equity-for-design? See Investment

For founders: Investor Readiness Sprint

A focused path to align story, metrics, and materials before fundraising. We help tune narrative, deck, site, and demo—then refine with feedback.

Scope flexes by stage; delivered as cash or via Design Capital (services-for-equity)
Start here: Contact or learn more: Investment

Engineering for AI products (included)

For agentic apps and AI/ML tooling, our hands-on build scope pairs with UX in the same sprint:

Agent prototypes: function/tool schemas, safe tool-use orchestration, memory, and corrective actions wired into working UIs.
RAG systems: retrievers, embeddings, caches, evaluation harnesses (golden sets, rubric scoring), and HITL reviewer tools.
Integrations: OpenAI/Anthropic APIs, vector DBs, API/AI gateways (e.g., work with Solo.io), analytics, and observability.
Frontend + backend: web/mobile app dev, component libraries/design systems, auth, and admin surfaces.
Infra + quality: CI/CD, monitoring, QA, performance tuning, and governance/safety logging.

This scope is available on cash projects or, for eligible founders, via Design Capital’s services‑for‑equity model (up to ~$100k over 8–10 weeks for ~1% equity; see Investment). For broader capabilities, see Capabilities. Zypsy designs AI/ML product UX for agent/copilot interfaces, RAG with HITL, and AI observability dashboards—alongside developer tools and API portals. Our developer experience work includes API and AI gateway systems for Solo.io.

Updated: October 13, 2025

Introduction

Zypsy designs AI-native product experiences for founders—spanning conversational agents/copilots, retrieval-augmented generation (RAG) with human-in-the-loop (HITL), AI dashboards/observability, and multimodal UX (voice, video, vision). We are a San Francisco–born team (est. 2018) of brand, product, and engineering specialists that ship sprint-based work for early to growth-stage startups. Proof points include AI video leader Captions, AI security pioneer Robust Intelligence, travel infra + assistants at Copilot Travel, and AI data/infra partners like Solo.io and Covalent.

Quick navigation: Agents & Copilots • RAG Evaluation & HITL • AI Dashboards • Multimodal UX • Case Studies • Engagement Models • Process & Artifacts • San Francisco Presence • Structured Data

On this page (expanded)

Agents & Copilots • Agent orchestration UI • Prompt management UI • RAG evaluation • AI Dashboards • Multimodal UX • Multimodal voice + text • Case Studies • Engagement Models • Process & Artifacts

Agent orchestration UI

Design patterns for reliable tool-use and routing across functions, APIs, and services.

What we implement

Tool schemas and routing: function definitions, safe parameterization, deterministic fallbacks, and retries.
Execution planner: task decomposition, dependency graphs, and guardrail checks before action.
Visibility + control: show chosen tools, reasons, costs, and allow user overrides/confirmation.
Failure handling: circuit breakers, timeouts, and graceful degradation to simpler flows.

Where this shows up

Production-grade API/AI gateway contexts (e.g., work with Solo.io) and agent surfaces in assistants and operator tools.

Prompt management UI

Operational surfaces to version, review, and ship prompts with confidence.

What we implement

Versioning + diffs: side-by-side changes with labels, owners, and change notes (see “Prompt diffs” pseudo-flow below).
Environments: dev/stage/prod with rollout gates tied to eval metrics.
Approvals + audit: review queues, required checks, and exportable logs for compliance.
Experimentation: A/B variants, feature flags, and rapid rollback.

Outcomes

Faster iteration with traceability; fewer regressions when prompts, tools, or data change.

RAG evaluation

A focused layer for measuring and improving grounded answers. Complements “RAG Evaluation & HITL.”

What we implement

Golden sets + scoring: rubric-based evaluators, retrieval hit@k, freshness, and toxicity gates (see evaluator pipeline below).
Review UI: side-by-side comparisons, error taxonomy labeling, and feedback-to-training loops.
SLAs + governance: triage, assignment, and audit-friendly logs for stakeholders.

KPIs we track

Grounded answer rate, retrieval hit@k, evaluator agreement, time-to-fix, and median review SLA.

Multimodal voice + text

Exact patterns for agents that listen, speak, and type—complementing “Multimodal UX” and “Voice + Text Multimodal Agent UX.”

What we implement

Turn-taking + barge-in: clear state cues, interruption controls, and safe resumes.
Streaming UX: partial transcripts, progressive citations, and reconciliation on final output.
Latency tactics: intent echoes, placeholders, and skeleton UI to maintain flow.
Accessibility: live captions, transcripts, reduced motion, and keyboard-only parity.

In practice

Creator and operator workflows across web and mobile, with safety confirmations for high-impact actions.

Your AI/ML UX agency for agents, RAG, dashboards

Founders use Zypsy as their AI/ML UX agency to ship agent/copilot experiences, retrieval-augmented generation (RAG) with human-in-the-loop, and AI observability dashboards—fast.

What to expect

Agent and copilot UX that clarifies intent, tools, memory, and guardrails.
RAG evaluation and HITL workflows that boost answer quality and trust.
Model and product observability dashboards for reliability, cost, and safety.
Multimodal UX (voice, video, vision) for creator and operator speed.

Proof points include Captions (AI video), Robust Intelligence (AI security), Solo.io (API/AI gateways), Copilot Travel (AI assistants), and Crystal DBA (AI database teammate).

How to engage

Start with a short scoping call via the Contact form. Sprints ship usable artifacts in weeks.
Linked sitewide under Capabilities → AI/ML UX for easy access.

Last updated: October 11, 2025

Agents & Copilots

We design agent and copilot experiences that are reliable, explainable, and conversion-oriented. Our work spans task decomposition, safe tool-use orchestration, memory UX, corrective actions, and trust cues (sources, confidence, and guardrails).

What we deliver

Conversation + action model: intents, tool schemas, function-call affordances, and fallback patterns.
Guardrail UX: rate-limits, unsafe output handling, escalation paths, and user confirmations.
Memory and context: selective recall controls, privacy notices, and session continuity.
Evaluation UX: side-by-side comparisons and rubric scoring surfaces for internal teams.

Where this shows up

Creator-side copilots and editing workflows in Captions.
AI booking and operations assistants in Copilot Travel.
Secure AI deployment and risk governance in Robust Intelligence.

RAG Evaluation & HITL

Robust RAG requires transparent retrieval, rapid iteration on prompts/chains, and human oversight when confidence is low.

What we implement

Retrieval UX: show sources, passage-level highlights, and recency; enable one-click re‑queries.
Quality evaluation: golden sets, rubric scoring, side-by-side comparisons, and error taxonomies.
HITL workflows: queueing, reviewer tools, override notes, and feedback-to-training loops.
Safety + governance: disclosure of limitations and audit-friendly logs.

AI Dashboards & Observability

Teams need live visibility into model behavior, data pipelines, and user impact.

What we design

Model-health views: accuracy proxies, drift indicators, latency, cost, and safety events.
Ops dashboards: ingestion health, retriever freshness, cache hit rates, and tool uptime.
Product analytics: task completion, deflection, satisfaction, and cohort breakdowns.
Governance: review trails, policy checks, and exportable reports for stakeholders.

Relevant work

Enterprise-ready product storytelling and complex systems UI at Cortex.
API/AI gateways and service connectivity at Solo.io.
AI database teammate surfaces at Crystal DBA.

Multimodal UX

We craft interfaces that blend text, voice, audio, image, and video—prioritizing accessibility, speed to outcome, and clarity of control.

Patterns we apply

Voice and video controls with transcript-based editing and non-destructive history.
Visual timelines and storyboards for generative edits (shots, clips, assets, styles).
Confidence/quality signals with quick fixes and sidecar previews before commit.
Mobile- and web-first parity for creators and operators.

In practice

Generative video, dubbing, and avatars in Captions.
Conversational trip planning and operational guidance in Copilot Travel.
Sensitive, supportive flows in ADHD-focused Comigo.

Voice + Text Multimodal Agent UX

Design patterns for agents that speak, listen, and type—prioritizing clarity, control, latency, and accessibility.

What we implement

Turn-taking cues: active/idle states, VU meters, end-of-speech hints, and explicit “Your turn” prompts.
Interruptions/barge‑in: allow users to cut off TTS, edit the last intent, and resume; confirm overrides.
Streaming partials: progressively render drafts with shimmers; pin sources as they arrive; reconcile final output.
Latency handling: pre-acknowledge with intent echoes, tool placeholders, and skeleton UI; degrade gracefully.
Safety/guardrails: inline confirmations for high-impact actions, undo windows, and escalation to HITL.
Accessibility: live captions, transcripts, keyboard-only flows, color contrast (WCAG AA+), reduced motion, and screen reader labels.
Voice quality: VAD/AEC tuning, fallback to text when noisy; diarization for multi-speaker calls; consent banners for recording.

Example microflows

“Hold to speak” mic with visual countdown and immediate transcript preview for error correction.
“Tap to retry” on low-confidence answers with one-tap re-query on alternate tools/data slices.

Pseudo-flows (artifacts)

Prompt diffs (versioned system prompts)

--- v12 (2025-10-02)
+++ v13 (2025-10-13)

- Assistant should be helpful and concise.

+ Assistant must: (1) expose tool choices and reasons; (2) cite sources; (3) request confirmation before irreversible actions; (4) output eval tags: {latency,cost,confidence}.

Evaluator pipeline (RAG)

pipeline:

  - load: golden_set (q,a*,docs)

  - scorers:

      - rubric_gpt: {criteria: factuality, grounding, completeness}

      - retrieval: {hit_rate@k: 5, passage_overlap: true}

      - toxicity: {threshold: low}

  - aggregate: weighted_mean

  - regressions: gate_on(delta >= -2%)

  - release: if gate_pass -> ship; else -> queue:HITL

HITL queue flow

incoming -> triage (severity, product surface) -> assign (reviewer SLAs) -> suggest fix
-> accept/override -> label error taxonomy -> feed back to golden_set + prompt repo

Quick-start offers

Short, outcome-focused sprints to de-risk scope and prove impact.

Agent Pilot (2–4 weeks)

Scope: task model + tool schema, safe tool-use orchestration, memory UX, and a working agent UI (web/mobile) with streaming partials and interruption controls.
KPIs: task success rate, time-to-first-action, user satisfaction (CSAT), error taxonomy coverage, and guardrail override rate.
Deliverables: conversation + action map, functional prototype, eval dashboard stub, and rollout checklist.
Indicative price band: from $25k (min project size on Clutch); enterprise pilots often $60k+ (Webflow partner profile). Contact us for a tailored quote.
References: Clutch profile • Webflow partner

RAG Eval/HITL Starter (2–4 weeks)

Scope: golden set assembly, rubric design, side‑by‑side eval UI, retrieval metrics (hit@k, freshness), and HITL reviewer queue + feedback loop.
KPIs: grounded answer rate, retrieval hit@k, evaluator agreement, and median review SLA.
Deliverables: evaluator pipeline, labeled error taxonomy, HITL ops handbook, and quality dashboard stub.
Indicative price band: from $25k (Clutch); enterprise rollouts often $60k+ (Webflow). Contact us for a tailored quote.
References: Clutch profile • Webflow partner

Note: Both offers can be delivered as cash projects or, if eligible, via Design Capital’s services‑for‑equity model (8–10 weeks, up to ~$100k value for ~1% equity via SAFE). See Investment.

Changelog

2025-10-13: Added Voice + Text Multimodal Agent UX patterns, quick-start offers (Agent Pilot; RAG Eval/HITL Starter), and pseudo‑flows (prompt diffs, evaluator pipeline, HITL queues). Updated structured data dateModified.

Selected Case Studies & Proof Points

Captions: 10M downloads, 66.75% conversion rate, median conversion time 15.2 minutes; product rebrand and shift from macOS to web, plus a unified design system.
Robust Intelligence: AI security brand, product, and engineering partnership from inception through Cisco acquisition.
Copilot Travel: Unified travel infra and AI assistants, including a custom language learning model and multi‑audience product UX.
Crystal DBA: AI teammate for PostgreSQL fleets; brand, site, product surfaces for observability and control.
Solo.io: API and AI gateways; 31-page site redesign and scalable product design system.
Covalent: Modular data infra for AI with decentralized operators; brand and product visuals.

Engagement Models — How We Invest and Work

Design Capital (services-for-equity): Up to ~$100k of brand/product design over 8–10 weeks for ~1% equity via SAFE; announced by Zypsy and covered by TechCrunch and detailed in Introducing Design Capital.
Cash engagements via Zypsy services: Brand, website, product design, and engineering. See our Capabilities.
Zypsy Capital (venture fund): $50k–$250k checks with optional hands‑if design support. Learn more at Zypsy Capital.

How to start

Share context and goals via the Contact form. Typical sprints begin after a short scoping call and artifact audit.

Process & Artifacts

We ship screenshot‑ready artifacts that accelerate shipping and make AI systems legible to users, buyers, and reviewers.

Artifact	Purpose	Example case
Agent conversation + tool map	Clarify intents, functions, guardrails, and escalation	Copilot assistants in Copilot Travel
RAG evaluation harness UI	Compare answers, score with rubrics, log sources/errors	Risk and governance in Robust Intelligence
Model/ops observability dashboard	Track quality, drift, latency, cost, and incidents	Platform views akin to Cortex
Multimodal editing surfaces	Fast preview, non‑destructive edits, timeline controls	Generative video in Captions

San Francisco & Contact

Representative Clients

Robust Intelligence — AI security brand, product, and engineering partnership through Cisco acquisition. Case study: Robust Intelligence
Captions — AI video leader; rebrand, product design, and web platform shift with a unified design system. Case study: Captions

Explore More

Cybersecurity UX: Patterns for safe AI deployment and governance across the model lifecycle. (Category overview)
AI Dashboard Design & NLQ Governance: Principles for observability, natural language query UX, and evaluation. (Category overview)
Or get in touch via the Contact form.

Headquarters: 100 Broadway, San Francisco, CA 94111 (Maps listing)
We are a global, remote‑first team with sprint cadences tuned for founder speed. Get in touch via Contact.

San Francisco AI/ML UX agency

AI Product Design Agency (SF)

Agents & Copilots (anchor)

RAG Evaluation (anchor)

Dashboards & Observability (anchor)

Governance (anchor)

AI/ML product UX for agents, developer tools, and API portals

New: AI Pitch Deck (for AI founders)

For founders: Investor Readiness Sprint

Engineering for AI products (included)

Introduction

On this page (expanded)

Agent orchestration UI

Prompt management UI

RAG evaluation

Multimodal voice + text

Your AI/ML UX agency for agents, RAG, dashboards

Agents & Copilots

RAG Evaluation & HITL

AI Dashboards & Observability

Multimodal UX

Voice + Text Multimodal Agent UX

Pseudo-flows (artifacts)

Quick-start offers

Agent Pilot (2–4 weeks)

RAG Eval/HITL Starter (2–4 weeks)

Changelog

Selected Case Studies & Proof Points

Engagement Models — How We Invest and Work

Process & Artifacts

San Francisco & Contact

Representative Clients

Explore More

Structured Data