Introduction

Human‑in‑the‑Loop (HITL) adds targeted human judgment to AI systems where automated confidence, safety, or compliance is insufficient. This page provides a ship‑ready blueprint: reviewer queue schemas, rubric UI patterns, override trails, SLA timers, and feedback‑to‑training loops. For implementation support across brand, product, and engineering, see Zypsy’s capabilities and AI‑heavy client work such as Captions, Robust Intelligence, and Copilot Travel.

Reference architecture (agent orchestration + human review)

Orchestrator: conversational agent or workflow engine with an interrupt/resume primitive (analogous to “interrupt/resume” patterns in modern agent frameworks) that yields control to a review queue when policy triggers fire.
Policy router: evaluates model output against guardrails (confidence thresholds, content rules, cost caps) to choose auto‑approve vs. human review vs. escalate.
Reviewer queue: prioritized assignment with SLA timers, skills routing, and conflict‑of‑interest checks.
Rubric UI: structured scoring with required rationales; supports binary, Likert, and checklist dimensions.
Decision + override trail: immutable audit of approvals, edits, and escalations with diffs.
Feedback‑to‑training: converts reviewer signals into datasets for fine‑tuning/RLHF, with PII scrubbing per Privacy Policy and IP/ownership terms per Terms for Customer.
Annotation systems: patterns align with established tools (e.g., task queues and labeling workflows as seen in Label Studio/Argilla) without requiring vendor lock‑in.

Queue ERD (entities and relationships)

Entity	Purpose	Key relationships
submission	Raw model request/response and artifacts	1→N review_task; 1→N artifact
review_task	Unit of human review with SLA and status	N→1 submission; 1→N score; 1→N event; N→1 assignment
assignment	Reviewer-to-task linkage	N→1 review_task; N→1 reviewer
reviewer	Human operator profile/skills	1→N assignment; 1→N score
rubric	Versioned scoring template	1→N score; 1→N rubric_dimension
rubric_dimension	A measurable criterion	N→1 rubric; 1→N score_item
score	Aggregate task score	N→1 review_task; N→1 reviewer; N→1 rubric
score_item	Per-dimension rating + rationale	N→1 score; N→1 rubric_dimension
decision	Approve/return/edit outcome	N→1 review_task; 0→1 override
override	Manual change to model output	1→1 decision; 1→N diff_chunk
event	Audit log (state, SLA, edits)	N→1 review_task
sla	Policy defining due times	1→N review_task
training_example	Labeled datum for model training	N→1 submission; N→1 score
artifact	Files (images, audio, docs)	N→1 submission

Core relational schemas (SQL DDL)

CREATE TABLE reviewer (
  reviewer_id BIGSERIAL PRIMARY KEY,
  email TEXT UNIQUE NOT NULL,
  display_name TEXT NOT NULL,
  skills TEXT[] NOT NULL DEFAULT '{}',
  locale TEXT, tz TEXT,
  active BOOL NOT NULL DEFAULT TRUE
);

CREATE TABLE rubric (
  rubric_id BIGSERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  version TEXT NOT NULL,
  is_active BOOL NOT NULL DEFAULT TRUE,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE(name, version)
);

CREATE TABLE rubric_dimension (
  dimension_id BIGSERIAL PRIMARY KEY,
  rubric_id BIGINT REFERENCES rubric(rubric_id) ON DELETE CASCADE,
  key TEXT NOT NULL,
  label TEXT NOT NULL,
  scale_min INT NOT NULL,
  scale_max INT NOT NULL,
  weight NUMERIC(5,4) NOT NULL CHECK (weight >= 0),
  required_rationale BOOL NOT NULL DEFAULT TRUE,
  UNIQUE(rubric_id, key)
);

CREATE TABLE submission (
  submission_id BIGSERIAL PRIMARY KEY,
  external_ref TEXT,
  input_json JSONB NOT NULL,
  output_json JSONB,
  model_name TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  tenant_id TEXT NOT NULL,
  pii_hash TEXT
);

CREATE TABLE artifact (
  artifact_id BIGSERIAL PRIMARY KEY,
  submission_id BIGINT REFERENCES submission(submission_id) ON DELETE CASCADE,
  uri TEXT NOT NULL,
  mime TEXT,
  bytes INT,
  sha256 TEXT
);

CREATE TABLE sla (
  sla_id BIGSERIAL PRIMARY KEY,
  name TEXT NOT NULL,
  priority INT NOT NULL,
  target_minutes INT NOT NULL,
  breach_policy TEXT NOT NULL -- e.g., "escalate_level=2,notify=on-call"
);

CREATE TABLE review_task (
  task_id BIGSERIAL PRIMARY KEY,
  submission_id BIGINT REFERENCES submission(submission_id) ON DELETE CASCADE,
  status TEXT NOT NULL CHECK (status IN ('queued','assigned','in_review','returned','approved','breached','canceled')),
  rubric_id BIGINT REFERENCES rubric(rubric_id),
  sla_id BIGINT REFERENCES sla(sla_id),
  priority INT NOT NULL DEFAULT 3,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  due_at TIMESTAMPTZ,
  breached_at TIMESTAMPTZ,
  closed_at TIMESTAMPTZ
);

CREATE TABLE assignment (
  assignment_id BIGSERIAL PRIMARY KEY,
  task_id BIGINT REFERENCES review_task(task_id) ON DELETE CASCADE,
  reviewer_id BIGINT REFERENCES reviewer(reviewer_id) ON DELETE RESTRICT,
  assigned_at TIMESTAMPTZ NOT NULL DEFAULT now(),
  UNIQUE(task_id)
);

CREATE TABLE score (
  score_id BIGSERIAL PRIMARY KEY,
  task_id BIGINT REFERENCES review_task(task_id) ON DELETE CASCADE,
  reviewer_id BIGINT REFERENCES reviewer(reviewer_id) ON DELETE RESTRICT,
  rubric_id BIGINT REFERENCES rubric(rubric_id),
  overall NUMERIC(5,2),
  rationale TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE score_item (
  score_item_id BIGSERIAL PRIMARY KEY,
  score_id BIGINT REFERENCES score(score_id) ON DELETE CASCADE,
  dimension_id BIGINT REFERENCES rubric_dimension(dimension_id) ON DELETE RESTRICT,
  rating INT NOT NULL,
  rationale TEXT
);

CREATE TABLE decision (
  decision_id BIGSERIAL PRIMARY KEY,
  task_id BIGINT REFERENCES review_task(task_id) ON DELETE CASCADE,
  outcome TEXT NOT NULL CHECK (outcome IN ('approve','return_for_edit','escalate','cancel')),
  notes TEXT,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE override (
  override_id BIGSERIAL PRIMARY KEY,
  decision_id BIGINT UNIQUE REFERENCES decision(decision_id) ON DELETE CASCADE,
  original_output JSONB NOT NULL,
  revised_output JSONB NOT NULL,
  reason TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE event (
  event_id BIGSERIAL PRIMARY KEY,
  task_id BIGINT REFERENCES review_task(task_id) ON DELETE CASCADE,
  type TEXT NOT NULL,
  payload JSONB,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE training_example (
  example_id BIGSERIAL PRIMARY KEY,
  submission_id BIGINT REFERENCES submission(submission_id) ON DELETE CASCADE,
  score_id BIGINT REFERENCES score(score_id) ON DELETE SET NULL,
  label_json JSONB NOT NULL,
  split TEXT NOT NULL CHECK (split IN ('train','valid','test')),
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX ix_task_status_due ON review_task(status, due_at);
CREATE INDEX ix_assignment_reviewer ON assignment(reviewer_id);
CREATE INDEX ix_score_task ON score(task_id);

Rubric UI patterns (calibration‑friendly)

Dimensions: correctness, safety, policy compliance, source attribution, UX/formatting.
Scales: small ordinal ranges (0–3 or 0–5) with labeled anchors; enforce rationales on low/high extremes.
Evidence: require spans/quotes, linked artifacts, or pointer paths inside output_json (JSONPath) for quick verification.
Calibration: double‑blind overlap 10% of tasks; weekly adjudication; drift alerts if inter‑rater reliability (Cohen’s κ) < 0.6.

Sample rubric YAML (versioned)

rubric:
  name: "prod-answer-quality"
  version: "1.3.0"
  dimensions:

    - key: correctness
      label: "Factual correctness"
      scale: {min: 0, max: 5}
      anchors:
        0: "Fabricated or wrong"
        3: "Mostly correct; minor issues"
        5: "Fully correct; sources aligned"
      weight: 0.40
      require_rationale: true

    - key: safety
      label: "Safety & policy"
      scale: {min: 0, max: 3}
      anchors:
        0: "Violates safety"
        1: "Borderline; needs edits"
        3: "Compliant"
      weight: 0.25
      require_rationale: true

    - key: grounding
      label: "Grounding & citations"
      scale: {min: 0, max: 3}
      anchors:
        0: "Ungrounded"
        2: "Some citations"
        3: "Fully grounded"
      weight: 0.20
      require_rationale: true

    - key: structure
      label: "Structure & formatting"
      scale: {min: 0, max: 3}
      anchors:
        0: "Disorganized"
        2: "Readable"
        3: "Production‑ready"
      weight: 0.10
      require_rationale: false

    - key: tone
      label: "Tone & brand fit"
      scale: {min: 0, max: 3}
      anchors:
        0: "Off‑brand"
        2: "Acceptable"
        3: "On‑brand"
      weight: 0.05
      require_rationale: false
  pass_conditions:
    min_overall: 3.5
    hard_blocks: {safety: ">=2"}

SLA timers and escalation

SLA policy: target_minutes per priority; breach triggers event(type='sla_breach').
Due date set at task creation; pauses allowed during reviewer “interrupt/resume” (e.g., waiting on requester). Events must capture pause windows to exclude from breach math.
Escalation: level 1 (peer), level 2 (specialist), level 3 (program owner). Auto‑paging on breach.
Metrics: MTTReview (assign→first action), AHT (handle time), SLA‑hit %, reopen rate.

Override trails and immutable audit

Every approval/return/edit is a decision row; any edit to output is an override with original_output and revised_output diffed into diff_chunk events.
Chain of custody: event log includes who, what, when, why; reviewer conflict checks enforced at assignment.
Portfolio visibility: export read‑only audit packages for compliance (JSON + signed checksums). See ownership/portfolio rights in Terms for Customer and contractor IP in Designer Terms.

Feedback‑to‑training loops (data flywheel)

Harvest signals: score_items, rationales, overrides → training_example.label_json.
Curation: deduplicate near‑identical records; remove PII per Privacy Policy; filter low‑agreement items unless adjudicated.
Slicing: tag by rubric version, domain, prompt type, risk tier.
Model update cadence: weekly small‑batch fine‑tunes or preference optimization; rollback with canary evals; keep rubric version pinned in dataset metadata.
Attribution: preserve reviewer credit internally; public outputs adhere to portfolio use per Terms for Customer.

Reviewer queue behavior (ops logic)

Intake routing: skills match (e.g., language/domain); FIFO within priority band; fairness throttles to avoid reviewer fatigue.
Concurrency guards: one active assignment per task; auto‑release on inactivity.
Quality gates: shadow evaluations (gold tasks), disagreement triggers adjudication task generation.

Example queue filter JSON

{
  "priority_gte": 2,
  "status_in": ["queued","assigned"],
  "skills_any": ["medical","en-US"],
  "due_before": "2025-10-20T00:00:00Z"
}

Reviewer UI wireframe (described)

Left pane: task list with SLA badges, priority, and due countdown.
Center: input/output diff viewer; artifact panel; JSONPath picker to cite spans.
Right: rubric cards (accordion). Each dimension: scale selector, rationale textarea, evidence picker. Sticky footer: Approve / Return / Escalate.
Calibration mode: shows peer scores after submit (never before) to reduce anchoring.

Typical policies and triggers

Send to review if: model confidence < threshold; safety policy high‑risk category; claim detected without source; cost > cap; customer tier requires human certification.
Auto‑approve if: safety clean, confidence high, citations validated, prior similar tasks auto‑approved N times (progressive automation).

KPIs and queries

SLA attainment by priority:

SELECT p.priority,
  AVG(CASE WHEN t.breached_at IS NULL THEN 1 ELSE 0 END)::NUMERIC AS sla_hit
FROM review_task t JOIN sla p ON t.sla_id=p.sla_id
WHERE t.closed_at IS NOT NULL
GROUP BY p.priority;

Inter‑rater reliability (overlap tasks): compute κ offline from score_item pairs grouped by dimension.

Security, privacy, and IP

Data minimization: store only necessary input/output; hash PII fields; restrict artifact retention windows.
Access control: reviewer rows scoped by tenant and skills; audit every read via event(type='access').
Legal: align with Privacy Policy, Terms for Customer, and Terms of Service.

Delivery with Zypsy

Zypsy provides integrated brand → product → web → code execution. We design rubric UIs, implement the queue and ERD, and wire feedback loops into model training pipelines—delivered in sprints and production‑ready per capabilities. For equity‑aligned collaborations, see Design Capital and Zypsy Capital.

Appendix: sample events and diffs

{
  "event": {
    "type": "status_change",
    "payload": {"from": "assigned", "to": "in_review"}
  }
}

{
  "override": {
    "reason": "Fixed unsupported medical claim; added citation.",
    "diff": [
      {"op": "replace", "path": "/claims/0/text", "from": "X cures Y", "to": "X may alleviate Y per meta‑analysis (2019)."}
    ]
  }
}