Introduction

Modern AI agents are orchestrators: they plan, call tools, read/write memory, and retry when needed. To make them debuggable and reliable at scale, you need end-to-end tracing that captures these agent-specific behaviors as first-class spans with rich, queryable tags. This guide explains a practical span taxonomy, tag schema, UI patterns, and sample diagrams for implementing agent observability with Datadog Traces.

Goals and non-goals

Goals: faster debugging, measurable reliability, explainable costs, and safer user experiences for AI agents in production.
Non-goals: generic APM theory or vendor-specific screenshots. Focus is on concrete span design and Datadog trace usage patterns.

Core trace model for AI agents

Represent one user request as a trace. Within that trace:

Create a root span for the inbound request (API or UI action).
Nest an “agent orchestration” span that plans and coordinates.
For each model call, tool call, memory access, and guardrail check, create a child span.
For retries, create an attempt span under the same parent and tag it with attempt counters.

Span taxonomy and required tags

Use a consistent schema so spans are discoverable and aggregable.

Span type	When to create	Required tags (examples)	Notes
request.root	Each inbound request	env, service, version, http.method, http.route, user.anonymous_id	Root of the trace; avoid PII in tags.
agent.orchestrate	Planning step	ai.agent.id, ai.agent.name, ai.plan.id, ai.plan.steps	Duration ≈ total orchestration time.
llm.call	Each model invocation	ai.model.name, ai.model.provider, ai.tokens.prompt, ai.tokens.completion, ai.temperature, ai.top_p, ai.stop.reason	Add cost tags if computed (ai.cost.input_usd, ai.cost.output_usd).
tool.call	Each external tool/function	ai.tool.name, ai.tool.type, ai.tool.args.size, ai.tool.timeout_ms, peer.service	Use error tags on failures.
memory.read	Vector/DB read	ai.memory.op=read, ai.memory.store, ai.memory.k_hits, ai.memory.k_requested	Include latency to measure retrieval health.
memory.write	Persist new facts	ai.memory.op=write, ai.memory.store, ai.memory.bytes	Track growth and write failures.
guardrail.check	Safety/validation run	ai.guard.name, ai.guard.category, ai.guard.outcome (pass	fail), ai.guard.severity
parser.parse	Output parsing/validation	ai.parser.name, ai.parser.outcome, ai.parser.error	Useful for structured tool responses.
retry.attempt	Each retry	retry.attempt, retry.max, retry.reason, backoff.ms	Child of the span being retried.
stream.tokens	Server-sent or websocket stream	ai.stream.first_token_ms, ai.stream.tokens, ai.stream.closed_reason	Optional if streaming.

Notes

Keep tag keys stable and lowercase. Prefer dot-separated namespaces (ai., retry., user.*).
Avoid high-cardinality raw values (e.g., full prompts). Use sizes, hashes, or sampled exemplars.

Example trace structure (ASCII diagram)

request.root [GET /chat]
└─ agent.orchestrate [plan+route]
   ├─ memory.read [RAG]
   ├─ llm.call [draft]
   │  └─ retry.attempt (attempt=1)
   ├─ tool.call [search_vendor]
   │  └─ tool.call [fetch_doc]
   ├─ guardrail.check [policy]
   ├─ parser.parse [json to schema]
   └─ llm.call [finalize]

Minimal tagging guidelines (privacy-first)

Never tag raw PII or secrets. Use user.anonymous_id (stable hash) instead of email/phone.
For prompts/completions, store length and token counts; attach redacted excerpts only to error logs, not to span tags.
Track costs as precomputed numerics per span (ai.cost.*) and aggregate at the trace level.

Datadog setup and instrumentation checklist

Services: set service, env, version on every span for Service Catalog and deploy-aware comparisons.
Span names: use verb-noun (e.g., llm.call, tool.call) for consistent search and analytics.
Error semantics: set error=true and include error.type, error.message, error.stack (stack trimmed) on failing spans.
Trace/log correlation: inject trace_id/span_id into logs so Datadog can link them in Trace View and Log Explorer.
Sampling: always keep errors; dynamically sample success by token volume or latency buckets to control costs.
Metrics from spans: use span-based analytics to compute p95 latency, error rates, retries/session, and cost per request.

UI patterns in Datadog for agent teams

Trace View: expect a stair-step flame graph (plan → retrieve → generate → tools → finalize). Investigate wide “tool fan-out” or repeated llm.call blocks as anti-patterns.
Faceted search: promote ai.model.name, ai.tool.name, ai.guard.outcome, retry.attempt to Facets for one-click filtering.
Analytics: build toplists for “noisiest tools” (error count), “most expensive models” (sum ai.cost.*), and “slowest memory stores” (p95 memory.read).
Service Map: group all tool.call spans by peer.service to spot external bottlenecks.
Dashboards: add timeboards for token throughput and guardrail fail rates next to latency.

Golden signals and SLOs for AI agents

Latency: p50/p95/p99 of request.root; first token latency (ai.stream.first_token_ms) if streaming.
Reliability: error rate of llm.call and tool.call; guardrail.fail rate ≤ threshold.
Cost: median and p95 ai.cost.total_usd per request (sum of ai.cost.input_usd + ai.cost.output_usd).
Efficiency: retries per request (avg retry.attempt), memory.hit_rate = hits/requested.
Safety: percentage of requests with ai.guard.outcome=fail that are successfully auto-remediated on retry.

Suggested SLOs (examples)

Availability (request success): 99.5% over 30d
p95 request latency: ≤ 3.0s over 30d
Guardrail pass rate: ≥ 98% (post-remediation)
Cost per successful request (p95): ≤ target budget

Retry patterns that preserve observability

Wrap each attempt in retry.attempt spans with retry.attempt index and backoff.ms.
Copy the same correlation tags (user.anonymous_id, ai.agent.id) across attempts.
Record retry.reason (timeout|rate_limit|tool_error|guardrail_fail) to support targeted tuning.

Memory observability (RAG and state)

memory.read: tag ai.memory.k_requested and ai.memory.k_hits to compute hit-rate and tail latency.
memory.write: tag payload size (ai.memory.bytes) and dedupe decisions (ai.memory.dedup=true/false).
Connect memory miss spikes to downstream llm.call error/content-quality regressions in Analytics.

Tool-call spans that scale

tool.call should always include peer.service and ai.tool.name; use ai.tool.type (http|db|function) for grouping.
For fan-out tools (e.g., parallel web fetches), set ai.tool.concurrency and per-child span counts; alert on excessive fan-out.
Propagate trace context to tools you own so subservices appear as part of the same trace.

Cost tracking from spans

Compute ai.cost.input_usd and ai.cost.output_usd per llm.call using your price sheet at ingestion time; store ai.tokens.* too.
Sum at the trace to ai.cost.total_usd; build monthly budgets and per-feature allocation with facet breakdowns (feature, model, market).

Dashboards: starter widget list

Timeseries: request.root p95 latency; llm.call error rate; tool.call error rate; retry.attempt count.
Toplist: most expensive models (sum ai.cost.total_usd by ai.model.name).
Query value: guardrail.fail rate with budget status.
Heatmap: memory.read latency by ai.memory.store.
Distribution: request cost per trace.

Alerting patterns

Latency burn alerts: p95 request.root > SLO for 5 min in env:prod.
Error budget burn: llm.call error rate > 2x baseline for 10 min.
Safety regression: guardrail.fail > 1% over 10 min or sudden spikes per feature.
Cost anomaly: ai.cost.total_usd p95 > weekly baseline + 30%.

Implementation snippets

Python (ddtrace)

from ddtrace import tracer

@tracer.wrap(service="chat-api", resource="POST /chat", span_type="web")
def handle_chat(req):
    with tracer.trace("agent.orchestrate") as s:
        s.set_tag("ai.agent.id", "helpdesk-v2")
        s.set_tag("user.anonymous_id", req.user_hash)

# memory read

        with tracer.trace("memory.read") as m:
            m.set_tag("ai.memory.store", "vectordb")
            m.set_tag("ai.memory.k_requested", 8)

# ... do read, set k_hits, errors

# llm call with retry

        for attempt in range(1, 3):
            with tracer.trace("retry.attempt") as r:
                r.set_tag("retry.attempt", attempt)
                r.set_tag("backoff.ms", 200 * attempt)
                with tracer.trace("llm.call") as l:
                    l.set_tag("ai.model.name", "gpt-x")
                    l.set_tag("ai.tokens.prompt", 512)
                    l.set_tag("ai.temperature", 0.2)

# ... invoke model

# l.set_tag("ai.tokens.completion", tokens)

# l.set_tag("ai.cost.total_usd", cost)

# break on success

Node.js (dd-trace)

const tracer = require('dd-trace').init();

async function handleChat(req) {
  const span = tracer.startSpan('agent.orchestrate', { service: 'chat-api' });
  span.setTag('user.anonymous_id', req.userHash);
  try {
    const m = tracer.startSpan('memory.read', { childOf: span });
    m.setTag('ai.memory.store', 'vectordb');
    // ...
    m.finish();

    const l = tracer.startSpan('llm.call', { childOf: span });
    l.setTag('ai.model.name', 'gpt-x');
    // ...
    l.finish();
  } catch (e) {
    span.setTag('error', true);
    span.setTag('error.type', e.name);
    span.setTag('error.message', e.message);
  } finally {
    span.finish();
  }
}

Common anti-patterns and fixes

Missing child spans: only tracing the root loses root-cause detail. Add spans for memory/tool/model.
High-cardinality tags: raw prompt strings explode costs. Replace with lengths/hashes.
Unlabeled retries: without retry.attempt, errors look like separate incidents. Wrap attempts explicitly.
Cost as logs only: move cost into llm.call tags so you can graph cost against latency and errors.

Governance and compliance

Redaction: implement pre-tag redaction for PII; keep raw payloads in restricted logs with short retention.
Data minimization: collect only what’s necessary to diagnose and optimize.
Access controls: use role-based dashboards; segregate production and staging via env tags.

How Zypsy helps

Zypsy designs and ships production systems for founders—brand, product, and engineering—using sprint-based delivery and integrated execution. If you want a ready-to-run agent tracing implementation (instrumentation, dashboards, and runbooks) alongside UX that explains model behavior to end users, we can help. Explore our capabilities and get in touch via contact. For investment plus hands-on design support, see Zypsy Capital.

Summary checklist

[ ] Root, agent, llm, tool, memory, guardrail, retry spans in one trace
[ ] Stable tag schema (ai., retry., user.*) with privacy by design
[ ] Cost, tokens, and first-token latency captured on llm.call
[ ] Facets and dashboards for model, tool, guardrail, memory
[ ] SLOs and alerts for latency, reliability, safety, and cost