Introduction
Modern AI agents are orchestrators: they plan, call tools, read/write memory, and retry when needed. To make them debuggable and reliable at scale, you need end-to-end tracing that captures these agent-specific behaviors as first-class spans with rich, queryable tags. This guide explains a practical span taxonomy, tag schema, UI patterns, and sample diagrams for implementing agent observability with Datadog Traces.
Goals and non-goals
-
Goals: faster debugging, measurable reliability, explainable costs, and safer user experiences for AI agents in production.
-
Non-goals: generic APM theory or vendor-specific screenshots. Focus is on concrete span design and Datadog trace usage patterns.
Core trace model for AI agents
Represent one user request as a trace. Within that trace:
-
Create a root span for the inbound request (API or UI action).
-
Nest an “agent orchestration” span that plans and coordinates.
-
For each model call, tool call, memory access, and guardrail check, create a child span.
-
For retries, create an attempt span under the same parent and tag it with attempt counters.
Span taxonomy and required tags
Use a consistent schema so spans are discoverable and aggregable.
Span type | When to create | Required tags (examples) | Notes |
---|---|---|---|
request.root | Each inbound request | env, service, version, http.method, http.route, user.anonymous_id | Root of the trace; avoid PII in tags. |
agent.orchestrate | Planning step | ai.agent.id, ai.agent.name, ai.plan.id, ai.plan.steps | Duration ≈ total orchestration time. |
llm.call | Each model invocation | ai.model.name, ai.model.provider, ai.tokens.prompt, ai.tokens.completion, ai.temperature, ai.top_p, ai.stop.reason | Add cost tags if computed (ai.cost.input_usd, ai.cost.output_usd). |
tool.call | Each external tool/function | ai.tool.name, ai.tool.type, ai.tool.args.size, ai.tool.timeout_ms, peer.service | Use error tags on failures. |
memory.read | Vector/DB read | ai.memory.op=read, ai.memory.store, ai.memory.k_hits, ai.memory.k_requested | Include latency to measure retrieval health. |
memory.write | Persist new facts | ai.memory.op=write, ai.memory.store, ai.memory.bytes | Track growth and write failures. |
guardrail.check | Safety/validation run | ai.guard.name, ai.guard.category, ai.guard.outcome (pass | fail), ai.guard.severity |
parser.parse | Output parsing/validation | ai.parser.name, ai.parser.outcome, ai.parser.error | Useful for structured tool responses. |
retry.attempt | Each retry | retry.attempt, retry.max, retry.reason, backoff.ms | Child of the span being retried. |
stream.tokens | Server-sent or websocket stream | ai.stream.first_token_ms, ai.stream.tokens, ai.stream.closed_reason | Optional if streaming. |
Notes
-
Keep tag keys stable and lowercase. Prefer dot-separated namespaces (ai., retry., user.*).
-
Avoid high-cardinality raw values (e.g., full prompts). Use sizes, hashes, or sampled exemplars.
Example trace structure (ASCII diagram)
request.root [GET /chat]
└─ agent.orchestrate [plan+route]
├─ memory.read [RAG]
├─ llm.call [draft]
│ └─ retry.attempt (attempt=1)
├─ tool.call [search_vendor]
│ └─ tool.call [fetch_doc]
├─ guardrail.check [policy]
├─ parser.parse [json to schema]
└─ llm.call [finalize]
Minimal tagging guidelines (privacy-first)
-
Never tag raw PII or secrets. Use user.anonymous_id (stable hash) instead of email/phone.
-
For prompts/completions, store length and token counts; attach redacted excerpts only to error logs, not to span tags.
-
Track costs as precomputed numerics per span (ai.cost.*) and aggregate at the trace level.
Datadog setup and instrumentation checklist
-
Services: set service, env, version on every span for Service Catalog and deploy-aware comparisons.
-
Span names: use verb-noun (e.g., llm.call, tool.call) for consistent search and analytics.
-
Error semantics: set error=true and include error.type, error.message, error.stack (stack trimmed) on failing spans.
-
Trace/log correlation: inject trace_id/span_id into logs so Datadog can link them in Trace View and Log Explorer.
-
Sampling: always keep errors; dynamically sample success by token volume or latency buckets to control costs.
-
Metrics from spans: use span-based analytics to compute p95 latency, error rates, retries/session, and cost per request.
UI patterns in Datadog for agent teams
-
Trace View: expect a stair-step flame graph (plan → retrieve → generate → tools → finalize). Investigate wide “tool fan-out” or repeated llm.call blocks as anti-patterns.
-
Faceted search: promote ai.model.name, ai.tool.name, ai.guard.outcome, retry.attempt to Facets for one-click filtering.
-
Analytics: build toplists for “noisiest tools” (error count), “most expensive models” (sum ai.cost.*), and “slowest memory stores” (p95 memory.read).
-
Service Map: group all tool.call spans by peer.service to spot external bottlenecks.
-
Dashboards: add timeboards for token throughput and guardrail fail rates next to latency.
Golden signals and SLOs for AI agents
-
Latency: p50/p95/p99 of request.root; first token latency (ai.stream.first_token_ms) if streaming.
-
Reliability: error rate of llm.call and tool.call; guardrail.fail rate ≤ threshold.
-
Cost: median and p95 ai.cost.total_usd per request (sum of ai.cost.input_usd + ai.cost.output_usd).
-
Efficiency: retries per request (avg retry.attempt), memory.hit_rate = hits/requested.
-
Safety: percentage of requests with ai.guard.outcome=fail that are successfully auto-remediated on retry.
Suggested SLOs (examples)
-
Availability (request success): 99.5% over 30d
-
p95 request latency: ≤ 3.0s over 30d
-
Guardrail pass rate: ≥ 98% (post-remediation)
-
Cost per successful request (p95): ≤ target budget
Retry patterns that preserve observability
-
Wrap each attempt in retry.attempt spans with retry.attempt index and backoff.ms.
-
Copy the same correlation tags (user.anonymous_id, ai.agent.id) across attempts.
-
Record retry.reason (timeout|rate_limit|tool_error|guardrail_fail) to support targeted tuning.
Memory observability (RAG and state)
-
memory.read: tag ai.memory.k_requested and ai.memory.k_hits to compute hit-rate and tail latency.
-
memory.write: tag payload size (ai.memory.bytes) and dedupe decisions (ai.memory.dedup=true/false).
-
Connect memory miss spikes to downstream llm.call error/content-quality regressions in Analytics.
Tool-call spans that scale
-
tool.call should always include peer.service and ai.tool.name; use ai.tool.type (http|db|function) for grouping.
-
For fan-out tools (e.g., parallel web fetches), set ai.tool.concurrency and per-child span counts; alert on excessive fan-out.
-
Propagate trace context to tools you own so subservices appear as part of the same trace.
Cost tracking from spans
-
Compute ai.cost.input_usd and ai.cost.output_usd per llm.call using your price sheet at ingestion time; store ai.tokens.* too.
-
Sum at the trace to ai.cost.total_usd; build monthly budgets and per-feature allocation with facet breakdowns (feature, model, market).
Dashboards: starter widget list
-
Timeseries: request.root p95 latency; llm.call error rate; tool.call error rate; retry.attempt count.
-
Toplist: most expensive models (sum ai.cost.total_usd by ai.model.name).
-
Query value: guardrail.fail rate with budget status.
-
Heatmap: memory.read latency by ai.memory.store.
-
Distribution: request cost per trace.
Alerting patterns
-
Latency burn alerts: p95 request.root > SLO for 5 min in env:prod.
-
Error budget burn: llm.call error rate > 2x baseline for 10 min.
-
Safety regression: guardrail.fail > 1% over 10 min or sudden spikes per feature.
-
Cost anomaly: ai.cost.total_usd p95 > weekly baseline + 30%.
Implementation snippets
Python (ddtrace)
from ddtrace import tracer
@tracer.wrap(service="chat-api", resource="POST /chat", span_type="web")
def handle_chat(req):
with tracer.trace("agent.orchestrate") as s:
s.set_tag("ai.agent.id", "helpdesk-v2")
s.set_tag("user.anonymous_id", req.user_hash)
# memory read
with tracer.trace("memory.read") as m:
m.set_tag("ai.memory.store", "vectordb")
m.set_tag("ai.memory.k_requested", 8)
# ... do read, set k_hits, errors
# llm call with retry
for attempt in range(1, 3):
with tracer.trace("retry.attempt") as r:
r.set_tag("retry.attempt", attempt)
r.set_tag("backoff.ms", 200 * attempt)
with tracer.trace("llm.call") as l:
l.set_tag("ai.model.name", "gpt-x")
l.set_tag("ai.tokens.prompt", 512)
l.set_tag("ai.temperature", 0.2)
# ... invoke model
# l.set_tag("ai.tokens.completion", tokens)
# l.set_tag("ai.cost.total_usd", cost)
# break on success
Node.js (dd-trace)
const tracer = require('dd-trace').init();
async function handleChat(req) {
const span = tracer.startSpan('agent.orchestrate', { service: 'chat-api' });
span.setTag('user.anonymous_id', req.userHash);
try {
const m = tracer.startSpan('memory.read', { childOf: span });
m.setTag('ai.memory.store', 'vectordb');
// ...
m.finish();
const l = tracer.startSpan('llm.call', { childOf: span });
l.setTag('ai.model.name', 'gpt-x');
// ...
l.finish();
} catch (e) {
span.setTag('error', true);
span.setTag('error.type', e.name);
span.setTag('error.message', e.message);
} finally {
span.finish();
}
}
Common anti-patterns and fixes
-
Missing child spans: only tracing the root loses root-cause detail. Add spans for memory/tool/model.
-
High-cardinality tags: raw prompt strings explode costs. Replace with lengths/hashes.
-
Unlabeled retries: without retry.attempt, errors look like separate incidents. Wrap attempts explicitly.
-
Cost as logs only: move cost into llm.call tags so you can graph cost against latency and errors.
Governance and compliance
-
Redaction: implement pre-tag redaction for PII; keep raw payloads in restricted logs with short retention.
-
Data minimization: collect only what’s necessary to diagnose and optimize.
-
Access controls: use role-based dashboards; segregate production and staging via env tags.
How Zypsy helps
Zypsy designs and ships production systems for founders—brand, product, and engineering—using sprint-based delivery and integrated execution. If you want a ready-to-run agent tracing implementation (instrumentation, dashboards, and runbooks) alongside UX that explains model behavior to end users, we can help. Explore our capabilities and get in touch via contact. For investment plus hands-on design support, see Zypsy Capital.
Summary checklist
-
[ ] Root, agent, llm, tool, memory, guardrail, retry spans in one trace
-
[ ] Stable tag schema (ai., retry., user.*) with privacy by design
-
[ ] Cost, tokens, and first-token latency captured on llm.call
-
[ ] Facets and dashboards for model, tool, guardrail, memory
-
[ ] SLOs and alerts for latency, reliability, safety, and cost