Generated by Codex with GPT-5
What happened
Databricks’ official blog published Observability for any agent, anywhere: Production-ready tracing with OpenTelemetry & Unity Catalog on Databricks, a May 22, 2026 post about treating production AI-agent traces as governed lakehouse data rather than as short-lived telemetry locked inside a separate observability tool.
The post is interesting because it frames agent observability as a data architecture problem. Traditional observability systems are good at operational questions such as whether latency or error rates are rising, but AI agents produce unusually rich traces: prompts, responses, tool calls, retrieval steps, model selections, token counts, intermediate decisions, user feedback, and sometimes sensitive business context. Those traces are too valuable to discard quickly, too sensitive to scatter across unmanaged pipelines, and too analytically useful to leave in systems that were designed mainly for logs, metrics, and dashboards.
Databricks’ answer is to ingest OpenTelemetry traces, logs, and metrics directly into Unity Catalog tables. The agent can run inside or outside Databricks. As long as it emits standard OTel signals through an OpenTelemetry-compatible client or collector, those signals can flow into Databricks’ managed ingestion endpoint and land in Delta tables. Once there, they inherit the same governance, access control, masking, row filtering, SQL queryability, ETL support, and retention economics as other lakehouse data.
The architecture is deliberately single-sink. Instead of sending traces through a chain of collectors, brokers, custom processors, and warehouse-loading jobs, Databricks routes OTel traffic through a serverless ingestion path powered by Zerobus Ingest. Zerobus accepts OTLP over gRPC for standard collectors and REST-style integrations for frameworks such as MLflow, then writes spans, logs, and metrics into Unity Catalog-backed Delta tables. The implementation choice matters because agent traces can be high-volume and text-heavy. Every extra hop adds operational surface area, cost, schema drift risk, and another place where sensitive prompts might be copied.
The data model is another important piece. The setup creates raw OpenTelemetry tables for spans, logs, and metrics, plus MLflow-oriented tables and views for annotations, trace metadata, and unified trace records. That gives teams two levels of access. They can inspect spans at the execution-path level when debugging a specific agent run, or query consolidated records when they need broader analysis across requests, users, tools, models, and outcomes. In practice, the trace becomes both an operational artifact and an evaluation dataset.
Databricks demonstrates the flow with a support-manager assistant built with LangGraph, a Databricks-hosted Claude model, and a Genie Space exposed as a tool through MCP. The agent answers a business question by calling the Genie tool several times, and the trace shows the model calls, tool calls, intermediate inputs and outputs, and final answer as one end-to-end execution. The example is simple, but it clarifies the systems point: the most useful debugging view is not just the final response or the average latency. It is the path the agent took through tools and data before producing that response.
Why it matters
The strongest idea in the post is that production agent traces should be part of the organization’s analytical data plane. Once traces are Delta tables, teams can write SQL over them, build dashboards, feed downstream pipelines, expose natural-language analysis through Genie, and join agent behavior to business outcomes. An operations team can ask which tool is driving P99 latency. A product team can ask which request classes lead to escalation. A finance team can apply contract-specific model pricing instead of relying on generic list-price estimates. A quality team can mine real user traces to build evaluation sets.
That last point is especially important for AI systems. Agent evaluations often start with synthetic cases, hand-curated prompts, or narrow benchmark tasks. Those are useful, but production failures usually come from the messy distribution of real work: ambiguous requests, repeated tool retries, prompt patterns that trigger expensive loops, edge-case customer data, and long execution paths that were not represented in test sets. If traces are retained and queryable, teams can continuously sample real interactions, label or score them, and turn them into evaluation data. The post’s “continuous improvement flywheel” is not just a slogan; it is a concrete loop from production trace to dataset to MLflow evaluation to monitoring policy or application change.
The governance angle is just as practical. Prompts and responses can contain personal data, contractual terms, confidential customer records, source code, or internal system names. Sending that raw material to a third-party observability service can create a security review bottleneck, while stripping the content too aggressively can make traces useless for debugging and evaluation. Keeping traces in Unity Catalog does not solve privacy automatically, and Databricks explicitly says the feature does not apply special PII handling on its own. But it does put trace data under the same permissioning, masking, and filtering model that data teams already use for regulated datasets.
The post also shows why AI observability needs component-level structure. A normal service trace might identify which microservice is slow. An agent trace has to explain whether the bottleneck was retrieval, an LLM call, a tool invocation, a retry loop, a prompt-expansion step, or a model handoff. Databricks’ example dashboards separate cost and latency by model and tool, which is the level where engineering action usually happens. A high overall latency number is not enough; teams need to see that a specific retrieval tool is slow, a particular model is producing too many tokens, or one request class repeatedly triggers unnecessary tool calls.
There is also a cost and retention lesson. Agent traces contain large text payloads, so retaining them in a conventional observability platform can become expensive quickly. Object storage plus Delta Lake changes the economics, making it more plausible to keep traces long enough for offline analysis, audit, regression detection, and quality improvement. That shifts observability from a short-term debugging tool into a long-term learning system.
Takeaway
Databricks’ post is a useful example of the next layer of agent infrastructure: not another orchestration framework, but the data plane needed to understand and improve agents after they are deployed. The important move is to make OpenTelemetry traces durable, governed, queryable, and connected to evaluation workflows.
For teams building production agents, the broader engineering takeaway is to design observability as part of the agent architecture from the beginning. Instrument model calls and tool calls with standard traces. Store enough context to reconstruct execution paths. Govern prompts and responses as sensitive data. Join traces with business and quality outcomes. Use real production behavior to refresh evaluation sets and monitoring rules.
Agents are difficult to improve when their behavior disappears into application logs or isolated dashboards. They become easier to operate when every run leaves behind a structured record that engineers, data scientists, security teams, and product teams can analyze with the same tools they already use for critical data. Databricks’ architecture points toward that operating model: agent traces as first-class production data.