Your Observability Tool just got a Second User
“It’s almost like I have a full-time person whose job is making sure my stack is always green”
💌 Hey there, it’s Elizabeth from SigNoz!
This newsletter is an honest attempt to talk about all things - observability, OpenTelemetry, open-source and the engineering in between! We at SigNoz are a bunch of observability fanatics obsessed with OpenTelemetry and open-source, and we reckon it’s important to share what we know. If this passes your vibe-check, we’d be pleased if you’d subscribe. We’ll make it worth your while.
This blog took 3 days and 12 hours to be curated, so make sure to show some love!
A few weeks ago, Leo Marechal stopped opening his SigNoz dashboards.
Not because they were bad, they weren’t. He’d set them up himself, spent a weekend tuning them, and he says he liked them, but he just stopped needing to look. “I have dashboards because I set them up at the beginning,” he told us. “I haven’t opened them except for screenshots for my ISO certification.” This is not a story about dashboards being obsolete; plenty of teams will still want them. It’s a story about what happens when something else starts reading your telemetry patiently, in parallel, while you’re trying to get your fourteen-month-old back to sleep, as Leo puts it. For a long time, observability tools have been built for one user: a human at a screen, scrolling logs, hovering over a flame graph, deciding what to do next. But there is now a second user in the building, and they are strange. They no longer read dashboards; they read APIs and JSON instead, open a dozen context windows at once and most importantly, they never get tired.
They are, in the way that matters for an SRE on call, infinitely patient.
What follows is a look at what one engineer built for that second user, why it worked, and the parts that still belong to the first.
What Leo built
Leo runs infrastructure at a small startup with three engineers, including him. Their product is the kind of microservice-heavy data plumbing that demands real observability and absolutely cannot justify a dedicated on-call rotation. He started, like a lot of people do, by trying Datadog. The two-week trial ended with a quote north of two thousand dollars.
Their AI SRE feature (Bits AI SRE), which lets you click a button on an error to have an agent investigate, was quoted at roughly $40 per investigation.
“You wouldn’t do it, right?” he said. “So instead, what that inspired me to do was: wait, I can do this.”
He moved to SigNoz, open source, openTelemetry-native, with a hosted MCP server that he could plug into Claude. Setup took, by his account, an afternoon, and then, instead of paying per-click for someone else’s investigation agent, he built his own and let it run.
The architecture is unromantic and worth describing in detail, because the details are where the difference lives:
Alerts originate in SigNoz. Rules are tuned by humans, when a rule fires, a webhook hits the agent.
The agent is Claude Opus, running on a small VM, kept alive by PM2.
It reads telemetry through the SigNoz MCP server, including ****logs, traces, metrics, the same data a human would query, surfaced through natural-language tool calls instead of a query bar.
It has read-only access to Kubernetes through an RBAC-scoped service account. It can describe a pod, look at events, inspect an Argo CD application but cannot change anything.
It has read access to GitLab. When something breaks, the first question it asks is whether anyone merged anything in the last hour.
It talks to humans through Slack in socket mode, not just a one-way notification firehose but an actual conversation.
It writes to a Postgres database as its working memory and diligently records every alert it has seen, every investigation it has done and every conclusion it has reached.
It writes incident reports to Notion when something is resolved, so the post-mortem isn’t a task someone has to do later. It already exists.
It restarts itself every day at 4 a.m. so its context window doesn’t fill up, reading a summary of its own memory on the way back up.
One of the first things in Leo’s playbook for the agent is, in his words, a warning to itself: “Your context window is your lifeline. If it fills up, you die and must restart.” The agent is instructed to never investigate directly, but to spawn sub-agents with fresh context windows, smaller and cheaper models that do the dirty work and report a single paragraph back. The main agent stays clean, and the sub-agents remain disposable.
What it actually does
When a noisy alert fires, say, a node briefly reporting memory pressure that will be fine in ten minutes, the agent receives the webhook, queries SigNoz, decides it’s transient, and logs the event to its memory.
When the same alert fires for the eleventh time that week, the agent’s memory tells it this is no longer noise and is more of a pattern. So it pings Leo in Slack and explains itself: here is the alert, here is how often I’ve seen it, here is what I think is going on, here is the merge request from yesterday that I suspect is the cause. And Leo is free to reply after checking it out on his own, based on his availability. At the end of the day, the agent files what amounts to an end-of-shift report, including what they saw.
“It’s almost like I have a full-time person whose job is making sure my stack is always green”, Leo said.
It has been live for about three weeks, and in this time, by his count, it has produced zero false positives. It has had one false negative, which is one real thing it categorised as noise. He has been quietly extending it. Last week, he added Falco for security events, wrote a new rule into the agent’s playbook, and watched it pick up the alert and start reasoning about it.
Context isn’t care
The temptation, reading this, is to assume the agent is doing the human’s job; it isn’t. It’s doing a job that didn’t really exist before, or rather, that existed as the worst part of a different job, the part most folks hated.
There is a useful distinction here, one that SigNoz’s own framing of agent-native observability gets right: agents bring context, humans bring care.
An agent connected to your telemetry has more context than any human ever will. It has read every trace in the last hour, scanned every log line, and can correlate ten metrics across five services in the time it takes you to find the right tab. The classic complaint of observability that I don’t have time to look at all of this is, for the first time, not a problem anymore.
But we have to realise that context isn’t the same as care. The agent does not know which 500 actually matters. It does not know that one customer is on a renewal call this afternoon, or that the checkout service degrading by 80ms is a Series A milestone, or that the noisy memory alert on the staging cluster genuinely doesn’t matter because that cluster is being torn down on Friday. You can tell it some of these things, but you cannot tell it all of them, and the list keeps changing.
This is why every team we’ve watched try to hand alert judgment entirely to an agent has ended up in the same place. The agent produces noisy pages, and the team learns to ignore the pages, flagging them as a waste of time and effort. Within two weeks, the experiment will be uninstalled.
Leo’s setup doesn’t fall into that trap, since he kept the rules, including the actual alert thresholds, under human control and written deterministically in SigNoz. The agent’s job is downstream of that: triage what fires, investigate what looks real, escalate what matters. Human up top, deterministic system in the middle, agent at the bottom where the grunt work lives. None of those layers is doing the job of the one above it.
Or, the way he put it: “I don’t want the AI to replace something. I want the AI to augment.”
What this asks of an observability platform
If you take the second-user idea seriously that AI agents are going to be reading your telemetry the way humans read dashboards, start being product decisions.
The data model has to be unified. When a human is investigating, they can paper over schema mismatches between logs, traces, and metrics by squinting, but an agent can’t. If your logs live in one schema and your traces in another and the correlation has to be reconstructed by hand each time, your agent will spend most of its tokens figuring out where it is instead of figuring out what’s wrong. This is one of the reasons SigNoz built on OpenTelemetry from day one and kept everything in a single store. Although it wasn’t an agent-era decision, it just turned out to be one.
The interface has to speak to agents directly. Dashboards are a human-facing rendering of the underlying data, and agents don’t need that. They need the underlying data, exposed through something they can actually call, which, today, is synonymous with MCP. SigNoz’s hosted MCP server is live for Cloud users, and the self-hosted version is on GitHub. Leo’s whole setup runs through it.
The platform has to be legible to a model. Models have been trained on years of OpenTelemetry tutorials, SigNoz documentation, GitHub issues, and Stack Overflow answers. They already know how to talk to an open, well-documented stack. They have a much harder time with proprietary query languages and gated docs, where the model has to reason from forum posts and Reddit threads.
Powered with skills. A raw MCP server is a bag of capabilities. The next layer up is teaching agents the conventions of your team, which metrics matter, which tracing patterns hold up at scale, and what your debugging tribal knowledge actually looks like (more and more context). Your agent eventually actually works exactly as per your runbook.
Two users, one platform
The shape of agent-native observability, at least for the foreseeable future, is two users sharing the same underlying data and asking different things of it.
The human user wants a clear view of what’s happening, the ability to ask hard questions, and confidence that the system will tell them when something they care about is wrong. The agent user wants raw data, fast tool calls, clean schemas, and enough structure to reason without getting lost. The platform that serves both well is the one that stops treating dashboards as the product and starts treating the underlying telemetry as the product with multiple ways to consume it.
Leo’s setup is a small early example of what this looks like in practice. The rules live in SigNoz, configured by him. The agent picks up the work that scales poorly with human attention, like reading everything, correlating everything, remembering everything and gives back the work that scales well: deciding what matters, choosing what to monitor, knowing when something is actually wrong.
Both users belong on the platform now. We’re building for both.
If you want to plug your own agent into your telemetry, the SigNoz MCP server is live for Cloud users and open source on GitHub for everyone else. The longer picture, including the AI Assistant beta and the skills work coming next, lives at Agent Native Observability in SigNoz.

