How the Sharks Do Observability
An account on how Netflix and Uber observe their massive systems everyday.
💌 Hey there, it’s Elizabeth from SigNoz!
This newsletter is an honest attempt to talk about all things - observability, OpenTelemetry, open-source and the engineering in between! We at SigNoz are a bunch of observability fanatics obsessed with OpenTelemetry and open-source, and we reckon it’s important to share what we know. If this passes your vibe-check, we’d be pleased if you’d subscribe. We’ll make it worth your while.
This blog took 6 days and 7 hours to be curated, so make sure to show some love!
As an observability enthusiast working at an observability startup and running an observability newsletter, I find this topic wildly fascinating. I know a bunch of lore on how companies thought about and invented (or, more precisely, reinvented) their observability systems to support their growing scale. But two of these stories have stuck with me and are interesting because each broke its observability system at a critical moment of growth and rebuilt it in a completely different and particularly breathtaking way, from which we have a lot to learn!
1. Netflix
Netflix’s observability origin story starts in a place that will make most engineers wince. In May 2011, Netflix was using a home-grown solution called Epic to manage time-series data. Epic was a combination of Perl CGI scripts, RRDTool logging, and MySQL. Alongside Epic, their telemetry was split between this home-grown tool and an IT-provisioned commercial product. Epic’s flexibility letting engineers send in arbitrary time-series data and query it made it popular, and it became the primary system of record.
They were tracking around 2 million distinct time series, and the monitoring system was regularly failing to keep up with the volume of data, and several things were about to make it dramatically worse: Netflix was shifting from rolling pushes to red/black deployments, starting to actually leverage auto-scaling rather than just using fixed-size groups, and expanding internationally into Latin America and Europe.
All these changes required them to scale by at least an order of magnitude from 2 million to 20 million metrics or more. Perl CGI scripts and MySQL were never going to handle what Netflix was becoming, and it was simply beyond what Epic was capable of.
So in early 2012, they started building Atlas, and by late 2012, it was being phased into production, with full deployment completed in early 2013.
The design philosophy behind Atlas is a chapter filled with learnings. Atlas features in-memory data storage, allowing it to gather and report very large numbers of metrics very quickly. It captures operational intelligence whereas business intelligence analyses trends over time, operational intelligence provides a picture of what is currently happening within a system.
Since their focus was primarily on operational insight, the top priority was determining what’s going on right now. This led to the following rules of thumb:
1/ data becomes exponentially less important as it gets older
2/ restoring service is more important than preventing data loss
This is a fundamentally different philosophy from store everything forever. Netflix decided that recent data matters enormously and old data barely matters at all.
The internal Atlas deployment breaks data into multiple time windows. The last 6 hours of data is kept fully in memory, so they can show recent data as long as clients can successfully publish. Everything is sharded across machines in these in-memory clusters. For older data, they compute rollups via Hadoop processing, drastically reducing data volume for historical queries.
One of the things they really wanted to fix from Epic was how dimensions worked. In the old system, everything was mangled into a metric name with different conventions per team, and users had to resort to complex regular expressions to slice and dice data. In Atlas, a metric’s identity is an arbitrary, unique set of key-value pairs. Some keys are set automatically by the client library (server name, AWS zone, ASG, cluster, application, region), with significant flexibility for users to specify whatever keys make sense for their use case.
The growth numbers tell the story of why all this mattered. In 2011, they were monitoring 2 million metrics. By 2014, they were at 1.2 billion metrics, and the numbers continued to rise. They routinely see Atlas fetch and graph many billions of datapoints per second. Today, Atlas processes 17 billion metrics and 700 billion distributed traces per day on 1.5 petabytes of log data, and the system’s architecture has kept observability data processing to less than 5% of Netflix’s infrastructure costs!
But even Atlas hit its limits. A few years ago, Netflix’s SRE team was paged because their alerting system was falling behind, and the critical application health alerts were reaching engineers 45 minutes late. One platform team had programmatically created tens of thousands of new alerts, which overwhelmed Atlas’s query capacity. They were looking at an order-of-magnitude increase in alert queries over the next 6 months, and scaling up Atlas’s storage layer to serve that volume would have been prohibitively expensive, since Atlas was already one of Netflix’s largest services in both size and cost.
Their answer was Atlas Streaming Eval, moving alerting from a cron-based query model to a streaming model. Today, they run 20x as many alert queries as a few years ago, at a fraction of the cost. Multiple platform teams at Netflix programmatically generate and maintain alerts on behalf of their users without affecting others, and streaming evaluation enabled them to relax cardinality restrictions and to alert on queries that were previously rejected.
What’s special here is that instead of throwing more hardware at the problem, they changed the model entirely, and in my opinion, that’s what separates great observability teams from the rest.
Some interesting references!
Introducing Atlas: Netflix’s Primary Telemetry Platform (Netflix Tech Blog)
Improved Alerting with Atlas Streaming Eval (Netflix Tech Blog)
Lessons from Building Observability Tools at Netflix (Netflix Tech Blog)
Solving Mysteries Faster with Observability (InfoQ / QCon)
Atlas Documentation (Netflix OSS)
2. Uber
Uber’s observability story starts in 2014 with a Graphite, Carbon, and WhisperDB stack that was held together very loosely. By late 2014, all services, infrastructure, and servers at Uber emitted metrics to a Graphite stack, which stored them in the Whisper file format in a sharded Carbon cluster. They used Grafana for dashboarding and Nagios for alerting, issuing Graphite threshold checks via source-controlled scripts.
The problems were fundamental, like the stack not being horizontally scalable, meaning you couldn’t add capacity just by adding machines. There were no replicas, so a single node dying meant losing an eighth of all Uber’s data and adding capacity required taking the system offline for a week or more. Martin Mao’s first on-call week was spent deleting data from the backend just to keep the observability stack alive.
First, they did a quick fix by swapping in Cassandra for time-series storage and ElasticSearch for the metrics index, all stitched together with Go. They stood up this new system in time for Halloween 2015, which was Uber’s second-largest peak load event. That year was the first time Uber’s observability system didn’t have an outage during the Halloween peak.
But Cassandra was the wrong tool for the job because they were using it as a time-series database even though it was built as a key-value store. As they entered their hyper-growth phase, the firefighting that had plagued the Graphite years resurfaced in a new form.
The team decided to build M3DB, a custom time-series database with an embedded inverted index from scratch. The architecture they landed on is worth understanding in detail.
Applications on hosts emit metrics to a local daemon called “Collector“, which aggregates them at 1-second intervals and then forwards them to the aggregation tier using a shard-aware topology retrieved from etcd. The aggregation tier further aggregates into 10-second and one-minute tiles, and the M3DB ingestor writes them to the storage tier. M3 Coordinator acts as a Prometheus sidecar, providing a global query and storage interface on top of M3DB clusters. It handles downsampling and ad hoc retention using rollup rules stored in etcd, which runs embedded in the binary of an M3DB seed node.
Let’s look at the results (quite phenomenal). Any given second, M3 processes 500 million metrics and persists another 20 million aggregated metrics. Extrapolating to a 24-hour cycle means roughly 45 trillion metrics per day, and the platform also houses over 6.6 billion time series!
The really interesting engineering is in the high-dimensional problem. High-dimensionality metrics; data tracked over time with many different aspects like route, region, and status code are critical to the business but costly at Uber’s scale. A single emission could lead to 100 million unique time series, and because code changes roll out to specific groups of cities over a few hours, they need city-level monitoring granularity. Different cities have different configurations; for example, rider pickups might be blocked on a street due to a parade, or local events can cause traffic changes.
Their alerting ecosystem is equally bespoke; it includes two in-datacenter alerting systems: uMonitor for time-series metrics-based alerting against M3, and Neris for host-level checks. Both feed into a common notification and deduplication pipeline called Origami. uMonitor uses static thresholds for steady-state metrics and anomaly thresholds via Argos, Uber’s anomaly detection platform, which generates dynamic thresholds from historical data.
They also added Jaeger, their open-source distributed tracing system. Jaeger’s distributed tracing follows requests from one service to another, composing a narrative of what happened and what went wrong, making it much easier to pinpoint causation.
The operational improvement after M3 was dramatic. Setting up monitoring in new data centres became 4x faster, and the operational maintenance burden dropped by over 16x, while combined high/low-urgency notifications per week went from 25 with Cassandra to 1.5 with M3DB. 👏🏻
Over a million unique visitors hit their systems every day, and more than half of their engineering team are using these observability tools daily.
Some resources that were my references and really good reads!
M3: Uber’s Open Source, Large-Scale Metrics Platform for Prometheus (Uber Engineering Blog)
How Uber Built its Observability Platform (The Pragmatic Engineer)
Observability at Scale: Building Uber’s Alerting Ecosystem (Uber Engineering Blog)
Optimizing M3: How Uber Halved Metrics Ingestion Latency by Forking the Go Compiler (Uber Engineering Blog)
Optimizing Observability with Jaeger, M3, and XYS (Uber Engineering Blog)
But here’s an interesting dilemma. What happens when the product you’re monitoring is the monitoring tool itself? When the observability system that’s supposed to tell you everything is broken... is the same system you need to diagnose the problem?
At Signoz, we have solved this exact problem by building a system called Nightswatch, a Game of Thrones-themed architecture featuring builders, rangers, and stewards to run SigNoz to monitor SigNoz.
That story drops in the next edition. Stay tuned.
Feel free to check out our blogs and docs here. Our GitHub is over here, and while you are at it, we’d appreciate it if you sent a star ⭐ our way. You’re also welcome to join the conversation in our growing Slack community for the latest news!

