There is a special kind of meeting that I have attended too many times in my career. Something is wrong in production. Users are complaining. And the room is full of smart engineers guessing. Maybe it is the database. Maybe the cache. Maybe that deploy from Tuesday. Everyone has a theory and nobody has evidence, because the system is a black box and the only data we have is the complaint itself.
Every hour of that meeting is expensive twice. Once in the salaries sitting in the room, and once in the users hitting the problem while we guess.
Observability is the discipline of never being in that meeting. Or at least, of making that meeting ten minutes long.
Monitoring tells you that, observability tells you why
The classic setup is monitoring: dashboards and alerts for problems you predicted in advance. CPU is high, disk is full, error rate crossed a line. Monitoring answers questions you wrote down before the incident.
The problem is that real incidents are creative. They are combinations nobody predicted. Only users on this app version, with this payment type, in this region, after this feature flag. Your dashboard has no panel for that, because you did not know to build one.
Charity Majors has been loud about this distinction for years, and the Observability Engineering book she co-wrote in 2022 is the best single source I can point to. The core idea: instrument your systems with rich, high cardinality events and traces, so you can ask new questions about production without shipping new code first. Not “is the thing I predicted happening,” but “what is actually happening, slice it by anything.” The Google SRE book had already planted the foundation earlier with its focus on SLOs, deciding explicitly how much unreliability is acceptable, and measuring against that instead of against vibes.
Tracing is the part that changed my daily work the most. A trace follows one request across every service it touches, with timing for each hop. The guessing meeting dies instantly. You do not debate whether the database is slow. You open the trace and look. The bottleneck has a name, a duration, and a line of code attached. After years of debugging by archaeology, grepping logs across machines and correlating timestamps by hand, this still feels a little like cheating.
And it honestly does not matter how you get there. A vendor platform like Datadog or New Relic, an error tracker like Sentry, an open stack like Grafana and Prometheus, OpenTelemetry wiring it all together, or even a few structured logs and counters you emit straight from your own code. The tool is a detail, and the right one depends on your budget and your team. What is not optional is the data. A cheap setup that gives you real, queryable data about your system beats an expensive one that nobody looks at. The goal is not to own a famous tool. The goal is that the system can tell you what it is doing.
Downtime is revenue with a minus sign
Now the money part, because that is how this investment gets approved.
For any business that transacts online, downtime is not an engineering metric. It is revenue with a minus sign. Take your annual online revenue, divide by the minutes in a year, and you have a price per minute of outage. For many companies that number makes people quiet. And full outages are the cheap case, because at least you notice them. The expensive case is the gray failure: checkout works but takes 9 seconds, search returns results but bad ones, five percent of requests fail and retry. Users do not file a ticket. They leave. That loss never appears in an incident report, it appears months later as a soft conversion number that gets blamed on marketing.
Observability shortens the two intervals that decide the bill: time to detect and time to resolve. If good instrumentation cuts an incident from three hours to thirty minutes, you can multiply the saved minutes by the revenue per minute and get the return on the tooling investment in one line. Few engineering purchases can show their value that directly.
There is a team health line in this budget too. On call without observability is a punishment, every page is a haunted house you enter with a flashlight. On call with good traces and dashboards is almost civilized. Engineers do not quit because of incidents. They quit because of helplessness during incidents. I have watched both kinds of on call rotation from inside, and they produce very different resignation rates, the same way good and bad developer experience does. Here in Calgary the on call alert buzzing my phone at 3 am in January feels even colder, trust me, so the least we can do is make the answer findable fast.
The part nobody sells: observability is product data
Here is what surprised me most as I got more senior. The same instrumentation that debugs incidents answers product questions, often better than the official analytics do.
Which features are actually used, and by whom? The traces know. Where in the flow do users suffer, retry, abandon? The latency data knows, sliced by endpoint and user segment. Is the slow page slow for everyone or only for the big customers we care about most? One query. I have seen feature roadmap debates, long ones, full of opinions, get settled in five minutes by someone pulling up usage data from the observability stack. We had been arguing about a feature that two percent of users touched.
This reframes the whole investment. Observability is not an engineering cost center that buys insurance against bad nights. It is a shared measurement layer that engineering, product, and support all drink from. When the discussion is framed that way, the budget conversation changes completely. Martin Fowler has long argued that good architecture is what keeps the cost of answering new questions low, and I see observability as exactly that, applied to the running system instead of the source code.
If you are starting from zero, my honest advice is small: pick your one most revenue critical user flow, instrument it end to end with traces and a couple of SLOs, and run with that for a quarter. Do not boil the ocean with a company wide observability program on day one. One flow, fully visible, will produce enough “how did we live without this” moments to fund the rest.
You cannot fix what you cannot see. But the deeper truth is worse: you cannot even prioritize what you cannot see. Without data, the loudest opinion wins, and loud is not a strategy.
Make the system tell you the truth. Everything else gets easier after that.
Pax et bonum.