The stakes of a production incident in business-critical software are different from the stakes in consumer software. A slow-loading social media feed is annoying. A POS system that cannot process transactions during lunch rush is a direct revenue loss for the business running it.
This reality shapes how Holixora thinks about observability in its products.
The Three Pillars in Practice
Observability in modern software is commonly described through three pillars: logs, metrics, and traces. Each answers a different question.
Logs tell you what happened. Metrics tell you the state of the system over time. Traces tell you where time was spent during a specific request. Together, they give you the ability to understand a production incident without needing to reproduce it in a development environment.
For Mercora and Hanoman, these are not theoretical. A Mercora transaction that fails needs a log that captures exactly which step failed, with what inputs, and what the error state was. A Hanoman booking sync that took 10 seconds instead of 1 second needs a trace that shows where the time went.
Alerting That Matters
The failure mode of naive monitoring is alert fatigue. If every minor anomaly triggers a notification, the notifications lose meaning and real problems get ignored because engineers have learned to tune out the noise.
The alerting we have built is focused on business-impact signals. Transaction failure rates above threshold. Sync delays beyond the SLA. Authentication errors that suggest a credential problem. These are the things that affect the businesses using the software, so they are the things that need immediate attention.
On-Call With Context
When an on-call engineer gets paged, the alert includes context: what triggered it, what the current state of the affected system is, and links to the relevant dashboards and recent logs. The goal is to reduce time-to-understanding so that time-to-resolution is also shorter.
An alert that says "error rate elevated" and nothing else is nearly useless. An alert that says "Mercora transaction failure rate 3.2x normal for the past 8 minutes, concentrated in payment processing module, affecting 12 active tenants" gives the on-call engineer a head start.