The Three Pillars of Observability: Logs, Metrics, Traces
When a user reports that "the app is slow," how do you find the root cause? In a monolith, you might check a single log file and a few database queries. In a distributed system with dozens of microservices, the request might touch 10 services, 3 databases, and 2 message queues before returning a response.
Observability is the ability to understand what's happening inside your system by examining its external outputs. The three pillars — logs, metrics, and traces — each provide a different lens into system behavior. Together, they give you the complete picture.
Pillar 1: Logs — What Happened?
Logs are timestamped, immutable records of discrete events. They're the most familiar observability signal — developers have been writing console.log and logger.info since the beginning of software.
What Makes a Good Log
- Structured format — use JSON instead of plain text. Structured logs are searchable and parseable.
- Correlation IDs — include a request ID that follows the request across all services
- Context — include user ID, service name, endpoint, and relevant business data
- Appropriate levels — use ERROR for failures, WARN for degradation, INFO for business events, DEBUG for development
// Bad: unstructured, no context
"Payment failed"
// Good: structured, contextual
{
"timestamp": "2026-02-08T10:23:45Z",
"level": "ERROR",
"service": "payment-service",
"traceId": "abc-123-def",
"userId": "user-456",
"message": "Payment processing failed",
"error": "Gateway timeout after 3000ms",
"orderId": "order-789",
"amount": 99.99
}
Log Aggregation Tools
- ELK Stack (Elasticsearch, Logstash, Kibana) — the classic open-source solution
- Loki + Grafana — lightweight, label-based log aggregation
- AWS CloudWatch Logs — native AWS logging
- Datadog Logs — commercial, full-featured
When Logs Fall Short
Logs tell you what happened but not how often or how fast. Searching through millions of log lines to answer "what's our p99 latency?" is inefficient. That's where metrics come in.
Pillar 2: Metrics — How Is It Performing?
Metrics are numerical measurements collected at regular intervals. Unlike logs (which capture individual events), metrics aggregate data into time-series that show trends, patterns, and anomalies.
Types of Metrics
- Counters — values that only go up (total requests, total errors)
- Gauges — values that go up and down (current CPU usage, active connections)
- Histograms — distribution of values (request latency percentiles)
- Summaries — pre-calculated percentiles over a sliding window
Metrics Tooling
- Prometheus — the standard for cloud-native metrics collection
- Grafana — visualization and dashboarding
- AWS CloudWatch Metrics — native AWS monitoring
- Datadog / New Relic — commercial APM solutions
When Metrics Fall Short
Metrics tell you something is wrong (latency is high, error rate is up) but not where in the request path the problem is. A request touching 8 services might be slow, but which service is the bottleneck? That's where traces come in.
Pillar 3: Traces — Where Is the Bottleneck?
Distributed traces follow a single request as it flows through multiple services. Each service adds a "span" to the trace, recording when it started, how long it took, and what it did.
Key Concepts
- Trace — the entire journey of a request through the system
- Span — a single operation within a trace (e.g., "query database", "call payment API")
- Trace ID — a unique identifier that links all spans in a single request
- Parent-child relationships — spans are nested to show the call hierarchy
Tracing Tools
- Jaeger — open-source, originally from Uber
- Zipkin — open-source, originally from Twitter
- AWS X-Ray — native AWS tracing
- OpenTelemetry — vendor-neutral standard for instrumentation (the future of observability)
How the Three Pillars Work Together
The real power of observability comes from correlating all three signals. Here's a typical debugging workflow:
- Metrics alert fires — "p99 latency for /checkout exceeded 500ms"
- Check dashboards — metrics show latency spike started at 2:15 PM, correlates with traffic increase
- Find slow traces — filter traces for /checkout requests over 500ms
- Identify bottleneck — trace shows payment-service span taking 400ms (normally 50ms)
- Check logs — payment-service logs show "connection pool exhausted" errors starting at 2:15 PM
- Root cause found — database connection pool was too small for the traffic spike
Metrics tell you something is wrong. Traces tell you where. Logs tell you why. You need all three.
OpenTelemetry: The Unified Standard
OpenTelemetry (OTel) is rapidly becoming the industry standard for observability instrumentation. It provides a single set of APIs, SDKs, and tools to generate logs, metrics, and traces — regardless of which backend you use.
Benefits of OpenTelemetry:
- Vendor-neutral — instrument once, send data to any backend
- Auto-instrumentation — automatic tracing for popular frameworks (Spring Boot, Express, Django)
- Correlation — automatically links logs, metrics, and traces with shared context
- Wide adoption — supported by all major observability vendors
Getting Started: Practical Recommendations
- Start with metrics — set up Prometheus + Grafana. Monitor the four golden signals for every service.
- Add structured logging — switch from plain text to JSON logs. Include correlation IDs in every log line.
- Introduce tracing — use OpenTelemetry to instrument your services. Start with the critical user-facing paths.
- Correlate everything — ensure trace IDs appear in logs and metrics labels so you can jump between signals.
- Set up alerts — alert on symptoms (high latency, error rate) not causes (high CPU). Let the observability stack help you find the cause.
Conclusion
Observability isn't about having more data — it's about having the right data to answer questions you haven't thought of yet. Logs, metrics, and traces each provide a unique perspective on system behavior. Together, they give you the ability to understand, debug, and optimize complex distributed systems.
Start simple, instrument incrementally, and focus on the signals that help you answer real questions about your system's health and performance.
At TechTrailCamp, observability is a core part of our AWS + DevOps and Microservices tracks. You'll learn to build observable systems from the ground up through hands-on, 1:1 mentoring.
Want to build observable systems?
Join TechTrailCamp's 1:1 training and learn production-grade observability practices.
Start Your Learning Journey
TechTrailCamp