Observability

The Three Pillars of Observability: Logs, Metrics, Traces

TechTrailCamp·Feb 8, 2026·13 min read

When a user reports that "the app is slow," how do you find the root cause? In a monolith, you might check a single log file and a few database queries. In a distributed system with dozens of microservices, the request might touch 10 services, 3 databases, and 2 message queues before returning a response.

Observability is the ability to understand what's happening inside your system by examining its external outputs. The three pillars — logs, metrics, and traces — each provide a different lens into system behavior. Together, they give you the complete picture.

Pillar 1: Logs — What Happened?

Logs are timestamped, immutable records of discrete events. They're the most familiar observability signal — developers have been writing console.log and logger.info since the beginning of software.

What Makes a Good Log

Structured format — use JSON instead of plain text. Structured logs are searchable and parseable.
Correlation IDs — include a request ID that follows the request across all services
Context — include user ID, service name, endpoint, and relevant business data
Appropriate levels — use ERROR for failures, WARN for degradation, INFO for business events, DEBUG for development

// Bad: unstructured, no context
"Payment failed"

// Good: structured, contextual
{
  "timestamp": "2026-02-08T10:23:45Z",
  "level": "ERROR",
  "service": "payment-service",
  "traceId": "abc-123-def",
  "userId": "user-456",
  "message": "Payment processing failed",
  "error": "Gateway timeout after 3000ms",
  "orderId": "order-789",
  "amount": 99.99
}

Log Aggregation Tools

ELK Stack (Elasticsearch, Logstash, Kibana) — the classic open-source solution
Loki + Grafana — lightweight, label-based log aggregation
AWS CloudWatch Logs — native AWS logging
Datadog Logs — commercial, full-featured

When Logs Fall Short

Logs tell you what happened but not how often or how fast. Searching through millions of log lines to answer "what's our p99 latency?" is inefficient. That's where metrics come in.

Pillar 2: Metrics — How Is It Performing?

Metrics are numerical measurements collected at regular intervals. Unlike logs (which capture individual events), metrics aggregate data into time-series that show trends, patterns, and anomalies.

Google's Four Golden Signals provide a framework for the most important metrics to track

Types of Metrics

Counters — values that only go up (total requests, total errors)
Gauges — values that go up and down (current CPU usage, active connections)
Histograms — distribution of values (request latency percentiles)
Summaries — pre-calculated percentiles over a sliding window

Metrics Tooling

Prometheus — the standard for cloud-native metrics collection
Grafana — visualization and dashboarding
AWS CloudWatch Metrics — native AWS monitoring
Datadog / New Relic — commercial APM solutions

When Metrics Fall Short

Metrics tell you something is wrong (latency is high, error rate is up) but not where in the request path the problem is. A request touching 8 services might be slow, but which service is the bottleneck? That's where traces come in.

Pillar 3: Traces — Where Is the Bottleneck?

Distributed traces follow a single request as it flows through multiple services. Each service adds a "span" to the trace, recording when it started, how long it took, and what it did.

A distributed trace reveals that the external payment gateway is the bottleneck in this checkout flow

Key Concepts

Trace — the entire journey of a request through the system
Span — a single operation within a trace (e.g., "query database", "call payment API")
Trace ID — a unique identifier that links all spans in a single request
Parent-child relationships — spans are nested to show the call hierarchy

Tracing Tools

Jaeger — open-source, originally from Uber
Zipkin — open-source, originally from Twitter
AWS X-Ray — native AWS tracing
OpenTelemetry — vendor-neutral standard for instrumentation (the future of observability)

How the Three Pillars Work Together

The real power of observability comes from correlating all three signals. Here's a typical debugging workflow:

Metrics alert fires — "p99 latency for /checkout exceeded 500ms"
Check dashboards — metrics show latency spike started at 2:15 PM, correlates with traffic increase
Find slow traces — filter traces for /checkout requests over 500ms
Identify bottleneck — trace shows payment-service span taking 400ms (normally 50ms)
Check logs — payment-service logs show "connection pool exhausted" errors starting at 2:15 PM
Root cause found — database connection pool was too small for the traffic spike

Metrics tell you something is wrong. Traces tell you where. Logs tell you why. You need all three.

OpenTelemetry: The Unified Standard

OpenTelemetry (OTel) is rapidly becoming the industry standard for observability instrumentation. It provides a single set of APIs, SDKs, and tools to generate logs, metrics, and traces — regardless of which backend you use.

Benefits of OpenTelemetry:

Vendor-neutral — instrument once, send data to any backend
Auto-instrumentation — automatic tracing for popular frameworks (Spring Boot, Express, Django)
Correlation — automatically links logs, metrics, and traces with shared context
Wide adoption — supported by all major observability vendors

Getting Started: Practical Recommendations

Start with metrics — set up Prometheus + Grafana. Monitor the four golden signals for every service.
Add structured logging — switch from plain text to JSON logs. Include correlation IDs in every log line.
Introduce tracing — use OpenTelemetry to instrument your services. Start with the critical user-facing paths.
Correlate everything — ensure trace IDs appear in logs and metrics labels so you can jump between signals.
Set up alerts — alert on symptoms (high latency, error rate) not causes (high CPU). Let the observability stack help you find the cause.

Conclusion

Observability isn't about having more data — it's about having the right data to answer questions you haven't thought of yet. Logs, metrics, and traces each provide a unique perspective on system behavior. Together, they give you the ability to understand, debug, and optimize complex distributed systems.

Start simple, instrument incrementally, and focus on the signals that help you answer real questions about your system's health and performance.

At TechTrailCamp, observability is a core part of our AWS + DevOps and Microservices tracks. You'll learn to build observable systems from the ground up through hands-on, 1:1 / Batch mentoring.

Want to build observable systems?

Join TechTrailCamp's 1:1 / Batch training and learn production-grade observability practices.

Start Your Learning Journey