What is Observability and Why It’s More Than Just Monitoring

Eglis Alvarez
Aug 31
3 min read

In a world where applications are increasingly distributed, dynamic, and business-critical, observability has become an essential component of any modern technology strategy.

And yet, even today, many organizations are stuck with very rudimentary monitoring practices, things like checking CPU usage, disk space, or whether a process is “up.” This is far from enough, and nowhere near true observability.

The issue with this limited approach is that it provides only a superficial and fragmented view of the system. When an outage, performance degradation, or intermittent error occurs, teams relying on such basic monitoring end up fighting fires blindly. Without correlated data, diagnostics are slow, expensive, and often inaccurate.

Observability exists to close this gap. It not only detects that something is wrong, but also helps you understand the root cause by correlating metrics, logs, and traces. It’s the difference between receiving an alert that “the API is slow” and being able to demonstrate that the issue is in a specific microservice, caused by slow queries to the database.

Monitoring vs. Observability

Monitoring shows the tip of the iceberg (symptoms). Observability reveals what lies beneath (root causes).

Monitoring tries to answer one narrow question: “Is my system working as expected?”

Typical examples from “basic” setups:

Is CPU above 80%?
Is the API returning 200 OK?
Does the server have enough disk space?

Prometheus alert example for CPU:

- alert: HighCPUUsage
  expr: avg(rate(node_cpu_seconds_total{mode!="idle"}[5m])) > 0.80
  for: 5m
  labels:
    severity: warning
  annotations:
    description: "CPU usage > 80% on node {{ $labels.instance }}"

This detects the symptom, but not the cause.

Observability, on the other hand, answers a much deeper question: “Why is my system behaving this way?”

The Three Pillars of Observability

Observability stands on three complementary data sources:

1. Logs

Detailed event records.

Example: “Error: database connection refused”.
With Loki + Promtail, logs can be centralized and queried efficiently.

Promtail config:

scrape_configs:
  - job_name: varlogs
    static_configs:
      - targets: [localhost]
        labels:
          job: varlogs
          __path__: /var/log/*.log

Grafana query with LogQL:

{app="payments-api"} |= "timeout" | json | line_format "{{.method}} {{.path}} dur={{.duration}}ms"

2. Metrics

Numerical values over time.

Example: memory usage, requests per second, average latency.
Prometheus is the de facto standard to collect them.

Python example with prometheus_client:

from prometheus_client import start_http_server, Counter, Histogram
import time, random

REQS = Counter('http_requests_total', 'Total HTTP requests', ['endpoint'])
LAT  = Histogram('http_request_seconds', 'Latency', ['endpoint'])

if __name__ == "__main__":
    start_http_server(8000)
    while True:
        with LAT.labels("/pay").time():
            time.sleep(random.uniform(0.05, 0.20))
        REQS.labels("/pay").inc()

Exposes metrics at http://localhost:8000/metrics.

3. Traces

Follow the path of a request across multiple services.

With OpenTelemetry + Jaeger/Tempo, you can instrument and capture them.

Python example:

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor, ConsoleSpanExporter

trace.set_tracer_provider(TracerProvider())
tp = trace.get_tracer_provider()
tp.add_span_processor(SimpleSpanProcessor(ConsoleSpanExporter()))
tracer = trace.get_tracer("payments")

with tracer.start_as_current_span("payment_transaction"):
    with tracer.start_as_current_span("db_query"):
        print("Database query...")

Practical Example

Imagine you manage a payment system and a customer reports that their transactions are taking too long.

With basic monitoring, you’d only see that average latency is increasing.
With observability, you go deeper:
- Prometheus metrics show that the p95 latency for the /pay endpoint jumped from 200 ms to 1200 ms:
  histogram_quantile(0.95, sum by (le) (rate(http_request_seconds_bucket{endpoint="/pay"}[5m])))
- Logs in Loki reveal DB connection timeout errors.
- Traces in Jaeger/Tempo show that the slowdown is concentrated in the payment-db-service microservice.

Instead of assumptions, observability gives you clear, actionable answers.

Why Observability Matters Today

In the era of monolithic applications, checking CPU, memory, and a handful of logs might have been enough.But today’s systems are fundamentally different:

Distributed microservices.
Ephemeral containers.
Cloud and third-party dependencies.

Failures are no longer confined to a single server: they may appear intermittently, only in specific regions, or vanish between restarts.

This is why observability is no longer optional. Organizations that still operate with rudimentary monitoring are at a serious disadvantage: they detect symptoms but cannot explain or resolve them quickly. Those that embrace observability, on the other hand, can reduce MTTR drastically and operate with greater confidence.

Conclusion

The key difference is simple:

Monitoring = detecting symptoms.
Observability = understanding causes.

Teams relying on outdated, basic monitoring are blind to root causes. Teams that invest in observability gain faster diagnostics, lower costs, and higher reliability.

Adopting observability means fewer blind firefights, a significant reduction in MTTR (Mean Time to Resolution), and stronger trust in your system’s stability.

In upcoming articles, we’ll explore how to implement observability with Prometheus, Grafana, and Loki, and how to spin up a complete stack with Docker Compose for local testing.