Build a Production-Grade Observability Stack with Prometheus, Grafana, Loki, and Jaeger
Observability Stack Setup: Prometheus, Grafana, Loki, and Jaeger for Distributed Systems
This guide covers the implementation of a complete observability stack for distributed systems using Prometheus for metrics collection, Grafana for visualization, Loki for log aggregation, and Jaeger for distributed tracing.
Architecture Overview
The stack operates on the Three Pillars of Observability: Metrics (Prometheus), Logs (Loki), and Tracing (Jaeger). Prometheus uses a pull-based model to scrape metrics from instrumented applications and exporters. Grafana queries Prometheus for time-series data, Loki for logs, and Jaeger for trace data, providing unified dashboards. Jaeger collects distributed traces via OpenTelemetry instrumentation, enabling request flow analysis across microservices.
Prometheus Setup
Installation and Configuration
Deploy Prometheus via Docker or Kubernetes. Create a prometheus.yml configuration file:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'docker-cluster'
monitor: 'prometheus'
storage:
tsdb:
path: /prometheus
out_of_order_time_window: 30m
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['prometheus:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'application'
static_configs:
- targets: ['app-service:8080']
metrics_path: '/metrics'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alerting_rules.yml'
Create alerting_rules.yml:
groups:
- name: application_alerts
interval: 30s
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
- alert: HighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
annotations:
summary: "High P95 latency"
description: "P95 latency is {{ $value }} seconds"
Run Prometheus with exemplar storage enabled:
docker run -d \
-p 9090:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
-v $(pwd)/alerting_rules.yml:/etc/prometheus/alerting_rules.yml \
-v prometheus-data:/prometheus \
prom/prometheus:latest \
--config.file=/etc/prometheus/prometheus.yml \
--enable-feature=exemplar-storage
Metrics Collection
Prometheus exposes metrics at /metrics endpoint. Instrument applications using client libraries or exporters. Key metrics types include:
- Counter: Monotonically increasing values (e.g.,
http_requests_total) - Gauge: Values that can go up or down (e.g.,
memory_usage_bytes) - Histogram: Count and sum of observed values in configurable buckets
- Summary: Count and sum plus quantiles over a sliding time window
Loki Setup
Installation and Configuration
Deploy Loki for log aggregation:
docker run -d \
-p 3100:3100 \
-v $(pwd)/loki-config.yml:/etc/loki/local-config.yaml \
-v loki-data:/loki \
grafana/loki:latest \
-config.file=/etc/loki/local-config.yaml
Create loki-config.yml:
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
rules_directory: /loki/rules
replication_factor: 1
ring:
kvstore:
store: inmemory
schema_config:
configs:
- from: 2020-10-24
store: tsdb
object_store: filesystem
schema: v13
index:
prefix: index_
period: 24h
storage_config:
filesystem:
directory: /loki/chunks
limits_config:
retention_period: 168h
ingestion_rate_mb: 10
ingestion_burst_size_mb: 20
Deploy Promtail to forward logs:
docker run -d \
-v $(pwd)/promtail-config.yml:/etc/promtail/config.yml \
-v $(pwd)/app-logs:/var/log:ro \
grafana/promtail:latest \
-config.file=/etc/promtail/config.yml
Create promtail-config.yml:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: application
static_configs:
- targets:
- localhost
labels:
job: application
__path__: /var/log/*.log
Note: Ensure log files exist in the mounted directory (./app-logs) before starting Promtail. Logs should be in plain text or JSON format.
Grafana Integration
Data Source Configuration
Create /etc/grafana/provisioning/datasources/datasources.yml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger:16686
editable: true
Access Grafana at http://localhost:3000 (default credentials: admin/admin). Datasources will be auto-provisioned on startup.
Dashboard Creation and Alerting
Use PromQL for queries. Common patterns:
# Request rate
rate(http_requests_total[5m])
# Error rate
rate(http_requests_total{status=~"5.."}[5m])
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage
rate(process_cpu_seconds_total[5m]) * 100
Use LogQL for log queries:
{job="application"} |= "error"
{job="application"} | logfmt | trace_id != ""
Configure Grafana alerts in /etc/grafana/provisioning/alerting/alerting.yml:
apiVersion: 1
providers:
- name: 'alertmanager'
orgId: 1
folder: ''
type: file
disableProvenance: false
options:
path: /etc/grafana/provisioning/alerting/rules
Create alert rule file /etc/grafana/provisioning/alerting/rules/application.yml:
apiVersion: 1
groups:
- name: application_alerts
interval: 1m
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value }} errors/sec"
Import pre-built dashboards from Grafana.com for quick setup (Node Exporter Full, Kubernetes Cluster Monitoring).
Jaeger Distributed Tracing
Deployment
Deploy Jaeger All-in-One with OTLP enabled:
docker run -d \
--name jaeger \
-p 5775:5775/udp \
-p 6831:6831/udp \
-p 6832:6832/udp \
-p 5778:5778 \
-p 16686:16686 \
-p 14268:14268 \
-p 14250:14250 \
-p 9411:9411 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one:latest
Access Jaeger UI at http://localhost:16686. OTLP endpoints are available at http://localhost:4317 (gRPC) and http://localhost:4318 (HTTP).
OpenTelemetry Instrumentation
Instrument applications using OpenTelemetry SDKs with OTLP exporters. Example for Go:
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/resource"
sdktrace "go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)
func initTracer(serviceName string) error {
exporter, err := otlptracegrpc.New(
otlptracegrpc.WithEndpoint("localhost:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return err
}
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceName(serviceName),
)),
)
otel.SetTracerProvider(tp)
return nil
}
// Usage in handlers
tracer := otel.Tracer("service-a")
ctx, span := tracer.Start(ctx, "process-request")
defer span.End()
For Python:
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource
resource = Resource.create({
"service.name": "my-service"
})
otlp_exporter = OTLPSpanExporter(
endpoint="localhost:4317",
insecure=True,
)
trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
BatchSpanProcessor(otlp_exporter)
)
tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("operation"):
# Your code here
pass
Correlation and Integration
Trace to Metrics
Link traces to Prometheus metrics using exemplars. Configure Prometheus to store exemplars:
storage:
tsdb:
path: /prometheus
out_of_order_time_window: 30m
Add exemplars in application metrics:
histogram.Record(ctx, latencyMs, metric.WithAttributes(
attribute.String("trace_id", span.SpanContext().TraceID().String()),
))
Trace to Logs
Configure Grafana to link traces to logs. In Jaeger data source settings:
- Navigate to Trace to logs section
- Select Loki data source
- Configure tag mapping for
trace_id - Enable Filter by trace ID
Instrument applications to include trace IDs in logs:
import "go.opentelemetry.io/otel/bridge/opentracing"
log.Printf("Processing request trace_id=%s", span.SpanContext().TraceID().String())
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
current_span = trace.get_current_span()
print(f"Processing request trace_id={current_span.context.trace_id}")
Getting Started
- Deploy the stack using Docker Compose:
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--enable-feature=exemplar-storage'
alertmanager:
image: prom/alertmanager:latest
ports:
- "9093:9093"
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana-storage:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- "16686:16686"
- "14268:14268"
- "14250:14250"
- "4317:4317"
- "4318:4318"
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
volumes:
- ./loki-config.yml:/etc/loki/local-config.yaml
- loki-data:/loki
command:
- '-config.file=/etc/loki/local-config.yaml'
promtail:
image: grafana/promtail:latest
volumes:
- ./promtail-config.yml:/etc/promtail/config.yml
- ./app-logs:/var/log:ro
command:
- '-config.file=/etc/promtail/config.yml'
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
app-service:
image: nginx:alpine
ports:
- "8080:80"
volumes:
grafana-storage:
loki-data:
prometheus-data:
Create alertmanager.yml:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
- Start services:
docker-compose up -d - Instrument applications with OpenTelemetry SDKs using OTLP exporters
- Configure Prometheus scrape targets using service names
- Set up Grafana dashboards and alerts
- Verify traces appear in Jaeger UI and logs in Loki
Access points:
- Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000 - Jaeger:
http://localhost:16686 - Loki:
http://localhost:3100
Share this Guide:
More Guides
Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readGleam on BEAM: Building Type-Safe, Fault-Tolerant Distributed Systems
Learn how Gleam combines Hindley-Milner type inference with Erlang's actor-based concurrency model to build systems that are both compile-time safe and runtime fault-tolerant. Covers OTP integration, supervision trees, and seamless interoperability with the BEAM ecosystem.
5 min readHono Edge Framework: Build Ultra-Fast APIs for Cloudflare Workers and Bun
Master Hono's zero-dependency web framework to build low-latency edge APIs that deploy seamlessly across Cloudflare Workers, Bun, and other JavaScript runtimes. Learn routing, middleware, validation, and real-time streaming patterns optimized for edge computing.
6 min readContinue Reading
Agentic Workflows: Building Self-Correcting Loops with LangGraph and CrewAI State Machines
Build production-ready AI agents that iteratively improve their outputs through automated feedback loops, combining LangGraph's state machine architecture with CrewAI's multi-agent orchestration for robust, self-correcting workflows.
14 min readBun Runtime Migration: Porting High-Traffic Node.js APIs with Native APIs and SQLite
Learn how to migrate high-traffic Node.js APIs to Bun for 4× HTTP throughput and 3.8× database performance gains using native APIs and bun:sqlite.
10 min readDeno 2.0 Workspaces: Build Monorepos with JSR Packages and TypeScript-First Development
Learn how to configure Deno 2.0 workspaces for monorepo management, publish TypeScript packages to JSR, and automate releases with OIDC-authenticated CI/CD pipelines.
7 min readShip Faster. Ship Safer.
Join thousands of engineering teams using MatterAI to autonomously build, review, and deploy code with enterprise-grade precision.
