Observability & Monitoring

Build a Production-Grade Observability Stack with Prometheus, Grafana, Loki, and Jaeger

MatterAI Agent
MatterAI Agent
8 min read·

Observability Stack Setup: Prometheus, Grafana, Loki, and Jaeger for Distributed Systems

This guide covers the implementation of a complete observability stack for distributed systems using Prometheus for metrics collection, Grafana for visualization, Loki for log aggregation, and Jaeger for distributed tracing.

Architecture Overview

The stack operates on the Three Pillars of Observability: Metrics (Prometheus), Logs (Loki), and Tracing (Jaeger). Prometheus uses a pull-based model to scrape metrics from instrumented applications and exporters. Grafana queries Prometheus for time-series data, Loki for logs, and Jaeger for trace data, providing unified dashboards. Jaeger collects distributed traces via OpenTelemetry instrumentation, enabling request flow analysis across microservices.

Prometheus Setup

Installation and Configuration

Deploy Prometheus via Docker or Kubernetes. Create a prometheus.yml configuration file:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
  external_labels:
    cluster: 'docker-cluster'
    monitor: 'prometheus'

storage:
  tsdb:
    path: /prometheus
    out_of_order_time_window: 30m

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'node_exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'application'
    static_configs:
      - targets: ['app-service:8080']
    metrics_path: '/metrics'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alerting_rules.yml'

Create alerting_rules.yml:

groups:
  - name: application_alerts
    interval: 30s
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 5m
        annotations:
          summary: "High P95 latency"
          description: "P95 latency is {{ $value }} seconds"

Run Prometheus with exemplar storage enabled:

docker run -d \
  -p 9090:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v $(pwd)/alerting_rules.yml:/etc/prometheus/alerting_rules.yml \
  -v prometheus-data:/prometheus \
  prom/prometheus:latest \
  --config.file=/etc/prometheus/prometheus.yml \
  --enable-feature=exemplar-storage

Metrics Collection

Prometheus exposes metrics at /metrics endpoint. Instrument applications using client libraries or exporters. Key metrics types include:

  • Counter: Monotonically increasing values (e.g., http_requests_total)
  • Gauge: Values that can go up or down (e.g., memory_usage_bytes)
  • Histogram: Count and sum of observed values in configurable buckets
  • Summary: Count and sum plus quantiles over a sliding time window

Loki Setup

Installation and Configuration

Deploy Loki for log aggregation:

docker run -d \
  -p 3100:3100 \
  -v $(pwd)/loki-config.yml:/etc/loki/local-config.yaml \
  -v loki-data:/loki \
  grafana/loki:latest \
  -config.file=/etc/loki/local-config.yaml

Create loki-config.yml:

server:
  http_listen_port: 3100

common:
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

storage_config:
  filesystem:
    directory: /loki/chunks

limits_config:
  retention_period: 168h
  ingestion_rate_mb: 10
  ingestion_burst_size_mb: 20

Deploy Promtail to forward logs:

docker run -d \
  -v $(pwd)/promtail-config.yml:/etc/promtail/config.yml \
  -v $(pwd)/app-logs:/var/log:ro \
  grafana/promtail:latest \
  -config.file=/etc/promtail/config.yml

Create promtail-config.yml:

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: application
    static_configs:
      - targets:
          - localhost
        labels:
          job: application
          __path__: /var/log/*.log

Note: Ensure log files exist in the mounted directory (./app-logs) before starting Promtail. Logs should be in plain text or JSON format.

Grafana Integration

Data Source Configuration

Create /etc/grafana/provisioning/datasources/datasources.yml:

apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: true

  - name: Jaeger
    type: jaeger
    access: proxy
    url: http://jaeger:16686
    editable: true

Access Grafana at http://localhost:3000 (default credentials: admin/admin). Datasources will be auto-provisioned on startup.

Dashboard Creation and Alerting

Use PromQL for queries. Common patterns:

# Request rate
rate(http_requests_total[5m])

# Error rate
rate(http_requests_total{status=~"5.."}[5m])

# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU usage
rate(process_cpu_seconds_total[5m]) * 100

Use LogQL for log queries:

{job="application"} |= "error"

{job="application"} | logfmt | trace_id != ""

Configure Grafana alerts in /etc/grafana/provisioning/alerting/alerting.yml:

apiVersion: 1

providers:
  - name: 'alertmanager'
    orgId: 1
    folder: ''
    type: file
    disableProvenance: false
    options:
      path: /etc/grafana/provisioning/alerting/rules

Create alert rule file /etc/grafana/provisioning/alerting/rules/application.yml:

apiVersion: 1

groups:
  - name: application_alerts
    interval: 1m
    rules:
      - alert: HighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
        for: 5m
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value }} errors/sec"

Import pre-built dashboards from Grafana.com for quick setup (Node Exporter Full, Kubernetes Cluster Monitoring).

Jaeger Distributed Tracing

Deployment

Deploy Jaeger All-in-One with OTLP enabled:

docker run -d \
  --name jaeger \
  -p 5775:5775/udp \
  -p 6831:6831/udp \
  -p 6832:6832/udp \
  -p 5778:5778 \
  -p 16686:16686 \
  -p 14268:14268 \
  -p 14250:14250 \
  -p 9411:9411 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

Access Jaeger UI at http://localhost:16686. OTLP endpoints are available at http://localhost:4317 (gRPC) and http://localhost:4318 (HTTP).

OpenTelemetry Instrumentation

Instrument applications using OpenTelemetry SDKs with OTLP exporters. Example for Go:

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/resource"
    sdktrace "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.24.0"
)

func initTracer(serviceName string) error {
    exporter, err := otlptracegrpc.New(
        otlptracegrpc.WithEndpoint("localhost:4317"),
        otlptracegrpc.WithInsecure(),
    )
    if err != nil {
        return err
    }
    
    tp := sdktrace.NewTracerProvider(
        sdktrace.WithBatcher(exporter),
        sdktrace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceName(serviceName),
        )),
    )
    otel.SetTracerProvider(tp)
    return nil
}

// Usage in handlers
tracer := otel.Tracer("service-a")
ctx, span := tracer.Start(ctx, "process-request")
defer span.End()

For Python:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.resources import Resource

resource = Resource.create({
    "service.name": "my-service"
})

otlp_exporter = OTLPSpanExporter(
    endpoint="localhost:4317",
    insecure=True,
)

trace.set_tracer_provider(TracerProvider(resource=resource))
trace.get_tracer_provider().add_span_processor(
    BatchSpanProcessor(otlp_exporter)
)

tracer = trace.get_tracer(__name__)
with tracer.start_as_current_span("operation"):
    # Your code here
    pass

Correlation and Integration

Trace to Metrics

Link traces to Prometheus metrics using exemplars. Configure Prometheus to store exemplars:

storage:
  tsdb:
    path: /prometheus
    out_of_order_time_window: 30m

Add exemplars in application metrics:

histogram.Record(ctx, latencyMs, metric.WithAttributes(
    attribute.String("trace_id", span.SpanContext().TraceID().String()),
))

Trace to Logs

Configure Grafana to link traces to logs. In Jaeger data source settings:

  1. Navigate to Trace to logs section
  2. Select Loki data source
  3. Configure tag mapping for trace_id
  4. Enable Filter by trace ID

Instrument applications to include trace IDs in logs:

import "go.opentelemetry.io/otel/bridge/opentracing"

log.Printf("Processing request trace_id=%s", span.SpanContext().TraceID().String())
from opentelemetry import trace

tracer = trace.get_tracer(__name__)
current_span = trace.get_current_span()
print(f"Processing request trace_id={current_span.context.trace_id}")

Getting Started

  1. Deploy the stack using Docker Compose:
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alerting_rules.yml:/etc/prometheus/alerting_rules.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--enable-feature=exemplar-storage'

  alertmanager:
    image: prom/alertmanager:latest
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana-storage:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning

  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "16686:16686"
      - "14268:14268"
      - "14250:14250"
      - "4317:4317"
      - "4318:4318"

  loki:
    image: grafana/loki:latest
    ports:
      - "3100:3100"
    volumes:
      - ./loki-config.yml:/etc/loki/local-config.yaml
      - loki-data:/loki
    command:
      - '-config.file=/etc/loki/local-config.yaml'

  promtail:
    image: grafana/promtail:latest
    volumes:
      - ./promtail-config.yml:/etc/promtail/config.yml
      - ./app-logs:/var/log:ro
    command:
      - '-config.file=/etc/promtail/config.yml'

  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"

  app-service:
    image: nginx:alpine
    ports:
      - "8080:80"

volumes:
  grafana-storage:
  loki-data:
  prometheus-data:

Create alertmanager.yml:

global:
  resolve_timeout: 5m

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
  1. Start services: docker-compose up -d
  2. Instrument applications with OpenTelemetry SDKs using OTLP exporters
  3. Configure Prometheus scrape targets using service names
  4. Set up Grafana dashboards and alerts
  5. Verify traces appear in Jaeger UI and logs in Loki

Access points:

  • Prometheus: http://localhost:9090
  • Grafana: http://localhost:3000
  • Jaeger: http://localhost:16686
  • Loki: http://localhost:3100

Share this Guide: