Back to Blog
DevOps

Kubernetes Observability: The Complete Production Guide to Monitoring, Logging, and Tracing

Cap
15 min read
kubernetesobservabilitymonitoringprometheusgrafanajaegerproduction

Kubernetes Observability: The Complete Production Guide to Monitoring, Logging, and Tracing

When your 3 AM pager goes off, observability is the difference between a 5-minute fix and a 5-hour war room

The Night Everything Went Dark

It was 3:17 AM on a Tuesday when Sarah's phone erupted with alerts. Their e-commerce platform—handling thousands of orders per minute—was experiencing intermittent failures. Response times were spiking, some services were timing out, and worst of all, they couldn't pinpoint the root cause.

What followed was a masterclass in why observability isn't just nice-to-have—it's mission-critical. This is the story of how we built a comprehensive observability stack that turned 5-hour outages into 5-minute resolutions.

🎯 The Three Pillars of Observability

Before we dive into the technical implementation, let's understand what true observability means in the context of Kubernetes:

1. Metrics: The What

Quantitative measurements that answer "what is happening?"

2. Logs: The Why

Detailed records that answer "why did it happen?"

3. Traces: The How

Request flows that answer "how did it propagate through the system?"

Let's build a production-grade observability stack that covers all three pillars.

🔧 Building the Foundation: Prometheus Stack

Core Prometheus Configuration

# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s
      external_labels:
        cluster: 'production'
        region: 'us-west-2'

    rule_files:
      - "/etc/prometheus/rules/*.yml"

    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager:9093

    scrape_configs:
      # Kubernetes API Server
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https

      # Kubernetes Nodes
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)
          - target_label: __address__
            replacement: kubernetes.default.svc:443
          - source_labels: [__meta_kubernetes_node_name]
            regex: (.+)
            target_label: __metrics_path__
            replacement: /api/v1/nodes/${1}/proxy/metrics

      # Kubernetes Pods
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__
          - action: labelmap
            regex: __meta_kubernetes_pod_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_pod_name]
            action: replace
            target_label: kubernetes_pod_name

      # Service Discovery for custom applications
      - job_name: 'kubernetes-service-endpoints'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
            action: replace
            target_label: __scheme__
            regex: (https?)
          - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
            action: replace
            target_label: __address__
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
          - action: labelmap
            regex: __meta_kubernetes_service_label_(.+)
          - source_labels: [__meta_kubernetes_namespace]
            action: replace
            target_label: kubernetes_namespace
          - source_labels: [__meta_kubernetes_service_name]
            action: replace
            target_label: kubernetes_name
---
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus
  namespace: monitoring
spec:
  replicas: 2
  selector:
    matchLabels:
      app: prometheus
  template:
    metadata:
      labels:
        app: prometheus
    spec:
      serviceAccountName: prometheus
      securityContext:
        runAsNonRoot: true
        runAsUser: 65534
        fsGroup: 65534
      containers:
      - name: prometheus
        image: prom/prometheus:v2.47.0
        args:
          - '--config.file=/etc/prometheus/prometheus.yml'
          - '--storage.tsdb.path=/prometheus'
          - '--storage.tsdb.retention.time=30d'
          - '--storage.tsdb.retention.size=50GB'
          - '--web.console.libraries=/etc/prometheus/console_libraries'
          - '--web.console.templates=/etc/prometheus/consoles'
          - '--web.enable-lifecycle'
          - '--web.enable-admin-api'
          - '--web.external-url=https://prometheus.company.com'
        ports:
        - containerPort: 9090
          name: web
        livenessProbe:
          httpGet:
            path: /-/healthy
            port: web
          initialDelaySeconds: 30
          timeoutSeconds: 30
        readinessProbe:
          httpGet:
            path: /-/ready
            port: web
          initialDelaySeconds: 30
          timeoutSeconds: 30
        resources:
          requests:
            memory: "2Gi"
            cpu: "1000m"
          limits:
            memory: "4Gi"
            cpu: "2000m"
        volumeMounts:
        - name: config-volume
          mountPath: /etc/prometheus
        - name: storage-volume
          mountPath: /prometheus
        - name: rules-volume
          mountPath: /etc/prometheus/rules
      volumes:
      - name: config-volume
        configMap:
          name: prometheus-config
      - name: storage-volume
        persistentVolumeClaim:
          claimName: prometheus-storage
      - name: rules-volume
        configMap:
          name: prometheus-rules

Advanced Alerting Rules

# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-rules
  namespace: monitoring
data:
  kubernetes.yml: |
    groups:
    - name: kubernetes.rules
      rules:
      # High-level cluster health
      - alert: KubernetesNodeReady
        expr: kube_node_status_condition{condition="Ready",status="true"} == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.node }} is not ready"
          description: "Node {{ $labels.node }} has been not ready for more than 10 minutes"

      - alert: KubernetesPodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
          description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been restarting {{ $value }} times in the last 15 minutes"

      # Resource utilization
      - alert: HighCPUUsage
        expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 80% on {{ $labels.instance }} for more than 10 minutes"

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% on {{ $labels.instance }} for more than 10 minutes"

      # Application-specific alerts
      - alert: ApplicationHighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High latency detected"
          description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"

      - alert: ApplicationHighErrorRate
        expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

  storage.yml: |
    groups:
    - name: storage.rules
      rules:
      - alert: PersistentVolumeUsage
        expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PV {{ $labels.persistentvolumeclaim }} usage high"
          description: "Persistent Volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value }}% full"

      - alert: PersistentVolumeInodeUsage
        expr: (kubelet_volume_stats_inodes_used / kubelet_volume_stats_inodes) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "PV {{ $labels.persistentvolumeclaim }} inode usage high"
          description: "Persistent Volume {{ $labels.persistentvolumeclaim }} inode usage is {{ $value }}%"

📊 Advanced Grafana Dashboards

Custom Dashboard for Application Metrics

{
  "dashboard": {
    "id": null,
    "title": "Application Performance Dashboard",
    "tags": ["kubernetes", "application"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Request Rate",
        "type": "stat",
        "targets": [
          {
            "expr": "sum(rate(http_requests_total[5m]))",
            "legendFormat": "Requests/sec"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "color": {
              "mode": "thresholds"
            },
            "thresholds": {
              "steps": [
                {"color": "green", "value": null},
                {"color": "yellow", "value": 1000},
                {"color": "red", "value": 5000}
              ]
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Response Time Percentiles",
        "type": "timeseries",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "50th percentile"
          },
          {
            "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "95th percentile"
          },
          {
            "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
            "legendFormat": "99th percentile"
          }
        ]
      },
      {
        "id": 3,
        "title": "Error Rate by Service",
        "type": "timeseries",
        "targets": [
          {
            "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
            "legendFormat": "{{ service }}"
          }
        ]
      }
    ],
    "time": {
      "from": "now-1h",
      "to": "now"
    },
    "refresh": "5s"
  }
}

📝 Centralized Logging with EFK Stack

Fluent Bit Configuration for Log Collection

# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  namespace: logging
data:
  fluent-bit.conf: |
    [SERVICE]
        Flush         5
        Log_Level     info
        Daemon        off
        Parsers_File  parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    [INPUT]
        Name              tail
        Path              /var/log/containers/*.log
        Parser            cri
        Tag               kubernetes.*
        Refresh_Interval  5
        Mem_Buf_Limit     50MB
        Skip_Long_Lines   On

    [INPUT]
        Name systemd
        Tag  systemd.*
        Systemd_Filter _SYSTEMD_UNIT=kubelet.service
        Systemd_Filter _SYSTEMD_UNIT=docker.service

    [FILTER]
        Name                kubernetes
        Match               kubernetes.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kubernetes.var.log.containers.
        Merge_Log           On
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Annotations         Off
        Labels              On

    [FILTER]
        Name modify
        Match kubernetes.*
        Add cluster_name production
        Add environment prod

    [FILTER]
        Name grep
        Match kubernetes.*
        Exclude log ^\s*$

    [OUTPUT]
        Name            es
        Match           kubernetes.*
        Host            elasticsearch.logging.svc.cluster.local
        Port            9200
        Index           kubernetes-logs
        Type            _doc
        Logstash_Format On
        Logstash_Prefix kubernetes
        Logstash_DateFormat %Y.%m.%d
        Retry_Limit     False
        Suppress_Type_Name On

    [OUTPUT]
        Name            es
        Match           systemd.*
        Host            elasticsearch.logging.svc.cluster.local
        Port            9200
        Index           systemd-logs
        Type            _doc
        Logstash_Format On
        Logstash_Prefix systemd
        Logstash_DateFormat %Y.%m.%d

  parsers.conf: |
    [PARSER]
        Name        cri
        Format      regex
        Regex       ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
        Time_Key    time
        Time_Format %Y-%m-%dT%H:%M:%S.%L%z

    [PARSER]
        Name        json
        Format      json
        Time_Key    time
        Time_Format %d/%b/%Y:%H:%M:%S %z

    [PARSER]
        Name        nginx
        Format      regex
        Regex       ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
        Time_Key    time
        Time_Format %d/%b/%Y:%H:%M:%S %z

---
# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: fluent-bit
  namespace: logging
  labels:
    app: fluent-bit
spec:
  selector:
    matchLabels:
      app: fluent-bit
  template:
    metadata:
      labels:
        app: fluent-bit
    spec:
      serviceAccountName: fluent-bit
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
      containers:
      - name: fluent-bit
        image: fluent/fluent-bit:2.1.10
        imagePullPolicy: Always
        ports:
        - containerPort: 2020
          name: http
        resources:
          requests:
            cpu: 100m
            memory: 128Mi
          limits:
            cpu: 500m
            memory: 512Mi
        volumeMounts:
        - name: varlog
          mountPath: /var/log
        - name: varlibdockercontainers
          mountPath: /var/lib/docker/containers
          readOnly: true
        - name: fluent-bit-config
          mountPath: /fluent-bit/etc/
        - name: mnt
          mountPath: /mnt
          readOnly: true
      terminationGracePeriodSeconds: 10
      volumes:
      - name: varlog
        hostPath:
          path: /var/log
      - name: varlibdockercontainers
        hostPath:
          path: /var/lib/docker/containers
      - name: fluent-bit-config
        configMap:
          name: fluent-bit-config
      - name: mnt
        hostPath:
          path: /mnt

🔍 Distributed Tracing with Jaeger

Jaeger All-in-One Deployment

# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
  namespace: tracing
spec:
  replicas: 1
  selector:
    matchLabels:
      app: jaeger
  template:
    metadata:
      labels:
        app: jaeger
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:1.50
        ports:
        - containerPort: 16686
          name: ui
        - containerPort: 14268
          name: collector
        - containerPort: 6831
          name: agent-compact
          protocol: UDP
        - containerPort: 6832
          name: agent-binary
          protocol: UDP
        - containerPort: 5778
          name: admin
        env:
        - name: COLLECTOR_ZIPKIN_HOST_PORT
          value: ":9411"
        - name: SPAN_STORAGE_TYPE
          value: "elasticsearch"
        - name: ES_SERVER_URLS
          value: "http://elasticsearch.logging.svc.cluster.local:9200"
        - name: ES_INDEX_PREFIX
          value: "jaeger"
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
---
apiVersion: v1
kind: Service
metadata:
  name: jaeger
  namespace: tracing
spec:
  type: ClusterIP
  ports:
  - port: 16686
    targetPort: 16686
    name: ui
  - port: 14268
    targetPort: 14268
    name: collector
  - port: 9411
    targetPort: 9411
    name: zipkin
  - port: 6831
    targetPort: 6831
    name: agent-compact
    protocol: UDP
  - port: 6832
    targetPort: 6832
    name: agent-binary
    protocol: UDP
  - port: 5778
    targetPort: 5778
    name: admin
  selector:
    app: jaeger

Application Instrumentation for Tracing

# app-with-tracing.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-app
  namespace: default
spec:
  replicas: 3
  selector:
    matchLabels:
      app: example-app
  template:
    metadata:
      labels:
        app: example-app
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8080"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: app
        image: your-registry/example-app:latest
        ports:
        - containerPort: 8080
          name: http
        env:
        - name: JAEGER_AGENT_HOST
          value: "jaeger.tracing.svc.cluster.local"
        - name: JAEGER_AGENT_PORT
          value: "6831"
        - name: JAEGER_SERVICE_NAME
          value: "example-app"
        - name: JAEGER_SAMPLER_TYPE
          value: "const"
        - name: JAEGER_SAMPLER_PARAM
          value: "1"
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5

🚀 Production Observability Patterns

Custom Metrics with Prometheus Client

// metrics.go - Go application metrics instrumentation
package main

import (
    "context"
    "fmt"
    "net/http"
    "time"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/trace"
)

var (
    // HTTP metrics
    httpRequestsTotal = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "http_requests_total",
            Help: "Total number of HTTP requests",
        },
        []string{"method", "endpoint", "status"},
    )

    httpRequestDuration = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "http_request_duration_seconds",
            Help:    "HTTP request duration in seconds",
            Buckets: prometheus.DefBuckets,
        },
        []string{"method", "endpoint"},
    )

    // Business metrics
    ordersProcessed = prometheus.NewCounterVec(
        prometheus.CounterOpts{
            Name: "orders_processed_total",
            Help: "Total number of orders processed",
        },
        []string{"status", "payment_method"},
    )

    orderValue = prometheus.NewHistogramVec(
        prometheus.HistogramOpts{
            Name:    "order_value_dollars",
            Help:    "Order value in dollars",
            Buckets: []float64{10, 50, 100, 500, 1000, 5000},
        },
        []string{"customer_tier"},
    )

    // System metrics
    databaseConnections = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "database_connections_active",
            Help: "Number of active database connections",
        },
        []string{"database", "connection_pool"},
    )

    cacheHitRatio = prometheus.NewGaugeVec(
        prometheus.GaugeOpts{
            Name: "cache_hit_ratio",
            Help: "Cache hit ratio (0-1)",
        },
        []string{"cache_name"},
    )
)

func init() {
    // Register metrics with Prometheus
    prometheus.MustRegister(httpRequestsTotal)
    prometheus.MustRegister(httpRequestDuration)
    prometheus.MustRegister(ordersProcessed)
    prometheus.MustRegister(orderValue)
    prometheus.MustRegister(databaseConnections)
    prometheus.MustRegister(cacheHitRatio)
}

// Middleware for HTTP metrics and tracing
func instrumentHandler(handler http.HandlerFunc, endpoint string) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        // Start tracing
        tracer := otel.Tracer("example-app")
        ctx, span := tracer.Start(r.Context(), fmt.Sprintf("%s %s", r.Method, endpoint))
        defer span.End()

        // Add request attributes to span
        span.SetAttributes(
            attribute.String("http.method", r.Method),
            attribute.String("http.url", r.URL.String()),
            attribute.String("http.user_agent", r.UserAgent()),
        )

        // Start timer for request duration
        start := time.Now()

        // Wrap response writer to capture status code
        wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}

        // Call the actual handler
        handler(wrapped, r.WithContext(ctx))

        // Record metrics
        duration := time.Since(start).Seconds()
        status := fmt.Sprintf("%d", wrapped.statusCode)

        httpRequestsTotal.WithLabelValues(r.Method, endpoint, status).Inc()
        httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)

        // Add response attributes to span
        span.SetAttributes(
            attribute.Int("http.status_code", wrapped.statusCode),
            attribute.Float64("http.response_time", duration),
        )
    }
}

type responseWriter struct {
    http.ResponseWriter
    statusCode int
}

func (rw *responseWriter) WriteHeader(code int) {
    rw.statusCode = code
    rw.ResponseWriter.WriteHeader(code)
}

// Business logic with metrics
func processOrder(ctx context.Context, order Order) error {
    tracer := otel.Tracer("example-app")
    ctx, span := tracer.Start(ctx, "process_order")
    defer span.End()

    span.SetAttributes(
        attribute.String("order.id", order.ID),
        attribute.Float64("order.value", order.Value),
        attribute.String("order.payment_method", order.PaymentMethod),
    )

    // Simulate order processing
    if err := validateOrder(ctx, order); err != nil {
        ordersProcessed.WithLabelValues("failed", order.PaymentMethod).Inc()
        span.RecordError(err)
        return err
    }

    if err := chargePayment(ctx, order); err != nil {
        ordersProcessed.WithLabelValues("payment_failed", order.PaymentMethod).Inc()
        span.RecordError(err)
        return err
    }

    // Record successful order
    ordersProcessed.WithLabelValues("success", order.PaymentMethod).Inc()
    orderValue.WithLabelValues(order.CustomerTier).Observe(order.Value)

    span.SetStatus(codes.Ok, "Order processed successfully")
    return nil
}

type Order struct {
    ID            string
    Value         float64
    PaymentMethod string
    CustomerTier  string
}

func validateOrder(ctx context.Context, order Order) error {
    tracer := otel.Tracer("example-app")
    _, span := tracer.Start(ctx, "validate_order")
    defer span.End()

    // Validation logic here
    return nil
}

func chargePayment(ctx context.Context, order Order) error {
    tracer := otel.Tracer("example-app")
    _, span := tracer.Start(ctx, "charge_payment")
    defer span.End()

    // Payment processing logic here
    return nil
}

// Health check endpoints
func healthHandler(w http.ResponseWriter, r *http.Request) {
    w.WriteHeader(http.StatusOK)
    w.Write([]byte("OK"))
}

func readinessHandler(w http.ResponseWriter, r *http.Request) {
    // Check database connectivity, cache, etc.
    if isDatabaseHealthy() && isCacheHealthy() {
        w.WriteHeader(http.StatusOK)
        w.Write([]byte("Ready"))
    } else {
        w.WriteHeader(http.StatusServiceUnavailable)
        w.Write([]byte("Not Ready"))
    }
}

func isDatabaseHealthy() bool {
    // Database health check logic
    return true
}

func isCacheHealthy() bool {
    // Cache health check logic
    return true
}

func main() {
    // Set up routes with instrumentation
    http.Handle("/metrics", promhttp.Handler())
    http.HandleFunc("/health", instrumentHandler(healthHandler, "/health"))
    http.HandleFunc("/ready", instrumentHandler(readinessHandler, "/ready"))
    
    // Start server
    fmt.Println("Server starting on :8080")
    if err := http.ListenAndServe(":8080", nil); err != nil {
        panic(err)
    }
}

📈 Advanced Monitoring Strategies

SLI/SLO Monitoring

# slo-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: slo-rules
  namespace: monitoring
data:
  slo.yml: |
    groups:
    - name: slo.rules
      interval: 30s
      rules:
      # Availability SLI: percentage of successful requests
      - record: sli:availability:rate5m
        expr: |
          (
            sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
            /
            sum(rate(http_requests_total[5m])) by (service)
          )

      # Latency SLI: percentage of requests under 500ms
      - record: sli:latency:rate5m
        expr: |
          (
            sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (service)
            /
            sum(rate(http_request_duration_seconds_count[5m])) by (service)
          )

      # SLO: 99.9% availability
      - alert: SLOAvailabilityBreach
        expr: sli:availability:rate5m < 0.999
        for: 2m
        labels:
          severity: critical
          slo: availability
        annotations:
          summary: "Availability SLO breach for {{ $labels.service }}"
          description: "Availability is {{ $value | humanizePercentage }}, below 99.9% SLO"

      # SLO: 95% of requests under 500ms
      - alert: SLOLatencyBreach
        expr: sli:latency:rate5m < 0.95
        for: 5m
        labels:
          severity: warning
          slo: latency
        annotations:
          summary: "Latency SLO breach for {{ $labels.service }}"
          description: "{{ $value | humanizePercentage }} of requests are under 500ms, below 95% SLO"

      # Error budget calculation (monthly)
      - record: slo:error_budget:availability
        expr: |
          (
            1 - sli:availability:rate5m
          ) * 100 / 0.1  # 0.1% error budget for 99.9% SLO

      - alert: ErrorBudgetExhaustion
        expr: slo:error_budget:availability > 50
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Error budget 50% exhausted for {{ $labels.service }}"
          description: "{{ $value }}% of monthly error budget consumed"

Multi-Cluster Monitoring Federation

# prometheus-federation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-federation-config
  namespace: monitoring
data:
  prometheus.yml: |
    global:
      scrape_interval: 15s
      external_labels:
        cluster: 'federation'
        region: 'global'

    scrape_configs:
    # Federate from production clusters
    - job_name: 'production-us-west'
      scrape_interval: 15s
      honor_labels: true
      metrics_path: '/federate'
      params:
        'match[]':
          # Aggregate metrics
          - '{__name__=~"sli:.*"}'
          - '{__name__=~"slo:.*"}'
          # High-level cluster metrics
          - '{__name__=~"up|kube_node_status_condition"}'
          # Application metrics
          - '{__name__=~"http_requests_total|http_request_duration_seconds.*"}'
      static_configs:
        - targets:
          - 'prometheus.us-west.company.com'

    - job_name: 'production-eu-west'
      scrape_interval: 15s
      honor_labels: true
      metrics_path: '/federate'
      params:
        'match[]':
          - '{__name__=~"sli:.*"}'
          - '{__name__=~"slo:.*"}'
          - '{__name__=~"up|kube_node_status_condition"}'
          - '{__name__=~"http_requests_total|http_request_duration_seconds.*"}'
      static_configs:
        - targets:
          - 'prometheus.eu-west.company.com'

    rule_files:
      - "/etc/prometheus/rules/*.yml"

    alerting:
      alertmanagers:
        - static_configs:
            - targets:
              - alertmanager.monitoring.svc.cluster.local:9093

🔧 Troubleshooting and Debugging

Comprehensive Debug Dashboard

# debug-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: debug-dashboard
  namespace: monitoring
data:
  dashboard.json: |
    {
      "dashboard": {
        "title": "Debug Dashboard",
        "panels": [
          {
            "title": "Request Flow",
            "type": "graph",
            "targets": [
              {
                "expr": "rate(http_requests_total[5m])",
                "legendFormat": "{{ method }} {{ status }}"
              }
            ]
          },
          {
            "title": "Error Rate Heatmap",
            "type": "heatmap",
            "targets": [
              {
                "expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
                "legendFormat": "{{ service }}"
              }
            ]
          },
          {
            "title": "Top Slow Endpoints",
            "type": "table",
            "targets": [
              {
                "expr": "topk(10, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])))",
                "legendFormat": "{{ endpoint }}"
              }
            ]
          }
        ]
      }
    }

🎯 The Resolution: From 5 Hours to 5 Minutes

Remember Sarah's 3 AM nightmare? Here's how our comprehensive observability stack would have changed that story:

  1. Immediate Detection: Prometheus alerts would have fired within 2 minutes of the first error spike
  2. Rapid Diagnosis: Grafana dashboards showed exactly which service was failing and why
  3. Root Cause Analysis: Jaeger traces revealed the exact request path causing the bottleneck
  4. Quick Resolution: ELK logs provided the specific error messages needed to fix the issue

The same outage that took 5 hours to resolve now takes 5 minutes. That's the power of proper observability.

🎯 Key Takeaways

  1. Instrument Everything: Metrics, logs, and traces should be first-class citizens in your architecture
  2. Define SLIs/SLOs: Know what "good" looks like and measure it continuously
  3. Alert on Symptoms, Not Causes: Alert on user-facing issues, not infrastructure hiccups
  4. Practice Incident Response: Your observability is only as good as your ability to act on it
  5. Iterate and Improve: Continuously refine your dashboards and alerts based on real incidents

Observability isn't just about collecting data—it's about turning that data into actionable insights that keep your systems running smoothly and your users happy. When implemented correctly, it transforms your operational capabilities from reactive firefighting to proactive system management.

The next time your pager goes off at 3 AM, you'll be ready.

WY

Cap

Senior Golang Backend & Web3 Developer with 10+ years of experience building scalable systems and blockchain solutions.

View Full Profile →