Kubernetes Observability: The Complete Production Guide to Monitoring, Logging, and Tracing
Kubernetes Observability: The Complete Production Guide to Monitoring, Logging, and Tracing
When your 3 AM pager goes off, observability is the difference between a 5-minute fix and a 5-hour war room
The Night Everything Went Dark
It was 3:17 AM on a Tuesday when Sarah's phone erupted with alerts. Their e-commerce platform—handling thousands of orders per minute—was experiencing intermittent failures. Response times were spiking, some services were timing out, and worst of all, they couldn't pinpoint the root cause.
What followed was a masterclass in why observability isn't just nice-to-have—it's mission-critical. This is the story of how we built a comprehensive observability stack that turned 5-hour outages into 5-minute resolutions.
🎯 The Three Pillars of Observability
Before we dive into the technical implementation, let's understand what true observability means in the context of Kubernetes:
1. Metrics: The What
Quantitative measurements that answer "what is happening?"
2. Logs: The Why
Detailed records that answer "why did it happen?"
3. Traces: The How
Request flows that answer "how did it propagate through the system?"
Let's build a production-grade observability stack that covers all three pillars.
🔧 Building the Foundation: Prometheus Stack
Core Prometheus Configuration
# prometheus-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
# Kubernetes API Server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes Nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
# Kubernetes Pods
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- action: labelmap
regex: __meta_kubernetes_pod_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: kubernetes_pod_name
# Service Discovery for custom applications
- job_name: 'kubernetes-service-endpoints'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: kubernetes_namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: kubernetes_name
---
# prometheus-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: prometheus
template:
metadata:
labels:
app: prometheus
spec:
serviceAccountName: prometheus
securityContext:
runAsNonRoot: true
runAsUser: 65534
fsGroup: 65534
containers:
- name: prometheus
image: prom/prometheus:v2.47.0
args:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--web.external-url=https://prometheus.company.com'
ports:
- containerPort: 9090
name: web
livenessProbe:
httpGet:
path: /-/healthy
port: web
initialDelaySeconds: 30
timeoutSeconds: 30
readinessProbe:
httpGet:
path: /-/ready
port: web
initialDelaySeconds: 30
timeoutSeconds: 30
resources:
requests:
memory: "2Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumeMounts:
- name: config-volume
mountPath: /etc/prometheus
- name: storage-volume
mountPath: /prometheus
- name: rules-volume
mountPath: /etc/prometheus/rules
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: storage-volume
persistentVolumeClaim:
claimName: prometheus-storage
- name: rules-volume
configMap:
name: prometheus-rules
Advanced Alerting Rules
# prometheus-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: monitoring
data:
kubernetes.yml: |
groups:
- name: kubernetes.rules
rules:
# High-level cluster health
- alert: KubernetesNodeReady
expr: kube_node_status_condition{condition="Ready",status="true"} == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Node {{ $labels.node }} is not ready"
description: "Node {{ $labels.node }} has been not ready for more than 10 minutes"
- alert: KubernetesPodCrashLooping
expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
for: 5m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod {{ $labels.namespace }}/{{ $labels.pod }} has been restarting {{ $value }} times in the last 15 minutes"
# Resource utilization
- alert: HighCPUUsage
expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 80% on {{ $labels.instance }} for more than 10 minutes"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: critical
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% on {{ $labels.instance }} for more than 10 minutes"
# Application-specific alerts
- alert: ApplicationHighLatency
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s for {{ $labels.job }}"
- alert: ApplicationHighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
storage.yml: |
groups:
- name: storage.rules
rules:
- alert: PersistentVolumeUsage
expr: (kubelet_volume_stats_used_bytes / kubelet_volume_stats_capacity_bytes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "PV {{ $labels.persistentvolumeclaim }} usage high"
description: "Persistent Volume {{ $labels.persistentvolumeclaim }} in namespace {{ $labels.namespace }} is {{ $value }}% full"
- alert: PersistentVolumeInodeUsage
expr: (kubelet_volume_stats_inodes_used / kubelet_volume_stats_inodes) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "PV {{ $labels.persistentvolumeclaim }} inode usage high"
description: "Persistent Volume {{ $labels.persistentvolumeclaim }} inode usage is {{ $value }}%"
📊 Advanced Grafana Dashboards
Custom Dashboard for Application Metrics
{
"dashboard": {
"id": null,
"title": "Application Performance Dashboard",
"tags": ["kubernetes", "application"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m]))",
"legendFormat": "Requests/sec"
}
],
"fieldConfig": {
"defaults": {
"color": {
"mode": "thresholds"
},
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1000},
{"color": "red", "value": 5000}
]
}
}
}
},
{
"id": 2,
"title": "Response Time Percentiles",
"type": "timeseries",
"targets": [
{
"expr": "histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "50th percentile"
},
{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "95th percentile"
},
{
"expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "99th percentile"
}
]
},
{
"id": 3,
"title": "Error Rate by Service",
"type": "timeseries",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])",
"legendFormat": "{{ service }}"
}
]
}
],
"time": {
"from": "now-1h",
"to": "now"
},
"refresh": "5s"
}
}
📝 Centralized Logging with EFK Stack
Fluent Bit Configuration for Log Collection
# fluent-bit-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: fluent-bit-config
namespace: logging
data:
fluent-bit.conf: |
[SERVICE]
Flush 5
Log_Level info
Daemon off
Parsers_File parsers.conf
HTTP_Server On
HTTP_Listen 0.0.0.0
HTTP_Port 2020
[INPUT]
Name tail
Path /var/log/containers/*.log
Parser cri
Tag kubernetes.*
Refresh_Interval 5
Mem_Buf_Limit 50MB
Skip_Long_Lines On
[INPUT]
Name systemd
Tag systemd.*
Systemd_Filter _SYSTEMD_UNIT=kubelet.service
Systemd_Filter _SYSTEMD_UNIT=docker.service
[FILTER]
Name kubernetes
Match kubernetes.*
Kube_URL https://kubernetes.default.svc:443
Kube_CA_File /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
Kube_Token_File /var/run/secrets/kubernetes.io/serviceaccount/token
Kube_Tag_Prefix kubernetes.var.log.containers.
Merge_Log On
K8S-Logging.Parser On
K8S-Logging.Exclude Off
Annotations Off
Labels On
[FILTER]
Name modify
Match kubernetes.*
Add cluster_name production
Add environment prod
[FILTER]
Name grep
Match kubernetes.*
Exclude log ^\s*$
[OUTPUT]
Name es
Match kubernetes.*
Host elasticsearch.logging.svc.cluster.local
Port 9200
Index kubernetes-logs
Type _doc
Logstash_Format On
Logstash_Prefix kubernetes
Logstash_DateFormat %Y.%m.%d
Retry_Limit False
Suppress_Type_Name On
[OUTPUT]
Name es
Match systemd.*
Host elasticsearch.logging.svc.cluster.local
Port 9200
Index systemd-logs
Type _doc
Logstash_Format On
Logstash_Prefix systemd
Logstash_DateFormat %Y.%m.%d
parsers.conf: |
[PARSER]
Name cri
Format regex
Regex ^(?<time>[^ ]+) (?<stream>stdout|stderr) (?<logtag>[^ ]*) (?<log>.*)$
Time_Key time
Time_Format %Y-%m-%dT%H:%M:%S.%L%z
[PARSER]
Name json
Format json
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
[PARSER]
Name nginx
Format regex
Regex ^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$
Time_Key time
Time_Format %d/%b/%Y:%H:%M:%S %z
---
# fluent-bit-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
namespace: logging
labels:
app: fluent-bit
spec:
selector:
matchLabels:
app: fluent-bit
template:
metadata:
labels:
app: fluent-bit
spec:
serviceAccountName: fluent-bit
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
containers:
- name: fluent-bit
image: fluent/fluent-bit:2.1.10
imagePullPolicy: Always
ports:
- containerPort: 2020
name: http
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512Mi
volumeMounts:
- name: varlog
mountPath: /var/log
- name: varlibdockercontainers
mountPath: /var/lib/docker/containers
readOnly: true
- name: fluent-bit-config
mountPath: /fluent-bit/etc/
- name: mnt
mountPath: /mnt
readOnly: true
terminationGracePeriodSeconds: 10
volumes:
- name: varlog
hostPath:
path: /var/log
- name: varlibdockercontainers
hostPath:
path: /var/lib/docker/containers
- name: fluent-bit-config
configMap:
name: fluent-bit-config
- name: mnt
hostPath:
path: /mnt
🔍 Distributed Tracing with Jaeger
Jaeger All-in-One Deployment
# jaeger-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: tracing
spec:
replicas: 1
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 16686
name: ui
- containerPort: 14268
name: collector
- containerPort: 6831
name: agent-compact
protocol: UDP
- containerPort: 6832
name: agent-binary
protocol: UDP
- containerPort: 5778
name: admin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
- name: SPAN_STORAGE_TYPE
value: "elasticsearch"
- name: ES_SERVER_URLS
value: "http://elasticsearch.logging.svc.cluster.local:9200"
- name: ES_INDEX_PREFIX
value: "jaeger"
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "2Gi"
cpu: "1000m"
---
apiVersion: v1
kind: Service
metadata:
name: jaeger
namespace: tracing
spec:
type: ClusterIP
ports:
- port: 16686
targetPort: 16686
name: ui
- port: 14268
targetPort: 14268
name: collector
- port: 9411
targetPort: 9411
name: zipkin
- port: 6831
targetPort: 6831
name: agent-compact
protocol: UDP
- port: 6832
targetPort: 6832
name: agent-binary
protocol: UDP
- port: 5778
targetPort: 5778
name: admin
selector:
app: jaeger
Application Instrumentation for Tracing
# app-with-tracing.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: example-app
namespace: default
spec:
replicas: 3
selector:
matchLabels:
app: example-app
template:
metadata:
labels:
app: example-app
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
containers:
- name: app
image: your-registry/example-app:latest
ports:
- containerPort: 8080
name: http
env:
- name: JAEGER_AGENT_HOST
value: "jaeger.tracing.svc.cluster.local"
- name: JAEGER_AGENT_PORT
value: "6831"
- name: JAEGER_SERVICE_NAME
value: "example-app"
- name: JAEGER_SAMPLER_TYPE
value: "const"
- name: JAEGER_SAMPLER_PARAM
value: "1"
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
🚀 Production Observability Patterns
Custom Metrics with Prometheus Client
// metrics.go - Go application metrics instrumentation
package main
import (
"context"
"fmt"
"net/http"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/trace"
)
var (
// HTTP metrics
httpRequestsTotal = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "Total number of HTTP requests",
},
[]string{"method", "endpoint", "status"},
)
httpRequestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "http_request_duration_seconds",
Help: "HTTP request duration in seconds",
Buckets: prometheus.DefBuckets,
},
[]string{"method", "endpoint"},
)
// Business metrics
ordersProcessed = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "orders_processed_total",
Help: "Total number of orders processed",
},
[]string{"status", "payment_method"},
)
orderValue = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "order_value_dollars",
Help: "Order value in dollars",
Buckets: []float64{10, 50, 100, 500, 1000, 5000},
},
[]string{"customer_tier"},
)
// System metrics
databaseConnections = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "database_connections_active",
Help: "Number of active database connections",
},
[]string{"database", "connection_pool"},
)
cacheHitRatio = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "cache_hit_ratio",
Help: "Cache hit ratio (0-1)",
},
[]string{"cache_name"},
)
)
func init() {
// Register metrics with Prometheus
prometheus.MustRegister(httpRequestsTotal)
prometheus.MustRegister(httpRequestDuration)
prometheus.MustRegister(ordersProcessed)
prometheus.MustRegister(orderValue)
prometheus.MustRegister(databaseConnections)
prometheus.MustRegister(cacheHitRatio)
}
// Middleware for HTTP metrics and tracing
func instrumentHandler(handler http.HandlerFunc, endpoint string) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
// Start tracing
tracer := otel.Tracer("example-app")
ctx, span := tracer.Start(r.Context(), fmt.Sprintf("%s %s", r.Method, endpoint))
defer span.End()
// Add request attributes to span
span.SetAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
attribute.String("http.user_agent", r.UserAgent()),
)
// Start timer for request duration
start := time.Now()
// Wrap response writer to capture status code
wrapped := &responseWriter{ResponseWriter: w, statusCode: 200}
// Call the actual handler
handler(wrapped, r.WithContext(ctx))
// Record metrics
duration := time.Since(start).Seconds()
status := fmt.Sprintf("%d", wrapped.statusCode)
httpRequestsTotal.WithLabelValues(r.Method, endpoint, status).Inc()
httpRequestDuration.WithLabelValues(r.Method, endpoint).Observe(duration)
// Add response attributes to span
span.SetAttributes(
attribute.Int("http.status_code", wrapped.statusCode),
attribute.Float64("http.response_time", duration),
)
}
}
type responseWriter struct {
http.ResponseWriter
statusCode int
}
func (rw *responseWriter) WriteHeader(code int) {
rw.statusCode = code
rw.ResponseWriter.WriteHeader(code)
}
// Business logic with metrics
func processOrder(ctx context.Context, order Order) error {
tracer := otel.Tracer("example-app")
ctx, span := tracer.Start(ctx, "process_order")
defer span.End()
span.SetAttributes(
attribute.String("order.id", order.ID),
attribute.Float64("order.value", order.Value),
attribute.String("order.payment_method", order.PaymentMethod),
)
// Simulate order processing
if err := validateOrder(ctx, order); err != nil {
ordersProcessed.WithLabelValues("failed", order.PaymentMethod).Inc()
span.RecordError(err)
return err
}
if err := chargePayment(ctx, order); err != nil {
ordersProcessed.WithLabelValues("payment_failed", order.PaymentMethod).Inc()
span.RecordError(err)
return err
}
// Record successful order
ordersProcessed.WithLabelValues("success", order.PaymentMethod).Inc()
orderValue.WithLabelValues(order.CustomerTier).Observe(order.Value)
span.SetStatus(codes.Ok, "Order processed successfully")
return nil
}
type Order struct {
ID string
Value float64
PaymentMethod string
CustomerTier string
}
func validateOrder(ctx context.Context, order Order) error {
tracer := otel.Tracer("example-app")
_, span := tracer.Start(ctx, "validate_order")
defer span.End()
// Validation logic here
return nil
}
func chargePayment(ctx context.Context, order Order) error {
tracer := otel.Tracer("example-app")
_, span := tracer.Start(ctx, "charge_payment")
defer span.End()
// Payment processing logic here
return nil
}
// Health check endpoints
func healthHandler(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
w.Write([]byte("OK"))
}
func readinessHandler(w http.ResponseWriter, r *http.Request) {
// Check database connectivity, cache, etc.
if isDatabaseHealthy() && isCacheHealthy() {
w.WriteHeader(http.StatusOK)
w.Write([]byte("Ready"))
} else {
w.WriteHeader(http.StatusServiceUnavailable)
w.Write([]byte("Not Ready"))
}
}
func isDatabaseHealthy() bool {
// Database health check logic
return true
}
func isCacheHealthy() bool {
// Cache health check logic
return true
}
func main() {
// Set up routes with instrumentation
http.Handle("/metrics", promhttp.Handler())
http.HandleFunc("/health", instrumentHandler(healthHandler, "/health"))
http.HandleFunc("/ready", instrumentHandler(readinessHandler, "/ready"))
// Start server
fmt.Println("Server starting on :8080")
if err := http.ListenAndServe(":8080", nil); err != nil {
panic(err)
}
}
📈 Advanced Monitoring Strategies
SLI/SLO Monitoring
# slo-rules.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: slo-rules
namespace: monitoring
data:
slo.yml: |
groups:
- name: slo.rules
interval: 30s
rules:
# Availability SLI: percentage of successful requests
- record: sli:availability:rate5m
expr: |
(
sum(rate(http_requests_total{status!~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
)
# Latency SLI: percentage of requests under 500ms
- record: sli:latency:rate5m
expr: |
(
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
)
# SLO: 99.9% availability
- alert: SLOAvailabilityBreach
expr: sli:availability:rate5m < 0.999
for: 2m
labels:
severity: critical
slo: availability
annotations:
summary: "Availability SLO breach for {{ $labels.service }}"
description: "Availability is {{ $value | humanizePercentage }}, below 99.9% SLO"
# SLO: 95% of requests under 500ms
- alert: SLOLatencyBreach
expr: sli:latency:rate5m < 0.95
for: 5m
labels:
severity: warning
slo: latency
annotations:
summary: "Latency SLO breach for {{ $labels.service }}"
description: "{{ $value | humanizePercentage }} of requests are under 500ms, below 95% SLO"
# Error budget calculation (monthly)
- record: slo:error_budget:availability
expr: |
(
1 - sli:availability:rate5m
) * 100 / 0.1 # 0.1% error budget for 99.9% SLO
- alert: ErrorBudgetExhaustion
expr: slo:error_budget:availability > 50
for: 1m
labels:
severity: warning
annotations:
summary: "Error budget 50% exhausted for {{ $labels.service }}"
description: "{{ $value }}% of monthly error budget consumed"
Multi-Cluster Monitoring Federation
# prometheus-federation.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-federation-config
namespace: monitoring
data:
prometheus.yml: |
global:
scrape_interval: 15s
external_labels:
cluster: 'federation'
region: 'global'
scrape_configs:
# Federate from production clusters
- job_name: 'production-us-west'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
# Aggregate metrics
- '{__name__=~"sli:.*"}'
- '{__name__=~"slo:.*"}'
# High-level cluster metrics
- '{__name__=~"up|kube_node_status_condition"}'
# Application metrics
- '{__name__=~"http_requests_total|http_request_duration_seconds.*"}'
static_configs:
- targets:
- 'prometheus.us-west.company.com'
- job_name: 'production-eu-west'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"sli:.*"}'
- '{__name__=~"slo:.*"}'
- '{__name__=~"up|kube_node_status_condition"}'
- '{__name__=~"http_requests_total|http_request_duration_seconds.*"}'
static_configs:
- targets:
- 'prometheus.eu-west.company.com'
rule_files:
- "/etc/prometheus/rules/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager.monitoring.svc.cluster.local:9093
🔧 Troubleshooting and Debugging
Comprehensive Debug Dashboard
# debug-dashboard.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: debug-dashboard
namespace: monitoring
data:
dashboard.json: |
{
"dashboard": {
"title": "Debug Dashboard",
"panels": [
{
"title": "Request Flow",
"type": "graph",
"targets": [
{
"expr": "rate(http_requests_total[5m])",
"legendFormat": "{{ method }} {{ status }}"
}
]
},
{
"title": "Error Rate Heatmap",
"type": "heatmap",
"targets": [
{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m])",
"legendFormat": "{{ service }}"
}
]
},
{
"title": "Top Slow Endpoints",
"type": "table",
"targets": [
{
"expr": "topk(10, histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])))",
"legendFormat": "{{ endpoint }}"
}
]
}
]
}
}
🎯 The Resolution: From 5 Hours to 5 Minutes
Remember Sarah's 3 AM nightmare? Here's how our comprehensive observability stack would have changed that story:
- Immediate Detection: Prometheus alerts would have fired within 2 minutes of the first error spike
- Rapid Diagnosis: Grafana dashboards showed exactly which service was failing and why
- Root Cause Analysis: Jaeger traces revealed the exact request path causing the bottleneck
- Quick Resolution: ELK logs provided the specific error messages needed to fix the issue
The same outage that took 5 hours to resolve now takes 5 minutes. That's the power of proper observability.
🎯 Key Takeaways
- Instrument Everything: Metrics, logs, and traces should be first-class citizens in your architecture
- Define SLIs/SLOs: Know what "good" looks like and measure it continuously
- Alert on Symptoms, Not Causes: Alert on user-facing issues, not infrastructure hiccups
- Practice Incident Response: Your observability is only as good as your ability to act on it
- Iterate and Improve: Continuously refine your dashboards and alerts based on real incidents
Observability isn't just about collecting data—it's about turning that data into actionable insights that keep your systems running smoothly and your users happy. When implemented correctly, it transforms your operational capabilities from reactive firefighting to proactive system management.
The next time your pager goes off at 3 AM, you'll be ready.
Cap
Senior Golang Backend & Web3 Developer with 10+ years of experience building scalable systems and blockchain solutions.
View Full Profile →