Back to Blog
DevOps

Production-Grade Docker & Kubernetes: Lessons from Managing 10,000+ Containers

Cap
16 min read
dockerkubernetesproductiondevopsscalability

Production-Grade Docker & Kubernetes: Lessons from Managing 10,000+ Containers

What they don't teach you in tutorials: Real-world container orchestration at enterprise scale

🚨 The 3 AM Wake-Up Call

Scene: Black Friday 2023, 3:17 AM. Our e-commerce platform serving 50M+ users just went dark.

The Culprit: A single misconfigured container brought down our entire Kubernetes cluster. One resources.requests.memory: "100Gi" typo in a config file triggered a cascade failure that cost us $2.3M in 4 hours.

This is the story of how we rebuilt our container infrastructure to handle enterprise-scale trafficβ€”and the hard lessons we learned along the way.

πŸ“Š The Scale We're Talking About

Production Stats (as of 2024):

  • 🚒 10,847 containers across 156 nodes
  • πŸ“ˆ 2.1M requests/second peak traffic
  • 🌍 23 regions worldwide
  • ⏱️ 99.97% uptime (4.8 hours downtime/year)
  • πŸ’° 67% cost reduction from optimization
  • πŸ”§ <2 minute average deployment time

Let me show you exactly how we achieved this.

πŸ”§ Foundation: Production-Ready Dockerfile Patterns

❌ What Not to Do (Our Original Approach)

# Dockerfile.bad - All the antipatterns
FROM node:latest

WORKDIR /app
COPY . .
RUN npm install
RUN npm run build

EXPOSE 3000
USER root

CMD ["npm", "start"]

# Problems:
# - Uses 'latest' tag (unstable)
# - Runs as root (security risk)
# - No multi-stage build (huge image)
# - No health checks
# - No signal handling

βœ… Production-Grade Dockerfile

# Dockerfile.production - Battle-tested approach
# Use specific version with security patches
FROM node:18.17.1-alpine3.18 AS base

# Install security updates
RUN apk update && apk upgrade && \
    apk add --no-cache dumb-init && \
    rm -rf /var/cache/apk/*

# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nextjs -u 1001

# Build stage
FROM base AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production --no-audit --no-fund && \
    npm cache clean --force

# Source build stage
FROM base AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --no-audit --no-fund

COPY . .
RUN npm run build && \
    npm prune --production

# Production stage
FROM base AS runner
WORKDIR /app

ENV NODE_ENV=production
ENV PORT=3000

# Copy built application
COPY --from=builder --chown=nextjs:nodejs /app/dist ./dist
COPY --from=builder --chown=nextjs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nextjs:nodejs /app/package.json ./package.json

# Health check
COPY --from=builder --chown=nextjs:nodejs /app/health-check.js ./health-check.js
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
    CMD node health-check.js

USER nextjs

EXPOSE 3000

# Use dumb-init for proper signal handling
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "dist/server.js"]

# Labels for metadata
LABEL maintainer="devops@company.com" \
      version="1.0.0" \
      description="Production API server"

Advanced: Multi-Architecture Builds

# Dockerfile.multiarch - Support ARM64 and AMD64
ARG BUILDPLATFORM
ARG TARGETPLATFORM
FROM --platform=$BUILDPLATFORM node:18.17.1-alpine3.18 AS base

# Platform-specific optimizations
RUN case "$TARGETPLATFORM" in \
        "linux/arm64") echo "Building for ARM64" && apk add --no-cache python3 make g++ ;; \
        "linux/amd64") echo "Building for AMD64" ;; \
        *) echo "Unsupported platform: $TARGETPLATFORM" && exit 1 ;; \
    esac

# Build for multiple platforms
# docker buildx build --platform linux/amd64,linux/arm64 -t myapp:latest --push .

🎯 Kubernetes Configuration: The Right Way

Pod Security & Resource Management

# deployment.yaml - Production-grade configuration
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
  labels:
    app: api-server
    version: v1.2.3
spec:
  replicas: 15  # Based on traffic analysis
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 25%
      maxSurge: 1
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
        version: v1.2.3
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "9090"
        prometheus.io/path: "/metrics"
    spec:
      # Security context
      securityContext:
        runAsNonRoot: true
        runAsUser: 1001
        fsGroup: 1001
        seccompProfile:
          type: RuntimeDefault
      
      # Pod disruption budget reference
      serviceAccountName: api-server
      
      # Anti-affinity for high availability
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - api-server
              topologyKey: kubernetes.io/hostname
      
      containers:
      - name: api-server
        image: myregistry.com/api-server:v1.2.3
        imagePullPolicy: IfNotPresent
        
        # Resource limits based on profiling
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"
          limits:
            memory: "512Mi"
            cpu: "500m"
        
        # Environment variables
        env:
        - name: NODE_ENV
          value: "production"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: database-secret
              key: url
        - name: REDIS_URL
          valueFrom:
            configMapKeyRef:
              name: redis-config
              key: url
        
        # Ports
        ports:
        - containerPort: 3000
          name: http
        - containerPort: 9090
          name: metrics
        
        # Health checks
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        readinessProbe:
          httpGet:
            path: /ready
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 3
        
        # Startup probe for slow-starting apps
        startupProbe:
          httpGet:
            path: /startup
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 30
        
        # Security context
        securityContext:
          allowPrivilegeEscalation: false
          readOnlyRootFilesystem: true
          capabilities:
            drop:
            - ALL
        
        # Volume mounts
        volumeMounts:
        - name: tmp
          mountPath: /tmp
        - name: cache
          mountPath: /app/cache
      
      # Volumes
      volumes:
      - name: tmp
        emptyDir: {}
      - name: cache
        emptyDir:
          sizeLimit: 1Gi
      
      # Image pull secrets
      imagePullSecrets:
      - name: registry-secret
      
      # Termination grace period
      terminationGracePeriodSeconds: 30

---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 70%
  selector:
    matchLabels:
      app: api-server

---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 5
  maxReplicas: 50
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 10
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60

Service Mesh with Istio

# istio-config.yaml - Advanced traffic management
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: api-server
spec:
  hosts:
  - api.company.com
  gateways:
  - api-gateway
  http:
  - match:
    - headers:
        canary:
          exact: "true"
    route:
    - destination:
        host: api-server
        subset: canary
      weight: 100
  - route:
    - destination:
        host: api-server
        subset: stable
      weight: 90
    - destination:
        host: api-server
        subset: canary
      weight: 10
    fault:
      delay:
        percentage:
          value: 0.1
        fixedDelay: 5s
    retries:
      attempts: 3
      perTryTimeout: 2s
      retryOn: 5xx,gateway-error,connect-failure,refused-stream

---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: api-server
spec:
  host: api-server
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        maxRequestsPerConnection: 2
    loadBalancer:
      simple: LEAST_CONN
    circuitBreaker:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s
      maxEjectionPercent: 50
  subsets:
  - name: stable
    labels:
      version: v1.2.3
  - name: canary
    labels:
      version: v1.3.0-rc1

πŸ” Security Hardening

Network Policies

# network-policy.yaml - Zero-trust networking
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: api-server-netpol
spec:
  podSelector:
    matchLabels:
      app: api-server
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: ingress-nginx
    - namespaceSelector:
        matchLabels:
          name: frontend
    - namespaceSelector:
        matchLabels:
          name: monitoring
    ports:
    - protocol: TCP
      port: 3000
    - protocol: TCP
      port: 9090
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          name: database
    ports:
    - protocol: TCP
      port: 5432
  - to:
    - namespaceSelector:
        matchLabels:
          name: redis
    ports:
    - protocol: TCP
      port: 6379
  - to: []  # DNS
    ports:
    - protocol: UDP
      port: 53
  - to: []  # HTTPS to external APIs
    ports:
    - protocol: TCP
      port: 443

Pod Security Standards

# pod-security.yaml - Enforce security policies
apiVersion: v1
kind: Namespace
metadata:
  name: production
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

---
# Security Context Constraints (OpenShift)
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
  name: api-server-scc
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegedContainer: false
allowedCapabilities: []
defaultAddCapabilities: []
requiredDropCapabilities:
- ALL
forbiddenSysctls:
- "*"
fsGroup:
  type: MustRunAs
  ranges:
  - min: 1001
    max: 1001
runAsUser:
  type: MustRunAsNonRoot
seLinuxContext:
  type: MustRunAs
supplementalGroups:
  type: MustRunAs
  ranges:
  - min: 1001
    max: 1001
volumes:
- configMap
- emptyDir
- projected
- secret
- downwardAPI
- persistentVolumeClaim
users:
- system:serviceaccount:production:api-server

πŸ“Š Monitoring & Observability

Prometheus Monitoring Stack

# monitoring.yaml - Comprehensive observability
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-server
  labels:
    team: backend
spec:
  selector:
    matchLabels:
      app: api-server
  endpoints:
  - port: metrics
    interval: 15s
    path: /metrics
    honorLabels: true

---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-server-alerts
spec:
  groups:
  - name: api-server.rules
    rules:
    - alert: APIServerHighErrorRate
      expr: |
        (
          rate(http_requests_total{app="api-server",status=~"5.."}[5m])
          /
          rate(http_requests_total{app="api-server"}[5m])
        ) > 0.05
      for: 5m
      labels:
        severity: critical
        team: backend
      annotations:
        summary: "API Server error rate is above 5%"
        description: "API Server {{ $labels.instance }} has error rate of {{ $value | humanizePercentage }}"
    
    - alert: APIServerHighLatency
      expr: |
        histogram_quantile(0.95,
          rate(http_request_duration_seconds_bucket{app="api-server"}[5m])
        ) > 0.5
      for: 10m
      labels:
        severity: warning
        team: backend
      annotations:
        summary: "API Server high latency"
        description: "API Server 95th percentile latency is {{ $value }}s"
    
    - alert: APIServerPodCrashLooping
      expr: |
        rate(kube_pod_container_status_restarts_total{container="api-server"}[5m]) * 60 * 5 > 0
      for: 5m
      labels:
        severity: critical
        team: backend
      annotations:
        summary: "API Server pod is crash looping"
        description: "Pod {{ $labels.pod }} is restarting {{ $value }} times per 5 minutes"

Distributed Tracing

// tracing.go - OpenTelemetry instrumentation
package main

import (
    "context"
    "net/http"
    
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/attribute"
    "go.opentelemetry.io/otel/exporters/jaeger"
    "go.opentelemetry.io/otel/sdk/resource"
    "go.opentelemetry.io/otel/sdk/trace"
    semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)

func initTracing() (*trace.TracerProvider, error) {
    // Create Jaeger exporter
    exp, err := jaeger.New(jaeger.WithCollectorEndpoint(
        jaeger.WithEndpoint("http://jaeger-collector:14268/api/traces"),
    ))
    if err != nil {
        return nil, err
    }

    // Create tracer provider
    tp := trace.NewTracerProvider(
        trace.WithBatcher(exp),
        trace.WithResource(resource.NewWithAttributes(
            semconv.SchemaURL,
            semconv.ServiceNameKey.String("api-server"),
            semconv.ServiceVersionKey.String("v1.2.3"),
            semconv.DeploymentEnvironmentKey.String("production"),
        )),
        trace.WithSampler(trace.TraceIDRatioBased(0.1)), // 10% sampling
    )

    otel.SetTracerProvider(tp)
    return tp, nil
}

// Middleware for HTTP tracing
func tracingMiddleware(next http.Handler) http.Handler {
    tracer := otel.Tracer("api-server")
    
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        ctx, span := tracer.Start(r.Context(), r.URL.Path)
        defer span.End()
        
        // Add attributes
        span.SetAttributes(
            attribute.String("http.method", r.Method),
            attribute.String("http.url", r.URL.String()),
            attribute.String("user.id", getUserID(r)),
        )
        
        // Inject context into request
        r = r.WithContext(ctx)
        
        // Custom response writer to capture status
        rw := &responseWriter{ResponseWriter: w}
        next.ServeHTTP(rw, r)
        
        span.SetAttributes(
            attribute.Int("http.status_code", rw.statusCode),
        )
        
        if rw.statusCode >= 400 {
            span.RecordError(fmt.Errorf("HTTP %d", rw.statusCode))
        }
    })
}

πŸš€ CI/CD Pipeline Integration

GitOps with ArgoCD

# .github/workflows/deploy.yml - Production deployment pipeline
name: Deploy to Production

on:
  push:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Run security scan
      uses: aquasecurity/trivy-action@master
      with:
        image-ref: ${{ env.IMAGE_NAME }}:${{ github.sha }}
        format: sarif
        output: trivy-results.sarif
    
    - name: Upload scan results
      uses: github/codeql-action/upload-sarif@v2
      with:
        sarif_file: trivy-results.sarif

  build-and-push:
    needs: test
    runs-on: ubuntu-latest
    steps:
    - uses: actions/checkout@v4
    
    - name: Set up Docker Buildx
      uses: docker/setup-buildx-action@v3
    
    - name: Login to registry
      uses: docker/login-action@v3
      with:
        registry: ${{ env.REGISTRY }}
        username: ${{ github.actor }}
        password: ${{ secrets.GITHUB_TOKEN }}
    
    - name: Build and push
      uses: docker/build-push-action@v5
      with:
        context: .
        platforms: linux/amd64,linux/arm64
        push: true
        tags: |
          ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
          ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
        cache-from: type=gha
        cache-to: type=gha,mode=max

  deploy:
    needs: build-and-push
    runs-on: ubuntu-latest
    steps:
    - name: Update deployment
      run: |
        # Update Helm values or Kustomization
        yq eval '.image.tag = "${{ github.sha }}"' -i values.yaml
        
        # Commit to GitOps repo
        git config user.name "GitHub Actions"
        git config user.email "actions@github.com"
        git add values.yaml
        git commit -m "Update image to ${{ github.sha }}"
        git push origin main

Blue-Green Deployment Strategy

# blue-green-deployment.yaml - Zero-downtime deployments
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api-server
spec:
  replicas: 20
  strategy:
    blueGreen:
      activeService: api-server-active
      previewService: api-server-preview
      autoPromotionEnabled: false
      scaleDownDelaySeconds: 30
      prePromotionAnalysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: api-server-preview
      postPromotionAnalysis:
        templates:
        - templateName: success-rate
        args:
        - name: service-name
          value: api-server-active
  selector:
    matchLabels:
      app: api-server
  template:
    metadata:
      labels:
        app: api-server
    spec:
      containers:
      - name: api-server
        image: myregistry.com/api-server:latest
        ports:
        - containerPort: 3000
        resources:
          requests:
            memory: 256Mi
            cpu: 250m
          limits:
            memory: 512Mi
            cpu: 500m

---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
  - name: service-name
  metrics:
  - name: success-rate
    interval: 30s
    count: 10
    successCondition: result[0] >= 0.95
    failureLimit: 3
    provider:
      prometheus:
        address: http://prometheus:9090
        query: |
          sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[5m])) /
          sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

πŸ’° Cost Optimization Strategies

Cluster Autoscaling

# cluster-autoscaler.yaml - Dynamic node scaling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cluster-autoscaler
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
        name: cluster-autoscaler
        command:
        - ./cluster-autoscaler
        - --v=4
        - --stderrthreshold=info
        - --cloud-provider=aws
        - --skip-nodes-with-local-storage=false
        - --expander=least-waste
        - --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production
        - --balance-similar-node-groups
        - --scale-down-enabled=true
        - --scale-down-delay-after-add=10m
        - --scale-down-unneeded-time=10m
        - --scale-down-utilization-threshold=0.5
        - --skip-nodes-with-system-pods=false

---
# Node pool configuration for cost optimization
apiVersion: v1
kind: ConfigMap
metadata:
  name: node-pools-config
data:
  spot-instances.yaml: |
    # 70% spot instances for cost savings
    nodeGroups:
    - name: spot-workers
      instanceTypes: 
      - m5.large
      - m5.xlarge  
      - m5a.large
      - m5a.xlarge
      spot: true
      minSize: 5
      maxSize: 100
      desiredCapacity: 20
      labels:
        node-type: spot
        cost-optimization: enabled
      taints:
      - key: spot-instance
        value: "true"
        effect: NoSchedule
        
    - name: on-demand-workers
      instanceTypes:
      - m5.large
      spot: false
      minSize: 3
      maxSize: 20
      desiredCapacity: 5
      labels:
        node-type: on-demand
        critical: "true"

Resource Right-Sizing

// resource-analyzer.go - Automated resource recommendations
package main

import (
    "context"
    "fmt"
    "time"
    
    "github.com/prometheus/client_golang/api"
    v1 "github.com/prometheus/client_golang/api/prometheus/v1"
)

type ResourceAnalyzer struct {
    prometheusClient v1.API
}

type ResourceRecommendation struct {
    Deployment    string
    Namespace     string
    CurrentCPU    string
    RecommendedCPU string
    CurrentMemory string
    RecommendedMemory string
    PotentialSavings float64
}

func (ra *ResourceAnalyzer) AnalyzeResources(ctx context.Context) ([]ResourceRecommendation, error) {
    // Query CPU usage over last 7 days
    cpuQuery := `
        max_over_time(
            avg by (pod, namespace) (
                rate(container_cpu_usage_seconds_total[5m])
            )[7d:1h]
        )
    `
    
    // Query memory usage over last 7 days  
    memoryQuery := `
        max_over_time(
            avg by (pod, namespace) (
                container_memory_working_set_bytes
            )[7d:1h]
        )
    `
    
    cpuResult, _, err := ra.prometheusClient.Query(ctx, cpuQuery, time.Now())
    if err != nil {
        return nil, err
    }
    
    memoryResult, _, err := ra.prometheusClient.Query(ctx, memoryQuery, time.Now())
    if err != nil {
        return nil, err
    }
    
    recommendations := ra.generateRecommendations(cpuResult, memoryResult)
    return recommendations, nil
}

func (ra *ResourceAnalyzer) generateRecommendations(cpuData, memoryData model.Value) []ResourceRecommendation {
    var recommendations []ResourceRecommendation
    
    // Add 20% buffer for CPU recommendations
    // Add 10% buffer for memory recommendations
    // Calculate cost savings based on cloud provider pricing
    
    for _, sample := range cpuData.(model.Vector) {
        maxCPU := float64(sample.Value) * 1.2 // 20% buffer
        
        recommendation := ResourceRecommendation{
            Deployment:       string(sample.Metric["deployment"]),
            Namespace:        string(sample.Metric["namespace"]),
            RecommendedCPU:   fmt.Sprintf("%.0fm", maxCPU*1000),
            PotentialSavings: ra.calculateSavings(sample),
        }
        
        recommendations = append(recommendations, recommendation)
    }
    
    return recommendations
}

// Generate automated resource updates
func (ra *ResourceAnalyzer) GenerateKustomizePatch(rec ResourceRecommendation) string {
    return fmt.Sprintf(`
apiVersion: apps/v1
kind: Deployment
metadata:
  name: %s
  namespace: %s
spec:
  template:
    spec:
      containers:
      - name: app
        resources:
          requests:
            cpu: %s
            memory: %s
          limits:
            cpu: %s
            memory: %s
`, rec.Deployment, rec.Namespace, 
   rec.RecommendedCPU, rec.RecommendedMemory,
   rec.RecommendedCPU, rec.RecommendedMemory)
}

πŸ”„ Disaster Recovery & Backup

Velero Backup Strategy

# velero-backup.yaml - Automated disaster recovery
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: production-backup
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
    - production
    - staging
    excludedResources:
    - events
    - events.events.k8s.io
    storageLocation: aws-s3
    volumeSnapshotLocations:
    - aws-ebs
    ttl: 720h0m0s  # 30 days retention
    hooks:
      resources:
      - name: database-backup-hook
        includedNamespaces:
        - production
        labelSelector:
          matchLabels:
            app: postgresql
        pre:
        - exec:
            container: postgresql
            command:
            - /bin/bash
            - -c
            - pg_dump -U postgres mydb > /backup/dump.sql
        post:
        - exec:
            container: postgresql
            command:
            - /bin/bash
            - -c
            - rm -f /backup/dump.sql

---
# Cross-region replication
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
  name: aws-s3-dr
spec:
  provider: aws
  objectStorage:
    bucket: production-backup-dr
    region: us-west-2  # Different region for DR
  config:
    region: us-west-2
    serverSideEncryption: AES256

πŸ“ˆ Performance Optimization Results

Before vs After Metrics

# Performance comparison script
#!/bin/bash

echo "=== Container Optimization Results ==="
echo "Metric                  | Before    | After     | Improvement"
echo "------------------------|-----------|-----------|------------"
echo "Image Size              | 1.2GB     | 180MB     | 85% ↓"
echo "Build Time              | 8min      | 2.5min    | 69% ↓"  
echo "Memory Usage            | 1.2GB     | 340MB     | 72% ↓"
echo "CPU Usage               | 800m      | 250m      | 69% ↓"
echo "Startup Time            | 45s       | 8s        | 82% ↓"
echo "Pod Restart Time        | 30s       | 5s        | 83% ↓"
echo ""
echo "=== Cost Impact ==="
echo "Monthly Infrastructure  | $12,400   | $4,100    | 67% ↓"
echo "Engineering Time Saved  | -         | 20hrs/wk  | +$50k/mo"
echo ""
echo "=== Reliability Metrics ==="
echo "Uptime                  | 99.2%     | 99.97%    | 0.77% ↑"
echo "MTTR                    | 23min     | 4min      | 83% ↓"
echo "Failed Deployments      | 12%       | 0.3%      | 97% ↓"

🎯 Production Readiness Checklist

Essential Pre-Production Steps

# production-readiness.yaml - Verification checklist
apiVersion: v1
kind: ConfigMap
metadata:
  name: production-checklist
data:
  security.md: |
    ## Security Checklist
    - [ ] Pod Security Standards enforced
    - [ ] Network policies configured
    - [ ] RBAC properly configured
    - [ ] Secrets encrypted at rest
    - [ ] Container images scanned for vulnerabilities
    - [ ] Non-root containers only
    - [ ] Resource limits set
    - [ ] Admission controllers configured
    
  reliability.md: |
    ## Reliability Checklist
    - [ ] Health checks configured
    - [ ] Resource requests/limits set
    - [ ] Pod disruption budgets defined
    - [ ] Anti-affinity rules configured
    - [ ] Graceful shutdown implemented
    - [ ] Circuit breakers in place
    - [ ] Retries and timeouts configured
    - [ ] Monitoring and alerting active
    
  performance.md: |
    ## Performance Checklist
    - [ ] Load testing completed
    - [ ] Resource usage analyzed
    - [ ] Horizontal pod autoscaler configured
    - [ ] Cluster autoscaler configured
    - [ ] CDN configured for static assets
    - [ ] Database connection pooling
    - [ ] Caching strategy implemented
    - [ ] Performance benchmarks established
    
  operations.md: |
    ## Operations Checklist
    - [ ] Backup strategy tested
    - [ ] Disaster recovery plan validated
    - [ ] Runbooks documented
    - [ ] On-call procedures defined
    - [ ] Log aggregation configured
    - [ ] Metrics collection active
    - [ ] Dashboards created
    - [ ] SLO/SLI defined

🚨 Common Production Pitfalls & Solutions

1. The "Everything is Urgent" Anti-Pattern

# ❌ Bad: All pods marked as critical
spec:
  priorityClassName: system-cluster-critical  # Reserved for system components!
  
# βœ… Good: Proper priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: business-critical
value: 1000
globalDefault: false
description: "Business critical applications"

---
apiVersion: scheduling.k8s.io/v1  
kind: PriorityClass
metadata:
  name: standard
value: 100
globalDefault: true
description: "Standard applications"

2. Resource Requests = Limits Mistake

# ❌ Bad: Identical requests and limits
resources:
  requests:
    memory: "1Gi"
    cpu: "500m"
  limits:
    memory: "1Gi"  # Can't burst!
    cpu: "500m"    # Can't burst!

# βœ… Good: Allow bursting for CPU, strict for memory
resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "1Gi"    # Strict memory limit
    cpu: "1000m"     # Allow CPU bursting

3. Ingress SSL Termination Issues

# ssl-optimization.yaml - Proper SSL handling
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: api-server
  annotations:
    kubernetes.io/ingress.class: nginx
    cert-manager.io/cluster-issuer: letsencrypt-prod
    nginx.ingress.kubernetes.io/ssl-redirect: "true"
    nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
    nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.2 TLSv1.3"
    nginx.ingress.kubernetes.io/ssl-ciphers: "ECDHE-RSA-AES128-GCM-SHA256,ECDHE-RSA-AES256-GCM-SHA384"
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
    nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
    nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
spec:
  tls:
  - hosts:
    - api.company.com
    secretName: api-tls-secret
  rules:
  - host: api.company.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: api-server
            port:
              number: 3000

🎯 Key Takeaways

What Actually Matters in Production:

  1. Security First: Pod Security Standards, Network Policies, and RBAC aren't optional
  2. Resource Management: Right-sized requests/limits save money and improve reliability
  3. Observability: You can't manage what you don't measure
  4. Gradual Rollouts: Blue-green and canary deployments prevent catastrophic failures
  5. Cost Optimization: Spot instances and autoscaling can cut costs by 60-70%

The 80/20 Rule: Focus on these 5 areas first, they'll solve 80% of your production issues.

Remember: Production is not about having the perfect setupβ€”it's about having a reliable, observable, and maintainable system that can evolve with your business needs.

Managing containers at scale? What's been your biggest production challenge? Share your war stories in the comments!

WY

Cap

Senior Golang Backend & Web3 Developer with 10+ years of experience building scalable systems and blockchain solutions.

View Full Profile β†’