Production-Grade Docker & Kubernetes: Lessons from Managing 10,000+ Containers
Production-Grade Docker & Kubernetes: Lessons from Managing 10,000+ Containers
What they don't teach you in tutorials: Real-world container orchestration at enterprise scale
π¨ The 3 AM Wake-Up Call
Scene: Black Friday 2023, 3:17 AM. Our e-commerce platform serving 50M+ users just went dark.
The Culprit: A single misconfigured container brought down our entire Kubernetes cluster. One resources.requests.memory: "100Gi"
typo in a config file triggered a cascade failure that cost us $2.3M in 4 hours.
This is the story of how we rebuilt our container infrastructure to handle enterprise-scale trafficβand the hard lessons we learned along the way.
π The Scale We're Talking About
Production Stats (as of 2024):
- π’ 10,847 containers across 156 nodes
- π 2.1M requests/second peak traffic
- π 23 regions worldwide
- β±οΈ 99.97% uptime (4.8 hours downtime/year)
- π° 67% cost reduction from optimization
- π§ <2 minute average deployment time
Let me show you exactly how we achieved this.
π§ Foundation: Production-Ready Dockerfile Patterns
β What Not to Do (Our Original Approach)
# Dockerfile.bad - All the antipatterns
FROM node:latest
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build
EXPOSE 3000
USER root
CMD ["npm", "start"]
# Problems:
# - Uses 'latest' tag (unstable)
# - Runs as root (security risk)
# - No multi-stage build (huge image)
# - No health checks
# - No signal handling
β Production-Grade Dockerfile
# Dockerfile.production - Battle-tested approach
# Use specific version with security patches
FROM node:18.17.1-alpine3.18 AS base
# Install security updates
RUN apk update && apk upgrade && \
apk add --no-cache dumb-init && \
rm -rf /var/cache/apk/*
# Create non-root user
RUN addgroup -g 1001 -S nodejs && \
adduser -S nextjs -u 1001
# Build stage
FROM base AS deps
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production --no-audit --no-fund && \
npm cache clean --force
# Source build stage
FROM base AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --no-audit --no-fund
COPY . .
RUN npm run build && \
npm prune --production
# Production stage
FROM base AS runner
WORKDIR /app
ENV NODE_ENV=production
ENV PORT=3000
# Copy built application
COPY --from=builder --chown=nextjs:nodejs /app/dist ./dist
COPY --from=builder --chown=nextjs:nodejs /app/node_modules ./node_modules
COPY --from=builder --chown=nextjs:nodejs /app/package.json ./package.json
# Health check
COPY --from=builder --chown=nextjs:nodejs /app/health-check.js ./health-check.js
HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
CMD node health-check.js
USER nextjs
EXPOSE 3000
# Use dumb-init for proper signal handling
ENTRYPOINT ["dumb-init", "--"]
CMD ["node", "dist/server.js"]
# Labels for metadata
LABEL maintainer="devops@company.com" \
version="1.0.0" \
description="Production API server"
Advanced: Multi-Architecture Builds
# Dockerfile.multiarch - Support ARM64 and AMD64
ARG BUILDPLATFORM
ARG TARGETPLATFORM
FROM --platform=$BUILDPLATFORM node:18.17.1-alpine3.18 AS base
# Platform-specific optimizations
RUN case "$TARGETPLATFORM" in \
"linux/arm64") echo "Building for ARM64" && apk add --no-cache python3 make g++ ;; \
"linux/amd64") echo "Building for AMD64" ;; \
*) echo "Unsupported platform: $TARGETPLATFORM" && exit 1 ;; \
esac
# Build for multiple platforms
# docker buildx build --platform linux/amd64,linux/arm64 -t myapp:latest --push .
π― Kubernetes Configuration: The Right Way
Pod Security & Resource Management
# deployment.yaml - Production-grade configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-server
labels:
app: api-server
version: v1.2.3
spec:
replicas: 15 # Based on traffic analysis
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 1
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
version: v1.2.3
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"
spec:
# Security context
securityContext:
runAsNonRoot: true
runAsUser: 1001
fsGroup: 1001
seccompProfile:
type: RuntimeDefault
# Pod disruption budget reference
serviceAccountName: api-server
# Anti-affinity for high availability
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- api-server
topologyKey: kubernetes.io/hostname
containers:
- name: api-server
image: myregistry.com/api-server:v1.2.3
imagePullPolicy: IfNotPresent
# Resource limits based on profiling
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
# Environment variables
env:
- name: NODE_ENV
value: "production"
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: database-secret
key: url
- name: REDIS_URL
valueFrom:
configMapKeyRef:
name: redis-config
key: url
# Ports
ports:
- containerPort: 3000
name: http
- containerPort: 9090
name: metrics
# Health checks
livenessProbe:
httpGet:
path: /health
port: 3000
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 3000
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
# Startup probe for slow-starting apps
startupProbe:
httpGet:
path: /startup
port: 3000
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 30
# Security context
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop:
- ALL
# Volume mounts
volumeMounts:
- name: tmp
mountPath: /tmp
- name: cache
mountPath: /app/cache
# Volumes
volumes:
- name: tmp
emptyDir: {}
- name: cache
emptyDir:
sizeLimit: 1Gi
# Image pull secrets
imagePullSecrets:
- name: registry-secret
# Termination grace period
terminationGracePeriodSeconds: 30
---
# Pod Disruption Budget
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-server-pdb
spec:
minAvailable: 70%
selector:
matchLabels:
app: api-server
---
# Horizontal Pod Autoscaler
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 5
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Percent
value: 50
periodSeconds: 60
Service Mesh with Istio
# istio-config.yaml - Advanced traffic management
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: api-server
spec:
hosts:
- api.company.com
gateways:
- api-gateway
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: api-server
subset: canary
weight: 100
- route:
- destination:
host: api-server
subset: stable
weight: 90
- destination:
host: api-server
subset: canary
weight: 10
fault:
delay:
percentage:
value: 0.1
fixedDelay: 5s
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,gateway-error,connect-failure,refused-stream
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: api-server
spec:
host: api-server
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
maxRequestsPerConnection: 2
loadBalancer:
simple: LEAST_CONN
circuitBreaker:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
maxEjectionPercent: 50
subsets:
- name: stable
labels:
version: v1.2.3
- name: canary
labels:
version: v1.3.0-rc1
π Security Hardening
Network Policies
# network-policy.yaml - Zero-trust networking
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-server-netpol
spec:
podSelector:
matchLabels:
app: api-server
policyTypes:
- Ingress
- Egress
ingress:
- from:
- namespaceSelector:
matchLabels:
name: ingress-nginx
- namespaceSelector:
matchLabels:
name: frontend
- namespaceSelector:
matchLabels:
name: monitoring
ports:
- protocol: TCP
port: 3000
- protocol: TCP
port: 9090
egress:
- to:
- namespaceSelector:
matchLabels:
name: database
ports:
- protocol: TCP
port: 5432
- to:
- namespaceSelector:
matchLabels:
name: redis
ports:
- protocol: TCP
port: 6379
- to: [] # DNS
ports:
- protocol: UDP
port: 53
- to: [] # HTTPS to external APIs
ports:
- protocol: TCP
port: 443
Pod Security Standards
# pod-security.yaml - Enforce security policies
apiVersion: v1
kind: Namespace
metadata:
name: production
labels:
pod-security.kubernetes.io/enforce: restricted
pod-security.kubernetes.io/audit: restricted
pod-security.kubernetes.io/warn: restricted
---
# Security Context Constraints (OpenShift)
apiVersion: security.openshift.io/v1
kind: SecurityContextConstraints
metadata:
name: api-server-scc
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
allowPrivilegedContainer: false
allowedCapabilities: []
defaultAddCapabilities: []
requiredDropCapabilities:
- ALL
forbiddenSysctls:
- "*"
fsGroup:
type: MustRunAs
ranges:
- min: 1001
max: 1001
runAsUser:
type: MustRunAsNonRoot
seLinuxContext:
type: MustRunAs
supplementalGroups:
type: MustRunAs
ranges:
- min: 1001
max: 1001
volumes:
- configMap
- emptyDir
- projected
- secret
- downwardAPI
- persistentVolumeClaim
users:
- system:serviceaccount:production:api-server
π Monitoring & Observability
Prometheus Monitoring Stack
# monitoring.yaml - Comprehensive observability
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-server
labels:
team: backend
spec:
selector:
matchLabels:
app: api-server
endpoints:
- port: metrics
interval: 15s
path: /metrics
honorLabels: true
---
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-server-alerts
spec:
groups:
- name: api-server.rules
rules:
- alert: APIServerHighErrorRate
expr: |
(
rate(http_requests_total{app="api-server",status=~"5.."}[5m])
/
rate(http_requests_total{app="api-server"}[5m])
) > 0.05
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "API Server error rate is above 5%"
description: "API Server {{ $labels.instance }} has error rate of {{ $value | humanizePercentage }}"
- alert: APIServerHighLatency
expr: |
histogram_quantile(0.95,
rate(http_request_duration_seconds_bucket{app="api-server"}[5m])
) > 0.5
for: 10m
labels:
severity: warning
team: backend
annotations:
summary: "API Server high latency"
description: "API Server 95th percentile latency is {{ $value }}s"
- alert: APIServerPodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total{container="api-server"}[5m]) * 60 * 5 > 0
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "API Server pod is crash looping"
description: "Pod {{ $labels.pod }} is restarting {{ $value }} times per 5 minutes"
Distributed Tracing
// tracing.go - OpenTelemetry instrumentation
package main
import (
"context"
"net/http"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/attribute"
"go.opentelemetry.io/otel/exporters/jaeger"
"go.opentelemetry.io/otel/sdk/resource"
"go.opentelemetry.io/otel/sdk/trace"
semconv "go.opentelemetry.io/otel/semconv/v1.4.0"
)
func initTracing() (*trace.TracerProvider, error) {
// Create Jaeger exporter
exp, err := jaeger.New(jaeger.WithCollectorEndpoint(
jaeger.WithEndpoint("http://jaeger-collector:14268/api/traces"),
))
if err != nil {
return nil, err
}
// Create tracer provider
tp := trace.NewTracerProvider(
trace.WithBatcher(exp),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("api-server"),
semconv.ServiceVersionKey.String("v1.2.3"),
semconv.DeploymentEnvironmentKey.String("production"),
)),
trace.WithSampler(trace.TraceIDRatioBased(0.1)), // 10% sampling
)
otel.SetTracerProvider(tp)
return tp, nil
}
// Middleware for HTTP tracing
func tracingMiddleware(next http.Handler) http.Handler {
tracer := otel.Tracer("api-server")
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, span := tracer.Start(r.Context(), r.URL.Path)
defer span.End()
// Add attributes
span.SetAttributes(
attribute.String("http.method", r.Method),
attribute.String("http.url", r.URL.String()),
attribute.String("user.id", getUserID(r)),
)
// Inject context into request
r = r.WithContext(ctx)
// Custom response writer to capture status
rw := &responseWriter{ResponseWriter: w}
next.ServeHTTP(rw, r)
span.SetAttributes(
attribute.Int("http.status_code", rw.statusCode),
)
if rw.statusCode >= 400 {
span.RecordError(fmt.Errorf("HTTP %d", rw.statusCode))
}
})
}
π CI/CD Pipeline Integration
GitOps with ArgoCD
# .github/workflows/deploy.yml - Production deployment pipeline
name: Deploy to Production
on:
push:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run security scan
uses: aquasecurity/trivy-action@master
with:
image-ref: ${{ env.IMAGE_NAME }}:${{ github.sha }}
format: sarif
output: trivy-results.sarif
- name: Upload scan results
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: trivy-results.sarif
build-and-push:
needs: test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Login to registry
uses: docker/login-action@v3
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
platforms: linux/amd64,linux/arm64
push: true
tags: |
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }}
${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:latest
cache-from: type=gha
cache-to: type=gha,mode=max
deploy:
needs: build-and-push
runs-on: ubuntu-latest
steps:
- name: Update deployment
run: |
# Update Helm values or Kustomization
yq eval '.image.tag = "${{ github.sha }}"' -i values.yaml
# Commit to GitOps repo
git config user.name "GitHub Actions"
git config user.email "actions@github.com"
git add values.yaml
git commit -m "Update image to ${{ github.sha }}"
git push origin main
Blue-Green Deployment Strategy
# blue-green-deployment.yaml - Zero-downtime deployments
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api-server
spec:
replicas: 20
strategy:
blueGreen:
activeService: api-server-active
previewService: api-server-preview
autoPromotionEnabled: false
scaleDownDelaySeconds: 30
prePromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: api-server-preview
postPromotionAnalysis:
templates:
- templateName: success-rate
args:
- name: service-name
value: api-server-active
selector:
matchLabels:
app: api-server
template:
metadata:
labels:
app: api-server
spec:
containers:
- name: api-server
image: myregistry.com/api-server:latest
ports:
- containerPort: 3000
resources:
requests:
memory: 256Mi
cpu: 250m
limits:
memory: 512Mi
cpu: 500m
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 30s
count: 10
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status!~"5.."}[5m])) /
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
π° Cost Optimization Strategies
Cluster Autoscaling
# cluster-autoscaler.yaml - Dynamic node scaling
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
template:
spec:
containers:
- image: k8s.gcr.io/autoscaling/cluster-autoscaler:v1.27.0
name: cluster-autoscaler
command:
- ./cluster-autoscaler
- --v=4
- --stderrthreshold=info
- --cloud-provider=aws
- --skip-nodes-with-local-storage=false
- --expander=least-waste
- --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/production
- --balance-similar-node-groups
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5
- --skip-nodes-with-system-pods=false
---
# Node pool configuration for cost optimization
apiVersion: v1
kind: ConfigMap
metadata:
name: node-pools-config
data:
spot-instances.yaml: |
# 70% spot instances for cost savings
nodeGroups:
- name: spot-workers
instanceTypes:
- m5.large
- m5.xlarge
- m5a.large
- m5a.xlarge
spot: true
minSize: 5
maxSize: 100
desiredCapacity: 20
labels:
node-type: spot
cost-optimization: enabled
taints:
- key: spot-instance
value: "true"
effect: NoSchedule
- name: on-demand-workers
instanceTypes:
- m5.large
spot: false
minSize: 3
maxSize: 20
desiredCapacity: 5
labels:
node-type: on-demand
critical: "true"
Resource Right-Sizing
// resource-analyzer.go - Automated resource recommendations
package main
import (
"context"
"fmt"
"time"
"github.com/prometheus/client_golang/api"
v1 "github.com/prometheus/client_golang/api/prometheus/v1"
)
type ResourceAnalyzer struct {
prometheusClient v1.API
}
type ResourceRecommendation struct {
Deployment string
Namespace string
CurrentCPU string
RecommendedCPU string
CurrentMemory string
RecommendedMemory string
PotentialSavings float64
}
func (ra *ResourceAnalyzer) AnalyzeResources(ctx context.Context) ([]ResourceRecommendation, error) {
// Query CPU usage over last 7 days
cpuQuery := `
max_over_time(
avg by (pod, namespace) (
rate(container_cpu_usage_seconds_total[5m])
)[7d:1h]
)
`
// Query memory usage over last 7 days
memoryQuery := `
max_over_time(
avg by (pod, namespace) (
container_memory_working_set_bytes
)[7d:1h]
)
`
cpuResult, _, err := ra.prometheusClient.Query(ctx, cpuQuery, time.Now())
if err != nil {
return nil, err
}
memoryResult, _, err := ra.prometheusClient.Query(ctx, memoryQuery, time.Now())
if err != nil {
return nil, err
}
recommendations := ra.generateRecommendations(cpuResult, memoryResult)
return recommendations, nil
}
func (ra *ResourceAnalyzer) generateRecommendations(cpuData, memoryData model.Value) []ResourceRecommendation {
var recommendations []ResourceRecommendation
// Add 20% buffer for CPU recommendations
// Add 10% buffer for memory recommendations
// Calculate cost savings based on cloud provider pricing
for _, sample := range cpuData.(model.Vector) {
maxCPU := float64(sample.Value) * 1.2 // 20% buffer
recommendation := ResourceRecommendation{
Deployment: string(sample.Metric["deployment"]),
Namespace: string(sample.Metric["namespace"]),
RecommendedCPU: fmt.Sprintf("%.0fm", maxCPU*1000),
PotentialSavings: ra.calculateSavings(sample),
}
recommendations = append(recommendations, recommendation)
}
return recommendations
}
// Generate automated resource updates
func (ra *ResourceAnalyzer) GenerateKustomizePatch(rec ResourceRecommendation) string {
return fmt.Sprintf(`
apiVersion: apps/v1
kind: Deployment
metadata:
name: %s
namespace: %s
spec:
template:
spec:
containers:
- name: app
resources:
requests:
cpu: %s
memory: %s
limits:
cpu: %s
memory: %s
`, rec.Deployment, rec.Namespace,
rec.RecommendedCPU, rec.RecommendedMemory,
rec.RecommendedCPU, rec.RecommendedMemory)
}
π Disaster Recovery & Backup
Velero Backup Strategy
# velero-backup.yaml - Automated disaster recovery
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: production-backup
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
includedNamespaces:
- production
- staging
excludedResources:
- events
- events.events.k8s.io
storageLocation: aws-s3
volumeSnapshotLocations:
- aws-ebs
ttl: 720h0m0s # 30 days retention
hooks:
resources:
- name: database-backup-hook
includedNamespaces:
- production
labelSelector:
matchLabels:
app: postgresql
pre:
- exec:
container: postgresql
command:
- /bin/bash
- -c
- pg_dump -U postgres mydb > /backup/dump.sql
post:
- exec:
container: postgresql
command:
- /bin/bash
- -c
- rm -f /backup/dump.sql
---
# Cross-region replication
apiVersion: velero.io/v1
kind: BackupStorageLocation
metadata:
name: aws-s3-dr
spec:
provider: aws
objectStorage:
bucket: production-backup-dr
region: us-west-2 # Different region for DR
config:
region: us-west-2
serverSideEncryption: AES256
π Performance Optimization Results
Before vs After Metrics
# Performance comparison script
#!/bin/bash
echo "=== Container Optimization Results ==="
echo "Metric | Before | After | Improvement"
echo "------------------------|-----------|-----------|------------"
echo "Image Size | 1.2GB | 180MB | 85% β"
echo "Build Time | 8min | 2.5min | 69% β"
echo "Memory Usage | 1.2GB | 340MB | 72% β"
echo "CPU Usage | 800m | 250m | 69% β"
echo "Startup Time | 45s | 8s | 82% β"
echo "Pod Restart Time | 30s | 5s | 83% β"
echo ""
echo "=== Cost Impact ==="
echo "Monthly Infrastructure | $12,400 | $4,100 | 67% β"
echo "Engineering Time Saved | - | 20hrs/wk | +$50k/mo"
echo ""
echo "=== Reliability Metrics ==="
echo "Uptime | 99.2% | 99.97% | 0.77% β"
echo "MTTR | 23min | 4min | 83% β"
echo "Failed Deployments | 12% | 0.3% | 97% β"
π― Production Readiness Checklist
Essential Pre-Production Steps
# production-readiness.yaml - Verification checklist
apiVersion: v1
kind: ConfigMap
metadata:
name: production-checklist
data:
security.md: |
## Security Checklist
- [ ] Pod Security Standards enforced
- [ ] Network policies configured
- [ ] RBAC properly configured
- [ ] Secrets encrypted at rest
- [ ] Container images scanned for vulnerabilities
- [ ] Non-root containers only
- [ ] Resource limits set
- [ ] Admission controllers configured
reliability.md: |
## Reliability Checklist
- [ ] Health checks configured
- [ ] Resource requests/limits set
- [ ] Pod disruption budgets defined
- [ ] Anti-affinity rules configured
- [ ] Graceful shutdown implemented
- [ ] Circuit breakers in place
- [ ] Retries and timeouts configured
- [ ] Monitoring and alerting active
performance.md: |
## Performance Checklist
- [ ] Load testing completed
- [ ] Resource usage analyzed
- [ ] Horizontal pod autoscaler configured
- [ ] Cluster autoscaler configured
- [ ] CDN configured for static assets
- [ ] Database connection pooling
- [ ] Caching strategy implemented
- [ ] Performance benchmarks established
operations.md: |
## Operations Checklist
- [ ] Backup strategy tested
- [ ] Disaster recovery plan validated
- [ ] Runbooks documented
- [ ] On-call procedures defined
- [ ] Log aggregation configured
- [ ] Metrics collection active
- [ ] Dashboards created
- [ ] SLO/SLI defined
π¨ Common Production Pitfalls & Solutions
1. The "Everything is Urgent" Anti-Pattern
# β Bad: All pods marked as critical
spec:
priorityClassName: system-cluster-critical # Reserved for system components!
# β
Good: Proper priority classes
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: business-critical
value: 1000
globalDefault: false
description: "Business critical applications"
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: standard
value: 100
globalDefault: true
description: "Standard applications"
2. Resource Requests = Limits Mistake
# β Bad: Identical requests and limits
resources:
requests:
memory: "1Gi"
cpu: "500m"
limits:
memory: "1Gi" # Can't burst!
cpu: "500m" # Can't burst!
# β
Good: Allow bursting for CPU, strict for memory
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "1Gi" # Strict memory limit
cpu: "1000m" # Allow CPU bursting
3. Ingress SSL Termination Issues
# ssl-optimization.yaml - Proper SSL handling
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-server
annotations:
kubernetes.io/ingress.class: nginx
cert-manager.io/cluster-issuer: letsencrypt-prod
nginx.ingress.kubernetes.io/ssl-redirect: "true"
nginx.ingress.kubernetes.io/force-ssl-redirect: "true"
nginx.ingress.kubernetes.io/ssl-protocols: "TLSv1.2 TLSv1.3"
nginx.ingress.kubernetes.io/ssl-ciphers: "ECDHE-RSA-AES128-GCM-SHA256,ECDHE-RSA-AES256-GCM-SHA384"
nginx.ingress.kubernetes.io/proxy-body-size: "50m"
nginx.ingress.kubernetes.io/proxy-connect-timeout: "5"
nginx.ingress.kubernetes.io/proxy-send-timeout: "60"
nginx.ingress.kubernetes.io/proxy-read-timeout: "60"
spec:
tls:
- hosts:
- api.company.com
secretName: api-tls-secret
rules:
- host: api.company.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: api-server
port:
number: 3000
π― Key Takeaways
What Actually Matters in Production:
- Security First: Pod Security Standards, Network Policies, and RBAC aren't optional
- Resource Management: Right-sized requests/limits save money and improve reliability
- Observability: You can't manage what you don't measure
- Gradual Rollouts: Blue-green and canary deployments prevent catastrophic failures
- Cost Optimization: Spot instances and autoscaling can cut costs by 60-70%
The 80/20 Rule: Focus on these 5 areas first, they'll solve 80% of your production issues.
Remember: Production is not about having the perfect setupβit's about having a reliable, observable, and maintainable system that can evolve with your business needs.
Managing containers at scale? What's been your biggest production challenge? Share your war stories in the comments!
Cap
Senior Golang Backend & Web3 Developer with 10+ years of experience building scalable systems and blockchain solutions.
View Full Profile β