Implementing Circuit Breaker Pattern in Go Microservices
๐จ When Everything Goes Wrong: My Black Friday Nightmare
3 AM. Pager buzzing. Half our microservices down. Customers can't checkout. $2M revenue evaporating by the minute.
The culprit? One failing payment service bringing down our entire e-commerce platform.
Here's how the Circuit Breaker Pattern saved our Black Friday (and how it can save yours too).
โก What Went Wrong
Our architecture looked solid on paper:
Order Service โ Payment Service โ Bank API
โ โ
User Service โ Email Service โ SMTP
โ
Inventory Service โ Database
But when the payment service started timing out, every other service kept retrying. The result? A catastrophic cascade failure.
Classic mistake: No circuit breaker protection.
๐ง The Circuit Breaker Pattern Explained
Think of your house's electrical system. When there's a short circuit, the breaker "trips" to prevent a fire. Same concept for microservices.
Three States:
CLOSED โ Requests flow normally
โ
OPEN โ Service is failing, reject fast
โ
HALF-OPEN โ Testing recovery
๐ป Building Our Circuit Breaker
I'll show you exactly how we implemented it:
package circuitbreaker
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
type CircuitBreaker struct {
mu sync.RWMutex
state State
failureCount int
failureThreshold int
resetTimeout time.Duration
lastFailureTime time.Time
}
func (cb *CircuitBreaker) Execute(fn func() error) error {
if !cb.allowRequest() {
return ErrCircuitBreakerOpen
}
err := fn()
cb.recordResult(err == nil)
return err
}
func (cb *CircuitBreaker) allowRequest() bool {
cb.mu.RLock()
defer cb.mu.RUnlock()
switch cb.state {
case StateClosed:
return true
case StateOpen:
return time.Since(cb.lastFailureTime) > cb.resetTimeout
case StateHalfOpen:
return cb.failureCount < 3 // Limited probes
}
return false
}
๐๏ธ Real Implementation: Payment Client
Here's our actual payment service client with circuit breaker:
type PaymentClient struct {
client *http.Client
breaker *CircuitBreaker
}
func (p *PaymentClient) ProcessPayment(ctx context.Context, req PaymentRequest) (*PaymentResponse, error) {
var result *PaymentResponse
err := p.breaker.Execute(func() error {
resp, err := p.client.PostJSON(ctx, "/payments", req)
if err != nil {
return err
}
if resp.StatusCode >= 500 {
return fmt.Errorf("payment service error: %d", resp.StatusCode)
}
return json.NewDecoder(resp.Body).Decode(&result)
})
return result, err
}
๐ฏ The Game Changer: Graceful Degradation
When the circuit breaker opens, we don't just fail. We have a fallback:
func (os *OrderService) ProcessOrder(w http.ResponseWriter, r *http.Request) {
payment, err := os.paymentClient.ProcessPayment(ctx, paymentReq)
if err == circuitbreaker.ErrCircuitBreakerOpen {
// Payment service is down - queue for later!
orderID := os.queueOrder(order)
w.WriteHeader(http.StatusAccepted)
json.NewEncoder(w).Encode(OrderResponse{
ID: orderID,
Status: "pending",
Message: "Order received, payment will be processed shortly",
})
return
}
if err != nil {
http.Error(w, "Payment failed", http.StatusBadRequest)
return
}
// Success path
json.NewEncoder(w).Encode(OrderResponse{
ID: order.ID,
Status: "completed",
PaymentID: payment.ID,
})
}
๐ The Results Speak for Themselves
Before Circuit Breaker:
- MTTR: 45 minutes (manual recovery)
- Failed requests: 100% during outage
- Revenue lost: $2M
- Services affected: 15/20
After Circuit Breaker:
- MTTR: 2 minutes (automatic recovery)
- Failed requests: 0% (queued instead)
- Revenue lost: <$50k
- Services affected: 1/20
๐ Advanced Features We Added
1. Adaptive Timeouts
type AdaptiveBreaker struct {
*CircuitBreaker
avgLatency time.Duration
timeout time.Duration
}
func (ab *AdaptiveBreaker) calculateTimeout() time.Duration {
return ab.avgLatency * 3 // 3x average as timeout
}
2. Health Check Probe
func (cb *CircuitBreaker) StartHealthCheck(healthURL string) {
go func() {
ticker := time.NewTicker(30 * time.Second)
for range ticker.C {
if cb.State() == StateOpen {
if cb.checkHealth(healthURL) {
cb.forceHalfOpen() // Force recovery attempt
}
}
}
}()
}
3. Metrics Integration
func (cb *CircuitBreaker) onStateChange(from, to State) {
prometheus.CircuitBreakerState.WithLabelValues(cb.name).Set(float64(to))
if to == StateOpen {
prometheus.CircuitBreakerTrips.WithLabelValues(cb.name).Inc()
log.Printf("๐จ Circuit breaker %s opened", cb.name)
// Alert on-call engineer
alertmanager.Send(Alert{
Service: cb.name,
Severity: "critical",
Description: "Circuit breaker opened due to repeated failures",
})
}
}
๐งช Testing Strategy
We learned testing circuit breakers is tricky. Here's our approach:
func TestCircuitBreakerWithRealService(t *testing.T) {
// Use httptest.Server for realistic testing
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
time.Sleep(100 * time.Millisecond) // Simulate slow service
w.WriteHeader(http.StatusInternalServerError)
}))
defer server.Close()
client := NewPaymentClient(server.URL)
// Should open after 5 failures
for i := 0; i < 5; i++ {
_, err := client.ProcessPayment(context.Background(), PaymentRequest{})
assert.Error(t, err)
}
// Should now be open
_, err := client.ProcessPayment(context.Background(), PaymentRequest{})
assert.Equal(t, circuitbreaker.ErrCircuitBreakerOpen, err)
}
โ ๏ธ Gotchas We Discovered
1. Don't Set Thresholds Too Low
// BAD: Will trip on single network hiccup
config := Config{FailureThreshold: 1}
// GOOD: Allows for occasional failures
config := Config{FailureThreshold: 5}
2. Consider Different Error Types
func isRetriableError(err error) bool {
// Don't trip breaker for client errors
if httpErr, ok := err.(*HTTPError); ok {
return httpErr.StatusCode >= 500
}
return true
}
3. Monitor Half-Open State
The half-open state is critical but often overlooked:
func (cb *CircuitBreaker) executeInHalfOpen(fn func() error) error {
// Limited concurrency in half-open
if !cb.halfOpenSemaphore.TryAcquire(1) {
return ErrTooManyRequests
}
defer cb.halfOpenSemaphore.Release(1)
return fn()
}
๐ฏ Key Takeaways
- Circuit breakers prevent cascade failures - One service's problems don't become everyone's problems
- Graceful degradation > complete failure - Queue operations when possible
- Monitor everything - You need visibility into breaker state changes
- Test with realistic scenarios - Unit tests aren't enough
- Tune based on real traffic - Every service has different failure patterns
๐ฎ What's Next?
In our next post, I'll show you how we built adaptive circuit breakers that adjust thresholds based on traffic patterns and bulkhead isolation to prevent resource exhaustion.
Question: Have you experienced cascade failures in your microservices? How did you handle it? Drop a comment below!
P.S. The full code is on GitHub - star it if this helped you!
Wang Yinneng
Senior Golang Backend & Web3 Developer with 10+ years of experience building scalable systems and blockchain solutions.
View Full Profile โ