High Availability Design: Or, How to Stop Your Systems from Taking Personal Days

Uptime is a love language. Speak it fluently.

Remember when we sifted through packet captures and firewall logs at 2 a.m., trying to figure out why the app face-planted like a fainting goat? Cool. High availability (HA) is the part where we engineer the system so those incidents are rarer, smaller, and much less dramatic. Think of it as proactive incident response: we design away single points of failure so your IR team gets to sleep.

What Is High Availability (HA), Really?

High availability is the practice of designing systems to keep services running even when components fail. It’s not the same as disaster recovery (that’s when the whole building floods and we cry into our runbooks). HA is about surviving the everyday chaos: servers dying, links dropping, patches rebooting at the worst time.

Key vibes:

Eliminate single points of failure (SPOFs). If one thing breaks and your service dies, that thing needs a buddy.
Constrain failure domains. Keep problems local (rack, zone) and prevent blast radius from nuking the whole app.
Maintain state smartly. Stateless front ends, replicated stateful back ends, and session sanity.
Automate failover. Humans are great, but they click slowly at 3 a.m.

Hope is not a strategy. Redundancy is.

The Math-ish: What Do We Mean by “Available”?

We can approximate availability using:

Availability ≈ MTBF / (MTBF + MTTR)

MTBF: Mean Time Between Failures
MTTR: Mean Time To Repair

The more redundancy and automation, the higher your “MTBF” feels and the lower your “MTTR” becomes.

How much downtime do different “nines” imply?

SLA	Max downtime/year
99%	~3 days, 15 hours
99.9%	~8 hours, 46 minutes
99.99%	~52 minutes
99.999%	~5 minutes

If your boss says “five nines,” translate that to “we patch with zero downtime and every widget has a stunt double.”

Also know two cousins from incident response land:

RTO (Recovery Time Objective): How fast we must recover.
RPO (Recovery Point Objective): How much data loss is tolerable.

Synchronous replication helps RPO. Automation helps RTO. Budget helps everything.

Core HA Patterns (a.k.a. Redundancy Menu)

Pattern	Use When	Pros	Tradeoffs
Active/Active	Scalable stateless services	Uses all capacity; faster failover	Complex state/session handling
Active/Passive	Simpler services, databases, firewalls	Easier failover logic	Idle hardware cost; slower cutover
N+1	You can tolerate 1 failure	Cost-effective balance	Two failures = sad
2N	Mission-critical components	Survives full side failure	$$$ and complexity
Geo-Redundancy	Region-level outages are a risk	Survives epic meltdowns	Data consistency, latency, cost

Pro tip from log analysis: visibility drives design. If you can’t see a node’s health, your failover will be vibes-based—and vibes don’t pass audits.

HA by Layer (Because Everything Is Stacked Like a Nacho Platter)

Facility and Power

Dual power feeds, dual PDUs, dual PSUs per server.
UPS + generator with tested fuel contracts. (Untested generators are decorative.)
Cooling redundancy (N+1 minimum). Heat has no chill.

Network

Dual ISPs with automatic failover (BGP, SD-WAN). Don’t let a backhoe take you out.
Redundant firewalls in HA pair (active/standby); protocols like VRRP/HSRP.
Load balancers (L4/L7) with health checks and connection draining.
Reduce DNS fragility with low TTLs, health-checked DNS (GSLB), or Anycast.

Compute and Application

Stateless front ends behind a load balancer. Offload session state to Redis or DB.
Rolling updates and blue/green deployments to patch without downtime.
Health probes that actually test the thing that matters (dependency checks, not just “:80 is open”).

Data and State

Synchronous replication for zero/near-zero RPO within a metro zone; asynchronous across regions for DR.
Understand quorum and split-brain. Two nodes can argue forever; three can vote.
Even with HA, you still need backups. Replication just copies your mistakes faster.

A Concrete Example: Two-Zone Web App That Refuses To Die

Users
  │
Anycast/DNS LB
  │
┌───────────────┬────────────────┐
│ Zone A        │ Zone B         │
│               │                │
│ LB A          │ LB B           │  (L7 health checks)
│  │            │  │
│ App A1  A2    │ App B1  B2     │  (stateless)
│   │      │    │   │      │
│  Cache A      │  Cache B       │  (replication)
│     │         │     │
│  DB A (Primary)───sync───DB B (Standby)  (auto failover)
│
Dual ISPs per zone, redundant firewalls, dual PSUs everywhere

Normal ops: both zones serve traffic (active/active at app tier), DB is active/passive with synchronous replication.
Zone A dies: GSLB routes to Zone B; app stays up; DB B promotes automatically; caches rehydrate.
Maintenance: drain connections on LB A, patch A1/A2, rotate; repeat in Zone B. Zero user tears.

Observability (throwback to our IR unit):

Logs flag promotions, health check failures, LB decisions.
Metrics watch saturation (CPU, memory, queue length) and availability SLOs.
Traces confirm end-to-end latency isn’t spiking during failovers.

Designing HA: A Playbook You Can Actually Use

Define SLOs/SLA.
- Pick your nines. Attach real business impact to each nine.
Map the service.
- Draw dependencies. Circle every SPOF in red like it owes you money.
Choose patterns per layer.
- Active/active where stateless. Active/passive for things that don’t scale horizontally (many DBs, firewalls).
Handle state deliberately.
- Externalize sessions; choose sync vs async with explicit RPO math.
Reduce shared fate.
- Separate zones/regions, diverse ISPs, different power/cooling.
Automate failover with guardrails.
- Health checks that matter; rate-limit flaps; manual override for emergencies.
Test it. Break it on purpose.
- Chaos drills, game days, and runbook reps.
Observe everything.
- Centralized logging, metrics, and alerting tied to business SLOs. If your failover happens and nobody logs it, did it even occur?
Patch without pain.
- Rolling/blue-green/canary. Security patches should not create incidents.
Document and train.

Runbooks, diagrams, and “this is fine” fire memes kept to a tasteful minimum.

Security Gotchas (Because HA Shouldn’t Undermine Security)

Fail-closed vs fail-open: Firewalls and auth services should fail-closed. Load balancers can fail-open only if you like chaos.
Consistency is security: Replicate ACLs, WAF rules, and TLS certs across nodes. Stale configs = attack surface.
Key management: HSMs/KMS must be redundant; quorum-based key escrow beats “the one USB in Karen’s desk.”
DDoS meets HA: Over-provision, use upstream scrubbing/CDN, and make health checks resilient to noisy-but-not-fatal spikes.
Cascading failure: Rate limits and circuit breakers prevent one slow dependency from taking the fleet down.

Remember our IR logs? In HA land, the same logs power:

Root cause of failover events (was it a node, a network, or the DB quorum?).
Forensics on config drift causing uneven security posture across nodes.
Capacity planning so “Friday traffic spike” doesn’t become “Friday outage.”

Migrating to HA: Baby Steps That Save Careers

Start with the loudest SPOF (usually the database or the only firewall).
Introduce a load balancer in front of a stateless service. Make sessions external.
Split into two zones. Turn off one during a maintenance window. Celebrate when nothing breaks.
Add health-checked DNS. Lower TTLs. Watch propagation in logs.
Move to rolling deployments. Your incident channel goes quiet. You taste peace.

Quick Reality Check: HA ≠ DR ≠ Backup

HA: Survive component/rack/zone failures, seconds to minutes.
DR: Survive site/region disasters, minutes to hours.
Backups: Survive oopsies and ransomware, restore hours to days—but clean.

All three are friends. Only one saves you from fat-finger Friday.

Key Takeaways (Tattoo These on Your Brain, Not Your Arm)

SPOFs are a design choice. Choose not to.
You can’t scale what you can’t observe. Logs/metrics make HA believable and debuggable.
State is the final boss. Externalize or replicate it intentionally.
Test your failovers. A plan not rehearsed is a wish.
Patch like a pro. Rolling and blue/green keep uptime and security on the same team.

Systems don’t become highly available by accident. They become highly available because someone got tired of waking up.

Next time we’ll combine this with governance and ops: SLAs that mean something, change windows that don’t ruin weekends, and budgets that align with the number of nines you bravely promised.

Resilience, Risk, Governance, and Operations

Content