Service Design
Learn how to design IT services, processes, and other aspects of service management.
Content
Availability Management
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Availability Management — Keep the Lights On (and the Users Happy)
"Availability is not a nice-to-have. It's the scoreboard for whether the business can actually do anything."
You already met the architects: Service Strategy set the vision, Service Catalog Management told the business what you actually offer, and Service Level Management negotiated the rules of the game (SLAs, OLAs, underpinning contracts). Availability Management is the practical coach who makes sure the team actually shows up to play, on time, with working shoes.
What is Availability Management? (short and stubbornly practical)
Availability Management ensures that IT services meet agreed availability targets in a cost-effective way. That sounds obvious — because it is. But it also means turning vague business expectations into measurable designs, controls, and operational behaviors that keep services usable when people need them.
Availability = the ability of a service to perform its agreed function when required. In ITIL terms we move from "we want it up" to "we design, measure, and improve so the service is up 99.95% between 08:00 and 20:00 on weekdays".
Why it matters (a reminder you’ll tell your CFO later)
- Downtime costs money, reputation, and sometimes lives (hello, healthcare systems).
- Availability requirements drive architecture, capacity, backup, and disaster recovery decisions.
- It forces meaningful collaboration: Availability depends on Design, Operations, Supplier Management, Incident and Problem Management, and yes — those SLAs you negotiated.
Ask yourself: if Service Level Management set an availability KPI, who owns achieving it? Availability Management. If Service Catalog declared the service exists during business hours, who designs the mechanisms to respect that? Availability Management.
Core activities — what Availability Management actually does
- Define availability requirements
- Translate business needs (from SLAs) into technical requirements: uptime windows, acceptable downtime, peak loads.
- Design for availability
- Architecture choices: redundancy, failover, load balancing, geographic distribution, resilient patterns.
- Implement controls and monitoring
- Instrumentation, synthetic transactions, alerting thresholds, and dashboards.
- Measure and report
- Gather metrics, compare against SLAs/OLAs, create management reports.
- Improve proactively
- Root cause analysis (with Problem Management), design changes, and supplier remediation.
- Manage availability-related documentation
- Availability plans, maintenance schedules, recovery procedures.
Key metrics and formulas (bring a calculator, or a good spreadsheet)
- MTTF — Mean Time To Failure: average time between failures for non-repairable systems.
- MTBF — Mean Time Between Failures: average time between failures for repairable systems.
- MTTR — Mean Time To Repair: average time to restore service after a failure.
Availability is often expressed as:
Availability = MTBF / (MTBF + MTTR)
Or, if you prefer business-speak: availability = uptime / (uptime + downtime).
Example: MTBF = 1000 hours, MTTR = 1 hour -> availability = 1000 / 1001 = 99.900%.
Table: quick mental map
| Metric | What it tells you | How you improve it |
|---|---|---|
| MTTR | How fast you fix stuff | Better runbooks, automation, incident response, redundancy |
| MTBF | How often stuff fails | Better design, replacement of flaky components |
| Availability % | Combined result | Both above + architecture and testing |
Design patterns that actually work (and their trade-offs)
- Redundancy (active-active, active-passive)
- Pros: reduces single points of failure
- Cons: cost, complexity, potential for split-brain scenarios
- Failover and replication
- Pros: continuity across component failure
- Cons: data consistency challenges, RTO/RPO trade-offs
- Load balancing and elasticity
- Pros: handles variable demand, reduces overload-related failures
- Cons: needs smart capacity planning and test scenarios
- Circuit breakers & graceful degradation
- Pros: prevents cascading failures
- Cons: requires good design and monitoring for degraded modes
Why trade-offs matter: you can chase 5 nines availability, but your budget might stop you at a more realistic 99.9. Availability Management is where business asks, "How much are we willing to pay?"
How it links with other processes (because nothing is an island)
- Service Level Management: SLAs give the targets; Availability Management designs to meet them.
- Service Catalog Management: defines when services are required — the availability window.
- Incident Management: restores service; MTTR is driven here.
- Problem Management: eliminates root causes; improves MTBF.
- Change Management: changes can improve or harm availability — test and control.
- Supplier Management: third-party SLAs and availability obligations must be enforced.
Imagine a chain: Strategy -> Catalog -> SLAs -> Availability Design -> Operations. Break any link and the user is on hold.
Real-world example (because math needs drama)
A university portal needs 99% availability during enrollment week (08:00–22:00). That’s about 6.6 hours of allowed downtime in a 30-day month, but concentrated in a smaller window makes tolerance even lower.
Steps Availability Management would take:
- Translate 99% into acceptable downtime during enrollment windows.
- Design for auto-scaling, read replicas for the database, and a maintenance window outside peak.
- Set up synthetic transactions to simulate student logins and detect slowdowns early.
- Define a failover plan and test it during a non-peak day. Update runbooks.
- Track MTTR and MTBF, report to SLM, recommend SLA adjustments or investment as needed.
If the team skips synthetic transactions and testing, they’ll learn the hard way: failures always choose the worst possible time.
Contrasting perspectives: perfection vs pragmatism
- "Aim for 99.999% — do whatever it takes." — the technologist who loves redundancy and hates budgets.
- "99% is fine, let’s use the saved money for new features." — the product owner with a roadmap.
Availability Management is the mediator: it shows the cost, risk, and business impact of each step and recommends a cost-effective target aligned to business needs.
Closing — takeaways and a tiny action list
- Availability Management turns SLA targets into real-world designs, measurements, and improvements.
- It’s about both preventing failures (raise MTBF) and fixing them fast (lower MTTR).
- Collaboration is essential: SLM sets the goal, Availability Management provides the plan, Operations executes.
Quick checklist to get going:
- Verify SLA availability targets with SLM.
- Map critical components and single points of failure.
- Define MTBF and MTTR targets and monitoring strategy.
- Build and test failover and recovery procedures.
- Report regularly and feed improvements into Problem and Change Management.
Final thought: designing for availability isn’t just built into servers or code. It’s built into decisions — about money, people, and priorities. Treat it as a strategic guardrail, not a post-mortem hobby.
Version note: This piece builds on Service Strategy, Service Catalog Management, and Service Level Management — use it to move from "what we want" to "how we make it stay working."
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!