Service Operation
Delve into the practices required to manage service operations effectively.
Content
Problem Management
Versions:
Watch & Learn
AI-discovered learning video
Sign in to watch the learning video for this topic.
Problem Management — The Detective Work of Service Operation (But With Fewer Magnifying Glasses)
"Incident Management puts out fires. Problem Management asks why the forest keeps catching fire."
You already know from Service Operation Overview and Incident Management that incidents are urgent, noisy, and demand immediate action. You also remember from Service Transition that when services move into operations, we do our best to avoid breaking things — but sometimes stuff still breaks. That's where Problem Management slides into the room with a cup of strong coffee and a flowchart.
This piece builds on those earlier lessons: Incident Management is the emergency room; Problem Management is the forensic lab and the prevention team rolled into one. Let's get into the how and why — with examples, steps, artifacts, and the occasional dramatic aside.
What Problem Management actually does (short and spicy)
- Reactive Problem Management: Investigate root causes of incidents that already happened. Stop the same chaos from reappearing.
- Proactive Problem Management: Hunt for patterns, trends, and ticking time bombs before users even notice.
Goal: reduce the number and impact of incidents over time by identifying root causes, creating workarounds, and pushing fixes through Change Management (remember Service Transition?).
How it fits with what you already know
- Incident Management -> restores service quickly. Problem Management -> prevents recurrence.
- Service Transition -> ensures changes are safe to operate. Problem Management -> feeds into Change Management when a permanent fix is needed.
- Continual Service Improvement -> uses Problem Management metrics to show whether the service is getting more stable.
Imagine Incident Management as the ambulance crew. They’ll splint the patient and stop the bleeding. Problem Management is the epidemiologist who figures out the contaminated water source so the community doesn't keep getting sick.
Core activities (the recipe, with fewer weird ingredients)
- Detection & Logging
- Problems are raised from recurring incidents, trend analysis, supplier notifications, or proactive scans.
- Categorization & Prioritization
- Classify by service, impact, and urgency. Prioritization criteria differ from incidents because you’re balancing investigation effort vs business value.
- Investigation & Diagnosis
- Root Cause Analysis (RCA): use techniques like 5 Whys, Ishikawa (fishbone), or fault-tree analysis.
- Workaround Identification
- Provide immediate relief if a permanent fix will take time.
- Raise RFC (Request for Change)
- If a permanent fix is needed, push a change through Change Management — the handoff to Service Transition processes.
- Problem Resolution & Closure
- Confirm the fix, update the KEDB (Known Error Database), close the problem record, and update CI/CMDB if necessary.
- Major Problem Review
- Post-resolution lessons learned; feed into Continual Service Improvement.
Roles & responsibilities (who does what)
- Problem Manager: Owns the problem process, coordinates RCA, ensures KEDB is updated.
- Problem Analyst / Technical Lead: Performs deep-dive diagnostics.
- 2nd/3rd Line Support / Vendors: Provide specialized expertise and fixes.
- Change Manager & CAB: Approve and schedule permanent fixes.
Real-world example (the Monday Wi‑Fi mystery)
Scenario: Every Monday at 09:00, the office Wi‑Fi drops for 5–10 minutes and then returns. Users submit dozens of tickets every week.
- Incident Management: Restores network quickly each Monday.
- Problem Management (reactive): Logs the recurring issue, groups incidents, runs an RCA.
- Investigation steps: Correlate logs, check scheduled tasks, review wireless controller health.
- Root cause found: A scheduled backup job on a networked storage device triggers a flood of traffic from 08:59–09:06, saturating uplinks, causing Wi‑Fi controller failover.
- Workaround: Throttle backup bandwidth or reschedule backups. Permanent fix: network QoS change and increase link capacity via a Change.
Result: Reduced repeat incidents and fewer Monday panic emails.
Known Error Database (KEDB) — your brain in a box
The KEDB is the catalog of problems with their root causes and workarounds. It makes life easier for the Service Desk and speeds incident resolution.
Sample KEDB entry (pseudo-fields):
Problem ID: PRB-2026-0042
Title: Office Wi‑Fi outage during scheduled backups
Root Cause: Backup job saturates uplink causing controller failover
Workaround: Throttle backup or reschedule to 02:00
Permanent Fix: RFC-CHG-2026-078 (QoS + link upgrade)
Status: Known Error
Affected CIs: Wireless Controller WLC-01, Uplink Router RT-03
Date Opened: 2026-02-12
Date Resolved: 2026-03-01
This is the stuff that turns chaos into predictable maintenance.
Quick comparison: Incident vs Problem (because people always confuse these)
| Aspect | Incident | Problem |
|---|---|---|
| Focus | Restore service | Find & fix root cause |
| Timeframe | Immediate | Medium to long term |
| Outcome | Workaround / restore | Root cause fix, KEDB entry |
| Trigger | One user or many users | Repeated incidents or trend |
Measurements that matter (metrics your manager will ask for)
- Number of problems opened vs closed
- Mean Time to Identify Root Cause (MTTRC)
- Percentage of incidents caused by known errors
- Number of repeat incidents per month
- Time from problem diagnosis to RFC submitted and implemented
These help show whether Problem Management is reducing incident noise or just creating paperwork.
Common pitfalls (and how to avoid them)
- Treating Problem Management like a ticket depository. Fix: Assign ownership and timelines.
- Not updating the KEDB. Fix: Make updates mandatory before closure.
- No links between Problem and Change records. Fix: Enforce RFC creation for permanent fixes.
- Over-investigating low-impact problems. Fix: Prioritize by business impact and probability.
Closing: The mindset shift (a tiny pep talk)
Problem Management isn't just bureaucratic paperwork disguised as detective work. It's the difference between running a hamster wheel of incidents and building a calmer, more reliable IT world. In the context of Service Transition, Problem Management closes the loop: when a change causes instability, we fix the change process; when a recurring incident hurts business, we fix the underlying cause and prevent future pain.
Final thought:
If Incident Management is firefighting, Problem Management is fire prevention. Both are heroic. One just wears a different helmet.
Key takeaways:
- Distinguish incidents from problems and treat each with the right process.
- Use RCA methods and the KEDB to shorten future incident lifecycles.
- Integrate tightly with Change Management and Service Transition for permanent fixes.
Want a mini exercise to lock this in? Pick a recurring issue at your org (printer jam, slow app load, login error). Map it: incidents → problem record → RCA → workaround → RFC. You'll see the arc from chaos to calm — and you might even feel like a superhero.
Comments (0)
Please sign in to leave a comment.
No comments yet. Be the first to comment!