Courses/Service Management (ITIL) - Certificate Course - within IT Support Specialist/Service Operation

Service Operation

17932 views

Delve into the practices required to manage service operations effectively.

Content

3 of 9

Problem Management

Problem Management — Forensics with Sass

4699 views

beginner

humorous

service management

visual

gpt-5-mini

4699 views

Versions:

Problem Management — Forensics with Sass

Watch & Learn

AI-discovered learning video

Start learning for free

Bookmark content and pick up later
AI-generated study materials
Flashcards, timelines, and more
Progress tracking and certificates

Free to join · No credit card required

Problem Management — The Detective Work of Service Operation (But With Fewer Magnifying Glasses)

"Incident Management puts out fires. Problem Management asks why the forest keeps catching fire."

You already know from Service Operation Overview and Incident Management that incidents are urgent, noisy, and demand immediate action. You also remember from Service Transition that when services move into operations, we do our best to avoid breaking things — but sometimes stuff still breaks. That's where Problem Management slides into the room with a cup of strong coffee and a flowchart.

This piece builds on those earlier lessons: Incident Management is the emergency room; Problem Management is the forensic lab and the prevention team rolled into one. Let's get into the how and why — with examples, steps, artifacts, and the occasional dramatic aside.

What Problem Management actually does (short and spicy)

Reactive Problem Management: Investigate root causes of incidents that already happened. Stop the same chaos from reappearing.
Proactive Problem Management: Hunt for patterns, trends, and ticking time bombs before users even notice.

Goal: reduce the number and impact of incidents over time by identifying root causes, creating workarounds, and pushing fixes through Change Management (remember Service Transition?).

How it fits with what you already know

Incident Management -> restores service quickly. Problem Management -> prevents recurrence.
Service Transition -> ensures changes are safe to operate. Problem Management -> feeds into Change Management when a permanent fix is needed.
Continual Service Improvement -> uses Problem Management metrics to show whether the service is getting more stable.

Imagine Incident Management as the ambulance crew. They’ll splint the patient and stop the bleeding. Problem Management is the epidemiologist who figures out the contaminated water source so the community doesn't keep getting sick.

Core activities (the recipe, with fewer weird ingredients)

Detection & Logging
- Problems are raised from recurring incidents, trend analysis, supplier notifications, or proactive scans.
Categorization & Prioritization
- Classify by service, impact, and urgency. Prioritization criteria differ from incidents because you’re balancing investigation effort vs business value.
Investigation & Diagnosis
- Root Cause Analysis (RCA): use techniques like 5 Whys, Ishikawa (fishbone), or fault-tree analysis.
Workaround Identification
- Provide immediate relief if a permanent fix will take time.
Raise RFC (Request for Change)
- If a permanent fix is needed, push a change through Change Management — the handoff to Service Transition processes.
Problem Resolution & Closure
- Confirm the fix, update the KEDB (Known Error Database), close the problem record, and update CI/CMDB if necessary.
Major Problem Review
- Post-resolution lessons learned; feed into Continual Service Improvement.

Roles & responsibilities (who does what)

Problem Manager: Owns the problem process, coordinates RCA, ensures KEDB is updated.
Problem Analyst / Technical Lead: Performs deep-dive diagnostics.
2nd/3rd Line Support / Vendors: Provide specialized expertise and fixes.
Change Manager & CAB: Approve and schedule permanent fixes.

Real-world example (the Monday Wi‑Fi mystery)

Scenario: Every Monday at 09:00, the office Wi‑Fi drops for 5–10 minutes and then returns. Users submit dozens of tickets every week.

Incident Management: Restores network quickly each Monday.
Problem Management (reactive): Logs the recurring issue, groups incidents, runs an RCA.
Investigation steps: Correlate logs, check scheduled tasks, review wireless controller health.
Root cause found: A scheduled backup job on a networked storage device triggers a flood of traffic from 08:59–09:06, saturating uplinks, causing Wi‑Fi controller failover.
Workaround: Throttle backup bandwidth or reschedule backups. Permanent fix: network QoS change and increase link capacity via a Change.

Result: Reduced repeat incidents and fewer Monday panic emails.

Known Error Database (KEDB) — your brain in a box

The KEDB is the catalog of problems with their root causes and workarounds. It makes life easier for the Service Desk and speeds incident resolution.

Sample KEDB entry (pseudo-fields):

Problem ID: PRB-2026-0042
Title: Office Wi‑Fi outage during scheduled backups
Root Cause: Backup job saturates uplink causing controller failover
Workaround: Throttle backup or reschedule to 02:00
Permanent Fix: RFC-CHG-2026-078 (QoS + link upgrade)
Status: Known Error
Affected CIs: Wireless Controller WLC-01, Uplink Router RT-03
Date Opened: 2026-02-12
Date Resolved: 2026-03-01

This is the stuff that turns chaos into predictable maintenance.

Quick comparison: Incident vs Problem (because people always confuse these)

Aspect	Incident	Problem
Focus	Restore service	Find & fix root cause
Timeframe	Immediate	Medium to long term
Outcome	Workaround / restore	Root cause fix, KEDB entry
Trigger	One user or many users	Repeated incidents or trend

Measurements that matter (metrics your manager will ask for)

Number of problems opened vs closed
Mean Time to Identify Root Cause (MTTRC)
Percentage of incidents caused by known errors
Number of repeat incidents per month
Time from problem diagnosis to RFC submitted and implemented

These help show whether Problem Management is reducing incident noise or just creating paperwork.

Common pitfalls (and how to avoid them)

Treating Problem Management like a ticket depository. Fix: Assign ownership and timelines.
Not updating the KEDB. Fix: Make updates mandatory before closure.
No links between Problem and Change records. Fix: Enforce RFC creation for permanent fixes.
Over-investigating low-impact problems. Fix: Prioritize by business impact and probability.

Closing: The mindset shift (a tiny pep talk)

Problem Management isn't just bureaucratic paperwork disguised as detective work. It's the difference between running a hamster wheel of incidents and building a calmer, more reliable IT world. In the context of Service Transition, Problem Management closes the loop: when a change causes instability, we fix the change process; when a recurring incident hurts business, we fix the underlying cause and prevent future pain.

Final thought:

If Incident Management is firefighting, Problem Management is fire prevention. Both are heroic. One just wears a different helmet.

Key takeaways:

Distinguish incidents from problems and treat each with the right process.
Use RCA methods and the KEDB to shorten future incident lifecycles.
Integrate tightly with Change Management and Service Transition for permanent fixes.

Want a mini exercise to lock this in? Pick a recurring issue at your org (printer jam, slow app load, login error). Map it: incidents → problem record → RCA → workaround → RFC. You'll see the arc from chaos to calm — and you might even feel like a superhero.

Flashcards

Mind Map

Speed Challenge

Comments (0)

Please sign in to leave a comment.

No comments yet. Be the first to comment!

Ready to practice?

Study with flashcards, timelines, and more

Earn certificates for completed courses

Bookmark content for later reference

Track your progress across all topics