What is Incident Management?
The process of detecting, responding to, and resolving service disruptions to minimize impact and restore normal operations.
Incident management follows a lifecycle: detection (alerts, user reports), triage (severity assessment), response (investigation, mitigation), resolution (fix applied), and post-mortem (blameless retrospective documenting what happened and how to prevent recurrence).
Key roles include Incident Commander (coordinates response), Communications Lead (updates stakeholders), and Subject Matter Experts (technical investigation). Tools include PagerDuty, OpsGenie, and Statuspage. SRE practices emphasize error budgets and toil reduction.