What Is Incident Response?

Incident response (IR) is the organized process an organization follows to detect, contain, and recover from a cybersecurity event that threatens the confidentiality, integrity, or availability of its systems and data. An event is any observable occurrence; an incident is an event (or series of events) that actually harms — or imminently threatens to harm — the business.

Why does a written plan matter? When ransomware is encrypting file servers at 3 a.m., nobody has time to invent a process. A tested incident response plan (IRP) defines who does what, who decides, who to call (legal, regulators, insurers, law enforcement), and how to communicate — before the pressure is on. Improvised response wastes the most valuable resource during an attack: time.

The goal of IR is not only to stop the bleeding but to do so while preserving evidence, limiting damage, and learning enough to prevent a repeat. Good IR is engineering discipline applied to chaos.

The Incident Response Lifecycle

Two frameworks dominate. They use different names but describe the same flow, so teams routinely map one onto the other.

  • NIST SP 800-61 uses four phases: Preparation; Detection & Analysis; Containment, Eradication & Recovery; and Post-Incident Activity.
  • SANS PICERL uses six: Preparation, Identification, Containment, Eradication, Recovery, and Lessons Learned.

The alignment is direct: NIST's "Detection & Analysis" is SANS "Identification," and NIST collapses SANS's Containment/Eradication/Recovery into one combined phase. Both are cyclical — what you learn at the end feeds back into preparation.

Preparation

Everything done before an incident: writing and approving the IRP, building the response team, defining severity levels and escalation paths, deploying logging and monitoring, maintaining asset inventories and network diagrams, securing communications channels, and rehearsing. Preparation also means hardening — patching, least privilege, backups that are tested and offline. The strongest predictor of a smooth response is how much was done here.

Detection & Analysis (Identification)

Recognizing that something is wrong and determining its scope. Signals come from SIEM alerts, EDR detections, user reports, threat-intel matches, or anomalies in logs. Analysts triage each alert: is it real, how severe, what is affected, and how far has it spread? This phase produces the incident's classification and priority, which drive everything downstream. Accurate scoping is critical — under-scoping leaves attackers in the network; over-scoping wastes effort.

Containment, Eradication & Recovery

Containment stops the spread. Eradication removes the attacker's foothold — malware, malicious accounts, web shells, persistence mechanisms. Recovery restores systems to normal operation and confirms they are clean. These are detailed below.

Post-Incident Activity (Lessons Learned)

After the dust settles, the team documents what happened, why, and how the response performed. The output is concrete improvements — new detections, closed gaps, updated runbooks. Skipping this phase guarantees the same incident happens twice.

The Security Operations Center

The SOC is the team — and often the physical or virtual facility — responsible for continuous monitoring, detection, and response. SOCs are typically staffed around the clock and organized in tiers.

  • Tier 1 — Triage analysts: monitor alert queues, perform initial triage, close false positives, and escalate genuine incidents. The front line, highest alert volume.
  • Tier 2 — Incident responders: investigate escalated alerts in depth, correlate across data sources, and lead containment and remediation.
  • Tier 3 — Threat hunters and specialists: proactively search for threats that evade automated detection, perform forensics and malware analysis, and tune detections.
  • Incident response / IR lead: coordinates major incidents end to end, manages communications, and interfaces with leadership and external parties.

Other common roles include the SOC manager, detection engineers who build and maintain detection rules, and threat intelligence analysts who track adversary behavior.

Core SOC Tooling

  • SIEM (Security Information and Event Management): aggregates and correlates logs from across the environment, applies detection rules, and raises alerts. The SOC's central nervous system.
  • EDR / XDR (Endpoint / Extended Detection and Response): agents on endpoints record process, file, and network activity, detect malicious behavior, and allow remote response such as isolating a host. XDR extends this correlation across endpoints, network, identity, and cloud.
  • SOAR (Security Orchestration, Automation and Response): automates repetitive workflows — enriching alerts, blocking IPs, opening tickets — via playbooks, reducing analyst toil and response time.
  • Threat intelligence: external and internal feeds of known-bad indicators and adversary tactics, used to prioritize and contextualize alerts.
  • Log management: the underlying collection, retention, and search of logs, essential for both detection and post-incident forensics.

Detection Concepts

Indicators of compromise (IOCs) are forensic artifacts that suggest an intrusion — malicious file hashes, attacker IP addresses or domains, suspicious registry keys, or unusual outbound connections. IOCs are useful but reactive: they describe attacks already seen.

The MITRE ATT&CK framework is a knowledge base of real-world adversary tactics (the "why," such as Persistence or Exfiltration) and techniques (the "how"). Mapping detections to ATT&CK lets a SOC reason about behavior rather than just static indicators, and measure coverage against the ways attackers actually operate.

Not every alert is an incident. Analysts distinguish true positives (real malicious activity), false positives (benign activity flagged as bad), true negatives, and the dangerous false negatives (real attacks that went undetected). Excessive false positives cause alert fatigue, which is itself a security risk.

Two key metrics gauge SOC performance:

  • MTTD (Mean Time to Detect): how long from compromise to discovery.
  • MTTR (Mean Time to Respond): how long from detection to containment or resolution.

Lowering both shrinks the attacker's window of opportunity, which directly limits damage.

Containment, Eradication, and Recovery in Depth

Containment comes in two flavors. Short-term containment stops immediate spread fast — isolating a host from the network, disabling a compromised account, blocking a malicious domain. Long-term containment applies more durable measures while a full rebuild is prepared, such as deploying temporary firewall rules or rebuilding a clean system to take over. A key decision is whether to isolate immediately or observe the attacker briefly to understand scope — but never at the cost of letting damage grow.

Eradication removes the root cause: deleting malware, closing the exploited vulnerability, removing attacker-created accounts and persistence, and rotating compromised credentials. If the entry point is not fixed, the attacker simply returns.

Recovery restores systems to production from known-good backups or rebuilds, then validates that they are clean and functioning normally. This phase includes heightened monitoring to confirm the threat is truly gone before declaring the incident closed.

PhaseGoalExample Action
Short-term containmentStop spread nowIsolate infected endpoint
Long-term containmentStabilize during cleanupTemporary firewall / segmentation rules
EradicationRemove the threatWipe malware, patch the exploited flaw
RecoveryReturn to normal safelyRestore from clean backup, monitor

Post-Incident: Learning From the Event

Once recovery is complete, the team conducts a root cause analysis (RCA) to identify the underlying failure — not just the malware, but how it got in and why it was not caught sooner. A lessons learned review (ideally within a week or two, while memory is fresh) documents what worked, what did not, and assigns concrete follow-up actions with owners.

Tabletop exercises are facilitated discussions where the team walks through a simulated incident scenario to test the plan and find gaps — without touching production. Regular tabletops build muscle memory so that, under real pressure, the response is practiced rather than improvised.

Special Considerations for OT/ICS Incident Response

In operational technology (OT) and industrial control system (ICS) environments — power plants, water treatment, manufacturing — the priorities shift. The overriding concern is safety: a response action that protects data in IT could cause physical harm in OT.

  • You cannot always "pull the plug." Isolating or shutting down a controller mid-process can trip equipment, damage machinery, or endanger personnel. Containment must account for the physical process state.
  • Availability often outranks confidentiality. A turbine controller staying online safely can matter more than the secrecy of its data.
  • Engineering must be in the loop. Effective OT response is a partnership between security analysts and process/control engineers who understand what each device does and how stopping it affects the plant.
  • Legacy and fragile systems. Many ICS devices run old, unpatchable software and react badly to aggressive scanning, so even investigation techniques must be chosen carefully.
  • Specialized monitoring. OT-aware tools that understand industrial protocols are needed, since standard IT security tools may miss — or disrupt — control traffic.

The lifecycle phases still apply, but every action is filtered through the question: "Will this keep people and the physical process safe?"