Why ICS Incident Response Is Not IT Incident Response

In IT incident response, the primary objectives are confidentiality and data integrity โ€” contain the breach, eradicate the malware, restore systems from backup. In OT/ICS incident response, there is a third dimension that overrides the others: process safety and continuity. Taking a compromised PLC offline to image it for forensic analysis may be the correct IR action โ€” but if that PLC controls a cooling pump for a reactor, taking it offline without a manual override in place could cause a physical safety incident. Every IR decision in an OT environment must be evaluated against its operational and safety consequence, not just its security merit. This requires ICS IR teams to include operations engineers and process safety experts, not just cybersecurity analysts.

The contrast with IT is stark: in IT, you isolate the compromised endpoint immediately (quarantine in EDR, disable AD account, block at NAC). In OT, "isolate" must be carefully defined. Isolating a PLC from the engineering network may be achievable mid-process. Isolating the DCS controller that runs the distillation column is a process shutdown decision that requires operations management approval, a defined shutdown sequence, and potentially regulatory notification. The IR plan must pre-define which OT isolation actions are pre-authorized for the IR team and which require escalation to operations management.

IR Phases Applied to OT: Preparation

Preparation is the phase where OT IR programs consistently underinvest. Preparation includes: maintaining an accurate OT asset inventory (the prerequisite for knowing what's been compromised), documenting golden-image baselines for engineering workstations and historian servers, backing up PLC ladder logic and project files to a secure, offline location (Rockwell Automation Logix Designer .ACD project files, Siemens TIA Portal project archives), creating network baseline documentation (what is normal traffic volume, which IPs communicate with which PLCs), pre-defining IR roles (who declares an OT incident, who has authority to isolate systems, who coordinates with operations), and conducting tabletop exercises that specifically include operations scenarios. The ICS Cybersecurity Incident Response Guide published by CISA and ICS-CERT provides a preparation checklist tailored to industrial environments.

Identification: Detecting OT Incidents

OT incidents are detected through multiple channels: OT network monitoring tools (Dragos, Claroty, Nozomi) generating alerts on anomalous PLC traffic or new device appearances; SIEM correlation rules firing on authentication anomalies on OT Windows systems; operations staff reporting unexpected process behavior (valves opening without commands, setpoints changing without operator input, historian alarms spiking); and IT SOC escalations when lateral movement indicators are detected on systems adjacent to OT networks. The challenge is correlating a cybersecurity signal with an operational anomaly โ€” a PLC fault alarm in the DCS may be caused by a sensor failure, a mechanical issue, OR a cyberattack that modified control logic. The IR team must coordinate with process engineers to determine whether the operational anomaly is consistent with a cyber cause before initiating full OT IR procedures.

Containment: OT-Safe Isolation Techniques

OT-safe containment must match the operational state of the process. Options ranked from least to most disruptive:

  • VLAN ACL modification: Block traffic at the switch ACL level to prevent the compromised system from reaching other OT assets, while leaving the system itself operational. This is appropriate when the compromised system is an HMI or engineering workstation that can be isolated from the network without affecting the PLC's autonomous control loop โ€” most PLCs continue running their control programs even with the network disconnected.
  • Physical network disconnection: Unplug the Ethernet cable or disable the switch port. Appropriate for the same scenarios as VLAN ACL modification but more reliable when there is uncertainty about whether a VLAN ACL will fully block traffic.
  • Process stabilization before isolation: For systems where isolation will cause a process upset, coordinate with operations to bring the process to a stable state (stable setpoints, safe operating conditions) before disconnecting the compromised system. This may require a controlled partial or full process shutdown.
  • Maintain surveillance while planning recovery: In some cases โ€” particularly when the attacker is being actively observed and eradication can be planned โ€” the correct containment action is to monitor without disrupting, while preparing for a coordinated, controlled shutdown and recovery. This requires careful judgment about whether the attacker is in a destructive phase.

Evidence Preservation from OT Systems

Forensic evidence collection in OT must occur without disrupting operations and must account for the ephemeral nature of PLC memory. Key evidence sources: Network traffic captures (PCAP from the OT network sensor SPAN port โ€” 72 hours pre-incident and ongoing capture during IR); PLC project file backup (use the vendor engineering software to export the current ladder logic/function block diagram from the PLC memory to a project file โ€” this captures any unauthorized logic modifications); PLC diagnostic logs (many modern PLCs, including Siemens S7-1500 and Rockwell ControlLogix, maintain internal event logs of logic downloads, user authentications, and fault events โ€” export these before they roll over); Historian export (export process data from the historian for the incident timeframe โ€” this captures setpoint changes, alarm events, and process values that may reveal what the attacker did to the process); Windows memory dump and disk image from compromised engineering workstations, using standard IT forensics tools (FTK Imager, Magnet RAM Capture) during a coordinated maintenance window.

Malware Analysis: Industroyer and Triton Case Studies

Industroyer (CRASHOVERRIDE), attributed to Sandworm (Russian GRU), was deployed in the December 2016 Ukraine power grid attack. It consisted of modular components for four industrial protocols: IEC 101, IEC 104, IEC 61850 GOOSE, and OPC DA. Each module opened connections to substation RTUs using the legitimate protocol and issued commands to open circuit breakers โ€” causing a power outage. The OPC DA module also scraped process data. The malware included a data wiper and port scanner. Analysis (published by ESET and Dragos) demonstrated that the attackers had deep knowledge of IEC protocol standards and the specific substation equipment in use. Recovery required manually resetting all breakers to their correct positions and verifying RTU configuration integrity.

Triton (TRISIS/HatMan), attributed to a Russian government research institute (CTIIC assessment), targeted Schneider Electric Triconex Safety Instrumented Systems (SIS) at a petrochemical facility in Saudi Arabia (2017). The malware exploited a zero-day in the Triconex TriStation protocol (CVE-2018-7515) to reprogram the SIS controllers. The intent was to disable safety functions during a simultaneous cyberattack on the process control system โ€” potentially causing a catastrophic physical accident. A bug in the malware caused the safety systems to fault and initiate a safe shutdown, unintentionally alerting the operators. Forensic analysis required offline examination of the Triconex controller firmware and comparison against known-good snapshots. This incident established that nation-state actors are willing to target safety systems โ€” the hardest red line in ICS security.

Recovery Priorities, CISA CIRCIA Reporting, and Tabletop Exercises

Recovery priority order in OT IR: (1) Safety systems โ€” verify SIS controllers are running unmodified, authenticated firmware before resuming hazardous operations. (2) Emergency shutdown systems. (3) Primary control systems (DCS, SCADA). (4) Historian and monitoring. (5) Engineering workstations. (6) Business-layer OT systems (MES, ERP integration). Each recovery step requires verification that the system is clean (comparison against golden-image backup, PLC logic diff against last-known-good project file) before it is trusted to control a live process.

The CISA CIRCIA (Cyber Incident Reporting for Critical Infrastructure Act of 2022) will require covered entities (critical infrastructure owners and operators in 16 sectors including energy, water, and chemical) to report substantial cyber incidents to CISA within 72 hours of discovery and ransomware payments within 24 hours. Final rules are pending (expected 2025โ€“2026) but preparing for CIRCIA reporting now means: establish an incident detection and classification procedure that determines whether an incident is "substantial" under the definitions (significant disruption to operations, unauthorized access to OT systems), designate a CIRCIA reporting contact, and draft report templates. Voluntary reports to ICS-CERT via the online reporting portal remain available for non-mandatory reporting. Tabletop exercises for OT environments should include: an initial alarm scenario (historian showing anomalous setpoint changes), an escalating IR scenario (discovery of malware on engineering workstation), and a worst-case scenario (safety system fault coincident with process upset) to test the decision-making, communication, and authority escalation processes before a real incident occurs.