Incident response at ultra-large scale

How to orchestrate incident response in a manner that accounts for heterogeneous incident response practices in constituent systems and provides situational awareness at the necessary pace and resolution for human-machine decision-making?

Over the past 20 years significant progress has been made to mature and develop incident response and recovery capacity, whether delivered by in-house security operations centres (SOCs) or by third party managed service providers. This is supported by automation and tooling, often in the form of Security Information and Event Management (SIEM) systems that provide real-time information to human operators in a SOC. However, selecting the best response and recovery actions remains a largely human task. Orchestrating incident response on an infrastructure-scale requires research into the appropriate balance between human-machine decision-making.

Existing standards such as ISO/IEC 27035-2:2023 offer guidelines on how to plan, prepare and learn lessons from any incidents, both in terms of system defences and the incident response approach. Given the high-level nature of such guidance, operationalisation happens through playbooks, acting as recipes on steps and actions to take during incident response. However, playbooks remain very much a manual setup, often taking the format of natural language texts or flow charts—typically in printed format placed in SOCs. Recent works have argued for more systematic model-based representations of playbooks, and have highlighted the lack of a) usability studies of playbooks, and b) specificity even for highly rated playbooks for completeness and correctness by experts.

In the infrastructures under discussion, each constituent system will have its own playbook unlikely to be formalised into any structured or systematic common model. Orchestrating a globally coordinated incident response on this scale is, therefore, a major research challenge. It is made even more challenging by the dynamism—systems composing with the infrastructure or leaving. Furthermore, constituent systems’ playbooks will change in response to incidents over time. So one cannot start from the assumption that the playbooks are convergent or will remain so over time. The complexity is further compounded because contextual information is a challenge in SIEMs as SOC workers are not involved in the design choices, configurations and operation of specific organisational assets from where telemetry is fed into the SOC. Where contextual information is communicated, this happens informally and thus remains tacit and not formally documented.