Incident Response Runbook
Severity, roles, communications, containment, recovery, and evidence capture
Use this runbook when a suspected or confirmed security event could affect production systems, customer data, delivery systems, credentials, or cloud infrastructure.
Severity model#
Preserve evidence
Do not delete logs, rebuild hosts, rotate every credential, or change retention settings before the incident commander confirms evidence capture. Containment matters, but avoid destroying the timeline.
Roles#
First 15 minutes#
- Open a dedicated incident channel and name the incident commander.
- Record the initial report: who reported, when, affected systems, suspected impact, and current evidence.
- Assign severity and declare whether customer data, production availability, or privileged access may be affected.
- Freeze destructive changes unless needed for containment.
- Capture volatile evidence: cloud audit events, CI/CD logs, deployment history, identity provider logs, affected pod/host metadata, and relevant application logs.
- Decide the first containment action and owner.
Investigation checklist#
- What changed recently? Check deployments, infrastructure changes, access grants, and vendor changes.
- Which identities were used? Include human users, service accounts, workload identities, CI tokens, deploy keys, and break-glass accounts.
- Which systems trust the affected credential or artifact?
- What data could the actor read, modify, delete, or export?
- Are there signs of persistence, lateral movement, or repeated access?
- Are logs complete enough to establish a timeline?
- Is the vulnerability still exploitable?
Common response plays#
Leaked production secret#
- Identify the exact secret, scope, owner, and last known valid use.
- Search source control, CI logs, artifact registries, chat, ticket systems, and paste locations for exposure.
- Rotate or revoke the secret using the owner-approved procedure.
- Check audit logs from the time of exposure to revocation.
- Update dependent workloads and verify successful restart or reload.
- Add a prevention action: scanning rule, secret manager reference, or workload identity migration.
Compromised CI/CD token#
- Disable the token and identify all workflows, repositories, and environments it could access.
- Review recent workflow runs, artifact uploads, package publishes, and deployment events.
- Revoke derived credentials and rotate environment secrets exposed to affected jobs.
- Validate runner isolation, pull request trigger policy, and protected environment rules.
- Rebuild affected artifacts from trusted source and redeploy if artifact integrity is uncertain.
Suspicious cloud administrator activity#
- Disable or restrict the identity if it is not required for containment.
- Export identity provider logs and cloud control-plane logs for the suspected window.
- Review new users, keys, roles, policies, network paths, snapshots, images, and forwarding rules.
- Check for persistence through access keys, OAuth apps, federated roles, or new CI secrets.
- Rotate impacted credentials and validate baseline policy state.
Communications template#
1Incident: <short name>2Severity: <SEV-1/2/3/4>3Status: Investigating / Contained / Recovering / Closed4Started: <timestamp and timezone>5Commander: <name>6Known impact: <what is confirmed, not guessed>7Current actions: <owners and next actions>8Next update: <timestamp>For external communication, separate confirmed facts from investigation hypotheses. Include what happened, what data or services were affected, what action customers should take, and when the next update will arrive.
Recovery and closure#
- Root cause identified or documented as unknown with rationale.
- Attack path closed and verified.
- Credentials rotated or revoked where needed.
- Artifacts, images, and deployments revalidated.
- Monitoring added for recurrence indicators.
- Customer/regulatory notifications completed if required.
- Corrective actions assigned owners and dates.
- Post-incident review completed within five business days for SEV-1/SEV-2.
Post-incident review prompts#
- What signal first identified the incident?
- What made triage faster or slower?
- Which controls worked as designed?
- Which assumptions were wrong?
- What evidence was missing?
- What prevention or detection improvement belongs in platform defaults?
References#
- NIST SP 800-61 Rev. 2 provides the incident handling lifecycle: preparation, detection and analysis, containment, eradication and recovery, and post-incident activity.
- CISA Incident Response resources provide federal and industry incident response guidance.
- FIRST CSIRT Services Framework describes incident management services for response teams.