Security

Incident Response Runbook

Severity, roles, communications, containment, recovery, and evidence capture


Use this runbook when a suspected or confirmed security event could affect production systems, customer data, delivery systems, credentials, or cloud infrastructure.

Severity model#

SeverityDefinitionExamplesResponse target
SEV-1Active or likely customer-impacting compromiseConfirmed data exfiltration, attacker in production, compromised cloud root/adminImmediate incident command
SEV-2High-risk security incident with contained or uncertain impactLeaked production secret, compromised CI token, exploited internet-facing serviceSame business day
SEV-3Security issue requiring coordinated remediationCritical vulnerability with exposure, suspicious access, failed controlNext business day
SEV-4Low-risk issue or policy exceptionNon-production secret, stale access, informational scanner findingNormal backlog

Roles#

RoleResponsibility
Incident commanderOwns severity, timeline, decisions, and handoffs
Technical leadCoordinates investigation, containment, eradication, and recovery
Communications leadManages internal updates, customer drafts, and leadership briefings
Evidence ownerPreserves logs, screenshots, tickets, and command history
Business ownerDecides customer, legal, regulatory, and contractual escalations

First 15 minutes#

  1. Open a dedicated incident channel and name the incident commander.
  2. Record the initial report: who reported, when, affected systems, suspected impact, and current evidence.
  3. Assign severity and declare whether customer data, production availability, or privileged access may be affected.
  4. Freeze destructive changes unless needed for containment.
  5. Capture volatile evidence: cloud audit events, CI/CD logs, deployment history, identity provider logs, affected pod/host metadata, and relevant application logs.
  6. Decide the first containment action and owner.

Investigation checklist#

  • What changed recently? Check deployments, infrastructure changes, access grants, and vendor changes.
  • Which identities were used? Include human users, service accounts, workload identities, CI tokens, deploy keys, and break-glass accounts.
  • Which systems trust the affected credential or artifact?
  • What data could the actor read, modify, delete, or export?
  • Are there signs of persistence, lateral movement, or repeated access?
  • Are logs complete enough to establish a timeline?
  • Is the vulnerability still exploitable?

Common response plays#

Leaked production secret#

  1. Identify the exact secret, scope, owner, and last known valid use.
  2. Search source control, CI logs, artifact registries, chat, ticket systems, and paste locations for exposure.
  3. Rotate or revoke the secret using the owner-approved procedure.
  4. Check audit logs from the time of exposure to revocation.
  5. Update dependent workloads and verify successful restart or reload.
  6. Add a prevention action: scanning rule, secret manager reference, or workload identity migration.

Compromised CI/CD token#

  1. Disable the token and identify all workflows, repositories, and environments it could access.
  2. Review recent workflow runs, artifact uploads, package publishes, and deployment events.
  3. Revoke derived credentials and rotate environment secrets exposed to affected jobs.
  4. Validate runner isolation, pull request trigger policy, and protected environment rules.
  5. Rebuild affected artifacts from trusted source and redeploy if artifact integrity is uncertain.

Suspicious cloud administrator activity#

  1. Disable or restrict the identity if it is not required for containment.
  2. Export identity provider logs and cloud control-plane logs for the suspected window.
  3. Review new users, keys, roles, policies, network paths, snapshots, images, and forwarding rules.
  4. Check for persistence through access keys, OAuth apps, federated roles, or new CI secrets.
  5. Rotate impacted credentials and validate baseline policy state.

Communications template#

1
Incident: <short name>
2
Severity: <SEV-1/2/3/4>
3
Status: Investigating / Contained / Recovering / Closed
4
Started: <timestamp and timezone>
5
Commander: <name>
6
Known impact: <what is confirmed, not guessed>
7
Current actions: <owners and next actions>
8
Next update: <timestamp>

For external communication, separate confirmed facts from investigation hypotheses. Include what happened, what data or services were affected, what action customers should take, and when the next update will arrive.

Recovery and closure#

  • Root cause identified or documented as unknown with rationale.
  • Attack path closed and verified.
  • Credentials rotated or revoked where needed.
  • Artifacts, images, and deployments revalidated.
  • Monitoring added for recurrence indicators.
  • Customer/regulatory notifications completed if required.
  • Corrective actions assigned owners and dates.
  • Post-incident review completed within five business days for SEV-1/SEV-2.

Post-incident review prompts#

  • What signal first identified the incident?
  • What made triage faster or slower?
  • Which controls worked as designed?
  • Which assumptions were wrong?
  • What evidence was missing?
  • What prevention or detection improvement belongs in platform defaults?

References#