Runner monitoring and support

Runner monitoring answers three user questions: are jobs starting on time, are they completing reliably, and is the runner boundary still safe to operate? Assistance can operate dashboards, alert rules, triage workflows, maintenance windows, and evidence collection for the runner layer inside the agreed engagement boundary.

Monitoring does not replace customer ownership of repository code, workflow design, release decisions, or compliance sign-off. Assistance provides operational signals, support, and evidence; the customer decides business priority, legal interpretation, and risk acceptance.

For hardening controls, see Runner security hardening. For platform-specific runner behavior, see GitHub Actions, GitLab runners, Gitea Actions, and Forgejo Actions. For incident triage, see Troubleshooting.

What Assistance monitors#

The exact implementation depends on the platform and hosting model, but a production runner fleet normally includes these signals.

Signal	What it shows	Typical alert condition
Queue time	How long jobs wait for matching capacity	Queue time above agreed target for a platform, repo, label, tag, or environment
Runner online status	Whether registered runners are available to the CI platform	Runner offline, missing heartbeat, or registration failure
Job success and failure rate	Whether jobs are completing normally	Sudden failure spike not explained by a known code change
Stuck or abandoned jobs	Jobs consuming capacity without progress	Job exceeds expected runtime or runner does not release capacity
Concurrency and saturation	Whether pools are full	Max concurrency reached for sustained period, especially for deploy labels/tags
Host CPU, memory, disk, and load	Whether the runner host is resource constrained	Resource pressure likely to cause job failures or slow builds
Disk and workspace cleanup	Whether workspaces, caches, and temporary files are cleared	Disk usage above threshold or cleanup task failing
Network reachability and egress	Whether required services are reachable and unexpected paths are blocked	Package registry, artifact store, internal endpoint, or CI API unreachable
Cache and artifact behavior	Whether caches speed jobs without creating leakage or capacity issues	Cache backend errors, artifact upload failures, or unusual retention growth
Agent and image version	Whether runner agents, images, and tooling are current	Unsupported runner version, failed update, or high-severity vulnerability window
Security events	Suspicious access, unexpected runner registration, secret leakage indicators, or policy violations	New privileged runner, forbidden egress, leaked token, unusual administrator action

User-facing service indicators#

Dashboards should be understandable to both platform owners and delivery teams. Assistance usually separates:

Fleet health — online runners, offline runners, version status, patch status, and host health.
Delivery health — queue time, job duration, failure rate, retry rate, and top failing repositories.
Capacity health — concurrency, saturation, autoscaling events, pending jobs, and utilization by label/tag.
Security health — restricted runner usage, administrative changes, private-network access, secret handling findings, and cleanup failures.
Evidence health — latest access review, latest patch record, latest maintenance window, and open remediation items.

For EU-sensitive environments, confirm where monitoring data, logs, support tickets, and exports are stored. CI metadata can include user names, repository names, branch names, internal service names, and accidental secret output.

Alert examples#

Assistance tunes alert thresholds to the customer's operating model. Common alerts include:

Alert	Severity guide	First response
Production deploy runner unavailable	High if a release or incident fix is blocked	Stop new deploy attempts, check runner registration, host status, CI platform status, and recent changes
Queue saturation	Medium to high depending on business impact	Identify labels/tags with backlog, check autoscaling, add capacity or reschedule low-priority jobs
Runner cleanup failure	Medium; high for secret-bearing or persistent runners	Quarantine affected runner if needed, inspect disk/workspace state, clean or rebuild, check for leaked artifacts
Disk nearly full	Medium; high if jobs are failing	Clear approved caches, rotate logs, validate artifact upload behavior, resize or rebuild if required
Unexpected privileged job on restricted runner	High	Preserve job logs, identify repository/ref/user, disable routing if needed, review approvals and secret access
Runner agent outdated or vulnerable	Medium; high for active exploitation	Schedule patch window or emergency replacement, record evidence, validate post-update jobs
Private endpoint unreachable	Medium to high depending on affected pipeline	Check DNS, firewall, VPN/private link, provider status, and recent network changes
High job failure spike	Medium unless production release is blocked	Compare against code changes, dependency outages, runner image changes, and platform incidents

Alerts should include affected platform, repository or group, label/tag, environment, impact, dashboard link, and the runbook section to start with.

Support workflow#

When Assistance receives a runner support request, the normal workflow is:

Classify impact — blocked production release, degraded CI, security concern, advisory request, or routine change.
Confirm scope — platform, repository or group, labels/tags, runner pool, environment, region, and affected jobs.
Check recent changes — workflow edits, runner image updates, patches, network changes, secrets rotation, provider incidents, and maintenance windows.
Triage signals — queue time, runner online status, failures, host resources, disk, network, cache/artifact backend, and CI platform status.
Stabilize — reroute jobs, pause low-priority workflows, add capacity, rebuild runners, roll back image changes, or disable suspect pools.
Document — record timeline, evidence, actions, customer approvals, remaining risks, and follow-up items.

What to include in a support ticket#

Send the smallest complete packet of information that lets Assistance reproduce the problem safely:

CI platform and link to the workflow, pipeline, or job.
Repository, project, group, or organization.
Runner labels/tags, runner group, executor, or pool name.
Environment affected: development, staging, production, deploy, private-network, GPU, bare metal, or other profile.
Business impact and deadline.
First failure time and whether retries behave differently.
Recent workflow, secret, image, dependency, network, or access changes.
Relevant log excerpts with secrets removed.
Whether untrusted code, forks, production secrets, or customer data may be involved.

Do not paste raw secrets, private keys, full environment dumps, or personal data that is not needed for triage.

Maintenance windows#

Runner maintenance should be predictable because CI capacity is part of the release path. A maintenance window record should include:

reason for the change: patching, runner-agent update, image refresh, host replacement, network change, or capacity change;
affected platforms, repositories, labels/tags, and environments;
expected user impact and freeze period;
rollback or rebuild path;
validation jobs to run before and after the change;
communication owner and escalation channel;
evidence to retain after completion.

Emergency maintenance is used for active exploitation, credential exposure, unstable hosts, provider incidents, or production release blockers. Assistance will favor containment and safe recovery over preserving convenience caches or local runner state.

Logs, metrics, and retention#

Retention must balance troubleshooting, audit, privacy, and cost.

Data type	Why keep it	Retention considerations
CI job logs	Troubleshooting failures and incident timelines	May contain secrets or personal data; redact before support sharing
Runner host logs	Agent health, cleanup, patching, system errors	Keep enough for incident review; restrict host-level access
Metrics	Capacity planning, alert history, SLA/support review	Lower sensitivity but may reveal repository and release patterns
Artifacts and caches	Build handoff and performance	Use explicit retention; avoid storing regulated data unless approved
Support tickets	Decisions, approvals, evidence, follow-up actions	Store in agreed region/tool and avoid unnecessary personal data

Assistance can recommend retention settings and collect evidence, but the customer approves legal and regulatory retention requirements.

Regular reviews#

For retained operations, Assistance usually reviews:

queue-time trends and top capacity constraints;
failure trends by repository, label/tag, image, and dependency;
open security findings and patch status;
access-review completion and privileged runner usage;
cache/artifact growth and retention exceptions;
incidents, near misses, and recurring support themes;
upcoming maintenance windows and platform deprecations.

The review output is an improvement backlog with owners, priorities, target dates, and evidence needed for closure.

Runner monitoring and support

Runner monitoring and support

What Assistance monitors#

User-facing service indicators#

Alert examples#

Support workflow#

What to include in a support ticket#

Maintenance windows#

Logs, metrics, and retention#

Regular reviews#

Related docs and services#