Runners

Runner monitoring and observability

Prometheus metrics, Grafana dashboards, and alerting for self-hosted runners


Monitoring your self-hosted runners ensures you catch problems before they impact developer productivity. This guide covers metrics collection, dashboards, and alerting across GitHub Actions, GitLab CI, and Bazel Remote Execution runners.

Architecture overview#

1
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
2
│ GitHub Actions │ │ GitLab Runner │ │ Bazel Remote │
3
│ Runner │ │ :9252/metrics │ │ Execution │
4
│ (webhook/export) │ │ (native) │ │ :9090/metrics │
5
└────────┬────────┘ └────────┬────────┘ └────────┬────────┘
6
│ │ │
7
└───────────┬───────────┴───────────┬───────────┘
8
│ │
9
┌──────▼───────┐ ┌──────▼───────┐
10
│ Prometheus │ │ Node │
11
│ (scrape) │◄───────│ Exporter │
12
└──────┬───────┘ │ :9100 │
13
│ └──────────────┘
14
┌──────▼───────┐
15
│ Grafana │
16
│ Dashboards │
17
└──────┬───────┘
18
19
┌──────▼───────┐
20
│ Alertmanager│
21
│ (PagerDuty, │
22
│ Slack, etc)│
23
└──────────────┘

Prometheus setup#

Scrape configuration#

Add runner targets to your Prometheus configuration:

1
# /etc/prometheus/prometheus.yml
2
global:
3
scrape_interval: 15s
4
evaluation_interval: 15s
5
6
rule_files:
7
- "runner_alerts.yml"
8
9
scrape_configs:
10
# System metrics from all runner hosts
11
- job_name: "node-exporter"
12
static_configs:
13
- targets:
14
- "runner-01:9100"
15
- "runner-02:9100"
16
- "runner-03:9100"
17
labels:
18
role: "ci-runner"
19
20
# GitLab Runner native metrics
21
- job_name: "gitlab-runner"
22
static_configs:
23
- targets:
24
- "runner-01:9252"
25
- "runner-02:9252"
26
metrics_path: /metrics
27
28
# Bazel Remote Execution / BuildBuddy metrics
29
- job_name: "bazel-remote"
30
static_configs:
31
- targets:
32
- "bazel-cache-01:9090"
33
metrics_path: /metrics
34
35
# GitHub Actions runner metrics (via exporter)
36
- job_name: "github-runner-exporter"
37
static_configs:
38
- targets:
39
- "runner-01:9500"
40
metrics_path: /metrics

Node exporter for system metrics#

Install node exporter on every runner host to collect CPU, memory, disk, and network metrics:

1
# Install node exporter
2
sudo useradd --no-create-home --shell /bin/false node_exporter
3
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
4
tar xzf node_exporter-1.7.0.linux-amd64.tar.gz
5
sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
6
sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter
7
8
# Create systemd service
9
sudo tee /etc/systemd/system/node_exporter.service << 'EOF'
10
[Unit]
11
Description=Node Exporter
12
After=network.target
13
14
[Service]
15
Type=simple
16
User=node_exporter
17
ExecStart=/usr/local/bin/node_exporter \
18
--collector.systemd \
19
--collector.processes
20
Restart=always
21
RestartSec=5
22
23
[Install]
24
WantedBy=multi-user.target
25
EOF
26
27
sudo systemctl daemon-reload
28
sudo systemctl enable --now node_exporter

Platform-specific metrics#

GitLab Runner metrics#

GitLab Runner exposes Prometheus metrics natively on port 9252. Enable it in config.toml or via command line:

1
# Via command line
2
gitlab-runner run --metrics-server ":9252"
3
4
# Via config.toml
5
# Add listen_address under the global section

Key metrics:

MetricTypeDescription
gitlab_runner_jobsgaugeNumber of currently running jobs
gitlab_runner_jobs_totalcounterTotal number of processed jobs
gitlab_runner_errors_totalcounterTotal number of errors by type
gitlab_runner_concurrentgaugeCurrent concurrent setting
gitlab_runner_limitgaugeCurrent limit setting
gitlab_runner_request_concurrencygaugeCurrent number of concurrent requests
gitlab_runner_version_infogaugeRunner version (labels: version, revision)
process_cpu_seconds_totalcounterRunner process CPU usage
process_resident_memory_bytesgaugeRunner process memory usage

Example PromQL queries:

1
# Job throughput (jobs per minute)
2
rate(gitlab_runner_jobs_total[5m]) * 60
3
4
# Error rate
5
rate(gitlab_runner_errors_total[5m])
6
7
# Runner utilization (jobs / concurrent limit)
8
gitlab_runner_jobs / gitlab_runner_concurrent
9
10
# Job duration (if using custom metrics)
11
histogram_quantile(0.95, rate(gitlab_runner_job_duration_seconds_bucket[5m]))

GitHub Actions runner metrics#

GitHub Actions runners don't expose a native Prometheus endpoint. Use one of these approaches:

Option 1: GitHub API polling exporter

Create a lightweight exporter that polls the GitHub API for runner status:

1
#!/bin/bash
2
# github-runner-exporter.sh — simple metrics exporter
3
# Runs as a service, exposes metrics on :9500
4
5
REPO="YOUR-ORG/YOUR-REPO"
6
PAT="YOUR_PAT"
7
PORT=9500
8
9
while true; do
10
RUNNERS=$(curl -s -H "Authorization: Bearer $PAT" \
11
"https://api.github.com/repos/$REPO/actions/runners")
12
13
TOTAL=$(echo "$RUNNERS" | jq '.total_count')
14
ONLINE=$(echo "$RUNNERS" | jq '[.runners[] | select(.status == "online")] | length')
15
BUSY=$(echo "$RUNNERS" | jq '[.runners[] | select(.busy == true)] | length')
16
IDLE=$((ONLINE - BUSY))
17
18
cat > /tmp/github_runner_metrics << METRICS
19
# HELP github_runner_total Total registered runners
20
# TYPE github_runner_total gauge
21
github_runner_total $TOTAL
22
# HELP github_runner_online Online runners
23
# TYPE github_runner_online gauge
24
github_runner_online $ONLINE
25
# HELP github_runner_busy Busy runners
26
# TYPE github_runner_busy gauge
27
github_runner_busy $BUSY
28
# HELP github_runner_idle Idle runners
29
# TYPE github_runner_idle gauge
30
github_runner_idle $IDLE
31
METRICS
32
33
sleep 30
34
done &
35
36
# Serve metrics via a simple HTTP server
37
while true; do
38
echo -e "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\n\r\n$(cat /tmp/github_runner_metrics)" \
39
| nc -l -p $PORT -q 1
40
done

Option 2: Webhook-based metrics

Use workflow_job webhooks to track job lifecycle events and export them as Prometheus metrics. This gives you job queue time, execution duration, and failure counts in real time.

Bazel Remote Execution metrics#

BuildBuddy and bazel-remote both expose Prometheus-compatible metrics:

BuildBuddy metrics (port 9090):

MetricDescription
buildbuddy_remote_cache_hit_countCache hits
buildbuddy_remote_cache_miss_countCache misses
buildbuddy_remote_cache_size_bytesTotal cache size
buildbuddy_invocation_countBuild invocations
buildbuddy_action_countRemote actions executed
buildbuddy_action_queue_lengthQueued actions waiting

bazel-remote metrics (port 9090):

MetricDescription
bazel_remote_cache_hitsCache hit count by type (ac/cas)
bazel_remote_cache_missesCache miss count by type
bazel_remote_disk_cache_size_bytesOn-disk cache size
bazel_remote_http_request_duration_secondsRequest latency histogram
1
# Cache hit rate
2
buildbuddy_remote_cache_hit_count / (buildbuddy_remote_cache_hit_count + buildbuddy_remote_cache_miss_count)
3
4
# Action queue depth (high = need more workers)
5
buildbuddy_action_queue_length
6
7
# Cache size growth rate
8
rate(buildbuddy_remote_cache_size_bytes[1h])

Grafana dashboards#

Runner fleet overview dashboard#

Create a dashboard that shows the health of your entire runner fleet at a glance:

1
{
2
"dashboard": {
3
"title": "CI/CD Runner Fleet",
4
"panels": [
5
{
6
"title": "Runner Fleet Status",
7
"type": "stat",
8
"targets": [
9
{
10
"expr": "count(up{job=~\"gitlab-runner|github-runner-exporter|node-exporter\", role=\"ci-runner\"} == 1)",
11
"legendFormat": "Online"
12
}
13
]
14
},
15
{
16
"title": "Active Jobs",
17
"type": "gauge",
18
"targets": [
19
{
20
"expr": "sum(gitlab_runner_jobs) + sum(github_runner_busy)",
21
"legendFormat": "Running"
22
}
23
]
24
},
25
{
26
"title": "CPU Usage by Runner",
27
"type": "timeseries",
28
"targets": [
29
{
30
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", role=\"ci-runner\"}[5m])) * 100)",
31
"legendFormat": "{{instance}}"
32
}
33
]
34
},
35
{
36
"title": "Memory Usage by Runner",
37
"type": "timeseries",
38
"targets": [
39
{
40
"expr": "(1 - node_memory_MemAvailable_bytes{role=\"ci-runner\"} / node_memory_MemTotal_bytes{role=\"ci-runner\"}) * 100",
41
"legendFormat": "{{instance}}"
42
}
43
]
44
},
45
{
46
"title": "Disk Usage by Runner",
47
"type": "timeseries",
48
"targets": [
49
{
50
"expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\", role=\"ci-runner\"} / node_filesystem_size_bytes{mountpoint=\"/\", role=\"ci-runner\"}) * 100",
51
"legendFormat": "{{instance}}"
52
}
53
]
54
},
55
{
56
"title": "GitLab Job Throughput",
57
"type": "timeseries",
58
"targets": [
59
{
60
"expr": "rate(gitlab_runner_jobs_total[5m]) * 60",
61
"legendFormat": "jobs/min"
62
}
63
]
64
},
65
{
66
"title": "Bazel Cache Hit Rate",
67
"type": "gauge",
68
"targets": [
69
{
70
"expr": "buildbuddy_remote_cache_hit_count / (buildbuddy_remote_cache_hit_count + buildbuddy_remote_cache_miss_count) * 100",
71
"legendFormat": "Hit %"
72
}
73
]
74
}
75
]
76
}
77
}

Key panels to include#

PanelQueryPurpose
Online runnerscount(up{role="ci-runner"} == 1)Fleet health
CPU saturationavg(rate(node_cpu_seconds_total{mode="idle"}[5m]))Capacity planning
Disk spacenode_filesystem_avail_bytes{mountpoint="/"}Prevent disk exhaustion
Network I/Orate(node_network_receive_bytes_total[5m])Bandwidth usage
GitLab errorsrate(gitlab_runner_errors_total[5m])Error trends
Job queue depthgitlab_runner_request_concurrencyScaling signals
Cache hit rateBazel cache hits / (hits + misses)Build efficiency

Alerting rules#

Prometheus alert rules#

1
# /etc/prometheus/runner_alerts.yml
2
groups:
3
- name: runner-health
4
interval: 30s
5
rules:
6
# Runner host is down
7
- alert: RunnerHostDown
8
expr: up{role="ci-runner"} == 0
9
for: 2m
10
labels:
11
severity: critical
12
annotations:
13
summary: "Runner host {{ $labels.instance }} is down"
14
description: "Runner host has been unreachable for more than 2 minutes."
15
16
# High CPU usage
17
- alert: RunnerHighCPU
18
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle", role="ci-runner"}[5m])) * 100) > 90
19
for: 10m
20
labels:
21
severity: warning
22
annotations:
23
summary: "High CPU on runner {{ $labels.instance }}"
24
description: "CPU usage above 90% for 10 minutes. Consider adding more runners."
25
26
# Disk space low
27
- alert: RunnerDiskSpaceLow
28
expr: (node_filesystem_avail_bytes{mountpoint="/", role="ci-runner"} / node_filesystem_size_bytes{mountpoint="/", role="ci-runner"}) * 100 < 15
29
for: 5m
30
labels:
31
severity: warning
32
annotations:
33
summary: "Low disk space on runner {{ $labels.instance }}"
34
description: "Less than 15% disk space remaining. Clean up work directories and Docker images."
35
36
# Disk space critical
37
- alert: RunnerDiskSpaceCritical
38
expr: (node_filesystem_avail_bytes{mountpoint="/", role="ci-runner"} / node_filesystem_size_bytes{mountpoint="/", role="ci-runner"}) * 100 < 5
39
for: 1m
40
labels:
41
severity: critical
42
annotations:
43
summary: "Critical disk space on runner {{ $labels.instance }}"
44
description: "Less than 5% disk space remaining. Jobs will fail."
45
46
# Memory exhaustion
47
- alert: RunnerMemoryHigh
48
expr: (1 - node_memory_MemAvailable_bytes{role="ci-runner"} / node_memory_MemTotal_bytes{role="ci-runner"}) * 100 > 90
49
for: 5m
50
labels:
51
severity: warning
52
annotations:
53
summary: "High memory usage on runner {{ $labels.instance }}"
54
description: "Memory usage above 90% for 5 minutes."
55
56
- name: runner-jobs
57
interval: 30s
58
rules:
59
# GitLab runner error rate spike
60
- alert: GitLabRunnerHighErrorRate
61
expr: rate(gitlab_runner_errors_total[5m]) > 0.1
62
for: 5m
63
labels:
64
severity: warning
65
annotations:
66
summary: "High error rate on GitLab Runner"
67
description: "More than 6 errors per minute for 5 minutes."
68
69
# All runners busy (queue building up)
70
- alert: AllRunnersBusy
71
expr: gitlab_runner_jobs == gitlab_runner_concurrent
72
for: 10m
73
labels:
74
severity: warning
75
annotations:
76
summary: "All GitLab Runner slots are busy"
77
description: "All concurrent job slots are in use for 10+ minutes. Jobs are queuing."
78
79
# Bazel cache hit rate drop
80
- alert: BazelCacheHitRateLow
81
expr: buildbuddy_remote_cache_hit_count / (buildbuddy_remote_cache_hit_count + buildbuddy_remote_cache_miss_count) < 0.5
82
for: 15m
83
labels:
84
severity: warning
85
annotations:
86
summary: "Bazel cache hit rate below 50%"
87
description: "Cache hit rate has dropped below 50%. Check for cache invalidation or configuration changes."
88
89
# GitHub runners all offline
90
- alert: GitHubRunnersOffline
91
expr: github_runner_online == 0 and github_runner_total > 0
92
for: 2m
93
labels:
94
severity: critical
95
annotations:
96
summary: "All GitHub Actions runners are offline"
97
description: "No online runners detected. Workflows will queue indefinitely."

Alertmanager configuration#

Route alerts to Slack and PagerDuty:

1
# /etc/alertmanager/alertmanager.yml
2
global:
3
slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
4
5
route:
6
group_by: ["alertname", "instance"]
7
group_wait: 30s
8
group_interval: 5m
9
repeat_interval: 4h
10
receiver: "slack-warnings"
11
12
routes:
13
- match:
14
severity: critical
15
receiver: "pagerduty-critical"
16
- match:
17
severity: warning
18
receiver: "slack-warnings"
19
20
receivers:
21
- name: "slack-warnings"
22
slack_configs:
23
- channel: "#ci-cd-alerts"
24
title: "{{ .GroupLabels.alertname }}"
25
text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
26
27
- name: "pagerduty-critical"
28
pagerduty_configs:
29
- service_key: "YOUR_PAGERDUTY_KEY"

Integration with existing monitoring#

If you're already running Prometheus and Grafana (for example, via the project's podman-compose stack), add the runner scrape targets to your existing configuration. The node exporter metrics are standard and work with any existing system dashboards.

For GlitchTip integration, runner failures can be reported as errors:

1
# Report runner failures to GlitchTip (Sentry-compatible)
2
curl -X POST "http://localhost:8000/api/GLITCHTIP_PROJECT_ID/store/" \
3
-H "Content-Type: application/json" \
4
-H "X-Sentry-Auth: Sentry sentry_key=YOUR_DSN_KEY" \
5
-d '{
6
"event_id": "'$(uuidgen | tr -d '-')'",
7
"message": "Runner runner-01 is offline",
8
"level": "error",
9
"tags": {"runner": "runner-01", "platform": "gitlab"}
10
}'

Next steps#