Runner monitoring and observability
Prometheus metrics, Grafana dashboards, and alerting for self-hosted runners
Monitoring your self-hosted runners ensures you catch problems before they impact developer productivity. This guide covers metrics collection, dashboards, and alerting across GitHub Actions, GitLab CI, and Bazel Remote Execution runners.
Architecture overview#
1┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐2│ GitHub Actions │ │ GitLab Runner │ │ Bazel Remote │3│ Runner │ │ :9252/metrics │ │ Execution │4│ (webhook/export) │ │ (native) │ │ :9090/metrics │5└────────┬────────┘ └────────┬────────┘ └────────┬────────┘6 │ │ │7 └───────────┬───────────┴───────────┬───────────┘8 │ │9 ┌──────▼───────┐ ┌──────▼───────┐10 │ Prometheus │ │ Node │11 │ (scrape) │◄───────│ Exporter │12 └──────┬───────┘ │ :9100 │13 │ └──────────────┘14 ┌──────▼───────┐15 │ Grafana │16 │ Dashboards │17 └──────┬───────┘18 │19 ┌──────▼───────┐20 │ Alertmanager│21 │ (PagerDuty, │22 │ Slack, etc)│23 └──────────────┘Prometheus setup#
Scrape configuration#
Add runner targets to your Prometheus configuration:
1# /etc/prometheus/prometheus.yml2global:3 scrape_interval: 15s4 evaluation_interval: 15s56rule_files:7 - "runner_alerts.yml"89scrape_configs:10 # System metrics from all runner hosts11 - job_name: "node-exporter"12 static_configs:13 - targets:14 - "runner-01:9100"15 - "runner-02:9100"16 - "runner-03:9100"17 labels:18 role: "ci-runner"1920 # GitLab Runner native metrics21 - job_name: "gitlab-runner"22 static_configs:23 - targets:24 - "runner-01:9252"25 - "runner-02:9252"26 metrics_path: /metrics2728 # Bazel Remote Execution / BuildBuddy metrics29 - job_name: "bazel-remote"30 static_configs:31 - targets:32 - "bazel-cache-01:9090"33 metrics_path: /metrics3435 # GitHub Actions runner metrics (via exporter)36 - job_name: "github-runner-exporter"37 static_configs:38 - targets:39 - "runner-01:9500"40 metrics_path: /metricsNode exporter for system metrics#
Install node exporter on every runner host to collect CPU, memory, disk, and network metrics:
1# Install node exporter2sudo useradd --no-create-home --shell /bin/false node_exporter3curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz4tar xzf node_exporter-1.7.0.linux-amd64.tar.gz5sudo cp node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/6sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter78# Create systemd service9sudo tee /etc/systemd/system/node_exporter.service << 'EOF'10[Unit]11Description=Node Exporter12After=network.target1314[Service]15Type=simple16User=node_exporter17ExecStart=/usr/local/bin/node_exporter \18 --collector.systemd \19 --collector.processes20Restart=always21RestartSec=52223[Install]24WantedBy=multi-user.target25EOF2627sudo systemctl daemon-reload28sudo systemctl enable --now node_exporterPlatform-specific metrics#
GitLab Runner metrics#
GitLab Runner exposes Prometheus metrics natively on port 9252. Enable it in config.toml or via command line:
1# Via command line2gitlab-runner run --metrics-server ":9252"34# Via config.toml5# Add listen_address under the global sectionKey metrics:
| Metric | Type | Description |
|---|---|---|
gitlab_runner_jobs | gauge | Number of currently running jobs |
gitlab_runner_jobs_total | counter | Total number of processed jobs |
gitlab_runner_errors_total | counter | Total number of errors by type |
gitlab_runner_concurrent | gauge | Current concurrent setting |
gitlab_runner_limit | gauge | Current limit setting |
gitlab_runner_request_concurrency | gauge | Current number of concurrent requests |
gitlab_runner_version_info | gauge | Runner version (labels: version, revision) |
process_cpu_seconds_total | counter | Runner process CPU usage |
process_resident_memory_bytes | gauge | Runner process memory usage |
Example PromQL queries:
1# Job throughput (jobs per minute)2rate(gitlab_runner_jobs_total[5m]) * 6034# Error rate5rate(gitlab_runner_errors_total[5m])67# Runner utilization (jobs / concurrent limit)8gitlab_runner_jobs / gitlab_runner_concurrent910# Job duration (if using custom metrics)11histogram_quantile(0.95, rate(gitlab_runner_job_duration_seconds_bucket[5m]))GitHub Actions runner metrics#
GitHub Actions runners don't expose a native Prometheus endpoint. Use one of these approaches:
Option 1: GitHub API polling exporter
Create a lightweight exporter that polls the GitHub API for runner status:
1#!/bin/bash2# github-runner-exporter.sh — simple metrics exporter3# Runs as a service, exposes metrics on :950045REPO="YOUR-ORG/YOUR-REPO"6PAT="YOUR_PAT"7PORT=950089while true; do10 RUNNERS=$(curl -s -H "Authorization: Bearer $PAT" \11 "https://api.github.com/repos/$REPO/actions/runners")1213 TOTAL=$(echo "$RUNNERS" | jq '.total_count')14 ONLINE=$(echo "$RUNNERS" | jq '[.runners[] | select(.status == "online")] | length')15 BUSY=$(echo "$RUNNERS" | jq '[.runners[] | select(.busy == true)] | length')16 IDLE=$((ONLINE - BUSY))1718 cat > /tmp/github_runner_metrics << METRICS19# HELP github_runner_total Total registered runners20# TYPE github_runner_total gauge21github_runner_total $TOTAL22# HELP github_runner_online Online runners23# TYPE github_runner_online gauge24github_runner_online $ONLINE25# HELP github_runner_busy Busy runners26# TYPE github_runner_busy gauge27github_runner_busy $BUSY28# HELP github_runner_idle Idle runners29# TYPE github_runner_idle gauge30github_runner_idle $IDLE31METRICS3233 sleep 3034done &3536# Serve metrics via a simple HTTP server37while true; do38 echo -e "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\n\r\n$(cat /tmp/github_runner_metrics)" \39 | nc -l -p $PORT -q 140doneOption 2: Webhook-based metrics
Use workflow_job webhooks to track job lifecycle events and export them as Prometheus metrics. This gives you job queue time, execution duration, and failure counts in real time.
Bazel Remote Execution metrics#
BuildBuddy and bazel-remote both expose Prometheus-compatible metrics:
BuildBuddy metrics (port 9090):
| Metric | Description |
|---|---|
buildbuddy_remote_cache_hit_count | Cache hits |
buildbuddy_remote_cache_miss_count | Cache misses |
buildbuddy_remote_cache_size_bytes | Total cache size |
buildbuddy_invocation_count | Build invocations |
buildbuddy_action_count | Remote actions executed |
buildbuddy_action_queue_length | Queued actions waiting |
bazel-remote metrics (port 9090):
| Metric | Description |
|---|---|
bazel_remote_cache_hits | Cache hit count by type (ac/cas) |
bazel_remote_cache_misses | Cache miss count by type |
bazel_remote_disk_cache_size_bytes | On-disk cache size |
bazel_remote_http_request_duration_seconds | Request latency histogram |
1# Cache hit rate2buildbuddy_remote_cache_hit_count / (buildbuddy_remote_cache_hit_count + buildbuddy_remote_cache_miss_count)34# Action queue depth (high = need more workers)5buildbuddy_action_queue_length67# Cache size growth rate8rate(buildbuddy_remote_cache_size_bytes[1h])Grafana dashboards#
Runner fleet overview dashboard#
Create a dashboard that shows the health of your entire runner fleet at a glance:
1{2 "dashboard": {3 "title": "CI/CD Runner Fleet",4 "panels": [5 {6 "title": "Runner Fleet Status",7 "type": "stat",8 "targets": [9 {10 "expr": "count(up{job=~\"gitlab-runner|github-runner-exporter|node-exporter\", role=\"ci-runner\"} == 1)",11 "legendFormat": "Online"12 }13 ]14 },15 {16 "title": "Active Jobs",17 "type": "gauge",18 "targets": [19 {20 "expr": "sum(gitlab_runner_jobs) + sum(github_runner_busy)",21 "legendFormat": "Running"22 }23 ]24 },25 {26 "title": "CPU Usage by Runner",27 "type": "timeseries",28 "targets": [29 {30 "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\", role=\"ci-runner\"}[5m])) * 100)",31 "legendFormat": "{{instance}}"32 }33 ]34 },35 {36 "title": "Memory Usage by Runner",37 "type": "timeseries",38 "targets": [39 {40 "expr": "(1 - node_memory_MemAvailable_bytes{role=\"ci-runner\"} / node_memory_MemTotal_bytes{role=\"ci-runner\"}) * 100",41 "legendFormat": "{{instance}}"42 }43 ]44 },45 {46 "title": "Disk Usage by Runner",47 "type": "timeseries",48 "targets": [49 {50 "expr": "100 - (node_filesystem_avail_bytes{mountpoint=\"/\", role=\"ci-runner\"} / node_filesystem_size_bytes{mountpoint=\"/\", role=\"ci-runner\"}) * 100",51 "legendFormat": "{{instance}}"52 }53 ]54 },55 {56 "title": "GitLab Job Throughput",57 "type": "timeseries",58 "targets": [59 {60 "expr": "rate(gitlab_runner_jobs_total[5m]) * 60",61 "legendFormat": "jobs/min"62 }63 ]64 },65 {66 "title": "Bazel Cache Hit Rate",67 "type": "gauge",68 "targets": [69 {70 "expr": "buildbuddy_remote_cache_hit_count / (buildbuddy_remote_cache_hit_count + buildbuddy_remote_cache_miss_count) * 100",71 "legendFormat": "Hit %"72 }73 ]74 }75 ]76 }77}Key panels to include#
| Panel | Query | Purpose |
|---|---|---|
| Online runners | count(up{role="ci-runner"} == 1) | Fleet health |
| CPU saturation | avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) | Capacity planning |
| Disk space | node_filesystem_avail_bytes{mountpoint="/"} | Prevent disk exhaustion |
| Network I/O | rate(node_network_receive_bytes_total[5m]) | Bandwidth usage |
| GitLab errors | rate(gitlab_runner_errors_total[5m]) | Error trends |
| Job queue depth | gitlab_runner_request_concurrency | Scaling signals |
| Cache hit rate | Bazel cache hits / (hits + misses) | Build efficiency |
Alerting rules#
Prometheus alert rules#
1# /etc/prometheus/runner_alerts.yml2groups:3 - name: runner-health4 interval: 30s5 rules:6 # Runner host is down7 - alert: RunnerHostDown8 expr: up{role="ci-runner"} == 09 for: 2m10 labels:11 severity: critical12 annotations:13 summary: "Runner host {{ $labels.instance }} is down"14 description: "Runner host has been unreachable for more than 2 minutes."1516 # High CPU usage17 - alert: RunnerHighCPU18 expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle", role="ci-runner"}[5m])) * 100) > 9019 for: 10m20 labels:21 severity: warning22 annotations:23 summary: "High CPU on runner {{ $labels.instance }}"24 description: "CPU usage above 90% for 10 minutes. Consider adding more runners."2526 # Disk space low27 - alert: RunnerDiskSpaceLow28 expr: (node_filesystem_avail_bytes{mountpoint="/", role="ci-runner"} / node_filesystem_size_bytes{mountpoint="/", role="ci-runner"}) * 100 < 1529 for: 5m30 labels:31 severity: warning32 annotations:33 summary: "Low disk space on runner {{ $labels.instance }}"34 description: "Less than 15% disk space remaining. Clean up work directories and Docker images."3536 # Disk space critical37 - alert: RunnerDiskSpaceCritical38 expr: (node_filesystem_avail_bytes{mountpoint="/", role="ci-runner"} / node_filesystem_size_bytes{mountpoint="/", role="ci-runner"}) * 100 < 539 for: 1m40 labels:41 severity: critical42 annotations:43 summary: "Critical disk space on runner {{ $labels.instance }}"44 description: "Less than 5% disk space remaining. Jobs will fail."4546 # Memory exhaustion47 - alert: RunnerMemoryHigh48 expr: (1 - node_memory_MemAvailable_bytes{role="ci-runner"} / node_memory_MemTotal_bytes{role="ci-runner"}) * 100 > 9049 for: 5m50 labels:51 severity: warning52 annotations:53 summary: "High memory usage on runner {{ $labels.instance }}"54 description: "Memory usage above 90% for 5 minutes."5556 - name: runner-jobs57 interval: 30s58 rules:59 # GitLab runner error rate spike60 - alert: GitLabRunnerHighErrorRate61 expr: rate(gitlab_runner_errors_total[5m]) > 0.162 for: 5m63 labels:64 severity: warning65 annotations:66 summary: "High error rate on GitLab Runner"67 description: "More than 6 errors per minute for 5 minutes."6869 # All runners busy (queue building up)70 - alert: AllRunnersBusy71 expr: gitlab_runner_jobs == gitlab_runner_concurrent72 for: 10m73 labels:74 severity: warning75 annotations:76 summary: "All GitLab Runner slots are busy"77 description: "All concurrent job slots are in use for 10+ minutes. Jobs are queuing."7879 # Bazel cache hit rate drop80 - alert: BazelCacheHitRateLow81 expr: buildbuddy_remote_cache_hit_count / (buildbuddy_remote_cache_hit_count + buildbuddy_remote_cache_miss_count) < 0.582 for: 15m83 labels:84 severity: warning85 annotations:86 summary: "Bazel cache hit rate below 50%"87 description: "Cache hit rate has dropped below 50%. Check for cache invalidation or configuration changes."8889 # GitHub runners all offline90 - alert: GitHubRunnersOffline91 expr: github_runner_online == 0 and github_runner_total > 092 for: 2m93 labels:94 severity: critical95 annotations:96 summary: "All GitHub Actions runners are offline"97 description: "No online runners detected. Workflows will queue indefinitely."Alertmanager configuration#
Route alerts to Slack and PagerDuty:
1# /etc/alertmanager/alertmanager.yml2global:3 slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"45route:6 group_by: ["alertname", "instance"]7 group_wait: 30s8 group_interval: 5m9 repeat_interval: 4h10 receiver: "slack-warnings"1112 routes:13 - match:14 severity: critical15 receiver: "pagerduty-critical"16 - match:17 severity: warning18 receiver: "slack-warnings"1920receivers:21 - name: "slack-warnings"22 slack_configs:23 - channel: "#ci-cd-alerts"24 title: "{{ .GroupLabels.alertname }}"25 text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"2627 - name: "pagerduty-critical"28 pagerduty_configs:29 - service_key: "YOUR_PAGERDUTY_KEY"Integration with existing monitoring#
If you're already running Prometheus and Grafana (for example, via the project's podman-compose stack), add the runner scrape targets to your existing configuration. The node exporter metrics are standard and work with any existing system dashboards.
For GlitchTip integration, runner failures can be reported as errors:
1# Report runner failures to GlitchTip (Sentry-compatible)2curl -X POST "http://localhost:8000/api/GLITCHTIP_PROJECT_ID/store/" \3 -H "Content-Type: application/json" \4 -H "X-Sentry-Auth: Sentry sentry_key=YOUR_DSN_KEY" \5 -d '{6 "event_id": "'$(uuidgen | tr -d '-')'",7 "message": "Runner runner-01 is offline",8 "level": "error",9 "tags": {"runner": "runner-01", "platform": "gitlab"}10 }'Next steps#
- Security hardening — Secure your runner infrastructure
- Troubleshooting — Diagnose and fix common runner issues
- Bare metal deployment — Infrastructure-level monitoring setup