Runners

Bare Metal Infrastructure for Self-Hosted Runners

Deploy and manage self-hosted runners on physical hardware infrastructure


Bare metal infrastructure provides maximum performance, security, and control for your self-hosted runners. This guide covers complete deployment automation for physical hardware across multiple operating systems and runner platforms.

Infrastructure Overview#

Benefits of bare metal runners#

Performance advantages:

  • Direct hardware access without virtualization overhead
  • Consistent performance for CPU-intensive builds
  • Faster I/O operations for large codebases
  • Dedicated resources with no noisy neighbors

Security benefits:

  • Complete control over the hardware stack
  • Physical security controls
  • Air-gapped environments for sensitive workloads
  • Custom hardening and compliance configurations

Cost efficiency:

  • Predictable costs for long-running workloads
  • Better price-performance ratio for high-utilization scenarios
  • No cloud egress charges for large artifact transfers
  • Simplified licensing for proprietary software

Considerations#

Management overhead:

  • Hardware procurement and lifecycle management
  • Operating system deployment and patching
  • Physical security and environmental controls
  • Power and cooling infrastructure

Scalability limitations:

  • Fixed capacity requires capacity planning
  • Longer provisioning times for new hardware
  • Physical space constraints
  • Manual intervention for hardware failures

Hardware Requirements#

Development Build Runners#

Optimized for standard CI/CD workloads with moderate resource requirements.

ComponentMinimumRecommendedNotes
CPU4 cores / 8 threads8 cores / 16 threadsIntel Xeon or AMD EPYC
RAM16 GB32 GBECC memory preferred
Storage256 GB SSD512 GB NVMe SSDLocal build cache
Network1 Gbps10 GbpsLow latency to repositories

High-Performance CI/CD#

For large monorepos, parallel builds, and intensive compilation workloads.

ComponentMinimumRecommendedNotes
CPU16 cores / 32 threads32 cores / 64 threadsHigh-frequency cores
RAM64 GB128 GBLarge compilation jobs
Storage1 TB NVMe SSD2 TB NVMe SSD RAID 0Fast I/O for builds
Network10 Gbps25 GbpsArtifact upload/download

Enterprise Workloads#

High-availability configurations with redundancy and compliance requirements.

ComponentMinimumRecommendedNotes
CPU24 cores / 48 threads48 cores / 96 threadsDual socket preferred
RAM128 GB256 GBECC with memory mirroring
Storage2 TB NVMe SSD4 TB NVMe SSD RAID 1Redundant storage
Network10 Gbps bonded25 Gbps bondedNetwork redundancy

Specialized Build Requirements#

For GPU workloads, mobile development, and machine learning pipelines.

ComponentGPU BuildsMobile DevelopmentML Pipelines
CPU16+ cores8+ cores32+ cores
RAM64+ GB32+ GB128+ GB
Storage1+ TB NVMe512+ GB NVMe2+ TB NVMe
SpecialNVIDIA RTX/TeslamacOS capabilityCUDA/ROCm support

Network Configuration#

Physical Network Topology#

1
# Network architecture for bare metal runners
2
Internet
3
4
├─ Firewall/Router (pfSense/FortiGate)
5
6
├─ Management Network (VLAN 10 - 192.168.10.0/24)
7
├─ IPMI/BMC interfaces
8
├─ Network switches
9
└─ Monitoring systems
10
11
├─ Runner Network (VLAN 20 - 192.168.20.0/24)
12
├─ Linux runners
13
├─ Windows runners
14
└─ macOS runners
15
16
└─ Storage Network (VLAN 30 - 192.168.30.0/24)
17
├─ NFS/SMB servers
18
├─ Backup systems
19
└─ Artifact repositories

VLAN Configuration#

Management VLAN (10):

1
# Switch configuration for management VLAN
2
vlan 10 name "Management"
3
interface vlan 10
4
ip address 192.168.10.1 255.255.255.0
5
ip helper-address 192.168.10.10

Runner VLAN (20):

1
# Switch configuration for runner VLAN
2
vlan 20 name "Runners"
3
interface vlan 20
4
ip address 192.168.20.1 255.255.255.0
5
ip helper-address 192.168.20.10

Storage VLAN (30):

1
# Switch configuration for storage VLAN
2
vlan 30 name "Storage"
3
interface vlan 30
4
ip address 192.168.30.1 255.255.255.0

Hybrid Cloud Connectivity#

Site-to-Site VPN Configuration:

1
#!/bin/bash
2
# IPSec VPN setup for hybrid cloud connectivity
3
4
# Install strongSwan
5
apt-get update
6
apt-get install -y strongswan strongswan-pki
7
8
# Configure IPSec
9
cat > /etc/ipsec.conf << 'EOF'
10
config setup
11
charondebug="ike 1, knl 1, cfg 0"
12
uniqueids=no
13
14
conn aws-vpn
15
auto=start
16
left=%defaultroute
17
leftid=203.0.113.12
18
leftsubnet=192.168.0.0/16
19
right=198.51.100.12
20
rightsubnet=10.0.0.0/16
21
ike=aes256-sha1-modp1024!
22
esp=aes256-sha1!
23
keyexchange=ikev1
24
authby=secret
25
dpddelay=30
26
dpdtimeout=120
27
dpdaction=restart
28
EOF
29
30
# Set shared secret
31
echo "203.0.113.12 198.51.100.12 : PSK 'your-shared-secret'" > /etc/ipsec.secrets
32
33
# Start IPSec
34
systemctl enable strongswan
35
systemctl start strongswan

Operating System Deployment#

PXE Boot Infrastructure#

DHCP Server Configuration:

1
# /etc/dhcp/dhcpd.conf
2
default-lease-time 600;
3
max-lease-time 7200;
4
5
subnet 192.168.20.0 netmask 255.255.255.0 {
6
range 192.168.20.100 192.168.20.200;
7
option routers 192.168.20.1;
8
option domain-name-servers 192.168.20.10;
9
option broadcast-address 192.168.20.255;
10
11
# PXE boot configuration
12
filename "pxelinux.0";
13
next-server 192.168.20.10;
14
}

TFTP Server Setup:

1
#!/bin/bash
2
# Install and configure TFTP server
3
apt-get install -y tftpd-hpa syslinux-common
4
5
# Configure TFTP
6
cat > /etc/default/tftpd-hpa << 'EOF'
7
TFTP_USERNAME="tftp"
8
TFTP_DIRECTORY="/var/lib/tftpboot"
9
TFTP_ADDRESS="192.168.20.10:69"
10
TFTP_OPTIONS="--secure"
11
EOF
12
13
# Copy PXE boot files
14
cp /usr/lib/PXELINUX/pxelinux.0 /var/lib/tftpboot/
15
cp /usr/lib/syslinux/modules/bios/*.c32 /var/lib/tftpboot/
16
mkdir -p /var/lib/tftpboot/pxelinux.cfg
17
18
# Create PXE menu
19
cat > /var/lib/tftpboot/pxelinux.cfg/default << 'EOF'
20
DEFAULT menu.c32
21
PROMPT 0
22
TIMEOUT 300
23
ONTIMEOUT local
24
25
MENU TITLE PXE Boot Menu
26
27
LABEL local
28
MENU LABEL Boot from local disk
29
LOCALBOOT 0
30
31
LABEL ubuntu
32
MENU LABEL Ubuntu 22.04 Automated Install
33
KERNEL ubuntu/vmlinuz
34
APPEND initrd=ubuntu/initrd.gz autoinstall ds=nocloud-net\;s=http://192.168.20.10/cloud-init/
35
EOF
36
37
systemctl enable tftpd-hpa
38
systemctl start tftpd-hpa

Automated Ubuntu Deployment#

Cloud-init Configuration:

1
# /var/www/html/cloud-init/user-data
2
#cloud-config
3
autoinstall:
4
version: 1
5
locale: en_US.UTF-8
6
keyboard:
7
layout: us
8
9
network:
10
network:
11
version: 2
12
ethernets:
13
eno1:
14
dhcp4: true
15
16
storage:
17
layout:
18
name: lvm
19
20
identity:
21
hostname: runner-node
22
username: runner
23
password: '$6$rounds=4096$saltsaltsal$hash'
24
25
ssh:
26
install-server: true
27
authorized-keys:
28
- ssh-rsa AAAAB3NzaC1yc2EAAAA... runner@devopshub
29
30
packages:
31
- docker.io
32
- git
33
- curl
34
- wget
35
- unzip
36
- build-essential
37
38
late-commands:
39
- 'systemctl enable docker'
40
- 'usermod -aG docker runner'
41
- 'curl -fsSL https://get.docker.com | sh'
42
- 'wget -O /tmp/install-runner.sh http://192.168.20.10/scripts/install-runner.sh'
43
- 'chmod +x /tmp/install-runner.sh'
44
- 'sudo -u runner /tmp/install-runner.sh'

Automated Windows Deployment#

Windows Deployment Services (WDS):

1
# Install WDS role
2
Install-WindowsFeature -Name WDS -IncludeManagementTools
3
4
# Configure WDS
5
wdsutil /initialize-server /reminst:"C:\RemoteInstall"
6
wdsutil /set-server /answerclients:all
7
8
# Add boot image
9
wdsutil /add-image /imagefile:"C:\Sources\boot.wim" /imagetype:boot
10
11
# Add install image with unattend
12
wdsutil /add-image /imagefile:"C:\Sources\install.wim" /imagetype:install

Unattend.xml for Automated Installation:

1
<?xml version="1.0" encoding="utf-8"?>
2
<unattend xmlns="urn:schemas-microsoft-com:unattend">
3
<settings pass="windowsPE">
4
<component name="Microsoft-Windows-Setup">
5
<DiskConfiguration>
6
<Disk wcm:action="add">
7
<DiskID>0</DiskID>
8
<WillWipeDisk>true</WillWipeDisk>
9
<CreatePartitions>
10
<CreatePartition wcm:action="add">
11
<Order>1</Order>
12
<Size>512</Size>
13
<Type>Primary</Type>
14
</CreatePartition>
15
<CreatePartition wcm:action="add">
16
<Order>2</Order>
17
<Extend>true</Extend>
18
<Type>Primary</Type>
19
</CreatePartition>
20
</CreatePartitions>
21
</Disk>
22
</DiskConfiguration>
23
<UserData>
24
<AcceptEula>true</AcceptEula>
25
<ProductKey>
26
<WillShowUI>Never</WillShowUI>
27
</ProductKey>
28
</UserData>
29
</component>
30
</settings>
31
32
<settings pass="specialize">
33
<component name="Microsoft-Windows-Shell-Setup">
34
<ComputerName>WIN-RUNNER</ComputerName>
35
<ProductKey>VK7JG-NPHTM-C97JM-9MPGT-3V66T</ProductKey>
36
</component>
37
</settings>
38
39
<settings pass="oobeSystem">
40
<component name="Microsoft-Windows-Shell-Setup">
41
<OOBE>
42
<HideEULAPage>true</HideEULAPage>
43
<HideWirelessSetupInOOBE>true</HideWirelessSetupInOOBE>
44
<NetworkLocation>Work</NetworkLocation>
45
<ProtectYourPC>1</ProtectYourPC>
46
</OOBE>
47
<UserAccounts>
48
<AdministratorPassword>
49
<Value>P@ssw0rd123</Value>
50
<PlainText>true</PlainText>
51
</AdministratorPassword>
52
<LocalAccounts>
53
<LocalAccount wcm:action="add">
54
<Password>
55
<Value>P@ssw0rd123</Value>
56
<PlainText>true</PlainText>
57
</Password>
58
<Description>Runner Service Account</Description>
59
<DisplayName>runner</DisplayName>
60
<Group>Administrators</Group>
61
<Name>runner</Name>
62
</LocalAccount>
63
</LocalAccounts>
64
</UserAccounts>
65
<FirstLogonCommands>
66
<SynchronousCommand wcm:action="add">
67
<CommandLine>powershell.exe -ExecutionPolicy Bypass -File C:\Scripts\setup-runner.ps1</CommandLine>
68
<Order>1</Order>
69
</SynchronousCommand>
70
</FirstLogonCommands>
71
</component>
72
</settings>
73
</unattend>

macOS Automated Deployment#

macOS NetInstall Server:

1
#!/bin/bash
2
# Create macOS NetInstall image
3
4
# Install macOS Server tools
5
sudo installer -pkg /Applications/Server.app/Contents/ServerRoot/System/Installation/Packages/OSInstall.mpkg -target /
6
7
# Create NetInstall image
8
sudo /System/Library/CoreServices/System\ Image\ Utility.app/Contents/MacOS/System\ Image\ Utility \
9
--source /Applications/Install\ macOS\ Monterey.app \
10
--output /Users/Shared/NetInstall.nbi \
11
--kind netinstall \
12
--name "macOS Runner AutoInstall"
13
14
# Configure NetBoot service
15
sudo serveradmin settings netboot:sharepoint = "/Users/Shared"
16
sudo serveradmin start netboot

Platform Installation Automation#

GitHub Actions Runner Installation#

Cross-Platform Installation Script:

1
#!/bin/bash
2
# GitHub Actions runner installation
3
4
RUNNER_VERSION="2.311.0"
5
GITHUB_TOKEN="$1"
6
GITHUB_URL="$2"
7
8
# Detect OS
9
if [[ "$OSTYPE" == "linux-gnu"* ]]; then
10
OS="linux"
11
ARCH="x64"
12
elif [[ "$OSTYPE" == "darwin"* ]]; then
13
OS="osx"
14
ARCH="x64"
15
elif [[ "$OSTYPE" == "msys" ]]; then
16
OS="win"
17
ARCH="x64"
18
fi
19
20
# Download runner
21
mkdir -p /opt/actions-runner && cd /opt/actions-runner
22
curl -o actions-runner.tar.gz -L "https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-${OS}-${ARCH}-${RUNNER_VERSION}.tar.gz"
23
tar xzf actions-runner.tar.gz
24
25
# Generate registration token
26
REG_TOKEN=$(curl -s -X POST \
27
-H "Authorization: token ${GITHUB_TOKEN}" \
28
-H "Accept: application/vnd.github.v3+json" \
29
"${GITHUB_URL}/actions/runners/registration-token" | \
30
jq -r .token)
31
32
# Configure runner
33
./config.sh --url "${GITHUB_URL}" --token "${REG_TOKEN}" --unattended --replace
34
35
# Install as service
36
if [[ "$OS" == "linux" ]]; then
37
sudo ./svc.sh install
38
sudo ./svc.sh start
39
elif [[ "$OS" == "osx" ]]; then
40
sudo ./svc.sh install
41
sudo launchctl start actions.runner.service
42
elif [[ "$OS" == "win" ]]; then
43
powershell -Command ".\svc.cmd install"
44
powershell -Command ".\svc.cmd start"
45
fi

GitLab Runner Installation#

Multi-Executor GitLab Runner Setup:

1
#!/bin/bash
2
# GitLab Runner installation and configuration
3
4
# Install GitLab Runner
5
curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash
6
sudo apt-get install gitlab-runner
7
8
# Register multiple runners with different executors
9
gitlab-runner register \
10
--non-interactive \
11
--url "https://gitlab.com/" \
12
--registration-token "$GITLAB_TOKEN" \
13
--executor "docker" \
14
--docker-image "alpine:latest" \
15
--description "Docker executor runner" \
16
--tag-list "docker,linux" \
17
--run-untagged="true" \
18
--locked="false" \
19
--access-level="not_protected"
20
21
gitlab-runner register \
22
--non-interactive \
23
--url "https://gitlab.com/" \
24
--registration-token "$GITLAB_TOKEN" \
25
--executor "shell" \
26
--description "Shell executor runner" \
27
--tag-list "shell,linux" \
28
--run-untagged="false" \
29
--locked="false" \
30
--access-level="not_protected"
31
32
# Configure concurrent jobs
33
sudo sed -i 's/concurrent = 1/concurrent = 4/' /etc/gitlab-runner/config.toml
34
35
# Start service
36
sudo systemctl enable gitlab-runner
37
sudo systemctl start gitlab-runner

Jenkins Agent Installation#

Jenkins Agent with JNLP:

1
#!/bin/bash
2
# Jenkins agent installation
3
4
JENKINS_URL="$1"
5
AGENT_NAME="$2"
6
AGENT_SECRET="$3"
7
8
# Create jenkins user
9
sudo useradd -m -s /bin/bash jenkins
10
sudo usermod -aG docker jenkins
11
12
# Download agent JAR
13
sudo -u jenkins mkdir -p /home/jenkins/agent
14
cd /home/jenkins/agent
15
sudo -u jenkins wget "${JENKINS_URL}/jnlpJars/agent.jar"
16
17
# Create systemd service
18
sudo tee /etc/systemd/system/jenkins-agent.service > /dev/null << EOF
19
[Unit]
20
Description=Jenkins Agent
21
After=network.target
22
23
[Service]
24
Type=simple
25
User=jenkins
26
WorkingDirectory=/home/jenkins/agent
27
ExecStart=/usr/bin/java -jar agent.jar -jnlpUrl ${JENKINS_URL}/computer/${AGENT_NAME}/jenkins-agent.jnlp -secret ${AGENT_SECRET} -workDir /home/jenkins/agent
28
Restart=always
29
RestartSec=10
30
31
[Install]
32
WantedBy=multi-user.target
33
EOF
34
35
sudo systemctl daemon-reload
36
sudo systemctl enable jenkins-agent
37
sudo systemctl start jenkins-agent

Azure DevOps Agent Installation#

Azure Pipelines Agent Setup:

1
# Azure DevOps agent installation (PowerShell)
2
param(
3
[Parameter(Mandatory=$true)]
4
[string]$OrganizationUrl,
5
6
[Parameter(Mandatory=$true)]
7
[string]$PersonalAccessToken,
8
9
[Parameter(Mandatory=$true)]
10
[string]$Pool,
11
12
[Parameter(Mandatory=$true)]
13
[string]$AgentName
14
)
15
16
# Download and extract agent
17
$agentDir = "C:\agent"
18
New-Item -ItemType Directory -Path $agentDir -Force
19
Set-Location $agentDir
20
21
$webClient = New-Object System.Net.WebClient
22
$webClient.DownloadFile("https://vstsagentpackage.azureedge.net/agent/2.214.1/vsts-agent-win-x64-2.214.1.zip", "$agentDir\agent.zip")
23
24
Expand-Archive -Path "$agentDir\agent.zip" -DestinationPath $agentDir -Force
25
26
# Configure agent
27
.\config.cmd --unattended --url $OrganizationUrl --auth pat --token $PersonalAccessToken --pool $Pool --agent $AgentName --runAsService --windowsLogonAccount "NT AUTHORITY\SYSTEM"
28
29
# Start service
30
.\svc.cmd install
31
.\svc.cmd start

Bazel Remote Execution Setup#

Bazel RBE Worker Configuration:

1
#!/bin/bash
2
# Bazel Remote Build Execution worker setup
3
4
# Install Bazel
5
curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > bazel.gpg
6
sudo mv bazel.gpg /etc/apt/trusted.gpg.d/
7
echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list
8
sudo apt update && sudo apt install bazel
9
10
# Install BuildBuddy RBE worker
11
wget https://github.com/buildbuddy-io/buildbuddy/releases/latest/download/executor-linux-amd64
12
chmod +x executor-linux-amd64
13
sudo mv executor-linux-amd64 /usr/local/bin/buildbuddy-executor
14
15
# Create configuration
16
sudo tee /etc/buildbuddy/config.yaml > /dev/null << EOF
17
executor:
18
app_target: "grpc://buildbuddy.example.com:1985"
19
root_directory: "/tmp/buildbuddy"
20
local_cache_directory: "/tmp/buildbuddy-cache"
21
local_cache_size_bytes: 10000000000 # 10GB
22
23
# Platform properties
24
platform:
25
os: "linux"
26
arch: "amd64"
27
28
# Resource limits
29
cpu_count: 8
30
memory_bytes: 16000000000 # 16GB
31
EOF
32
33
# Create systemd service
34
sudo tee /etc/systemd/system/buildbuddy-executor.service > /dev/null << EOF
35
[Unit]
36
Description=BuildBuddy RBE Executor
37
After=network.target
38
39
[Service]
40
Type=simple
41
User=buildbuddy
42
ExecStart=/usr/local/bin/buildbuddy-executor --config_file=/etc/buildbuddy/config.yaml
43
Restart=always
44
RestartSec=10
45
46
[Install]
47
WantedBy=multi-user.target
48
EOF
49
50
sudo systemctl daemon-reload
51
sudo systemctl enable buildbuddy-executor
52
sudo systemctl start buildbuddy-executor

Storage Configuration#

Local Storage Optimization#

NVMe SSD Configuration for Build Performance:

1
#!/bin/bash
2
# Optimize NVMe SSDs for build workloads
3
4
# Check NVMe devices
5
nvme list
6
7
# Set up RAID 0 for maximum performance (multiple NVMe drives)
8
mdadm --create --verbose /dev/md0 --level=0 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1
9
10
# Format with optimized filesystem
11
mkfs.ext4 -F -O ^has_journal /dev/md0
12
tune2fs -o discard /dev/md0
13
14
# Mount with performance optimizations
15
mkdir -p /opt/builds
16
echo '/dev/md0 /opt/builds ext4 defaults,noatime,discard,barrier=0 0 2' >> /etc/fstab
17
mount /opt/builds
18
19
# Set up build cache directories
20
mkdir -p /opt/builds/{docker,maven,gradle,npm,ccache}
21
chown -R runner:runner /opt/builds

Network-Attached Storage#

NFS Server for Shared Build Cache:

1
#!/bin/bash
2
# NFS server setup for shared build artifacts
3
4
# Install NFS server
5
apt-get update
6
apt-get install -y nfs-kernel-server
7
8
# Create shared directories
9
mkdir -p /srv/nfs/{artifacts,cache,tools}
10
11
# Configure exports
12
cat > /etc/exports << 'EOF'
13
/srv/nfs/artifacts 192.168.20.0/24(rw,sync,no_subtree_check,no_root_squash)
14
/srv/nfs/cache 192.168.20.0/24(rw,sync,no_subtree_check,no_root_squash)
15
/srv/nfs/tools 192.168.20.0/24(ro,sync,no_subtree_check,no_root_squash)
16
EOF
17
18
# Start NFS services
19
systemctl enable nfs-kernel-server
20
systemctl start nfs-kernel-server
21
exportfs -ra
22
23
# Client-side NFS mounting
24
# On runner nodes:
25
# mkdir -p /mnt/{artifacts,cache,tools}
26
# echo '192.168.30.10:/srv/nfs/artifacts /mnt/artifacts nfs defaults,_netdev 0 0' >> /etc/fstab
27
# echo '192.168.30.10:/srv/nfs/cache /mnt/cache nfs defaults,_netdev 0 0' >> /etc/fstab
28
# echo '192.168.30.10:/srv/nfs/tools /mnt/tools nfs defaults,_netdev 0 0' >> /etc/fstab
29
# mount -a

Backup and Archive Strategy#

Automated Backup with rsync and ZFS:

1
#!/bin/bash
2
# Backup strategy for runner infrastructure
3
4
# Create ZFS pool for backups
5
zpool create backuppool /dev/sdb /dev/sdc
6
zfs create backuppool/daily
7
zfs create backuppool/weekly
8
zfs create backuppool/monthly
9
10
# Automated backup script
11
cat > /usr/local/bin/backup-runners.sh << 'EOF'
12
#!/bin/bash
13
DATE=$(date +%Y%m%d)
14
15
# Backup runner configurations
16
rsync -av --delete /etc/gitlab-runner/ /backup/gitlab-runner-$DATE/
17
rsync -av --delete /opt/actions-runner/ /backup/actions-runner-$DATE/
18
rsync -av --delete /home/jenkins/ /backup/jenkins-$DATE/
19
20
# Create ZFS snapshots
21
zfs snapshot backuppool/daily@$DATE
22
zfs snapshot backuppool/weekly@$(date +%Y-W%U)
23
zfs snapshot backuppool/monthly@$(date +%Y-%m)
24
25
# Retention policy - keep 7 daily, 4 weekly, 12 monthly
26
zfs list -t snapshot | grep backuppool/daily | sort | head -n -7 | cut -f1 | xargs -n1 zfs destroy
27
zfs list -t snapshot | grep backuppool/weekly | sort | head -n -4 | cut -f1 | xargs -n1 zfs destroy
28
zfs list -t snapshot | grep backuppool/monthly | sort | head -n -12 | cut -f1 | xargs -n1 zfs destroy
29
EOF
30
31
chmod +x /usr/local/bin/backup-runners.sh
32
33
# Schedule backups
34
echo "0 2 * * * root /usr/local/bin/backup-runners.sh" >> /etc/crontab

DevOps Hub Agent Installation#

Cross-Platform Agent Deployment#

Universal Agent Installation Script:

1
#!/bin/bash
2
# DevOps Hub agent installation for bare metal runners
3
4
AGENT_VERSION="1.2.3"
5
API_TOKEN="$1"
6
HUB_URL="${2:-https://api.devopshub.com}"
7
AGENT_NAME="${3:-$(hostname)}"
8
9
# Detect platform
10
case "$(uname -s)" in
11
Linux*) PLATFORM=linux;;
12
Darwin*) PLATFORM=macos;;
13
MINGW*) PLATFORM=windows;;
14
*) echo "Unsupported platform"; exit 1;;
15
esac
16
17
case "$(uname -m)" in
18
x86_64*) ARCH=amd64;;
19
arm64*) ARCH=arm64;;
20
aarch64*) ARCH=arm64;;
21
*) echo "Unsupported architecture"; exit 1;;
22
esac
23
24
# Download agent
25
DOWNLOAD_URL="${HUB_URL}/downloads/agent/v${AGENT_VERSION}/devopshub-agent-${PLATFORM}-${ARCH}"
26
curl -fsSL "${DOWNLOAD_URL}" -o /tmp/devopshub-agent
27
chmod +x /tmp/devopshub-agent
28
29
# Install agent
30
sudo mkdir -p /opt/devopshub
31
sudo mv /tmp/devopshub-agent /opt/devopshub/
32
sudo ln -sf /opt/devopshub/devopshub-agent /usr/local/bin/
33
34
# Create configuration
35
sudo tee /etc/devopshub/config.yaml > /dev/null << EOF
36
agent:
37
name: "${AGENT_NAME}"
38
token: "${API_TOKEN}"
39
hub_url: "${HUB_URL}"
40
41
# Capabilities auto-detection
42
auto_detect_capabilities: true
43
44
# Resource limits
45
max_concurrent_jobs: 4
46
max_memory_mb: 8192
47
max_disk_mb: 51200
48
49
# Platform-specific settings
50
platforms:
51
- github-actions
52
- gitlab-ci
53
- jenkins
54
- azure-devops
55
- bazel-rbe
56
57
logging:
58
level: info
59
file: /var/log/devopshub-agent.log
60
EOF
61
62
# Capability detection
63
/opt/devopshub/devopshub-agent detect-capabilities > /tmp/capabilities.json
64
65
# Register agent
66
/opt/devopshub/devopshub-agent register \
67
--config /etc/devopshub/config.yaml \
68
--capabilities /tmp/capabilities.json
69
70
# Create systemd service (Linux)
71
if [[ "$PLATFORM" == "linux" ]]; then
72
sudo tee /etc/systemd/system/devopshub-agent.service > /dev/null << EOF
73
[Unit]
74
Description=DevOps Hub Agent
75
After=network.target
76
77
[Service]
78
Type=simple
79
User=runner
80
ExecStart=/opt/devopshub/devopshub-agent run --config /etc/devopshub/config.yaml
81
Restart=always
82
RestartSec=10
83
84
[Install]
85
WantedBy=multi-user.target
86
EOF
87
88
sudo systemctl daemon-reload
89
sudo systemctl enable devopshub-agent
90
sudo systemctl start devopshub-agent
91
fi
92
93
# Create launchd service (macOS)
94
if [[ "$PLATFORM" == "macos" ]]; then
95
sudo tee /Library/LaunchDaemons/com.devopshub.agent.plist > /dev/null << EOF
96
<?xml version="1.0" encoding="UTF-8"?>
97
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
98
<plist version="1.0">
99
<dict>
100
<key>Label</key>
101
<string>com.devopshub.agent</string>
102
<key>ProgramArguments</key>
103
<array>
104
<string>/opt/devopshub/devopshub-agent</string>
105
<string>run</string>
106
<string>--config</string>
107
<string>/etc/devopshub/config.yaml</string>
108
</array>
109
<key>RunAtLoad</key>
110
<true/>
111
<key>KeepAlive</key>
112
<true/>
113
</dict>
114
</plist>
115
EOF
116
117
sudo launchctl load /Library/LaunchDaemons/com.devopshub.agent.plist
118
fi
119
120
echo "DevOps Hub agent installed successfully"
121
echo "Agent name: ${AGENT_NAME}"
122
echo "Capabilities: $(cat /tmp/capabilities.json)"

Agent Health Monitoring#

Health Check and Self-Healing:

1
#!/bin/bash
2
# DevOps Hub agent health monitoring
3
4
# Health check function
5
check_agent_health() {
6
local status_url="http://localhost:8080/health"
7
local response=$(curl -s -o /dev/null -w "%{http_code}" "$status_url")
8
9
if [[ "$response" == "200" ]]; then
10
echo "Agent healthy"
11
return 0
12
else
13
echo "Agent unhealthy (HTTP $response)"
14
return 1
15
fi
16
}
17
18
# Service restart function
19
restart_agent() {
20
echo "Restarting DevOps Hub agent..."
21
22
case "$(uname -s)" in
23
Linux*)
24
sudo systemctl restart devopshub-agent
25
;;
26
Darwin*)
27
sudo launchctl unload /Library/LaunchDaemons/com.devopshub.agent.plist
28
sleep 5
29
sudo launchctl load /Library/LaunchDaemons/com.devopshub.agent.plist
30
;;
31
esac
32
}
33
34
# Capability refresh
35
refresh_capabilities() {
36
echo "Refreshing agent capabilities..."
37
/opt/devopshub/devopshub-agent detect-capabilities > /tmp/capabilities.json
38
/opt/devopshub/devopshub-agent update-capabilities --capabilities /tmp/capabilities.json
39
}
40
41
# Main monitoring loop
42
main() {
43
local consecutive_failures=0
44
local max_failures=3
45
46
while true; do
47
if check_agent_health; then
48
consecutive_failures=0
49
50
# Refresh capabilities daily
51
if [[ $(date +%H:%M) == "02:00" ]]; then
52
refresh_capabilities
53
fi
54
else
55
((consecutive_failures++))
56
57
if [[ $consecutive_failures -ge $max_failures ]]; then
58
restart_agent
59
consecutive_failures=0
60
sleep 30 # Wait for service to start
61
fi
62
fi
63
64
sleep 60 # Check every minute
65
done
66
}
67
68
# Run as daemon or one-shot
69
if [[ "${1}" == "--daemon" ]]; then
70
main
71
else
72
check_agent_health
73
fi

Power and Environmental Management#

Uninterruptible Power Supply (UPS) Integration#

Network UPS Tools (NUT) Configuration:

1
#!/bin/bash
2
# UPS monitoring and management with NUT
3
4
# Install NUT
5
apt-get update
6
apt-get install -y nut nut-client nut-server
7
8
# Configure UPS (APC Smart-UPS via USB)
9
cat > /etc/nut/ups.conf << 'EOF'
10
[apc-ups]
11
driver = usbhid-ups
12
port = auto
13
desc = "APC Smart-UPS 1500VA"
14
vendorid = 051d
15
EOF
16
17
# Configure NUT daemon
18
cat > /etc/nut/nut.conf << 'EOF'
19
MODE=netserver
20
EOF
21
22
# Configure users
23
cat > /etc/nut/upsd.users << 'EOF'
24
[admin]
25
password = supersecret
26
actions = SET
27
instcmds = ALL
28
upsmon master
29
30
[upsmon]
31
password = secret
32
upsmon slave
33
EOF
34
35
# Configure monitoring
36
cat > /etc/nut/upsmon.conf << 'EOF'
37
MONITOR apc-ups@localhost 1 upsmon secret master
38
MINSUPPLIES 1
39
SHUTDOWNCMD "/sbin/shutdown -h +0"
40
POLLFREQ 5
41
POLLFREQALERT 5
42
HOSTSYNC 15
43
DEADTIME 15
44
POWERDOWNFLAG /etc/killpower
45
46
NOTIFYCMD /etc/nut/notifyscript.sh
47
NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC
48
NOTIFYFLAG LOWBATT SYSLOG+WALL+EXEC
49
NOTIFYFLAG FSD SYSLOG+WALL+EXEC
50
NOTIFYFLAG COMMOK SYSLOG+EXEC
51
NOTIFYFLAG COMMBAD SYSLOG+WALL+EXEC
52
NOTIFYFLAG SHUTDOWN SYSLOG+WALL+EXEC
53
NOTIFYFLAG REPLBATT SYSLOG+WALL+EXEC
54
NOTIFYFLAG NOCOMM SYSLOG+WALL+EXEC
55
EOF
56
57
# Notification script
58
cat > /etc/nut/notifyscript.sh << 'EOF'
59
#!/bin/bash
60
# UPS notification script
61
62
case "$1" in
63
ONBATT)
64
echo "UPS on battery power - gracefully stopping build jobs"
65
# Signal agents to finish current jobs and stop accepting new ones
66
systemctl stop jenkins-agent
67
systemctl stop gitlab-runner
68
systemctl stop devopshub-agent
69
;;
70
LOWBATT)
71
echo "UPS low battery - initiating emergency shutdown"
72
# Force stop all services and prepare for shutdown
73
docker stop $(docker ps -q)
74
sync
75
;;
76
COMMOK)
77
echo "UPS communication restored"
78
;;
79
COMMBAD)
80
echo "UPS communication lost"
81
;;
82
FSD)
83
echo "Forced shutdown initiated"
84
;;
85
SHUTDOWN)
86
echo "System shutdown in progress"
87
;;
88
esac
89
EOF
90
91
chmod +x /etc/nut/notifyscript.sh
92
93
# Start NUT services
94
systemctl enable nut-server nut-client
95
systemctl start nut-server nut-client

Environmental Monitoring#

Temperature and Hardware Monitoring:

1
#!/bin/bash
2
# Environmental monitoring with lm-sensors and IPMI
3
4
# Install monitoring tools
5
apt-get install -y lm-sensors ipmitool smartmontools
6
7
# Detect sensors
8
sensors-detect --auto
9
10
# Create monitoring script
11
cat > /usr/local/bin/hardware-monitor.sh << 'EOF'
12
#!/bin/bash
13
# Hardware monitoring script
14
15
LOG_FILE="/var/log/hardware-monitor.log"
16
TEMP_THRESHOLD=75 # Celsius
17
FAN_THRESHOLD=500 # RPM
18
19
# Function to log with timestamp
20
log_message() {
21
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"
22
}
23
24
# Check CPU temperatures
25
check_temperatures() {
26
local temps=($(sensors | grep -E 'Core [0-9]+' | awk '{print $3}' | sed 's/+//g' | sed 's/°C//g'))
27
28
for temp in "${temps[@]}"; do
29
if (( $(echo "$temp > $TEMP_THRESHOLD" | bc -l) )); then
30
log_message "WARNING: High CPU temperature: ${temp}°C"
31
# Reduce runner concurrency
32
if systemctl is-active gitlab-runner > /dev/null; then
33
gitlab-runner verify --delete || true
34
fi
35
fi
36
done
37
}
38
39
# Check fan speeds
40
check_fans() {
41
local fans=($(sensors | grep 'fan' | awk '{print $2}'))
42
43
for fan in "${fans[@]}"; do
44
if [[ "$fan" =~ ^[0-9]+$ ]] && (( fan < FAN_THRESHOLD )); then
45
log_message "WARNING: Low fan speed: ${fan} RPM"
46
fi
47
done
48
}
49
50
# Check disk health
51
check_disks() {
52
for disk in /dev/sd? /dev/nvme?n?; do
53
if [[ -e "$disk" ]]; then
54
local health=$(smartctl -H "$disk" | grep "SMART overall-health" | awk '{print $6}')
55
if [[ "$health" != "PASSED" ]]; then
56
log_message "CRITICAL: Disk health issue on $disk: $health"
57
fi
58
fi
59
done
60
}
61
62
# IPMI monitoring (if available)
63
check_ipmi() {
64
if command -v ipmitool &> /dev/null; then
65
local power_status=$(ipmitool power status)
66
local chassis_status=$(ipmitool chassis status)
67
68
if echo "$chassis_status" | grep -q "System Power.*off"; then
69
log_message "WARNING: Chassis reports system power issues"
70
fi
71
fi
72
}
73
74
# Main monitoring function
75
main() {
76
log_message "Starting hardware monitoring check"
77
check_temperatures
78
check_fans
79
check_disks
80
check_ipmi
81
log_message "Hardware monitoring check completed"
82
}
83
84
main
85
EOF
86
87
chmod +x /usr/local/bin/hardware-monitor.sh
88
89
# Schedule monitoring
90
echo "*/5 * * * * root /usr/local/bin/hardware-monitor.sh" >> /etc/crontab

Disaster Recovery#

Infrastructure Backup and Recovery#

Complete Infrastructure Backup Strategy:

1
#!/bin/bash
2
# Comprehensive backup and recovery procedures
3
4
BACKUP_DIR="/backup/infrastructure"
5
REMOTE_BACKUP="backup-server:/srv/backup/runners"
6
DATE=$(date +%Y%m%d-%H%M%S)
7
8
# Create backup directories
9
mkdir -p "$BACKUP_DIR"/{system,configs,data,images}
10
11
# System configuration backup
12
backup_system_configs() {
13
echo "Backing up system configurations..."
14
15
# Essential system files
16
tar -czf "$BACKUP_DIR/system/system-configs-$DATE.tar.gz" \
17
/etc/fstab \
18
/etc/hosts \
19
/etc/network/ \
20
/etc/ssh/ \
21
/etc/sudoers \
22
/etc/sudoers.d/ \
23
/boot/grub/ \
24
/etc/systemd/system/ \
25
2>/dev/null || true
26
27
# Network configuration
28
cp -r /etc/netplan/ "$BACKUP_DIR/system/netplan-$DATE/" 2>/dev/null || true
29
30
# Firewall rules
31
iptables-save > "$BACKUP_DIR/system/iptables-$DATE.rules"
32
33
# Package list
34
dpkg --get-selections > "$BACKUP_DIR/system/packages-$DATE.list"
35
snap list > "$BACKUP_DIR/system/snap-packages-$DATE.list" 2>/dev/null || true
36
}
37
38
# Runner configurations backup
39
backup_runner_configs() {
40
echo "Backing up runner configurations..."
41
42
# GitHub Actions
43
if [[ -d /opt/actions-runner ]]; then
44
tar -czf "$BACKUP_DIR/configs/github-actions-$DATE.tar.gz" \
45
/opt/actions-runner/ \
46
--exclude="*.log" \
47
--exclude="_diag/" || true
48
fi
49
50
# GitLab Runner
51
if [[ -f /etc/gitlab-runner/config.toml ]]; then
52
cp /etc/gitlab-runner/config.toml "$BACKUP_DIR/configs/gitlab-runner-$DATE.toml"
53
fi
54
55
# Jenkins
56
if [[ -d /home/jenkins ]]; then
57
tar -czf "$BACKUP_DIR/configs/jenkins-$DATE.tar.gz" \
58
/home/jenkins/ \
59
--exclude="workspace/" \
60
--exclude="*.log" || true
61
fi
62
63
# DevOps Hub Agent
64
if [[ -d /etc/devopshub ]]; then
65
cp -r /etc/devopshub/ "$BACKUP_DIR/configs/devopshub-$DATE/"
66
fi
67
}
68
69
# Data backup
70
backup_data() {
71
echo "Backing up important data..."
72
73
# Build caches
74
if [[ -d /opt/builds ]]; then
75
tar -czf "$BACKUP_DIR/data/build-caches-$DATE.tar.gz" \
76
/opt/builds/ccache/ \
77
/opt/builds/maven/ \
78
/opt/builds/gradle/ \
79
2>/dev/null || true
80
fi
81
82
# Docker images
83
docker images --format 'table {{.Repository}}:{{.Tag}}' | \
84
grep -v "REPOSITORY" | \
85
while read image; do
86
filename=$(echo "$image" | tr '/:' '_')
87
docker save "$image" | gzip > "$BACKUP_DIR/images/${filename}-$DATE.tar.gz"
88
done
89
}
90
91
# System image creation
92
create_system_image() {
93
echo "Creating system disk image..."
94
95
# Create compressed disk image
96
if command -v partclone.ext4 &> /dev/null; then
97
partclone.ext4 -c -s /dev/sda1 | gzip > "$BACKUP_DIR/images/system-image-$DATE.img.gz"
98
else
99
dd if=/dev/sda bs=4M status=progress | gzip > "$BACKUP_DIR/images/system-image-$DATE.img.gz"
100
fi
101
}
102
103
# Cleanup old backups
104
cleanup_old_backups() {
105
echo "Cleaning up old backups..."
106
107
# Keep last 7 daily backups
108
find "$BACKUP_DIR" -name "*-*.tar.gz" -mtime +7 -delete
109
find "$BACKUP_DIR" -name "*-*.img.gz" -mtime +14 -delete
110
}
111
112
# Sync to remote backup
113
sync_to_remote() {
114
echo "Syncing to remote backup location..."
115
116
if command -v rsync &> /dev/null; then
117
rsync -av --delete "$BACKUP_DIR/" "$REMOTE_BACKUP/"
118
fi
119
}
120
121
# Recovery procedures - generates recovery documentation
122
recovery_procedures() {
123
cat > "$BACKUP_DIR/RECOVERY_PROCEDURES.txt" << 'RECOVERY_EOF'
124
DISASTER RECOVERY PROCEDURES
125
=============================
126
127
SYSTEM RECOVERY
128
---------------
129
130
1. Boot from rescue media:
131
mount /dev/sdb1 /mnt/backup
132
gunzip -c /mnt/backup/images/system-image-YYYYMMDD.img.gz | dd of=/dev/sda bs=4M
133
134
2. Configuration restore:
135
tar -xzf /mnt/backup/system/system-configs-YYYYMMDD.tar.gz -C /
136
dpkg --set-selections < /mnt/backup/system/packages-YYYYMMDD.list
137
apt-get dselect-upgrade
138
139
3. Runner configuration restore:
140
tar -xzf /mnt/backup/configs/github-actions-YYYYMMDD.tar.gz -C /
141
cp /mnt/backup/configs/gitlab-runner-YYYYMMDD.toml /etc/gitlab-runner/config.toml
142
tar -xzf /mnt/backup/configs/jenkins-YYYYMMDD.tar.gz -C /
143
cp -r /mnt/backup/configs/devopshub-YYYYMMDD/ /etc/devopshub/
144
145
4. Service restart:
146
systemctl daemon-reload
147
systemctl restart gitlab-runner jenkins-agent devopshub-agent
148
149
NETWORK RECOVERY
150
----------------
151
152
1. Restore network configuration:
153
cp -r /mnt/backup/system/netplan-YYYYMMDD/* /etc/netplan/
154
netplan apply
155
156
2. Restore firewall rules:
157
iptables-restore < /mnt/backup/system/iptables-YYYYMMDD.rules
158
159
DATA RECOVERY
160
-------------
161
162
1. Restore build caches:
163
tar -xzf /mnt/backup/data/build-caches-YYYYMMDD.tar.gz -C /
164
chown -R runner:runner /opt/builds
165
166
2. Restore Docker images:
167
cd /mnt/backup/images
168
for image in *.tar.gz; do gunzip -c "$image" | docker load; done
169
RECOVERY_EOF
170
}
171
172
# Main execution
173
main() {
174
local date_str=$(date +%Y%m%d-%H%M%S)
175
echo "Starting infrastructure backup - ${date_str}"
176
177
backup_system_configs
178
backup_runner_configs
179
backup_data
180
181
# Only create system image on Sundays
182
if [[ $(date +%u) -eq 7 ]]; then
183
create_system_image
184
fi
185
186
recovery_procedures
187
cleanup_old_backups
188
sync_to_remote
189
190
echo "Backup completed - ${date_str}"
191
}
192
193
# Execute if run directly
194
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
195
main "$@"
196
fi

Monitoring and Alerting#

Infrastructure Health Monitoring#

Comprehensive Monitoring with Prometheus and Grafana:

1
#!/bin/bash
2
# Deploy monitoring stack for bare metal runners
3
4
# Install Prometheus
5
useradd --no-create-home --shell /bin/false prometheus
6
mkdir -p /etc/prometheus /var/lib/prometheus
7
chown prometheus:prometheus /etc/prometheus /var/lib/prometheus
8
9
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
10
tar xvf prometheus-2.45.0.linux-amd64.tar.gz
11
cp prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/
12
cp prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/
13
chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool
14
15
# Prometheus configuration for runners
16
cat > /etc/prometheus/prometheus.yml << 'EOF'
17
global:
18
scrape_interval: 15s
19
evaluation_interval: 15s
20
21
rule_files:
22
- "runner_rules.yml"
23
24
alerting:
25
alertmanagers:
26
- static_configs:
27
- targets:
28
- alertmanager:9093
29
30
scrape_configs:
31
# Prometheus itself
32
- job_name: 'prometheus'
33
static_configs:
34
- targets: ['localhost:9090']
35
36
# Node exporter for system metrics
37
- job_name: 'node-exporter'
38
static_configs:
39
- targets: ['localhost:9100']
40
41
# GitHub Actions runners
42
- job_name: 'github-actions'
43
static_configs:
44
- targets: ['runner1:8080', 'runner2:8080', 'runner3:8080']
45
46
# GitLab runners
47
- job_name: 'gitlab-runner'
48
static_configs:
49
- targets: ['runner1:9252', 'runner2:9252', 'runner3:9252']
50
51
# Jenkins agents
52
- job_name: 'jenkins'
53
static_configs:
54
- targets: ['jenkins-master:8080']
55
metrics_path: '/prometheus'
56
57
# DevOps Hub agents
58
- job_name: 'devopshub-agents'
59
static_configs:
60
- targets: ['runner1:8081', 'runner2:8081', 'runner3:8081']
61
62
# IPMI exporter for hardware metrics
63
- job_name: 'ipmi'
64
static_configs:
65
- targets: ['runner1', 'runner2', 'runner3']
66
metrics_path: /ipmi
67
params:
68
module: [default]
69
relabel_configs:
70
- source_labels: [__address__]
71
target_label: __param_target
72
- source_labels: [__param_target]
73
target_label: instance
74
- target_label: __address__
75
replacement: localhost:9290
76
77
# UPS monitoring
78
- job_name: 'nut-exporter'
79
static_configs:
80
- targets: ['localhost:9199']
81
EOF
82
83
# Alerting rules
84
cat > /etc/prometheus/runner_rules.yml << 'EOF'
85
groups:
86
- name: runner_alerts
87
rules:
88
- alert: RunnerDown
89
expr: up{job=~".*runner.*"} == 0
90
for: 2m
91
labels:
92
severity: critical
93
annotations:
94
summary: "Runner {{ $labels.instance }} is down"
95
description: "Runner {{ $labels.instance }} has been down for more than 2 minutes."
96
97
- alert: HighCPUUsage
98
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
99
for: 5m
100
labels:
101
severity: warning
102
annotations:
103
summary: "High CPU usage on {{ $labels.instance }}"
104
description: "CPU usage is above 90% for more than 5 minutes."
105
106
- alert: HighMemoryUsage
107
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90
108
for: 5m
109
labels:
110
severity: warning
111
annotations:
112
summary: "High memory usage on {{ $labels.instance }}"
113
description: "Memory usage is above 90% for more than 5 minutes."
114
115
- alert: DiskSpaceLow
116
expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10
117
for: 1m
118
labels:
119
severity: critical
120
annotations:
121
summary: "Low disk space on {{ $labels.instance }}"
122
description: "Disk space is below 10% on root filesystem."
123
124
- alert: HighTemperature
125
expr: node_hwmon_temp_celsius > 75
126
for: 2m
127
labels:
128
severity: warning
129
annotations:
130
summary: "High temperature on {{ $labels.instance }}"
131
description: "Temperature is above 75°C on {{ $labels.instance }}."
132
133
- alert: UPSOnBattery
134
expr: nut_ups_status{flag="OB"} == 1
135
for: 0m
136
labels:
137
severity: critical
138
annotations:
139
summary: "UPS on battery power"
140
description: "UPS is running on battery power, mains supply lost."
141
142
- alert: BuildJobsFailing
143
expr: rate(gitlab_runner_failed_jobs_total[5m]) > 0.1
144
for: 5m
145
labels:
146
severity: warning
147
annotations:
148
summary: "High job failure rate"
149
description: "Job failure rate is above 10% over the last 5 minutes."
150
EOF
151
152
# Create systemd service
153
cat > /etc/systemd/system/prometheus.service << 'EOF'
154
[Unit]
155
Description=Prometheus
156
Wants=network-online.target
157
After=network-online.target
158
159
[Service]
160
User=prometheus
161
Group=prometheus
162
Type=simple
163
ExecStart=/usr/local/bin/prometheus \
164
--config.file /etc/prometheus/prometheus.yml \
165
--storage.tsdb.path /var/lib/prometheus/ \
166
--web.console.templates=/etc/prometheus/consoles \
167
--web.console.libraries=/etc/prometheus/console_libraries \
168
--web.listen-address=0.0.0.0:9090 \
169
--web.external-url=
170
171
[Install]
172
WantedBy=multi-user.target
173
EOF
174
175
# Install Node Exporter
176
useradd --no-create-home --shell /bin/false node_exporter
177
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz
178
tar xvf node_exporter-1.6.0.linux-amd64.tar.gz
179
cp node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/
180
chown node_exporter:node_exporter /usr/local/bin/node_exporter
181
182
cat > /etc/systemd/system/node_exporter.service << 'EOF'
183
[Unit]
184
Description=Node Exporter
185
Wants=network-online.target
186
After=network-online.target
187
188
[Service]
189
User=node_exporter
190
Group=node_exporter
191
Type=simple
192
ExecStart=/usr/local/bin/node_exporter \
193
--collector.systemd \
194
--collector.processes
195
196
[Install]
197
WantedBy=multi-user.target
198
EOF
199
200
# Install IPMI Exporter
201
wget https://github.com/prometheus-community/ipmi_exporter/releases/download/v1.6.1/ipmi_exporter-1.6.1.linux-amd64.tar.gz
202
tar xvf ipmi_exporter-1.6.1.linux-amd64.tar.gz
203
cp ipmi_exporter-1.6.1.linux-amd64/ipmi_exporter /usr/local/bin/
204
chown prometheus:prometheus /usr/local/bin/ipmi_exporter
205
206
cat > /etc/systemd/system/ipmi_exporter.service << 'EOF'
207
[Unit]
208
Description=IPMI Exporter
209
After=network.target
210
211
[Service]
212
Type=simple
213
User=prometheus
214
Group=prometheus
215
ExecStart=/usr/local/bin/ipmi_exporter \
216
--config.file=/etc/prometheus/ipmi.yml \
217
--web.listen-address=0.0.0.0:9290
218
219
[Install]
220
WantedBy=multi-user.target
221
EOF
222
223
# Start services
224
systemctl daemon-reload
225
systemctl enable prometheus node_exporter ipmi_exporter
226
systemctl start prometheus node_exporter ipmi_exporter

Grafana Dashboard for Runner Infrastructure#

Runner Infrastructure Dashboard:

1
{
2
"dashboard": {
3
"id": null,
4
"title": "Bare Metal Runner Infrastructure",
5
"tags": ["runners", "infrastructure"],
6
"timezone": "browser",
7
"panels": [
8
{
9
"id": 1,
10
"title": "Runner Status Overview",
11
"type": "stat",
12
"targets": [
13
{
14
"expr": "up{job=~\".*runner.*\"}",
15
"legendFormat": "{{ instance }}"
16
}
17
],
18
"fieldConfig": {
19
"defaults": {
20
"color": {
21
"mode": "thresholds"
22
},
23
"thresholds": {
24
"steps": [
25
{"color": "red", "value": 0},
26
{"color": "green", "value": 1}
27
]
28
}
29
}
30
}
31
},
32
{
33
"id": 2,
34
"title": "System Resource Usage",
35
"type": "graph",
36
"targets": [
37
{
38
"expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
39
"legendFormat": "CPU {{ instance }}"
40
},
41
{
42
"expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",
43
"legendFormat": "Memory {{ instance }}"
44
}
45
]
46
},
47
{
48
"id": 3,
49
"title": "Build Job Statistics",
50
"type": "graph",
51
"targets": [
52
{
53
"expr": "rate(gitlab_runner_job_duration_seconds_count[5m])",
54
"legendFormat": "Jobs/sec {{ instance }}"
55
},
56
{
57
"expr": "rate(gitlab_runner_failed_jobs_total[5m])",
58
"legendFormat": "Failed jobs/sec {{ instance }}"
59
}
60
]
61
},
62
{
63
"id": 4,
64
"title": "Hardware Temperature",
65
"type": "graph",
66
"targets": [
67
{
68
"expr": "node_hwmon_temp_celsius",
69
"legendFormat": "{{ instance }} {{ chip }} {{ sensor }}"
70
}
71
],
72
"yAxes": [
73
{
74
"unit": "celsius",
75
"max": 100,
76
"min": 0
77
}
78
]
79
},
80
{
81
"id": 5,
82
"title": "Network I/O",
83
"type": "graph",
84
"targets": [
85
{
86
"expr": "rate(node_network_receive_bytes_total[5m])",
87
"legendFormat": "RX {{ instance }} {{ device }}"
88
},
89
{
90
"expr": "rate(node_network_transmit_bytes_total[5m])",
91
"legendFormat": "TX {{ instance }} {{ device }}"
92
}
93
]
94
},
95
{
96
"id": 6,
97
"title": "Disk I/O",
98
"type": "graph",
99
"targets": [
100
{
101
"expr": "rate(node_disk_read_bytes_total[5m])",
102
"legendFormat": "Read {{ instance }} {{ device }}"
103
},
104
{
105
"expr": "rate(node_disk_written_bytes_total[5m])",
106
"legendFormat": "Write {{ instance }} {{ device }}"
107
}
108
]
109
}
110
],
111
"time": {
112
"from": "now-1h",
113
"to": "now"
114
},
115
"refresh": "5s"
116
}
117
}

This comprehensive bare metal infrastructure guide provides enterprise teams with complete automation for deploying and managing self-hosted runners on physical hardware. The guide covers all major platforms, operating systems, and includes robust monitoring, alerting, and disaster recovery procedures for production environments.

Next steps#