Bare Metal Infrastructure for Self-Hosted Runners
Deploy and manage self-hosted runners on physical hardware infrastructure
Bare metal infrastructure provides maximum performance, security, and control for your self-hosted runners. This guide covers complete deployment automation for physical hardware across multiple operating systems and runner platforms.
Infrastructure Overview#
Benefits of bare metal runners#
Performance advantages:
- Direct hardware access without virtualization overhead
- Consistent performance for CPU-intensive builds
- Faster I/O operations for large codebases
- Dedicated resources with no noisy neighbors
Security benefits:
- Complete control over the hardware stack
- Physical security controls
- Air-gapped environments for sensitive workloads
- Custom hardening and compliance configurations
Cost efficiency:
- Predictable costs for long-running workloads
- Better price-performance ratio for high-utilization scenarios
- No cloud egress charges for large artifact transfers
- Simplified licensing for proprietary software
Considerations#
Management overhead:
- Hardware procurement and lifecycle management
- Operating system deployment and patching
- Physical security and environmental controls
- Power and cooling infrastructure
Scalability limitations:
- Fixed capacity requires capacity planning
- Longer provisioning times for new hardware
- Physical space constraints
- Manual intervention for hardware failures
Hardware Requirements#
Development Build Runners#
Optimized for standard CI/CD workloads with moderate resource requirements.
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| CPU | 4 cores / 8 threads | 8 cores / 16 threads | Intel Xeon or AMD EPYC |
| RAM | 16 GB | 32 GB | ECC memory preferred |
| Storage | 256 GB SSD | 512 GB NVMe SSD | Local build cache |
| Network | 1 Gbps | 10 Gbps | Low latency to repositories |
High-Performance CI/CD#
For large monorepos, parallel builds, and intensive compilation workloads.
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| CPU | 16 cores / 32 threads | 32 cores / 64 threads | High-frequency cores |
| RAM | 64 GB | 128 GB | Large compilation jobs |
| Storage | 1 TB NVMe SSD | 2 TB NVMe SSD RAID 0 | Fast I/O for builds |
| Network | 10 Gbps | 25 Gbps | Artifact upload/download |
Enterprise Workloads#
High-availability configurations with redundancy and compliance requirements.
| Component | Minimum | Recommended | Notes |
|---|---|---|---|
| CPU | 24 cores / 48 threads | 48 cores / 96 threads | Dual socket preferred |
| RAM | 128 GB | 256 GB | ECC with memory mirroring |
| Storage | 2 TB NVMe SSD | 4 TB NVMe SSD RAID 1 | Redundant storage |
| Network | 10 Gbps bonded | 25 Gbps bonded | Network redundancy |
Specialized Build Requirements#
For GPU workloads, mobile development, and machine learning pipelines.
| Component | GPU Builds | Mobile Development | ML Pipelines |
|---|---|---|---|
| CPU | 16+ cores | 8+ cores | 32+ cores |
| RAM | 64+ GB | 32+ GB | 128+ GB |
| Storage | 1+ TB NVMe | 512+ GB NVMe | 2+ TB NVMe |
| Special | NVIDIA RTX/Tesla | macOS capability | CUDA/ROCm support |
Network Configuration#
Physical Network Topology#
1# Network architecture for bare metal runners2Internet3 │4 ├─ Firewall/Router (pfSense/FortiGate)5 │ │6 ├─ Management Network (VLAN 10 - 192.168.10.0/24)7 │ ├─ IPMI/BMC interfaces8 │ ├─ Network switches9 │ └─ Monitoring systems10 │11 ├─ Runner Network (VLAN 20 - 192.168.20.0/24)12 │ ├─ Linux runners13 │ ├─ Windows runners14 │ └─ macOS runners15 │16 └─ Storage Network (VLAN 30 - 192.168.30.0/24)17 ├─ NFS/SMB servers18 ├─ Backup systems19 └─ Artifact repositoriesVLAN Configuration#
Management VLAN (10):
1# Switch configuration for management VLAN2vlan 10 name "Management"3interface vlan 104 ip address 192.168.10.1 255.255.255.05 ip helper-address 192.168.10.10Runner VLAN (20):
1# Switch configuration for runner VLAN2vlan 20 name "Runners"3interface vlan 204 ip address 192.168.20.1 255.255.255.05 ip helper-address 192.168.20.10Storage VLAN (30):
1# Switch configuration for storage VLAN2vlan 30 name "Storage"3interface vlan 304 ip address 192.168.30.1 255.255.255.0Hybrid Cloud Connectivity#
Site-to-Site VPN Configuration:
1#!/bin/bash2# IPSec VPN setup for hybrid cloud connectivity34# Install strongSwan5apt-get update6apt-get install -y strongswan strongswan-pki78# Configure IPSec9cat > /etc/ipsec.conf << 'EOF'10config setup11 charondebug="ike 1, knl 1, cfg 0"12 uniqueids=no1314conn aws-vpn15 auto=start16 left=%defaultroute17 leftid=203.0.113.1218 leftsubnet=192.168.0.0/1619 right=198.51.100.1220 rightsubnet=10.0.0.0/1621 ike=aes256-sha1-modp1024!22 esp=aes256-sha1!23 keyexchange=ikev124 authby=secret25 dpddelay=3026 dpdtimeout=12027 dpdaction=restart28EOF2930# Set shared secret31echo "203.0.113.12 198.51.100.12 : PSK 'your-shared-secret'" > /etc/ipsec.secrets3233# Start IPSec34systemctl enable strongswan35systemctl start strongswanOperating System Deployment#
PXE Boot Infrastructure#
DHCP Server Configuration:
1# /etc/dhcp/dhcpd.conf2default-lease-time 600;3max-lease-time 7200;45subnet 192.168.20.0 netmask 255.255.255.0 {6 range 192.168.20.100 192.168.20.200;7 option routers 192.168.20.1;8 option domain-name-servers 192.168.20.10;9 option broadcast-address 192.168.20.255;1011 # PXE boot configuration12 filename "pxelinux.0";13 next-server 192.168.20.10;14}TFTP Server Setup:
1#!/bin/bash2# Install and configure TFTP server3apt-get install -y tftpd-hpa syslinux-common45# Configure TFTP6cat > /etc/default/tftpd-hpa << 'EOF'7TFTP_USERNAME="tftp"8TFTP_DIRECTORY="/var/lib/tftpboot"9TFTP_ADDRESS="192.168.20.10:69"10TFTP_OPTIONS="--secure"11EOF1213# Copy PXE boot files14cp /usr/lib/PXELINUX/pxelinux.0 /var/lib/tftpboot/15cp /usr/lib/syslinux/modules/bios/*.c32 /var/lib/tftpboot/16mkdir -p /var/lib/tftpboot/pxelinux.cfg1718# Create PXE menu19cat > /var/lib/tftpboot/pxelinux.cfg/default << 'EOF'20DEFAULT menu.c3221PROMPT 022TIMEOUT 30023ONTIMEOUT local2425MENU TITLE PXE Boot Menu2627LABEL local28 MENU LABEL Boot from local disk29 LOCALBOOT 03031LABEL ubuntu32 MENU LABEL Ubuntu 22.04 Automated Install33 KERNEL ubuntu/vmlinuz34 APPEND initrd=ubuntu/initrd.gz autoinstall ds=nocloud-net\;s=http://192.168.20.10/cloud-init/35EOF3637systemctl enable tftpd-hpa38systemctl start tftpd-hpaAutomated Ubuntu Deployment#
Cloud-init Configuration:
1# /var/www/html/cloud-init/user-data2#cloud-config3autoinstall:4 version: 15 locale: en_US.UTF-86 keyboard:7 layout: us89 network:10 network:11 version: 212 ethernets:13 eno1:14 dhcp4: true1516 storage:17 layout:18 name: lvm1920 identity:21 hostname: runner-node22 username: runner23 password: '$6$rounds=4096$saltsaltsal$hash'2425 ssh:26 install-server: true27 authorized-keys:28 - ssh-rsa AAAAB3NzaC1yc2EAAAA... runner@devopshub2930 packages:31 - docker.io32 - git33 - curl34 - wget35 - unzip36 - build-essential3738 late-commands:39 - 'systemctl enable docker'40 - 'usermod -aG docker runner'41 - 'curl -fsSL https://get.docker.com | sh'42 - 'wget -O /tmp/install-runner.sh http://192.168.20.10/scripts/install-runner.sh'43 - 'chmod +x /tmp/install-runner.sh'44 - 'sudo -u runner /tmp/install-runner.sh'Automated Windows Deployment#
Windows Deployment Services (WDS):
1# Install WDS role2Install-WindowsFeature -Name WDS -IncludeManagementTools34# Configure WDS5wdsutil /initialize-server /reminst:"C:\RemoteInstall"6wdsutil /set-server /answerclients:all78# Add boot image9wdsutil /add-image /imagefile:"C:\Sources\boot.wim" /imagetype:boot1011# Add install image with unattend12wdsutil /add-image /imagefile:"C:\Sources\install.wim" /imagetype:installUnattend.xml for Automated Installation:
1<?xml version="1.0" encoding="utf-8"?>2<unattend xmlns="urn:schemas-microsoft-com:unattend">3 <settings pass="windowsPE">4 <component name="Microsoft-Windows-Setup">5 <DiskConfiguration>6 <Disk wcm:action="add">7 <DiskID>0</DiskID>8 <WillWipeDisk>true</WillWipeDisk>9 <CreatePartitions>10 <CreatePartition wcm:action="add">11 <Order>1</Order>12 <Size>512</Size>13 <Type>Primary</Type>14 </CreatePartition>15 <CreatePartition wcm:action="add">16 <Order>2</Order>17 <Extend>true</Extend>18 <Type>Primary</Type>19 </CreatePartition>20 </CreatePartitions>21 </Disk>22 </DiskConfiguration>23 <UserData>24 <AcceptEula>true</AcceptEula>25 <ProductKey>26 <WillShowUI>Never</WillShowUI>27 </ProductKey>28 </UserData>29 </component>30 </settings>3132 <settings pass="specialize">33 <component name="Microsoft-Windows-Shell-Setup">34 <ComputerName>WIN-RUNNER</ComputerName>35 <ProductKey>VK7JG-NPHTM-C97JM-9MPGT-3V66T</ProductKey>36 </component>37 </settings>3839 <settings pass="oobeSystem">40 <component name="Microsoft-Windows-Shell-Setup">41 <OOBE>42 <HideEULAPage>true</HideEULAPage>43 <HideWirelessSetupInOOBE>true</HideWirelessSetupInOOBE>44 <NetworkLocation>Work</NetworkLocation>45 <ProtectYourPC>1</ProtectYourPC>46 </OOBE>47 <UserAccounts>48 <AdministratorPassword>49 <Value>P@ssw0rd123</Value>50 <PlainText>true</PlainText>51 </AdministratorPassword>52 <LocalAccounts>53 <LocalAccount wcm:action="add">54 <Password>55 <Value>P@ssw0rd123</Value>56 <PlainText>true</PlainText>57 </Password>58 <Description>Runner Service Account</Description>59 <DisplayName>runner</DisplayName>60 <Group>Administrators</Group>61 <Name>runner</Name>62 </LocalAccount>63 </LocalAccounts>64 </UserAccounts>65 <FirstLogonCommands>66 <SynchronousCommand wcm:action="add">67 <CommandLine>powershell.exe -ExecutionPolicy Bypass -File C:\Scripts\setup-runner.ps1</CommandLine>68 <Order>1</Order>69 </SynchronousCommand>70 </FirstLogonCommands>71 </component>72 </settings>73</unattend>macOS Automated Deployment#
macOS NetInstall Server:
1#!/bin/bash2# Create macOS NetInstall image34# Install macOS Server tools5sudo installer -pkg /Applications/Server.app/Contents/ServerRoot/System/Installation/Packages/OSInstall.mpkg -target /67# Create NetInstall image8sudo /System/Library/CoreServices/System\ Image\ Utility.app/Contents/MacOS/System\ Image\ Utility \9 --source /Applications/Install\ macOS\ Monterey.app \10 --output /Users/Shared/NetInstall.nbi \11 --kind netinstall \12 --name "macOS Runner AutoInstall"1314# Configure NetBoot service15sudo serveradmin settings netboot:sharepoint = "/Users/Shared"16sudo serveradmin start netbootPlatform Installation Automation#
GitHub Actions Runner Installation#
Cross-Platform Installation Script:
1#!/bin/bash2# GitHub Actions runner installation34RUNNER_VERSION="2.311.0"5GITHUB_TOKEN="$1"6GITHUB_URL="$2"78# Detect OS9if [[ "$OSTYPE" == "linux-gnu"* ]]; then10 OS="linux"11 ARCH="x64"12elif [[ "$OSTYPE" == "darwin"* ]]; then13 OS="osx"14 ARCH="x64"15elif [[ "$OSTYPE" == "msys" ]]; then16 OS="win"17 ARCH="x64"18fi1920# Download runner21mkdir -p /opt/actions-runner && cd /opt/actions-runner22curl -o actions-runner.tar.gz -L "https://github.com/actions/runner/releases/download/v${RUNNER_VERSION}/actions-runner-${OS}-${ARCH}-${RUNNER_VERSION}.tar.gz"23tar xzf actions-runner.tar.gz2425# Generate registration token26REG_TOKEN=$(curl -s -X POST \27 -H "Authorization: token ${GITHUB_TOKEN}" \28 -H "Accept: application/vnd.github.v3+json" \29 "${GITHUB_URL}/actions/runners/registration-token" | \30 jq -r .token)3132# Configure runner33./config.sh --url "${GITHUB_URL}" --token "${REG_TOKEN}" --unattended --replace3435# Install as service36if [[ "$OS" == "linux" ]]; then37 sudo ./svc.sh install38 sudo ./svc.sh start39elif [[ "$OS" == "osx" ]]; then40 sudo ./svc.sh install41 sudo launchctl start actions.runner.service42elif [[ "$OS" == "win" ]]; then43 powershell -Command ".\svc.cmd install"44 powershell -Command ".\svc.cmd start"45fiGitLab Runner Installation#
Multi-Executor GitLab Runner Setup:
1#!/bin/bash2# GitLab Runner installation and configuration34# Install GitLab Runner5curl -L "https://packages.gitlab.com/install/repositories/runner/gitlab-runner/script.deb.sh" | sudo bash6sudo apt-get install gitlab-runner78# Register multiple runners with different executors9gitlab-runner register \10 --non-interactive \11 --url "https://gitlab.com/" \12 --registration-token "$GITLAB_TOKEN" \13 --executor "docker" \14 --docker-image "alpine:latest" \15 --description "Docker executor runner" \16 --tag-list "docker,linux" \17 --run-untagged="true" \18 --locked="false" \19 --access-level="not_protected"2021gitlab-runner register \22 --non-interactive \23 --url "https://gitlab.com/" \24 --registration-token "$GITLAB_TOKEN" \25 --executor "shell" \26 --description "Shell executor runner" \27 --tag-list "shell,linux" \28 --run-untagged="false" \29 --locked="false" \30 --access-level="not_protected"3132# Configure concurrent jobs33sudo sed -i 's/concurrent = 1/concurrent = 4/' /etc/gitlab-runner/config.toml3435# Start service36sudo systemctl enable gitlab-runner37sudo systemctl start gitlab-runnerJenkins Agent Installation#
Jenkins Agent with JNLP:
1#!/bin/bash2# Jenkins agent installation34JENKINS_URL="$1"5AGENT_NAME="$2"6AGENT_SECRET="$3"78# Create jenkins user9sudo useradd -m -s /bin/bash jenkins10sudo usermod -aG docker jenkins1112# Download agent JAR13sudo -u jenkins mkdir -p /home/jenkins/agent14cd /home/jenkins/agent15sudo -u jenkins wget "${JENKINS_URL}/jnlpJars/agent.jar"1617# Create systemd service18sudo tee /etc/systemd/system/jenkins-agent.service > /dev/null << EOF19[Unit]20Description=Jenkins Agent21After=network.target2223[Service]24Type=simple25User=jenkins26WorkingDirectory=/home/jenkins/agent27ExecStart=/usr/bin/java -jar agent.jar -jnlpUrl ${JENKINS_URL}/computer/${AGENT_NAME}/jenkins-agent.jnlp -secret ${AGENT_SECRET} -workDir /home/jenkins/agent28Restart=always29RestartSec=103031[Install]32WantedBy=multi-user.target33EOF3435sudo systemctl daemon-reload36sudo systemctl enable jenkins-agent37sudo systemctl start jenkins-agentAzure DevOps Agent Installation#
Azure Pipelines Agent Setup:
1# Azure DevOps agent installation (PowerShell)2param(3 [Parameter(Mandatory=$true)]4 [string]$OrganizationUrl,56 [Parameter(Mandatory=$true)]7 [string]$PersonalAccessToken,89 [Parameter(Mandatory=$true)]10 [string]$Pool,1112 [Parameter(Mandatory=$true)]13 [string]$AgentName14)1516# Download and extract agent17$agentDir = "C:\agent"18New-Item -ItemType Directory -Path $agentDir -Force19Set-Location $agentDir2021$webClient = New-Object System.Net.WebClient22$webClient.DownloadFile("https://vstsagentpackage.azureedge.net/agent/2.214.1/vsts-agent-win-x64-2.214.1.zip", "$agentDir\agent.zip")2324Expand-Archive -Path "$agentDir\agent.zip" -DestinationPath $agentDir -Force2526# Configure agent27.\config.cmd --unattended --url $OrganizationUrl --auth pat --token $PersonalAccessToken --pool $Pool --agent $AgentName --runAsService --windowsLogonAccount "NT AUTHORITY\SYSTEM"2829# Start service30.\svc.cmd install31.\svc.cmd startBazel Remote Execution Setup#
Bazel RBE Worker Configuration:
1#!/bin/bash2# Bazel Remote Build Execution worker setup34# Install Bazel5curl -fsSL https://bazel.build/bazel-release.pub.gpg | gpg --dearmor > bazel.gpg6sudo mv bazel.gpg /etc/apt/trusted.gpg.d/7echo "deb [arch=amd64] https://storage.googleapis.com/bazel-apt stable jdk1.8" | sudo tee /etc/apt/sources.list.d/bazel.list8sudo apt update && sudo apt install bazel910# Install BuildBuddy RBE worker11wget https://github.com/buildbuddy-io/buildbuddy/releases/latest/download/executor-linux-amd6412chmod +x executor-linux-amd6413sudo mv executor-linux-amd64 /usr/local/bin/buildbuddy-executor1415# Create configuration16sudo tee /etc/buildbuddy/config.yaml > /dev/null << EOF17executor:18 app_target: "grpc://buildbuddy.example.com:1985"19 root_directory: "/tmp/buildbuddy"20 local_cache_directory: "/tmp/buildbuddy-cache"21 local_cache_size_bytes: 10000000000 # 10GB2223 # Platform properties24 platform:25 os: "linux"26 arch: "amd64"2728 # Resource limits29 cpu_count: 830 memory_bytes: 16000000000 # 16GB31EOF3233# Create systemd service34sudo tee /etc/systemd/system/buildbuddy-executor.service > /dev/null << EOF35[Unit]36Description=BuildBuddy RBE Executor37After=network.target3839[Service]40Type=simple41User=buildbuddy42ExecStart=/usr/local/bin/buildbuddy-executor --config_file=/etc/buildbuddy/config.yaml43Restart=always44RestartSec=104546[Install]47WantedBy=multi-user.target48EOF4950sudo systemctl daemon-reload51sudo systemctl enable buildbuddy-executor52sudo systemctl start buildbuddy-executorStorage Configuration#
Local Storage Optimization#
NVMe SSD Configuration for Build Performance:
1#!/bin/bash2# Optimize NVMe SSDs for build workloads34# Check NVMe devices5nvme list67# Set up RAID 0 for maximum performance (multiple NVMe drives)8mdadm --create --verbose /dev/md0 --level=0 --raid-devices=2 /dev/nvme0n1 /dev/nvme1n1910# Format with optimized filesystem11mkfs.ext4 -F -O ^has_journal /dev/md012tune2fs -o discard /dev/md01314# Mount with performance optimizations15mkdir -p /opt/builds16echo '/dev/md0 /opt/builds ext4 defaults,noatime,discard,barrier=0 0 2' >> /etc/fstab17mount /opt/builds1819# Set up build cache directories20mkdir -p /opt/builds/{docker,maven,gradle,npm,ccache}21chown -R runner:runner /opt/buildsNetwork-Attached Storage#
NFS Server for Shared Build Cache:
1#!/bin/bash2# NFS server setup for shared build artifacts34# Install NFS server5apt-get update6apt-get install -y nfs-kernel-server78# Create shared directories9mkdir -p /srv/nfs/{artifacts,cache,tools}1011# Configure exports12cat > /etc/exports << 'EOF'13/srv/nfs/artifacts 192.168.20.0/24(rw,sync,no_subtree_check,no_root_squash)14/srv/nfs/cache 192.168.20.0/24(rw,sync,no_subtree_check,no_root_squash)15/srv/nfs/tools 192.168.20.0/24(ro,sync,no_subtree_check,no_root_squash)16EOF1718# Start NFS services19systemctl enable nfs-kernel-server20systemctl start nfs-kernel-server21exportfs -ra2223# Client-side NFS mounting24# On runner nodes:25# mkdir -p /mnt/{artifacts,cache,tools}26# echo '192.168.30.10:/srv/nfs/artifacts /mnt/artifacts nfs defaults,_netdev 0 0' >> /etc/fstab27# echo '192.168.30.10:/srv/nfs/cache /mnt/cache nfs defaults,_netdev 0 0' >> /etc/fstab28# echo '192.168.30.10:/srv/nfs/tools /mnt/tools nfs defaults,_netdev 0 0' >> /etc/fstab29# mount -aBackup and Archive Strategy#
Automated Backup with rsync and ZFS:
1#!/bin/bash2# Backup strategy for runner infrastructure34# Create ZFS pool for backups5zpool create backuppool /dev/sdb /dev/sdc6zfs create backuppool/daily7zfs create backuppool/weekly8zfs create backuppool/monthly910# Automated backup script11cat > /usr/local/bin/backup-runners.sh << 'EOF'12#!/bin/bash13DATE=$(date +%Y%m%d)1415# Backup runner configurations16rsync -av --delete /etc/gitlab-runner/ /backup/gitlab-runner-$DATE/17rsync -av --delete /opt/actions-runner/ /backup/actions-runner-$DATE/18rsync -av --delete /home/jenkins/ /backup/jenkins-$DATE/1920# Create ZFS snapshots21zfs snapshot backuppool/daily@$DATE22zfs snapshot backuppool/weekly@$(date +%Y-W%U)23zfs snapshot backuppool/monthly@$(date +%Y-%m)2425# Retention policy - keep 7 daily, 4 weekly, 12 monthly26zfs list -t snapshot | grep backuppool/daily | sort | head -n -7 | cut -f1 | xargs -n1 zfs destroy27zfs list -t snapshot | grep backuppool/weekly | sort | head -n -4 | cut -f1 | xargs -n1 zfs destroy28zfs list -t snapshot | grep backuppool/monthly | sort | head -n -12 | cut -f1 | xargs -n1 zfs destroy29EOF3031chmod +x /usr/local/bin/backup-runners.sh3233# Schedule backups34echo "0 2 * * * root /usr/local/bin/backup-runners.sh" >> /etc/crontabDevOps Hub Agent Installation#
Cross-Platform Agent Deployment#
Universal Agent Installation Script:
1#!/bin/bash2# DevOps Hub agent installation for bare metal runners34AGENT_VERSION="1.2.3"5API_TOKEN="$1"6HUB_URL="${2:-https://api.devopshub.com}"7AGENT_NAME="${3:-$(hostname)}"89# Detect platform10case "$(uname -s)" in11 Linux*) PLATFORM=linux;;12 Darwin*) PLATFORM=macos;;13 MINGW*) PLATFORM=windows;;14 *) echo "Unsupported platform"; exit 1;;15esac1617case "$(uname -m)" in18 x86_64*) ARCH=amd64;;19 arm64*) ARCH=arm64;;20 aarch64*) ARCH=arm64;;21 *) echo "Unsupported architecture"; exit 1;;22esac2324# Download agent25DOWNLOAD_URL="${HUB_URL}/downloads/agent/v${AGENT_VERSION}/devopshub-agent-${PLATFORM}-${ARCH}"26curl -fsSL "${DOWNLOAD_URL}" -o /tmp/devopshub-agent27chmod +x /tmp/devopshub-agent2829# Install agent30sudo mkdir -p /opt/devopshub31sudo mv /tmp/devopshub-agent /opt/devopshub/32sudo ln -sf /opt/devopshub/devopshub-agent /usr/local/bin/3334# Create configuration35sudo tee /etc/devopshub/config.yaml > /dev/null << EOF36agent:37 name: "${AGENT_NAME}"38 token: "${API_TOKEN}"39 hub_url: "${HUB_URL}"4041 # Capabilities auto-detection42 auto_detect_capabilities: true4344 # Resource limits45 max_concurrent_jobs: 446 max_memory_mb: 819247 max_disk_mb: 512004849 # Platform-specific settings50 platforms:51 - github-actions52 - gitlab-ci53 - jenkins54 - azure-devops55 - bazel-rbe5657logging:58 level: info59 file: /var/log/devopshub-agent.log60EOF6162# Capability detection63/opt/devopshub/devopshub-agent detect-capabilities > /tmp/capabilities.json6465# Register agent66/opt/devopshub/devopshub-agent register \67 --config /etc/devopshub/config.yaml \68 --capabilities /tmp/capabilities.json6970# Create systemd service (Linux)71if [[ "$PLATFORM" == "linux" ]]; then72 sudo tee /etc/systemd/system/devopshub-agent.service > /dev/null << EOF73[Unit]74Description=DevOps Hub Agent75After=network.target7677[Service]78Type=simple79User=runner80ExecStart=/opt/devopshub/devopshub-agent run --config /etc/devopshub/config.yaml81Restart=always82RestartSec=108384[Install]85WantedBy=multi-user.target86EOF8788 sudo systemctl daemon-reload89 sudo systemctl enable devopshub-agent90 sudo systemctl start devopshub-agent91fi9293# Create launchd service (macOS)94if [[ "$PLATFORM" == "macos" ]]; then95 sudo tee /Library/LaunchDaemons/com.devopshub.agent.plist > /dev/null << EOF96<?xml version="1.0" encoding="UTF-8"?>97<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">98<plist version="1.0">99<dict>100 <key>Label</key>101 <string>com.devopshub.agent</string>102 <key>ProgramArguments</key>103 <array>104 <string>/opt/devopshub/devopshub-agent</string>105 <string>run</string>106 <string>--config</string>107 <string>/etc/devopshub/config.yaml</string>108 </array>109 <key>RunAtLoad</key>110 <true/>111 <key>KeepAlive</key>112 <true/>113</dict>114</plist>115EOF116117 sudo launchctl load /Library/LaunchDaemons/com.devopshub.agent.plist118fi119120echo "DevOps Hub agent installed successfully"121echo "Agent name: ${AGENT_NAME}"122echo "Capabilities: $(cat /tmp/capabilities.json)"Agent Health Monitoring#
Health Check and Self-Healing:
1#!/bin/bash2# DevOps Hub agent health monitoring34# Health check function5check_agent_health() {6 local status_url="http://localhost:8080/health"7 local response=$(curl -s -o /dev/null -w "%{http_code}" "$status_url")89 if [[ "$response" == "200" ]]; then10 echo "Agent healthy"11 return 012 else13 echo "Agent unhealthy (HTTP $response)"14 return 115 fi16}1718# Service restart function19restart_agent() {20 echo "Restarting DevOps Hub agent..."2122 case "$(uname -s)" in23 Linux*)24 sudo systemctl restart devopshub-agent25 ;;26 Darwin*)27 sudo launchctl unload /Library/LaunchDaemons/com.devopshub.agent.plist28 sleep 529 sudo launchctl load /Library/LaunchDaemons/com.devopshub.agent.plist30 ;;31 esac32}3334# Capability refresh35refresh_capabilities() {36 echo "Refreshing agent capabilities..."37 /opt/devopshub/devopshub-agent detect-capabilities > /tmp/capabilities.json38 /opt/devopshub/devopshub-agent update-capabilities --capabilities /tmp/capabilities.json39}4041# Main monitoring loop42main() {43 local consecutive_failures=044 local max_failures=34546 while true; do47 if check_agent_health; then48 consecutive_failures=04950 # Refresh capabilities daily51 if [[ $(date +%H:%M) == "02:00" ]]; then52 refresh_capabilities53 fi54 else55 ((consecutive_failures++))5657 if [[ $consecutive_failures -ge $max_failures ]]; then58 restart_agent59 consecutive_failures=060 sleep 30 # Wait for service to start61 fi62 fi6364 sleep 60 # Check every minute65 done66}6768# Run as daemon or one-shot69if [[ "${1}" == "--daemon" ]]; then70 main71else72 check_agent_health73fiPower and Environmental Management#
Uninterruptible Power Supply (UPS) Integration#
Network UPS Tools (NUT) Configuration:
1#!/bin/bash2# UPS monitoring and management with NUT34# Install NUT5apt-get update6apt-get install -y nut nut-client nut-server78# Configure UPS (APC Smart-UPS via USB)9cat > /etc/nut/ups.conf << 'EOF'10[apc-ups]11 driver = usbhid-ups12 port = auto13 desc = "APC Smart-UPS 1500VA"14 vendorid = 051d15EOF1617# Configure NUT daemon18cat > /etc/nut/nut.conf << 'EOF'19MODE=netserver20EOF2122# Configure users23cat > /etc/nut/upsd.users << 'EOF'24[admin]25 password = supersecret26 actions = SET27 instcmds = ALL28 upsmon master2930[upsmon]31 password = secret32 upsmon slave33EOF3435# Configure monitoring36cat > /etc/nut/upsmon.conf << 'EOF'37MONITOR apc-ups@localhost 1 upsmon secret master38MINSUPPLIES 139SHUTDOWNCMD "/sbin/shutdown -h +0"40POLLFREQ 541POLLFREQALERT 542HOSTSYNC 1543DEADTIME 1544POWERDOWNFLAG /etc/killpower4546NOTIFYCMD /etc/nut/notifyscript.sh47NOTIFYFLAG ONBATT SYSLOG+WALL+EXEC48NOTIFYFLAG LOWBATT SYSLOG+WALL+EXEC49NOTIFYFLAG FSD SYSLOG+WALL+EXEC50NOTIFYFLAG COMMOK SYSLOG+EXEC51NOTIFYFLAG COMMBAD SYSLOG+WALL+EXEC52NOTIFYFLAG SHUTDOWN SYSLOG+WALL+EXEC53NOTIFYFLAG REPLBATT SYSLOG+WALL+EXEC54NOTIFYFLAG NOCOMM SYSLOG+WALL+EXEC55EOF5657# Notification script58cat > /etc/nut/notifyscript.sh << 'EOF'59#!/bin/bash60# UPS notification script6162case "$1" in63 ONBATT)64 echo "UPS on battery power - gracefully stopping build jobs"65 # Signal agents to finish current jobs and stop accepting new ones66 systemctl stop jenkins-agent67 systemctl stop gitlab-runner68 systemctl stop devopshub-agent69 ;;70 LOWBATT)71 echo "UPS low battery - initiating emergency shutdown"72 # Force stop all services and prepare for shutdown73 docker stop $(docker ps -q)74 sync75 ;;76 COMMOK)77 echo "UPS communication restored"78 ;;79 COMMBAD)80 echo "UPS communication lost"81 ;;82 FSD)83 echo "Forced shutdown initiated"84 ;;85 SHUTDOWN)86 echo "System shutdown in progress"87 ;;88esac89EOF9091chmod +x /etc/nut/notifyscript.sh9293# Start NUT services94systemctl enable nut-server nut-client95systemctl start nut-server nut-clientEnvironmental Monitoring#
Temperature and Hardware Monitoring:
1#!/bin/bash2# Environmental monitoring with lm-sensors and IPMI34# Install monitoring tools5apt-get install -y lm-sensors ipmitool smartmontools67# Detect sensors8sensors-detect --auto910# Create monitoring script11cat > /usr/local/bin/hardware-monitor.sh << 'EOF'12#!/bin/bash13# Hardware monitoring script1415LOG_FILE="/var/log/hardware-monitor.log"16TEMP_THRESHOLD=75 # Celsius17FAN_THRESHOLD=500 # RPM1819# Function to log with timestamp20log_message() {21 echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"22}2324# Check CPU temperatures25check_temperatures() {26 local temps=($(sensors | grep -E 'Core [0-9]+' | awk '{print $3}' | sed 's/+//g' | sed 's/°C//g'))2728 for temp in "${temps[@]}"; do29 if (( $(echo "$temp > $TEMP_THRESHOLD" | bc -l) )); then30 log_message "WARNING: High CPU temperature: ${temp}°C"31 # Reduce runner concurrency32 if systemctl is-active gitlab-runner > /dev/null; then33 gitlab-runner verify --delete || true34 fi35 fi36 done37}3839# Check fan speeds40check_fans() {41 local fans=($(sensors | grep 'fan' | awk '{print $2}'))4243 for fan in "${fans[@]}"; do44 if [[ "$fan" =~ ^[0-9]+$ ]] && (( fan < FAN_THRESHOLD )); then45 log_message "WARNING: Low fan speed: ${fan} RPM"46 fi47 done48}4950# Check disk health51check_disks() {52 for disk in /dev/sd? /dev/nvme?n?; do53 if [[ -e "$disk" ]]; then54 local health=$(smartctl -H "$disk" | grep "SMART overall-health" | awk '{print $6}')55 if [[ "$health" != "PASSED" ]]; then56 log_message "CRITICAL: Disk health issue on $disk: $health"57 fi58 fi59 done60}6162# IPMI monitoring (if available)63check_ipmi() {64 if command -v ipmitool &> /dev/null; then65 local power_status=$(ipmitool power status)66 local chassis_status=$(ipmitool chassis status)6768 if echo "$chassis_status" | grep -q "System Power.*off"; then69 log_message "WARNING: Chassis reports system power issues"70 fi71 fi72}7374# Main monitoring function75main() {76 log_message "Starting hardware monitoring check"77 check_temperatures78 check_fans79 check_disks80 check_ipmi81 log_message "Hardware monitoring check completed"82}8384main85EOF8687chmod +x /usr/local/bin/hardware-monitor.sh8889# Schedule monitoring90echo "*/5 * * * * root /usr/local/bin/hardware-monitor.sh" >> /etc/crontabDisaster Recovery#
Infrastructure Backup and Recovery#
Complete Infrastructure Backup Strategy:
1#!/bin/bash2# Comprehensive backup and recovery procedures34BACKUP_DIR="/backup/infrastructure"5REMOTE_BACKUP="backup-server:/srv/backup/runners"6DATE=$(date +%Y%m%d-%H%M%S)78# Create backup directories9mkdir -p "$BACKUP_DIR"/{system,configs,data,images}1011# System configuration backup12backup_system_configs() {13 echo "Backing up system configurations..."1415 # Essential system files16 tar -czf "$BACKUP_DIR/system/system-configs-$DATE.tar.gz" \17 /etc/fstab \18 /etc/hosts \19 /etc/network/ \20 /etc/ssh/ \21 /etc/sudoers \22 /etc/sudoers.d/ \23 /boot/grub/ \24 /etc/systemd/system/ \25 2>/dev/null || true2627 # Network configuration28 cp -r /etc/netplan/ "$BACKUP_DIR/system/netplan-$DATE/" 2>/dev/null || true2930 # Firewall rules31 iptables-save > "$BACKUP_DIR/system/iptables-$DATE.rules"3233 # Package list34 dpkg --get-selections > "$BACKUP_DIR/system/packages-$DATE.list"35 snap list > "$BACKUP_DIR/system/snap-packages-$DATE.list" 2>/dev/null || true36}3738# Runner configurations backup39backup_runner_configs() {40 echo "Backing up runner configurations..."4142 # GitHub Actions43 if [[ -d /opt/actions-runner ]]; then44 tar -czf "$BACKUP_DIR/configs/github-actions-$DATE.tar.gz" \45 /opt/actions-runner/ \46 --exclude="*.log" \47 --exclude="_diag/" || true48 fi4950 # GitLab Runner51 if [[ -f /etc/gitlab-runner/config.toml ]]; then52 cp /etc/gitlab-runner/config.toml "$BACKUP_DIR/configs/gitlab-runner-$DATE.toml"53 fi5455 # Jenkins56 if [[ -d /home/jenkins ]]; then57 tar -czf "$BACKUP_DIR/configs/jenkins-$DATE.tar.gz" \58 /home/jenkins/ \59 --exclude="workspace/" \60 --exclude="*.log" || true61 fi6263 # DevOps Hub Agent64 if [[ -d /etc/devopshub ]]; then65 cp -r /etc/devopshub/ "$BACKUP_DIR/configs/devopshub-$DATE/"66 fi67}6869# Data backup70backup_data() {71 echo "Backing up important data..."7273 # Build caches74 if [[ -d /opt/builds ]]; then75 tar -czf "$BACKUP_DIR/data/build-caches-$DATE.tar.gz" \76 /opt/builds/ccache/ \77 /opt/builds/maven/ \78 /opt/builds/gradle/ \79 2>/dev/null || true80 fi8182 # Docker images83 docker images --format 'table {{.Repository}}:{{.Tag}}' | \84 grep -v "REPOSITORY" | \85 while read image; do86 filename=$(echo "$image" | tr '/:' '_')87 docker save "$image" | gzip > "$BACKUP_DIR/images/${filename}-$DATE.tar.gz"88 done89}9091# System image creation92create_system_image() {93 echo "Creating system disk image..."9495 # Create compressed disk image96 if command -v partclone.ext4 &> /dev/null; then97 partclone.ext4 -c -s /dev/sda1 | gzip > "$BACKUP_DIR/images/system-image-$DATE.img.gz"98 else99 dd if=/dev/sda bs=4M status=progress | gzip > "$BACKUP_DIR/images/system-image-$DATE.img.gz"100 fi101}102103# Cleanup old backups104cleanup_old_backups() {105 echo "Cleaning up old backups..."106107 # Keep last 7 daily backups108 find "$BACKUP_DIR" -name "*-*.tar.gz" -mtime +7 -delete109 find "$BACKUP_DIR" -name "*-*.img.gz" -mtime +14 -delete110}111112# Sync to remote backup113sync_to_remote() {114 echo "Syncing to remote backup location..."115116 if command -v rsync &> /dev/null; then117 rsync -av --delete "$BACKUP_DIR/" "$REMOTE_BACKUP/"118 fi119}120121# Recovery procedures - generates recovery documentation122recovery_procedures() {123 cat > "$BACKUP_DIR/RECOVERY_PROCEDURES.txt" << 'RECOVERY_EOF'124DISASTER RECOVERY PROCEDURES125=============================126127SYSTEM RECOVERY128---------------1291301. Boot from rescue media:131 mount /dev/sdb1 /mnt/backup132 gunzip -c /mnt/backup/images/system-image-YYYYMMDD.img.gz | dd of=/dev/sda bs=4M1331342. Configuration restore:135 tar -xzf /mnt/backup/system/system-configs-YYYYMMDD.tar.gz -C /136 dpkg --set-selections < /mnt/backup/system/packages-YYYYMMDD.list137 apt-get dselect-upgrade1381393. Runner configuration restore:140 tar -xzf /mnt/backup/configs/github-actions-YYYYMMDD.tar.gz -C /141 cp /mnt/backup/configs/gitlab-runner-YYYYMMDD.toml /etc/gitlab-runner/config.toml142 tar -xzf /mnt/backup/configs/jenkins-YYYYMMDD.tar.gz -C /143 cp -r /mnt/backup/configs/devopshub-YYYYMMDD/ /etc/devopshub/1441454. Service restart:146 systemctl daemon-reload147 systemctl restart gitlab-runner jenkins-agent devopshub-agent148149NETWORK RECOVERY150----------------1511521. Restore network configuration:153 cp -r /mnt/backup/system/netplan-YYYYMMDD/* /etc/netplan/154 netplan apply1551562. Restore firewall rules:157 iptables-restore < /mnt/backup/system/iptables-YYYYMMDD.rules158159DATA RECOVERY160-------------1611621. Restore build caches:163 tar -xzf /mnt/backup/data/build-caches-YYYYMMDD.tar.gz -C /164 chown -R runner:runner /opt/builds1651662. Restore Docker images:167 cd /mnt/backup/images168 for image in *.tar.gz; do gunzip -c "$image" | docker load; done169RECOVERY_EOF170}171172# Main execution173main() {174 local date_str=$(date +%Y%m%d-%H%M%S)175 echo "Starting infrastructure backup - ${date_str}"176177 backup_system_configs178 backup_runner_configs179 backup_data180181 # Only create system image on Sundays182 if [[ $(date +%u) -eq 7 ]]; then183 create_system_image184 fi185186 recovery_procedures187 cleanup_old_backups188 sync_to_remote189190 echo "Backup completed - ${date_str}"191}192193# Execute if run directly194if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then195 main "$@"196fiMonitoring and Alerting#
Infrastructure Health Monitoring#
Comprehensive Monitoring with Prometheus and Grafana:
1#!/bin/bash2# Deploy monitoring stack for bare metal runners34# Install Prometheus5useradd --no-create-home --shell /bin/false prometheus6mkdir -p /etc/prometheus /var/lib/prometheus7chown prometheus:prometheus /etc/prometheus /var/lib/prometheus89wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz10tar xvf prometheus-2.45.0.linux-amd64.tar.gz11cp prometheus-2.45.0.linux-amd64/prometheus /usr/local/bin/12cp prometheus-2.45.0.linux-amd64/promtool /usr/local/bin/13chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtool1415# Prometheus configuration for runners16cat > /etc/prometheus/prometheus.yml << 'EOF'17global:18 scrape_interval: 15s19 evaluation_interval: 15s2021rule_files:22 - "runner_rules.yml"2324alerting:25 alertmanagers:26 - static_configs:27 - targets:28 - alertmanager:90932930scrape_configs:31 # Prometheus itself32 - job_name: 'prometheus'33 static_configs:34 - targets: ['localhost:9090']3536 # Node exporter for system metrics37 - job_name: 'node-exporter'38 static_configs:39 - targets: ['localhost:9100']4041 # GitHub Actions runners42 - job_name: 'github-actions'43 static_configs:44 - targets: ['runner1:8080', 'runner2:8080', 'runner3:8080']4546 # GitLab runners47 - job_name: 'gitlab-runner'48 static_configs:49 - targets: ['runner1:9252', 'runner2:9252', 'runner3:9252']5051 # Jenkins agents52 - job_name: 'jenkins'53 static_configs:54 - targets: ['jenkins-master:8080']55 metrics_path: '/prometheus'5657 # DevOps Hub agents58 - job_name: 'devopshub-agents'59 static_configs:60 - targets: ['runner1:8081', 'runner2:8081', 'runner3:8081']6162 # IPMI exporter for hardware metrics63 - job_name: 'ipmi'64 static_configs:65 - targets: ['runner1', 'runner2', 'runner3']66 metrics_path: /ipmi67 params:68 module: [default]69 relabel_configs:70 - source_labels: [__address__]71 target_label: __param_target72 - source_labels: [__param_target]73 target_label: instance74 - target_label: __address__75 replacement: localhost:92907677 # UPS monitoring78 - job_name: 'nut-exporter'79 static_configs:80 - targets: ['localhost:9199']81EOF8283# Alerting rules84cat > /etc/prometheus/runner_rules.yml << 'EOF'85groups:86- name: runner_alerts87 rules:88 - alert: RunnerDown89 expr: up{job=~".*runner.*"} == 090 for: 2m91 labels:92 severity: critical93 annotations:94 summary: "Runner {{ $labels.instance }} is down"95 description: "Runner {{ $labels.instance }} has been down for more than 2 minutes."9697 - alert: HighCPUUsage98 expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 9099 for: 5m100 labels:101 severity: warning102 annotations:103 summary: "High CPU usage on {{ $labels.instance }}"104 description: "CPU usage is above 90% for more than 5 minutes."105106 - alert: HighMemoryUsage107 expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 90108 for: 5m109 labels:110 severity: warning111 annotations:112 summary: "High memory usage on {{ $labels.instance }}"113 description: "Memory usage is above 90% for more than 5 minutes."114115 - alert: DiskSpaceLow116 expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10117 for: 1m118 labels:119 severity: critical120 annotations:121 summary: "Low disk space on {{ $labels.instance }}"122 description: "Disk space is below 10% on root filesystem."123124 - alert: HighTemperature125 expr: node_hwmon_temp_celsius > 75126 for: 2m127 labels:128 severity: warning129 annotations:130 summary: "High temperature on {{ $labels.instance }}"131 description: "Temperature is above 75°C on {{ $labels.instance }}."132133 - alert: UPSOnBattery134 expr: nut_ups_status{flag="OB"} == 1135 for: 0m136 labels:137 severity: critical138 annotations:139 summary: "UPS on battery power"140 description: "UPS is running on battery power, mains supply lost."141142 - alert: BuildJobsFailing143 expr: rate(gitlab_runner_failed_jobs_total[5m]) > 0.1144 for: 5m145 labels:146 severity: warning147 annotations:148 summary: "High job failure rate"149 description: "Job failure rate is above 10% over the last 5 minutes."150EOF151152# Create systemd service153cat > /etc/systemd/system/prometheus.service << 'EOF'154[Unit]155Description=Prometheus156Wants=network-online.target157After=network-online.target158159[Service]160User=prometheus161Group=prometheus162Type=simple163ExecStart=/usr/local/bin/prometheus \164 --config.file /etc/prometheus/prometheus.yml \165 --storage.tsdb.path /var/lib/prometheus/ \166 --web.console.templates=/etc/prometheus/consoles \167 --web.console.libraries=/etc/prometheus/console_libraries \168 --web.listen-address=0.0.0.0:9090 \169 --web.external-url=170171[Install]172WantedBy=multi-user.target173EOF174175# Install Node Exporter176useradd --no-create-home --shell /bin/false node_exporter177wget https://github.com/prometheus/node_exporter/releases/download/v1.6.0/node_exporter-1.6.0.linux-amd64.tar.gz178tar xvf node_exporter-1.6.0.linux-amd64.tar.gz179cp node_exporter-1.6.0.linux-amd64/node_exporter /usr/local/bin/180chown node_exporter:node_exporter /usr/local/bin/node_exporter181182cat > /etc/systemd/system/node_exporter.service << 'EOF'183[Unit]184Description=Node Exporter185Wants=network-online.target186After=network-online.target187188[Service]189User=node_exporter190Group=node_exporter191Type=simple192ExecStart=/usr/local/bin/node_exporter \193 --collector.systemd \194 --collector.processes195196[Install]197WantedBy=multi-user.target198EOF199200# Install IPMI Exporter201wget https://github.com/prometheus-community/ipmi_exporter/releases/download/v1.6.1/ipmi_exporter-1.6.1.linux-amd64.tar.gz202tar xvf ipmi_exporter-1.6.1.linux-amd64.tar.gz203cp ipmi_exporter-1.6.1.linux-amd64/ipmi_exporter /usr/local/bin/204chown prometheus:prometheus /usr/local/bin/ipmi_exporter205206cat > /etc/systemd/system/ipmi_exporter.service << 'EOF'207[Unit]208Description=IPMI Exporter209After=network.target210211[Service]212Type=simple213User=prometheus214Group=prometheus215ExecStart=/usr/local/bin/ipmi_exporter \216 --config.file=/etc/prometheus/ipmi.yml \217 --web.listen-address=0.0.0.0:9290218219[Install]220WantedBy=multi-user.target221EOF222223# Start services224systemctl daemon-reload225systemctl enable prometheus node_exporter ipmi_exporter226systemctl start prometheus node_exporter ipmi_exporterGrafana Dashboard for Runner Infrastructure#
Runner Infrastructure Dashboard:
1{2 "dashboard": {3 "id": null,4 "title": "Bare Metal Runner Infrastructure",5 "tags": ["runners", "infrastructure"],6 "timezone": "browser",7 "panels": [8 {9 "id": 1,10 "title": "Runner Status Overview",11 "type": "stat",12 "targets": [13 {14 "expr": "up{job=~\".*runner.*\"}",15 "legendFormat": "{{ instance }}"16 }17 ],18 "fieldConfig": {19 "defaults": {20 "color": {21 "mode": "thresholds"22 },23 "thresholds": {24 "steps": [25 {"color": "red", "value": 0},26 {"color": "green", "value": 1}27 ]28 }29 }30 }31 },32 {33 "id": 2,34 "title": "System Resource Usage",35 "type": "graph",36 "targets": [37 {38 "expr": "100 - (avg by(instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",39 "legendFormat": "CPU {{ instance }}"40 },41 {42 "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100",43 "legendFormat": "Memory {{ instance }}"44 }45 ]46 },47 {48 "id": 3,49 "title": "Build Job Statistics",50 "type": "graph",51 "targets": [52 {53 "expr": "rate(gitlab_runner_job_duration_seconds_count[5m])",54 "legendFormat": "Jobs/sec {{ instance }}"55 },56 {57 "expr": "rate(gitlab_runner_failed_jobs_total[5m])",58 "legendFormat": "Failed jobs/sec {{ instance }}"59 }60 ]61 },62 {63 "id": 4,64 "title": "Hardware Temperature",65 "type": "graph",66 "targets": [67 {68 "expr": "node_hwmon_temp_celsius",69 "legendFormat": "{{ instance }} {{ chip }} {{ sensor }}"70 }71 ],72 "yAxes": [73 {74 "unit": "celsius",75 "max": 100,76 "min": 077 }78 ]79 },80 {81 "id": 5,82 "title": "Network I/O",83 "type": "graph",84 "targets": [85 {86 "expr": "rate(node_network_receive_bytes_total[5m])",87 "legendFormat": "RX {{ instance }} {{ device }}"88 },89 {90 "expr": "rate(node_network_transmit_bytes_total[5m])",91 "legendFormat": "TX {{ instance }} {{ device }}"92 }93 ]94 },95 {96 "id": 6,97 "title": "Disk I/O",98 "type": "graph",99 "targets": [100 {101 "expr": "rate(node_disk_read_bytes_total[5m])",102 "legendFormat": "Read {{ instance }} {{ device }}"103 },104 {105 "expr": "rate(node_disk_written_bytes_total[5m])",106 "legendFormat": "Write {{ instance }} {{ device }}"107 }108 ]109 }110 ],111 "time": {112 "from": "now-1h",113 "to": "now"114 },115 "refresh": "5s"116 }117}This comprehensive bare metal infrastructure guide provides enterprise teams with complete automation for deploying and managing self-hosted runners on physical hardware. The guide covers all major platforms, operating systems, and includes robust monitoring, alerting, and disaster recovery procedures for production environments.