Infrastructure

Managed Prometheus

Assistance-operated metrics, alerting, dashboards, and reliability signals for production systems


Managed Prometheus is for teams that need dependable metrics and alerting but do not want Prometheus itself to become another production platform nobody owns. Assistance operates the observability stack while your team owns service meaning, response decisions, and product reliability priorities.

Best-fit use cases#

Use caseWhy Managed Prometheus fits
Infrastructure monitoringServer, container, Kubernetes, network, storage, and platform metrics
Application healthRequest rate, error rate, latency, saturation, queue depth, and custom metrics
Alerting cleanupReplace noisy pages with actionable alerts tied to ownership and runbooks
SLO visibilityBuild service-level indicators, error budget views, and reliability review dashboards
Managed service visibilityMonitor databases, Redis, Kafka, OpenSearch, registries, and platform dependencies

What Assistance operates#

AreaIncluded managed service responsibility
ProvisioningPrometheus deployment, scrape topology, storage sizing, network placement, and secure defaults
CollectionScrape configuration, service discovery patterns, exporter onboarding guidance, and target health monitoring
AlertingAlertmanager setup, routing, severity labels, silences, inhibition rules, and integration with paging/chat tools
DashboardsGrafana data source integration, base dashboards, and service health views where scoped
RetentionLocal retention and long-term storage options such as Thanos/Cortex/Mimir-style patterns when required
MaintenanceVersion lifecycle guidance, patching, configuration changes, maintenance windows, and rollback planning
SupportPlatform incident response and escalation for covered observability services

Ownership boundary#

ResponsibilityAssistance ownsCustomer owns
Prometheus runtimeDeployment, scraping platform, retention, upgrades, monitoring, and platform incidentsInstrumenting application code and exposing meaningful metrics
Alert routingAlertmanager configuration, integrations, routing mechanics, and noise-reduction implementationService owners, severity policy, escalation decisions, and response behavior
DashboardsPlatform dashboards and agreed service viewsBusiness meaning, product KPIs, and interpretation of application-specific metrics
SLOsTechnical implementation of SLIs/SLO dashboards where scopedChoosing user-facing objectives and accepting error-budget trade-offs
AccessRoles, data source permissions, credential rotation supportUser approval, identity source, and internal access reviews

Deployment options#

OptionWhen to use it
Assistance physical serversDevelopment platform monitoring, staging observability, and internal services
Customer cloud accountProduction observability inside existing cloud/network/compliance boundary
Hybrid observabilityCentral managed Prometheus with remote write or federation across environments
SRE engagementCombine Managed Prometheus with service ownership, incident response, SLO, and runbook work

Reliability and support model#

TopicManaged Prometheus approach
AvailabilityScoped by topology, retention design, and support plan; HA pairs or long-term storage used where required
Data retentionRetention and downsampling defined by operational and compliance needs
Alert deliveryIntegrations configured for agreed channels; escalation ownership must be defined by customer/team
Platform monitoringPrometheus monitors itself: scrape failures, query pressure, storage, rule evaluation, and Alertmanager health
ResponseCritical response targets scoped in the support agreement; 24/7 coverage available for covered production observability platforms

Onboarding#

1. Observability assessment#

We review current metrics, dashboards, alert history, incident pain points, service ownership, environments, retention needs, and existing tools.

2. Platform design#

Assistance defines scrape architecture, retention, long-term storage, dashboards, alert routing, integrations, access model, and support tier.

3. Signal implementation#

We configure targets, exporters, rules, dashboards, Alertmanager routes, and runbook links. Where needed, we help teams define service-level indicators.

4. Operate and refine#

After go-live, we monitor platform health, tune noisy alerts, review capacity, and keep dashboards aligned with service ownership and incident response.

Supported capabilities#

  • Prometheus servers, HA patterns, and federation/remote-write designs
  • Alertmanager routing, silencing, inhibition, and notification integrations
  • Grafana dashboards and data source configuration
  • Exporter onboarding for Linux, Kubernetes, PostgreSQL, MySQL, Redis, MongoDB, Kafka, Nginx, HAProxy, and common infrastructure
  • Long-term metric storage patterns where required
  • SLO dashboard implementation when paired with reliability work

Not included by default#

  • Instrumenting every application endpoint
  • Defining business KPIs without product owner input
  • Providing blanket on-call response for services outside the support plan
  • Guaranteeing alert actionability when service ownership is undefined
  • Replacing all existing observability tools unless migration is scoped

Getting started#

Frequently asked questions#

Can you work with our existing Grafana? Yes. We can integrate with existing Grafana or operate Grafana as part of the managed observability platform when scoped.

Do you write application metrics? We advise on instrumentation and can implement it as separate project work. By default, application teams own code-level metrics.

Can this reduce alert noise? Yes, if service ownership and severity criteria are defined. We tune alerts to actionable conditions and connect them to dashboards and runbooks.

Do you provide on-call response for alerts? Only for services explicitly covered by the support agreement. We can route alerts to your team, Assistance, or a shared model depending on scope.

What retention is available? Retention is designed per plan and may include local storage plus long-term storage. We choose based on query needs, compliance, cost, and SLO review requirements.