Managed Prometheus

Assistance-operated metrics, alerting, dashboards, and reliability signals for production systems

Managed Prometheus is the metrics and alerting companion to Managed OpenSearch in the docs-level observability set. Use Prometheus for time-series signals, alert rules, SLO evidence, and infrastructure visibility; use OpenSearch for logs, search, and indexed operational data. Assistance operates the observability stack inside an agreed consulting or services boundary while your team owns service meaning, response decisions, and product reliability priorities.

Best-fit use cases#

Use case	Why Managed Prometheus fits
Infrastructure monitoring	Server, container, Kubernetes, network, storage, and platform metrics
Application health	Request rate, error rate, latency, saturation, queue depth, and custom metrics
Alerting cleanup	Replace noisy pages with actionable alerts tied to ownership and runbooks
SLO visibility	Build service-level indicators, error budget views, and reliability review dashboards
Managed service visibility	Monitor databases, Redis, Kafka, OpenSearch, registries, and platform dependencies

What Assistance operates#

Area	Included managed service responsibility
Provisioning	Prometheus deployment, scrape topology, storage sizing, network placement, and secure defaults
Collection	Scrape configuration, service discovery patterns, exporter onboarding guidance, and target health monitoring
Alerting	Alertmanager setup, routing, severity labels, silences, inhibition rules, and integration with paging/chat tools
Dashboards	Grafana data source integration, base dashboards, and service health views where scoped
Retention	Local retention and long-term storage options such as Thanos/Cortex/Mimir-style patterns when required
Maintenance	Version lifecycle guidance, patching, configuration changes, maintenance windows, and rollback planning
Support	Platform incident response and escalation for covered observability services

Metrics are shared evidence, not automatic reliability

Assistance operates Prometheus and alerting infrastructure. Your team owns service intent, SLO decisions, business impact definitions, and whether an alert requires product or application remediation. We help turn signals into an operating model, but service ownership must be explicit.

Ownership boundary#

Responsibility	Assistance owns	Customer owns
Prometheus runtime	Deployment, scraping platform, retention, upgrades, monitoring, and platform incidents	Instrumenting application code and exposing meaningful metrics
Alert routing	Alertmanager configuration, integrations, routing mechanics, and noise-reduction implementation	Service owners, severity policy, escalation decisions, and response behavior
Dashboards	Platform dashboards and agreed service views	Business meaning, product KPIs, and interpretation of application-specific metrics
SLOs	Technical implementation of SLIs/SLO dashboards where scoped	Choosing user-facing objectives and accepting error-budget trade-offs
Access	Roles, data source permissions, credential rotation support	User approval, identity source, and internal access reviews

Deployment options#

Option	When to use it
Assistance physical servers	Development platform monitoring, staging observability, and internal services
Customer cloud account	Production observability inside existing cloud/network/compliance boundary
Hybrid observability	Central managed Prometheus with remote write or federation across environments
SRE engagement	Combine Managed Prometheus with service ownership, incident response, SLO, and runbook work

Reliability and support model#

Topic	Managed Prometheus approach
Availability	Scoped by topology, retention design, and support plan; HA pairs or long-term storage used where required
Data retention	Retention and downsampling defined by operational and compliance needs
Alert delivery	Integrations configured for agreed channels; escalation ownership must be defined by customer/team
Platform monitoring	Prometheus monitors itself: scrape failures, query pressure, storage, rule evaluation, and Alertmanager health
Response	Critical response targets scoped in the support agreement; critical coverage available for covered production observability platforms

Onboarding#

1. Observability assessment#

We review current metrics, dashboards, alert history, incident pain points, service ownership, environments, retention needs, and existing tools.

2. Platform design#

Assistance defines scrape architecture, retention, long-term storage, dashboards, alert routing, integrations, access model, and support tier.

3. Signal implementation#

We configure targets, exporters, rules, dashboards, Alertmanager routes, and runbook links. Where needed, we help teams define service-level indicators.

4. Operate and refine#

After go-live, we monitor platform health, tune noisy alerts, review capacity, and keep dashboards aligned with service ownership and incident response.

Supported capabilities#

Prometheus servers, HA patterns, and federation/remote-write designs
Alertmanager routing, silencing, inhibition, and notification integrations
Grafana dashboards and data source configuration
Exporter onboarding for Linux, Kubernetes, PostgreSQL, MySQL, Redis, MongoDB, Kafka, Nginx, HAProxy, and common infrastructure
Long-term metric storage patterns where required
SLO dashboard implementation when paired with reliability work

Not included by default#

Instrumenting every application endpoint
Defining business KPIs without product owner input
Providing blanket on-call response for services outside the support plan
Guaranteeing alert actionability when service ownership is undefined
Replacing all existing observability tools unless migration is scoped

SRE as a Service — Turn metrics into SLOs, runbooks, and incident response practice
Managed OpenSearch — Logs, search, and indexed operational data
Managed Kafka — Metrics and alerting for streaming platforms
Managed PostgreSQL — Database monitoring and operational dashboards

Getting started#

Request an observability assessment. We will review current metrics, alerts, service ownership, and retention needs before proposing a managed Prometheus model.

Request observability assessment →

Frequently asked questions#

Can you work with our existing Grafana? Yes. We can integrate with existing Grafana or operate Grafana as part of the managed observability platform when scoped.

Do you write application metrics? We advise on instrumentation and can implement it as separate project work. By default, application teams own code-level metrics.

Can this reduce alert noise? Yes, if service ownership and severity criteria are defined. We tune alerts to actionable conditions and connect them to dashboards and runbooks.

Do you provide on-call response for alerts? Only for services explicitly covered by the support agreement. We can route alerts to your team, Assistance, or a shared model depending on scope.

What retention is available? Retention is designed per plan and may include local storage plus long-term storage. We choose based on query needs, compliance, cost, and SLO review requirements.