Scaling with Prometheus: Managing 80M Metrics Smoothly

2 hours ago 2

At Flipkart, we faced a challenge at the API Gateway, where roughly 2,000 instances of the gateway application were emitting close to 40,000 metrics each, resulting in over 80 million time-series datapoints being pushed at once. The metrics ranged from requests per second, diverse error rates, custom JMX metrics and detailed latency metrics such as p99, p95.

While StatsD worked at first, querying more than 3 hours of data became painful because it didn’t scale well. It choked the storage and made it practically impossible to query for long-term insights.

This article walks through how moving to prometheus helped us in this situation and also explains the architecture of a federated prometheus setup.

Some history: Gaps in existing tools

Before Prometheus, earlier monitoring tools had their own limitations:

Push-based model: Tools like StatsD and Collectd, often combined with InfluxDB, relied on a push-based model for metric ingestion.

Press enter or click to view image in full size

They relied on a weaker query language: Complex queries, especially multi-dimensional aggregations are more cumbersome in InfluxQL.

Example of a PromQL query without an InfluxQL equivalent:

-- Flag routes with high traffic AND spiking p95 latency
sum by (route) (rate(api_requests_total[5m]))
and on(route)
histogram_quantile(
0.95,
sum by (le, route) (rate(api_request_duration_seconds_bucket[5m]))
) > 0.3

Ecosystem and Extensibility: Prometheus shines with first-class support for Kubernetes, Consul, AWS EC2, GCE, Azure, and file-based service discovery (full list here). It also has a vibrant ecosystem of exporters.
Scaling Limitations: Prometheus supports hierarchical federation, making it easier to scale via metric filtering and enabling tiered aggregations. Most of the time a single instance is enough to able to handle a large number of metrics.

Architecture Components & Data Flow

Service discovery & Exporters: Kubernetes pods (targets) expose metrics on an endpoint. Prometheus pulls from these endpoints.
Prometheus: One Prometheus server is responsible for scraping and aggregating metrics. In our setup, this prometheus server drops instance label and exposes aggregated metrics over a new endpoint /federate, a federated prometheus server or job is responsible for scraping selected aggregated metrics and writes them to long-term storage
Remote Storage & Grafana Dashboard: Long retention and visualization.

Press enter or click to view image in full size

Dialing Down on Cardinality

Drop instance label for metrics with the help of recording rules, keep stable labels like service & cluster.
For counters (e.g., error count) and rate‑derived metrics (QPS/RPS/error‑rate), summing across instances gives the total while dropping the instance dimension.
For latency (p95/p99), you can also publish avg/max/min across instances to capture typical vs. worst‑case — again without the instance label.
If a service has N instances × M metrics in one cluster, dropping instance label collapses series by N times. E.g., 2,000 pods × 40k metrics → after rollups, ~40k series per cluster.

When Federation Helps (and When It Doesn’t)

Use it when:

You have multiple clusters/regions and need org‑wide dashboards.
To achieve a scalable Prometheus monitoring setup when a single server hits its CPU or memory.

Skip it when:

You need metrics for per‑instance debugging globally.
You have one small cluster where a single prometheus + storage would suffice.
The plan is to mirror all raw metrics upward

Implementation Details

# Prometheus server configuration (example)
global:
scrape_interval: 30s # How frequently to scrape targets
scrape_timeout: 30s # How long until a scrape times out
query_log_file: /dev/stdout # File to which PromQL queries are logged

rule_files:
- /etc/prometheus/rules.yml # List of recording and alerting rules

scrape_configs:
- job_name: "custom_job_1" # Scrape job name
kubernetes_sd_configs: # K8s target service discovery rules
- role: pod

# Rewrite the label set of a target before it gets scraped
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_env_name]
regex: production
action: keep
- source_labels: [__meta_kubernetes_pod_phase]
regex: Running
action: keep

# Metric relabeling after scrape and before ingestion
metric_relabel_configs:
- source_labels: [__name__]
regex: "(.+)"
target_label: __name__
replacement: "prefix_xyz_${1}" # Example: rename metrics with a prefix
- source_labels: [__name__]
regex: "go_.*" # Example: drop noisy metrics
action: drop

remote_write:
- url: http://localhost:9205/write # Write metrics to a remote storage
# Metric relabeling after ingestion and before writing to remote API
write_relabel_configs:
- source_labels: [__name__]
regex: "experimental_.*"
action: drop

# Federation job config (example)
scrape_configs:
- job_name: federate
scrape_interval: 30s
honor_labels: true
metrics_path: /federate
params:
match[]:
- '{__name__=~"critical_.*"}'
- '{__name__=~"cluster:.*|aggregated:.*"}'
static_configs:
- targets: [localhost:9090]# Recording rules (aggregation examples)
groups:
- name: api-request-aggregates
interval: 1m
rules:
- record: aggregated:requests_per_second
expr: sum without(instance,pod) (rate(api_requests_total[1m]))
- record: aggregated:error_rate
expr: sum without(instance,pod) (rate(api_request_errors_total[5m]))
/ sum without(instance,pod) (rate(api_requests_total[5m]))
- record: aggregated:http_latency_p95
expr: histogram_quantile(0.95,
sum without(instance,pod) (rate(api_request_duration_seconds_bucket[5m])))

Scaling with Prometheus: Managing 80M Metrics Smoothly

Some history: Gaps in existing tools

Architecture Components & Data Flow

Dialing Down on Cardinality

When Federation Helps (and When It Doesn’t)

Implementation Details

More to read:

Related

Lesser-Known but C Language Facts That Might Surprise You

Cheese cave fungi reveal how genetic mutations drive rapid e...

The Startup Killer Nobody Talks About: Domain Negotiations