You want a time-series database that swallows millions of writes per second, answers real-time alerts in under a second, and still gives you a year of useful history without bankrupting the company. That is doable, but only if you design for observability’s special constraints: extreme write fan-out, skewed and growing cardinality, many small queries that touch recent data, and the need to age and compress older data cheaply.
Below I give a narrative, practitioner-grade playbook you can use to design or evaluate a time-series database for observability. Early on I summarize what real engineering teams are saying and what that implies for design choices. Then I walk through the core design ideas, an explicit how-to with implementation choices, a worked example with numbers, and a short FAQ.
What engineers working on observability storage actually say (effort signal)
Datadog engineering team, when describing their Husky project, emphasize a vectorized, columnar, schemaless store built around commodity object storage, to separate hot write/read paths from cheap long term retention.
Grafana Labs / Prometheus ecosystem contributors point out that Prometheus’ local index model breaks down at high cardinality, which is why solutions like Thanos, Cortex, and Mimir push long-term data to object storage and introduce sharded, horizontally scalable ingest and index layers.
Observability vendors and practitioners repeatedly call out cardinality as the single largest cost and scalability risk, and recommend active cardinality management, tiered retention with downsampling, and rate limiting of label values as primary defenses.
Synthesis: separate concerns, engineer for cardinality first, and adopt tiered storage and progressive aggregation rather than trying to keep everything in hot, indexed storage forever. Those three themes will shape the rest of the design.
What an observability TSDB is, and why it is different
Time-series databases for observability store metrics, events, and derived series indexed by time and a set of labels or tags. They are different from analytics DBs because:
-
writes are extremely write-heavy and append-only, often bursts during incidents,
-
queries are small and time-focused, typically “recent N minutes” or rollups,
-
cardinality (number of unique series) can explode with tags and is the top cost driver,
-
retention requirements are tiered, cheap long-term storage is required.
Mechanically, you will trade memory and index complexity for lower latency on recent queries, and trade raw fidelity for cheaper long-term storage via downsampling and compaction.
Core design principles, in plain language
-
Separate hot write/read from cold storage. Keep recent, frequently queried data in a write-optimized, indexed store. Push older data to object storage or columnar cold stores with cheaper query models. This is the architecture used in Thanos, Mimir, and Datadog’s Husky.
-
Design for cardinality control, not unlimited labels. Assume cardinality will grow. Limit tag explosion by quotas, cardinality caps per metric, or by encouraging low-cardinality label schemas. Instrument overload handling for label explosions.
-
Shard by series key and time, with small time-chunk units. Use a deterministic shard scheme that groups related series together for efficient IO and compaction, and store data in fixed time chunks (for example, 2h or 1d chunks). Chunks simplify compression and compaction.
-
Use columnar/vectorized storage for efficient compression on similar fields. Vectorized encodings and columnar layouts give large compression gains on homogenous time blocks, and speed up aggregations. Datadog and modern TSDB research both push this direction.
-
Make the index lean and layered. Keep a fast in-memory index for active series (hot index), and a compact on-disk index or object-backed index for cold series. Inverted label indexes are common, but must be memory-capped and sharded.
-
Support multi-resolution retention and continuous downsampling. Store raw samples for a short hot window, then keep 1m aggregates for mid term and 1h aggregates for long term. Provide continuous aggregates so you never need to recompute old rollups.
-
Design queries for streaming/mergeable aggregation. Queries should be able to merge compressed time-chunks and partial aggregates from multiple backends, enabling global queries across shards and tiers.
How to build it — 4 practical steps
1) Choose the storage and indexing stack
Options and tradeoffs, short:
-
LSM + chunked columnar blocks: good write throughput and offline compaction, used in many TSDBs. Works well with inverted label index. Use if you need high ingest and compaction flexibility.
-
Vectorized column store over object storage: decouple compute from storage, cheaper cold retention, good for analytic rollups. Datadog’s Husky is an example pattern. Use when you must store months or years cheaply.
-
Hybrid architecture (Prometheus + long-term backend): local Prometheus for scraping and recent queries, remote write to a horizontally scalable long-term system like Thanos/Mimir/VictoriaMetrics. Good for incremental adoption.
Key implementation details:
-
Store time in chunks, e.g., per 2h or per day, then compress each chunk with delta + integer compression or Gorilla encoding.
-
Maintain an in-memory hot index of active series to translate query label match to series ids. Persist compact index snapshots to object store.
2) Design the schema and cardinality controls
Schema rules that work in practice:
-
Metric name is one dimension, labels/tags are others. Require a strict naming convention to avoid label drift.
-
Avoid high-cardinality labels as first class tags. If you must capture things like request_id or user_id, store them as logs or traces, or convert them into sampled attributes.
-
Implement label value sampling, label cardinality quotas per tenant, series eviction policies, and active monitoring of active series count. Vendors often enforce hard limits per metric to avoid runaway costs.
Make the index: inverted index mapping label (key,value) to series ids, plus a lightweight bloom filter per chunk to quickly skip chunks that cannot match a query.
3) Retention, downsampling, and tiering
A practical retention tier:
-
Hot: raw samples, 7 days, low latency queries.
-
Warm: 30 days, 1m and 5m aggregates for most dashboards.
-
Cold: 1 year, 1h aggregates stored in object storage for ad-hoc trend analysis.
Implementation notes:
-
Compute continuous aggregates on ingest or via background jobs so older data does not require heavy recompute. Use compact, pre-aggregated chunk formats that are mergeable.
-
When moving data to cold tier, store metadata that maps time ranges and series keys to object locations, so queries can fetch only relevant objects.
4) Query model and availability
Design queries so they:
-
First hit the hot index and hot chunks for recent time range, fallback to warm/cold objects for older ranges.
-
Use streaming merge aggregation: each shard returns partial aggregates per chunk and a coordinator merges them. This supports global queries without loading full series in memory.
For availability and scale:
-
Shard series by a hash of metric name and a subset of labels, or by tenant + metric. Replicate metadata and have multiple reader paths for HA. Consider a write path behind a message queue (Kafka or native remote write) to absorb bursts and provide backpressure.
Worked example: capacity math you can use right now
Estimate storage for 1,000,000 active series, 1 sample/sec, 8 bytes per sample.
Step by step:
-
Samples per second = 1,000,000.
-
Samples per day = 1,000,000 * 86,400 = 86,400,000,000 samples/day.
-
Bytes per day = 86,400,000,000 * 8 = 691,200,000,000 bytes/day.
-
Convert to GiB, using 1 GiB = 1,073,741,824 bytes: 691,200,000,000 / 1,073,741,824 ≈ 644 GiB/day.
So raw storage would be roughly 644 GiB per day. If you keep raw data for 7 days, hot tier raw storage is about 4.5 TiB. If you downsample older data aggressively, that long term cost drops significantly, often by 10x or more depending on aggregation windows and compression. This arithmetic shows why cardinality is the dominant cost: doubling series doubles cost linearly.
Practical engineering must-haves and micro-optimizations
-
Sketches and approximate structures: use HyperLogLog or Count-Min to track cardinality and do cheap queries on unique counts. Good for alerting on cardinality spikes.
-
Chunk size tuning: smaller chunks help eviction and cold reads, larger chunks improve compression. Empirically tune between 1h and 12h based on your query patterns.
-
Label cardinality telemetry: emit series_count per metric, label_value_histograms, and alert on jumps. That is how operators detect runaway label explosions early.
-
Backpressure and quotas at ingestion: remote_write throttling, per-tenant rate limits, or queueing in Kafka reduce incident risk.
-
Compression choices: Gorilla encoding, XOR delta for timestamps, delta-of-delta for values, plus generic LZ4 on blocks to speed decompress. These are widely used in TSDBs.
Short FAQ
Q: Should I store logs and traces in the same TSDB?
A: No. Logs and traces are usually stored in systems optimized for full text search and spans with richer schemas. Store high-cardinality identifiers in logs/traces, and export metrics and rollups to the TSDB. Use correlation references between systems when you need cross-lookup.
Q: How low should cardinality limits be?
A: It depends, but many systems set soft limits per metric and per tenant, then alert before hitting hard caps. There is no one number, but treat anything that produces millions of active series as a red flag.
Q: Do I need object storage?
A: For anything beyond short hot retention, yes. Object storage gives cheap, durable long-term retention and works well with a compute layer that knows how to fetch and merge objects.
Honest takeaway
Designing an observability time-series databases is mostly about tradeoffs: latency vs cost, fidelity vs retention, and complexity vs manageability. If you get three things right — cardinality management, a tiered hot/warm/cold storage model, and a compact, sharded index for hot series — you will have a system that is operationally tractable and cost efficient. If you skip those, your bills and outages will grow faster than your dashboards.

