The short version: An account-month total summed over millions of raw events is too slow to recompute on every dashboard load or invoice preview. usageDb precomputes it hour by hour: a background worker seals each completed hour into per-bucket rollup segments and advances a watermark in the manifest. Queries serve the sealed part of a range from rollups and fall back to a raw scan only for the still-open tail. The correctness story lives in three bounds the worker respects before moving the watermark, because a watermark that crosses unflushed or still-arriving data would produce a wrong bill.
This is Part 6 of the usageDb internals series. Part 5 covered the manifest: the index that names every segment and holds watermarks.hourly_rollup_ms. This article is about how that watermark gets set, and why moving it one hour too far is a billing bug, not a performance one.
(Previous: Part 5, the manifest and crash recovery. Next: Part 7, the query engine.)
The problem: month totals over millions of events
A usage-based bill is fundamentally an aggregate: sum the quantity of every metered event for an account, within a billing period, grouped by meter and model. For a busy account that is millions of rows of token counts and tool calls. Scanning every raw segment to compute it is correct, and usageDb can always do it, but it is far too slow to run on every dashboard refresh or invoice preview. The raw scan is the fallback, not the hot path.
The fast path is precomputation. Usage events are append-only and time-ordered, so once an hour is over its total never changes under normal operation. That makes the hour a natural unit to seal: aggregate it once, store the result, serve it cheaply forever after.
What a rollup row is
The builder in src/rollup/builder.rs groups events by a wide key: account, subscription, product, meter, model, source, unit, the hour start, and the canonical JSON of the event's dimensions. For each distinct key it keeps a running quantity_sum, an event_count, and the first and last event timestamps. One hour of an account's traffic collapses from thousands of raw events into a handful of HourlyRollupRecord rows, one per distinct billable combination. Keeping source and unit in the key matters because the public API lets callers filter and group by both. Dimensions are stored as canonical JSON so rollups from different segments are cross-comparable: the same dimension map always serializes to the same bytes.
Sealing an hour and advancing the watermark
The background worker in src/rollup/worker.rs ticks on an interval. Each tick computes a target watermark, then for every hour in [current_watermark, target) it scans the raw segments overlapping that hour, aggregates the events inside it, partitions by account bucket, and writes one rollup segment per bucket touched. The rollup segments and the new watermark commit to the manifest in a single atomic save, the same commit_manifest helper from Part 5:
self.state.commit_manifest(|manifest| {
for meta in &metas_to_commit {
manifest.rollup_segments.push(meta.clone());
}
manifest.watermarks.hourly_rollup_ms = target_hour;
}).await
Because the segments and the watermark move together, a crash mid-tick leaves the previous watermark in place and the next tick redoes the work. If the manifest save fails, the worker unlinks the orphaned segment files it just wrote.
The three safety bounds
The heart of correctness is how the worker picks target_hour. It is the minimum of three bounds, each preventing a specific way the watermark could lie.
Bound (a): a time target with a safety lag
The first bound is time_target = floor((now - safety_lag) / 1h) * 1h. The worker never seals an hour that is too recent. The safety lag exists because events can arrive slightly late: a collector retries, a clock skews, a batch is delayed in flight. Sealing an hour the instant it ends would seal it before the stragglers land, and those events would never make it into the rollup. The lag holds the watermark back so an hour is only sealed once it is comfortably in the past.
Bound (b): skip if a flush is in flight
The flusher seals a WAL file and then commits its events as a raw segment. In the window between those two steps, a WAL file is sealed but its segment is not yet in the manifest. Advancing the watermark there would seal hours from raw segments that do not yet include those in-flight events, and the rollup would undercount. So the worker detects the gap and skips the tick:
// active WAL is more than one ahead of the last sealed-and-committed id
// => a flush is in flight; those events aren't in raw segments yet.
if active_wal_id > last_sealed_wal_id + 1 {
return Ok(RollupTickStats { skipped_for_in_flight: true, .. });
}
The flusher will catch up, and a later tick proceeds against the complete set of raw segments.
Bound (c): never cross the memtable
Not all acknowledged events are in raw segments yet. Recently ingested events sit in the in-memory memtable until a flush trigger drains them. Those events are durable in the WAL, but the rollup worker only reads raw segment files, so anything still in the memtable is invisible to it. The third bound caps the target at the floor-of-hour of the oldest event still in the memtable:
let time_target = ((now - safety_lag) / HOUR_MS) * HOUR_MS;
let target_hour = match memtable_min_ts {
Some(ts) => time_target.min((ts / HOUR_MS) * HOUR_MS),
None => time_target,
};
If the oldest unflushed event belongs to hour H, the watermark stops at the start of H. That hour is not sealed until the event has reached a raw segment, so the rollup can never miss it. This is the bound that most directly enforces the invariant that the watermark never crosses unflushed data.
Force-drain: when old events would stall rollups forever
Bound (c) has a failure mode of its own. If a single old event trickles into the memtable and the memtable never reaches its size-based flush threshold, that one event holds the watermark hostage: the oldest-event cap pins it at that event's hour indefinitely, and rollups stall for every account.
The worker breaks the stall with a force-drain. If the memtable's oldest event has been sitting there longer than memtable_max_age_ms, the worker drains the memtable, rotates the WAL, and queues the batch for the flusher, then skips the current tick. These are the same primitives the ingest path uses when the size threshold trips. The next tick sees the events safely in a raw segment and the watermark resumes. Slow trickles of late data get flushed on a timer instead of blocking the pipeline.
Query routing: rollups plus a raw tail
On the read side, a RollupHourly query in src/query/executor.rs splits the requested range at the watermark with half-open semantics. The sealed part [from, min(to, watermark)) is served from rollup segments, each pruned by bucket so a single-account query only opens that account's segments. The open tail [max(from, watermark), to) falls back to a raw scan over segments and the memtable. A month query inside a sealed period is pure rollup; one taken right after the period ends is mostly rollups with a thin raw tail for the current hour.
Both halves feed the same aggregator. A rollup row is converted into a synthetic event whose quantity is the precomputed sum and whose timestamp is the hour start, so rollup-derived and raw-derived rows aggregate uniformly. The tests in tests/rollups.rs assert that SUM(quantity) agrees between the RawEvents and RollupHourly sources, and that an open hour past the watermark still shows up through the raw fallback.
The COUNT caveat, stated honestly
There is one place where rollups and raw scans deliberately disagree. Because each rollup row enters the aggregator as a single synthetic event, a COUNT over a RollupHourly query counts rollup rows, not underlying events. A meter that recorded ten thousand events in an hour contributes one rollup row, so the rollup COUNT for that hour is 1, not 10,000.
This is documented, not hidden. SUM is exact across both sources, which is what matters for billing. For an exact event count, query the RawEvents source. The rollup carries a per-row event_count, but the current query path does not surface it through COUNT; that is a deliberate MVP boundary, not a silent inconsistency.
Corrections inside rollups
A Correction event carries a negative quantity to walk back an overcount. The rollup builder does nothing special with it beyond arithmetic: a negative quantity adds into the same quantity_sum as any other event, so an hour with an original +100 and a later -30 correction rolls up to a net 70. The rollup total is original + correction by construction, which is exactly what an invoice should show. How corrections interact with closed billing periods is the subject of Part 9.
Rebuilding rollups when something goes wrong
Rollups are a cache, and a cache that cannot be rebuilt is a liability. The operator tool rebuild_rollups(from_ms, to_ms) drops every rollup segment overlapping the range from the manifest and rewinds the watermark to from_ms. The next tick refills the gap from raw segments. The raw segments are immutable and never touched, so this is always safe to run:
| Situation | What rebuild does |
|---|---|
| A rollup builder bug was fixed | Regenerate the cached rollups from the unchanged raw events with the corrected logic. |
| Late data arrived for an already-sealed hour | Rewind past that hour so the next tick re-aggregates it, picking up the now-present events. |
| Verifying rollup-vs-raw drift | Force a clean recompute and compare against a raw scan. |
The rollup files are removed from manifest.rollup_segments rather than deleted from disk immediately, so an in-flight query that already snapshotted the old manifest can still open them. usageDb also exposes a verify endpoint and a verify-period admin command that compare a rollup-sourced total against a raw scan to catch drift before it reaches a bill; that machinery is covered in Part 7.
Why the watermark is the whole design
Strip away the encodings and the segment formats and the rollup story reduces to one number. The watermark is a promise: everything below it has been completely and correctly aggregated, so it is safe to serve from the fast path; everything at or above it has not, so it must be scanned raw. The three bounds keep that promise from breaking, force-drain keeps it advancing, and rebuild reissues it when the underlying truth changes. That is how usageDb stays both fast enough to render a dashboard and correct enough to print an invoice.
usageDb internals: the full series
- Why a purpose-built usage database
- The ingest path and durability contract
- Idempotency and deduplication
- The columnar segment format
- The manifest and crash recovery
- Hourly rollups and the watermark
- The query engine
- Compaction
- Period lifecycle and frozen snapshots
- Property tests and simulation testing
usageDb is open source. The rollup worker, builder, and reader live under src/rollup/ in the repo at github.com/pbudzik/usagedb.
Part 6 of the usageDb internals series. usageDb is the open-source Rust storage engine behind UsageBox, built for usage-based AI billing where being fast and being correct are the same requirement.