The short version: usageDb flushes small immutable segments often, which is great for durability but bad for reads: a busy account ends up with dozens of tiny per-bucket files, each with its own dictionary. Compaction merges those small segments into one larger, well-sorted, well-compressed output per bucket, swaps it in through a single atomic manifest commit, and only deletes the old files after a reader grace period so no in-flight query is ever pulled out from under itself.
This is Part 8 of the usageDb internals series. usageDb is the open-source Rust storage engine behind UsageBox. Code: github.com/pbudzik/usagedb.
Part 7 covered the query engine. This part covers compaction.
Why compaction exists
The ingest path described in Part 2 trades latency for durability by flushing often. Every time the memtable crosses its size threshold it is drained, sorted, and written out as one or more immutable raw segments, one per bucket (where bucket = blake3(account_id) % bucket_count). A high-volume account therefore accumulates many small .seg files over time, each one the residue of a single flush.
Frequent small flushes are good for the durability contract but they punish reads in two specific ways:
- Pruning gets weaker per byte of useful data. The query engine prunes whole segments using
SegmentMetabefore opening any file. More files means more metadata to scan and more files to open for the same logical rows, so the fixed per-file cost dominates. - Compression gets worse. The columnar format from Part 4 dictionary-encodes the ID columns (
account_id,product_id,meter_id,model_id, and friends). Each segment carries its own dictionary. Ten tiny files each repeat the same handful of product and meter strings ten times over instead of sharing one dictionary across the merged whole.
Compaction is the background process that buys those reads back. It merges the small per-bucket segments into a single larger output, keeps the canonical billing sort order so the merge stays cheap and the result compresses well, and does the swap without ever losing an event or breaking a reader mid-query. The constraint that makes this interesting: the files are append-only and immutable, so you cannot edit a segment in place. You can only write a new one and atomically change which set the manifest points at.
The planner: choosing what to merge
The planner (src/compact/planner.rs) is deliberately small. It groups the raw segments by bucket and emits one CompactionPlan per bucket that holds more than max_small_segments segments. Each plan names a bucket and the list of segment IDs to merge. Inputs always share a bucket so the output can stay bucketed, which is what keeps the query engine's bucket(account_id) pruning valid afterward.
let mut by_bucket: HashMap<u32, Vec<&SegmentMeta>> = HashMap::new();
for seg in segments {
by_bucket.entry(seg.bucket).or_default().push(seg);
}
let mut plans = Vec::new();
for (bucket, bucket_segments) in by_bucket {
if bucket_segments.len() > self.max_small_segments {
let segment_ids = bucket_segments.iter()
.map(|s| s.segment_id.clone()).collect();
plans.push(CompactionPlan { bucket, segment_ids });
}
}
The default threshold is sixteen segments per bucket (compaction_max_small_segments), with a max_small_size_bytes guard of 32 MB carried alongside it. These mirror the spec's suggested thresholds: a small segment is under 32 MB, and compaction triggers once a bucket holds more than sixteen of them. Note that the trigger is a per-bucket count, not a global one, so a quiet account never drags a noisy one into a needless merge.
The worker: merge, sort, dedupe, write
The worker (src/compact/worker.rs) executes one plan at a time. It reads every event out of the input segments, then runs a single sort_and_dedupe pass before writing the output. The reads happen outside the manifest write lock, which is safe precisely because segment files are immutable: nothing can mutate them while the worker is mid-read.
The sort reproduces the canonical billing order established at flush time in Part 4: account_id, then product_id, meter_id, model_id (with a missing model sorting as the empty string), then timestamp_ms. Because each input was already written in that order, the merge is close to a merge of already-sorted runs, and the resulting contiguity is exactly what makes dictionary and delta encoding pay off. Dedup is by exact event_id: the first occurrence wins, later duplicates are dropped.
events.sort_by(|a, b| {
a.account_id.0.cmp(&b.account_id.0)
.then_with(|| a.product_id.0.cmp(&b.product_id.0))
.then_with(|| a.meter_id.0.cmp(&b.meter_id.0))
.then_with(|| /* model_id, empty string if None */)
.then_with(|| a.timestamp_ms.cmp(&b.timestamp_ms))
});
let mut seen: HashSet<String> = HashSet::with_capacity(events.len());
events.retain(|e| seen.insert(e.event_id.0.clone()));
The output is written to a fresh compacted_<uuid>.seg file using the same RawSegmentWriter the flusher uses, so it is byte-identical in format to any other raw segment. The worker builds its SegmentMeta with the same build_segment_meta helper, and it writes the dedupe sidecar index for the output too so the recovery speedup from Part 3 still applies. If the writer fails at any point, the half-written output is removed and the job aborts. The manifest is untouched, so a failed compaction is a no-op rather than a corruption.
The atomic swap: ReplacementRecord
The actual cutover is a single manifest commit. The worker calls commit_manifest_if with a closure that does three things atomically: removes the input segments from raw_segments, pushes the new output segment in, and appends a ReplacementRecord to compacted_replacements. The record names the old segment IDs, the new one, and the wall-clock millisecond at which the swap committed.
manifest.raw_segments.retain(|s| !input_set.contains(&s.segment_id));
manifest.raw_segments.push(new_meta.clone());
manifest.compacted_replacements.push(ReplacementRecord {
old_segments: plan.segment_ids.clone(),
new_segments: vec![output_id.clone()],
committed_at_ms: now_ms,
});
This rides on the atomic manifest commit from Part 5: the closure mutates a clone, and the new state is only published in memory after the new manifest generation has been durably written and CURRENT advanced. Readers either see the old segment set or the new one, never a half-applied mix. The _if variant adds one more guard against concurrency: before mutating, the closure re-checks that every planned input is still present in raw_segments. If a previous tick or a racing worker already removed any of them, it returns None, the manifest is left untouched, and the freshly written output file is deleted. That is why a second tick over an already-compacted manifest is a clean no-op rather than a double-merge.
The detail that makes the whole scheme safe is what the swap does not do: it does not delete the old files. They stay on disk. The manifest no longer references them, but the bytes are still there.
The reader grace period
Spec section 15.3 lists the compaction sequence as choose, merge, sort, dedupe, write, atomically update the manifest, then "delete old files after reader grace period." That last clause closes a race that is specific to append-only storage with snapshot reads.
A query in usageDb begins by snapshotting the manifest: it reads the current raw_segments list and then opens those files. Suppose a query snapshots the manifest, gets a list that includes segment raw_X, and is about to open it. Compaction commits, swaps raw_X out for compacted_Y, and the query is now holding a segment ID the live manifest no longer mentions. If compaction deleted raw_X immediately, the in-flight query would try to open a file that no longer exists and fail through no fault of its own. The output compacted_Y contains exactly the same events, but the query already committed to the old plan.
The grace period prevents this. Deletion is deferred: the file is only removed once its ReplacementRecord is older than compaction_grace_ms (30 seconds by default), which must exceed the longest expected query duration. The sweep runs as phase one of every tick, before any new merge:
let cutoff = now_ms - self.grace_ms;
let to_finalize = manifest.compacted_replacements.iter()
.filter(|r| r.committed_at_ms != 0 && r.committed_at_ms <= cutoff)
.cloned().collect();
// ... delete each old_segments file, then drop the finalized records
// from compacted_replacements in a follow-up manifest commit.
Only records past the cutoff are finalized: the old .seg files are deleted from disk, and the now-empty ReplacementRecords are dropped from the manifest in a follow-up commit. The integration tests pin this exactly: at commit time the files survive, a tick one millisecond inside the window leaves them in place, and a tick one millisecond past it deletes them and clears the record.
Compaction preserves the sum
The invariant that holds the whole feature accountable is simple to state: a compacted segment set must produce exactly the same SUM(quantity) and COUNT as the inputs it replaced. Compaction reorganizes storage, it never changes a billing answer. This is spec section 19.9, and it is checked two ways in the test suite covered in Part 10: a proptest property compaction_preserves_sum that aggregates before and after a forced compaction tick and asserts equality, and the deterministic simulation test that interleaves CompactTick with ingest, flush, rollup, and restart against a reference model and asserts the SUT still matches after every step. Because exact event_id dedup runs during the merge, the post-compaction count can only ever drop duplicates that were already logically identical, never real events.
Running in the background
Compaction is a background worker, structured like the hourly rollup worker from Part 6. It runs on a tokio interval (compaction_tick_interval_secs, default 60), skips its immediate first tick, and on shutdown notification breaks cleanly out of its loop. Each tick first sweeps pending deletions, then plans and runs any merges. Tests drive tick(now_ms) directly with an explicit clock so the grace-window behavior is exercised without waiting on wall time. Failures of an individual plan are logged and isolated, so one bad bucket does not stall compaction for the rest.
usageDb internals: the full series
- Why a purpose-built usage database
- The ingest path and durability contract
- Idempotency and deduplication
- The columnar segment format
- The manifest and crash recovery
- Hourly rollups and the watermark
- The query engine
- Compaction
- Period lifecycle and frozen snapshots
- Property tests and simulation testing
Next: Part 9, period lifecycle and frozen snapshots, on how a billing period is closed and what stays stable afterward.
usageDb is an open-source Rust storage engine for usage-based AI billing, developed alongside UsageBox. The source for this part lives in src/compact/worker.rs, src/compact/planner.rs, and src/storage/manifest.rs. It is an MVP scaffold and the spec is the source of truth.