Crash-Safe Metadata in usageDb: Atomic Manifest Commits and Generation Rollback

How usageDb keeps its single source of truth durable: temp-and-rename atomic commits with a parent-directory fsync, numbered manifest generations that roll back past a corrupt write, fail-closed recovery, and an exclusive process lock.

8 min read

usageDbdatabase internalsRustcrash recoveryatomic commitmanifestfsyncfail-closed

The short version: The manifest is usageDb's single source of truth: the catalog of every raw and rollup segment, the rollup watermark, the highest sealed WAL id, and the list of closed billing periods. Lose it to a torn write and every other file on disk becomes an orphan with no index. So the manifest is committed atomically (write a temp file, rename, fsync the directory), versioned into numbered generations, and on startup it rolls back to the previous valid generation if the current one is corrupt. It only refuses to start when nothing parses.

This is Part 5 of the usageDb internals series. Part 4 explained the immutable columnar .seg files that hold the actual usage events. But a .seg file by itself is meaningless. The engine has no idea it exists, what time range it covers, or which account IDs are inside it, until something points at it. That something is the manifest, and this article is about why it is the one structure in the database that has to update atomically, survive a crash, and survive its own corruption.

(Previous: Part 4, the columnar segment format. Next: Part 6, hourly rollups and the watermark.)

What the manifest holds

The Manifest struct in src/storage/manifest.rs is the index that ties the WAL, the segments, and the rollups together. Every field on it is load-bearing:

  • raw_segments and rollup_segments: vectors of SegmentMeta. Each entry names a segment file and carries its pruning metadata: min and max timestamp, bucket, row count, min/max account id, and the set of product, meter, and model IDs inside it. The query planner reads this to skip whole segments without opening them.
  • watermarks.hourly_rollup_ms: how far the rollup builder has sealed completed hours. Everything before the watermark is served from rollups; everything after falls back to a raw scan. Part 6 covers why this bound is so careful.
  • last_sealed_wal_id: the highest WAL file id whose contents are durably in committed segments. WAL files at or below this can be deleted; files above it must be replayed on recovery. This single integer is the join between the durable log and the durable segments.
  • compacted_replacements: ReplacementRecord entries that map old segment IDs to the compacted output that replaced them, with a commit timestamp for the reader grace period.
  • closed_periods: per-account billing periods that have been finalized, each with its frozen snapshot. Ingest rejects new Usage events landing inside one of these. Part 9 is entirely about this.

If any of these are wrong, the database lies. A missing segment means silently dropped revenue; a watermark that is too high means an account-month total served from a rollup that does not yet include all the data. That is why the manifest cannot tolerate a partial write.

Why an in-place rewrite is unsafe

The naive way to persist the manifest is to truncate the file and write the new JSON in place. The problem is that a write is not atomic with respect to a crash. Lose power halfway through and you are left with a file that is neither the old manifest nor the new one: a torn write. JSON that ends mid-array will not parse, and the index to the entire database is gone. Every committed segment is still on disk, intact and immutable, but nothing points at them. The database is effectively empty.

usageDb avoids this with the standard atomic-replace dance: write the new content to a temporary file, fsync it, then rename it over the target. A rename within a filesystem is atomic, so a reader sees either the complete old file or the complete new file, never a half-written one. The final, easy-to-miss step is fsyncing the parent directory, because the rename itself is a directory metadata change and is not durable until the directory is synced.

The commit helper: clone, mutate, save, publish

Atomicity on disk is only half the contract. The other half is that the in-memory manifest must never get ahead of the on-disk one. An earlier version of usageDb had several call sites that took the manifest write lock, mutated it in place, and then called save. If save failed, the in-memory state held the mutation but disk did not. Queries would see writes that had not reached durable storage, and after a crash, recovery would rebuild from the older on-disk manifest while the running process believed it had committed newer state. That was a P0 finding from an external review.

The fix routes every mutation through one helper in src/runtime/state.rs. It clones the manifest, mutates the clone, saves the clone, and only publishes the clone in memory after the save succeeds. A save failure leaves both disk and memory untouched:

pub async fn commit_manifest<F, T>(&self, op: F) -> std::io::Result<T>
where
    F: FnOnce(&mut Manifest) -> T,
{
    let mut guard = self.manifest.write().await;
    let mut next = guard.clone();
    let value = op(&mut next);
    next.save(&self.config.db_root)?;   // fails here => nothing published
    *guard = next;                        // publish only on success
    Ok(value)
}

A sibling, commit_manifest_if, takes a closure that returns Option<T> and skips the save entirely when it returns None. That is for cases where the closure decides under the lock, with race safety, whether the change is even needed, for example a close-period call that rechecks whether a racing caller already closed the period. The regression test in tests/p0_manifest_atomic.rs sabotages the manifest directory, calls the helper, and asserts that neither the generation counter nor the mutated field moved.

Generations: numbered manifests and a CURRENT pointer

Atomic rename protects a single write. It does not protect against the new file being corrupt for some other reason: a bit flip on disk, a serialization bug, a botched migration. To survive that, usageDb does not overwrite one manifest file. It keeps a directory of numbered generations and a one-line pointer:

db_root/
  manifest/
    CURRENT                 (single line: latest valid generation u64)
    manifest-000001.json
    manifest-000002.json    (last KEEP_GENERATIONS = 10 retained)
    ...
  manifest.json             (legacy; auto-migrated to generation 1 on first load)

Every call to save bumps the generation counter, writes a fresh manifest-NNNNNN.json with the temp-and-rename dance, then atomically advances CURRENT the same way, and finally fsyncs the directory so both renames are durable. After the write, prune_old_generations deletes everything older than the last ten. Pruning failure is non-fatal: the worst case is a few extra files on disk, not a lost commit.

A database created before generations existed has a single manifest.json at the db root. On first load, migrate_legacy parses it, saves it as generation 1 through the normal path, and removes the old file. If that legacy file will not parse, migration refuses to start rather than silently booting an empty database, because there are no older generations to fall back to.

Recovery: roll back, or fail closed

On startup, run_startup_recovery in src/runtime/recovery.rs calls Manifest::load, which reads CURRENT, parses the generation number, and tries to load that generation. If it parses cleanly, done. If it does not, the loader walks backwards through earlier generations until one parses, logging a warning that it rolled back. Billing data is not orphaned by a single corrupt write, because the previous nine generations all point at the same immutable segment files.

The crucial design choice is what happens when nothing parses. The loader does not start with an empty database. It returns an error and the process refuses to start, telling the operator to inspect the directory manually. Silently booting empty would mean serving zero usage for every account, the worst possible failure for a billing system. This fail-closed behavior is exercised directly: one test corrupts the only generation and asserts that load returns an error containing "no valid manifest generation", and the simulation harness in Part 10 drives a CorruptLatestManifestAndRestart op so the rollback path runs against the real recovery sequence with the model checking for divergence afterward.

The exclusive process lock

All of this atomicity assumes one writer. Two processes racing on the same manifest directory, say the HTTP server and an admin CLI command running at the same time, could interleave their generation bumps and clobber each other. usageDb prevents that with an OS-level file lock. DbLock::acquire in src/runtime/lock.rs opens db_root/LOCK and takes an exclusive flock for the lifetime of the process. It uses the non-blocking variant, so a second process gets a clear "another usagedb process holds the lock" error instead of hanging. The lock is released when the holder drops it; the file is intentionally left on disk.

Where the manifest fits in the recovery sequence

Loading the manifest is step one of recovery, and the rest of the sequence depends on it. Once the manifest is loaded, recovery reads last_sealed_wal_id, cleans up tmp files and WAL files at or below that id, then replays the unsealed WAL files above it back into both the dedupe cache and the memtable. Finally it scans raw segments inside the dedupe TTL window and re-registers their events, so a retry that crosses a restart is still caught as a duplicate. The WAL replay and dedupe re-registration are the subjects of Part 2 and Part 3. The point here is the ordering: nothing else can happen until the manifest is known good, because the manifest is what tells recovery which WAL files matter and which segments exist.


usageDb is open source and developed alongside UsageBox. Read the manifest and recovery code at github.com/pbudzik/usagedb, then continue with Part 6 on hourly rollups and the watermark.

Key Topics

  • usageDb
  • database internals
  • Rust
  • crash recovery
  • atomic commit
  • manifest
  • fsync
  • fail-closed

Related Articles

Explore more articles on similar topics to deepen your understanding of usage-based billing.

Compaction in usageDb: Merging Segments Behind an Atomic Manifest Swap

How usageDb background compaction merges many small per-bucket segments into one well-sorted, well-compressed output, sw...

8 min readRead more

Inside usageDb's Ingest Path: WAL, Memtable, and the Durability Contract

How usageDb turns an acknowledged usage event into a durable, billable fact: the three-phase ingest critical section, th...

9 min readRead more

Proving usageDb Correct: Property Tests and Deterministic Simulation Testing

How usageDb, the open-source Rust usage database behind UsageBox, verifies its billing invariants: proptest property tes...

10 min readRead more

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles