Proving usageDb Correct: Property Tests and Deterministic Simulation Testing

How usageDb, the open-source Rust usage database behind UsageBox, verifies its billing invariants: proptest property tests over thousands of random inputs, plus deterministic simulation testing that runs random crash, restart, and manifest-corruption sequences against a parallel reference model.

10 min read

usageDbdatabase internalsRustproperty testingdeterministic simulation testingproptestreference modelcrash recovery

The short version: You should not trust a database that bills your customers just because its unit tests pass. usageDb is verified two ways. Property tests assert the billing invariants from the spec over thousands of randomly generated event sequences, regenerated on every CI run. Deterministic simulation testing drives random sequences of ingest, flush, rollup, compaction, restart, and manifest-corruption ops against a simple parallel reference model, asserting that the real system agrees with the model after every single step. When either fails, proptest shrinks the failure to the shortest reproducible op trace and locks it into a regressions file so it can never silently come back.

This is the finale of the usageDb internals series, Part 10 of 10. The previous nine parts each ended with a claim that some invariant holds: events are never double-counted, rollups reconcile with raw scans, compaction preserves totals, a crash never loses an acknowledged event. This chapter is where those claims get earned. (Previous: Part 9, period lifecycle and frozen snapshots.)

A billing database that drops, duplicates, or miscounts events does not produce a slightly-wrong dashboard. It produces a wrong invoice. Example-based unit tests check the handful of cases an engineer thought of; they do not check the interleaving of a flush mid-batch, then a restart, then a retry of the same events. usageDb leans on two techniques that generate those cases for you.

The invariants come from the spec, not from the tests

Section 19 of the design spec lists the invariants the database must preserve. They are not assertions invented after the fact; they are the contract, and the tests reference them by number. The ones the suite exercises directly:

Invariant (spec section 19)What it means for billingImplemented in
1. Acknowledged events are recoverableIf we returned 200, the event survives a crash, flushed or not.Part 2
4. A raw segment is never counted twice in rollupsRe-running the rollup builder cannot inflate a total.Part 6
5. Same event_id, same payload is not double-countedA retried batch is free; the customer pays once.Part 3
6. Same event_id, different payload is visible as a conflictA buggy collector that reuses an id is surfaced, not silently merged.Part 3
8. Rollups reconcile with a raw audit scanThe fast monthly total equals the slow correctness fallback.Part 6
9. Compaction does not change logical resultsMerging small segments never alters a SUM.Part 8

Property tests: one invariant, thousands of random inputs

The property tests live in tests/properties.rs and use proptest. Property-based testing means describing the shape of valid input, then asserting a property holds for every input drawn from that shape. Here the shape is a batch of one to thirty valid usage events, with event IDs drawn from a deliberately small alphabet so collisions (and therefore the dedupe path) actually get exercised. Timestamps stay inside a single hour so the rollup path is deterministic, and quantities are small positive integers so a SUM stays well inside an i128.

Each test spins up a fresh tempdir-backed database, drives the same primitives the production pipeline uses (ingest, optional flush, optional rollup or compaction, restart), and compares two derived values for equality. One property reads almost like the English statement of the invariant:

// section 19.8 -- rollup totals reconcile with raw totals.
#[test]
fn raw_sum_equals_rollup_sum(events in arb_event_batch()) {
    let (raw, rollup) = rt.block_on(async {
        let harness = Harness::new();
        harness.ingest(events).await;
        harness.force_flush().await;
        harness.tick_rollup(HOUR_MS + 120_000).await; // seal hour 0
        (harness.query_sums_per_account(QuerySource::RawEvents).await,
         harness.query_sums_per_account(QuerySource::RollupHourly).await)
    });
    for acc in ACCOUNTS {
        prop_assert_eq!(raw[acc], rollup[acc]); // must agree, every time
    }
}

The properties cover: raw totals equal rollup totals; a retried batch re-accepts zero events and leaves every SUM unchanged; compaction over several accumulated segments preserves every account total; a restart after a flush preserves totals, and so does a restart with no prior flush at all (memtable events come back through WAL replay); a second rollup tick at the same clock changes nothing; and a same-id-different-payload event is reported as a conflict rather than accepted. Each property runs 32 randomized cases (a count chosen to keep CI fast given that every case does real disk I/O), regenerated from a fresh seed on every run. A bug that shows up one input in a hundred eventually surfaces on a CI run rather than hiding forever behind a fixed example.

Deterministic simulation testing: random histories against a reference model

Property tests check one invariant after one fixed pipeline. They do not catch the bug that only appears when a flush, a restart, and a retry happen in a particular order. That is the job of deterministic simulation testing (DST) in tests/dst.rs. DST generates a random sequence of five to thirty operations and applies each one to two things in lockstep: the real database (the system under test) and a deliberately dumb reference model. The reference model is the heart of the technique. It is a tiny in-memory struct tracking only what is needed to predict every observable query result: a map of acknowledged event-id hashes so duplicates are not double-counted, a per-account quantity total, and the set of closed billing periods. It knows nothing about rollup watermarks, segment layout, bucketing, or compaction. It is obviously correct precisely because it is too simple to be wrong, which makes it a trustworthy oracle for the complicated real system sitting next to it.

The op set is small and each op maps to a real subsystem:

OpEffect on the real system
IngestA small batch of random events through the dedupe and WAL path.
FlushDrain the memtable to one segment per bucket, rotate the WAL, commit the manifest.
RollupTickAdvance the rollup watermark past the completed hour.
CompactTickMerge any bucket holding two or more small segments.
RestartDrop in-process state and rebuild from disk via the real recovery path.
ClosePeriodFreeze a snapshot for an account and mark its period closed.
CorruptLatestManifestAndRestartOverwrite the newest manifest generation with garbage, then restart.

After every step, the driver asserts the system's raw SUM per account matches the model's. After the whole sequence it does a final flush, rebuilds the rollups, advances the watermark, runs a compaction pass, and asserts raw equals rollup for every account. The per-step assertion is what makes DST powerful: a bug that leaves the database in a subtly wrong state shows up on the next step, not at the very end where the cause is long gone. The check is one short loop:

for (idx, op) in ops.iter().cloned().enumerate() {
    let label = op_label(&op);
    harness = step(harness, &mut model, op).await;   // apply to SUT + model
    assert_raw_matches_model(&harness, &model, idx, label).await?;
}
// ... final flush + rebuild_rollups + rollup_tick + compact_tick ...
for acc in ACCOUNTS {
    prop_assert_eq!(raw[acc], rollup[acc]); // section 19.8 at the end
}

Ops are weighted toward the data-producing ones (ingest is five times likelier than a restart) so a sequence accumulates real events to reason about rather than churning through administrative no-ops.

Corruption and the resync trick

The CorruptLatestManifestAndRestart op is where DST earns its keep against the manifest generation-rollback path from Part 5. It overwrites the newest manifest generation file with garbage and runs the real restart. Recovery is supposed to detect the unparseable current generation and roll back to the previous valid one. The catch: rolling back can legitimately make some data invisible, because a segment referenced only by the corrupt generation is no longer in any manifest the system trusts, so it correctly drops out of query results.

A naive model would now disagree with a correctly-behaving system and report a false failure. The fix is the model's resync_from_sut method: after a corruption-and-restart op, the model discards its own state and re-derives it by scanning whatever the system believes after recovery, then models subsequent ops normally on that baseline. This is correct because the invariant under test is not "no data is ever lost on rollback" (rollback can lose the in-flight generation by design); it is "after recovery, the system is internally consistent." Any post-rollback panic, raw-versus-rollup drift, or ghost event still trips the per-step assertion on the very next op.

One case is deliberately excluded from the random driver: a manifest with a corrupt current generation and no older generation to fall back to. The correct behavior there is to refuse to start rather than silently boot an empty database and bill everyone zero. That fail-closed path would crash the test driver, so it gets its own dedicated test, corrupt_only_manifest_generation_is_fatal, which flushes one generation, corrupts it, and asserts that recovery returns an error instead of an empty database.

Shrinking: a 500-step haystack becomes a four-step recipe

Random testing has a notorious failure mode: when it breaks, the failing input is a giant unreadable blob. proptest solves this with shrinking. On failure it does not hand you the 500-step sequence that tripped the bug; it searches for the smallest input that still fails, repeatedly removing ops and simplifying events until any further reduction makes the test pass. A violation surfaces as a short, reproducible recipe an engineer can read.

Those minimized failures are checked into source control. usageDb has two regressions files: tests/properties.proptest-regressions and tests/dst.proptest-regressions. proptest reads them and re-runs every previously-seen failure before generating new cases. The DST file, for instance, locks in a shrunk trace of the form [RollupTick, Ingest, Ingest, Ingest, Restart]: advance the watermark on an empty database, ingest, restart. Once that bug is fixed, the recipe lives in the repo forever, so the regression cannot silently creep back on a future refactor.

The durability contract has its own integration tests

Underneath the randomized suites, tests/durability.rs pins down the WAL and recovery mechanics from Part 2 with direct example-based tests: WAL rotation seals the active file and starts a new one; delete_files_through removes only sealed files; recovery replays unsealed WAL files into both the memtable and the dedupe cache while skipping sealed files already durable in a segment; classifying an event does not mutate the dedupe cache (only an explicit commit does); and the event hash is deterministic and ignores the server-stamped ingested_at_ms. These are the low-level guarantees the property and DST harnesses build on, worth nailing down explicitly rather than only through random sequences.

Honest about MVP status

The README is upfront that usageDb is an MVP scaffold: some spec items are stubbed, rollup segments use a simpler encoding than raw segments, and a group-commit durability mode is not yet implemented. The testing rigor here is real regardless. The invariants under test are exactly the ones that matter for billing correctness, checked against random adversarial histories rather than a curated happy path. A feature being unfinished is a roadmap item; a billing invariant being unverified would be a liability, and that is the line this chapter draws.

That is the case for trusting the prior nine parts. Each described a mechanism; this one shows the mechanisms are checked together, under random crash and corruption sequences, against a model too simple to lie, with every failure minimized to a recipe and frozen into a regression. The code is open, so the claims are auditable rather than asserted.


That closes the usageDb internals series. usageDb is open source and developed alongside UsageBox; read the property and simulation suites at github.com/pbudzik/usagedb, or start over from Part 1, why a purpose-built usage database.

Key Topics

  • usageDb
  • database internals
  • Rust
  • property testing
  • deterministic simulation testing
  • proptest
  • reference model
  • crash recovery

Related Articles

Explore more articles on similar topics to deepen your understanding of usage-based billing.

Crash-Safe Metadata in usageDb: Atomic Manifest Commits and Generation Rollback

How usageDb keeps its single source of truth durable: temp-and-rename atomic commits with a parent-directory fsync, numb...

8 min readRead more

Why We Built usageDb: A Purpose-Built Rust Database for AI Usage and Billing

usageDb is an open-source Rust storage engine for AI usage metering and billing. Part 1 of a 10-part internals series: t...

8 min readRead more

Inside usageDb's Ingest Path: WAL, Memtable, and the Durability Contract

How usageDb turns an acknowledged usage event into a durable, billable fact: the three-phase ingest critical section, th...

9 min readRead more

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles