Inside usageDb's Ingest Path: WAL, Memtable, and the Durability Contract

How usageDb turns an acknowledged usage event into a durable, billable fact: the three-phase ingest critical section, the fsynced write-ahead log, Strict vs Fast durability modes, and the memtable re-insert rule that keeps a failed flush from silently stranding data.

9 min read

usageDbdatabase internalsRustwrite-ahead logdurabilityfsyncmeteringidempotency

The short version: usageDb's ingest path runs three phases under one critical section: validate and classify, append and sync to a write-ahead log, then commit dedupe state and the memtable. The WAL is fsynced before the batch is acknowledged in the default Strict mode, so once a client sees a 200, the event is on disk and billable. Nothing mutates the dedupe cache until the bytes are durable, and a failed background flush re-inserts events into the memtable instead of stranding them in a sealed log.

This is Part 2 of the usageDb internals series. Part 1 covered the architecture and why a billing engine needs invariants a general-purpose database will not give you. Here we follow a single batch of usage events from the HTTP handler down to fsync, and look at exactly where the durability contract is enforced. usageDb is the open-source Rust storage engine behind UsageBox; the source for everything below is on GitHub.

The whole point of a metering database is that an acknowledged write is a billable event you will never silently lose. That single sentence drives every design decision on the ingest path. If a batch is acked but later vanishes on a host crash, you under-bill. If it is double-counted on a retry, you over-bill and a customer disputes the invoice. The ingest path exists to make both outcomes impossible under the durability mode you have chosen.

The three-phase critical section

A batch arrives at POST /v1/usage/batch. The handler handle_ingest in src/api/http_server.rs first validates and classifies every event, then hands the survivors to ingest_critical_section, which holds three locks at once: the dedupe cache, the WAL, and the memtable. Inside that section the three phases run in strict order.

Phase 1: validate, stamp, classify

validate_event rejects anything that would corrupt downstream accounting: empty event_id, account_id, product_id, or meter_id; a non-positive timestamp_ms; more than 16 dimensions; or a Correction / Retraction that arrives without a correction_ref. A Usage event landing in an already-closed billing period is rejected too. Every surviving event is stamped server-side with ingested_at_ms = now_ms(), so a client with a wrong clock cannot poison the dedupe TTL eviction.

Then each event is classified against the dedupe cache, and this is the subtle part. Classification does not mutate anything. The HotDedupe::classify method in src/ingest/dedupe.rs takes &self, not &mut self:

pub fn classify(&self, event_id_hash: EventHash, payload_hash: EventHash) -> DedupeResult {
    if let Some(existing) = self.cache.get(&event_id_hash) {
        if existing.payload_hash == payload_hash {
            DedupeResult::ExactDuplicate
        } else {
            DedupeResult::PayloadConflict
        }
    } else {
        DedupeResult::NewEvent
    }
}

Splitting classification (read-only) from commit (the mutating insert) is what makes the durability ordering possible. We decide which events are new before writing the WAL, but we do not record them as seen until after the WAL is durable. Part 3 covers the dedupe and idempotency model in depth, including the blake3 128-bit identity hashing and the 7-day TTL window.

Phase 2: append and sync to the WAL

Once the set of genuinely new events is known, they are appended to the write-ahead log. The WAL, in src/ingest/wal.rs, is a BufWriter<File> over a numbered file under wal/ (wal-000001.log, wal-000002.log, and so on). append_batch serializes each event to JSON and writes a line through the userspace buffer, so the kernel sees one bulk write per batch rather than one syscall per event. Crucially, that buffered write is not yet durable.

What happens next is governed entirely by Config.durability_mode (src/runtime/config.rs). This is the branch that decides whether you can trust a 200 response:

if !new_events.is_empty() {
    wal.append_batch(new_events.iter().map(|c| &c.event))
        .map_err(|e| AppError(anyhow::anyhow!("WAL append failed: {}", e)))?;
    match state.config.durability_mode {
        DurabilityMode::Strict => {
            wal.sync()       // flush + fsync before ack
                .map_err(|e| AppError(anyhow::anyhow!("WAL sync failed: {}", e)))?;
        }
        DurabilityMode::Fast => {
            wal.flush_buffer()  // flush to page cache only, no fsync
                .map_err(|e| AppError(anyhow::anyhow!("WAL flush failed: {}", e)))?;
        }
    }
}

The two modes differ by exactly one disk round-trip:

ModeWAL methodDurability before ackUse when
Strict (default) sync() = flush() then sync_data() Bytes are on the physical disk. A host crash after ack loses nothing. Billing. You want a 200 to mean the event is recorded forever.
Fast flush_buffer() = flush() only Bytes are in the kernel page cache. A host crash can lose the tail. At-least-once upstream retry pipelines that will replay anything unacked.

The implementation of the two WAL methods is small and tells the whole story. sync drains the userspace buffer to the kernel and then forces it to disk; flush_buffer stops after the kernel:

pub fn sync(&mut self) -> IoResult<()> {
    self.file.flush()?;             // BufWriter -> kernel
    self.file.get_ref().sync_data() // kernel -> disk (fsync)
}

pub fn flush_buffer(&mut self) -> IoResult<()> {
    self.file.flush()               // BufWriter -> kernel, then stop
}

In Strict mode the fsync dominates the latency of a batch, and that is intentional. A metering write that has not hit the platter is not a write you can put on an invoice. If you are running behind a collector that already does at-least-once delivery with retries, Fast trades that fsync for throughput, because the collector will resend anything the host lost on crash. DurabilityMode in the spec also defines a Balanced group-commit mode that batches fsyncs across concurrent writers, but that is not yet implemented: the engine ships only Strict and Fast today.

Phase 3: commit dedupe and insert into the memtable

Only after the WAL append-and-sync returns successfully does the engine touch in-memory state. Each new event is committed into the dedupe cache and moved into the memtable:

let accepted = new_events.len();
for c in new_events {
    dedupe.commit(c.event_id_hash, c.payload_hash);
    memtable.insert(c.event);
}

This ordering is the heart of the durability contract. If Phase 2 fails, the ? propagates an error and the function returns before any commit runs. No dedupe entry is created, no event enters the memtable, and the client gets a 500. When that client retries the same batch, the events classify as NewEvent again, because the dedupe cache was never told they existed. There are no false duplicates: a write that was never made durable is never remembered as having happened. That is the property that lets an upstream retry loop be safe.

The memtable and WAL rotation

The memtable in src/ingest/memtable.rs is a VecDeque<UsageEvent> with an approximate byte-size accounting. It serves two readers: queries snapshot it to see unflushed data, and the rollup worker inspects its oldest event timestamp so the watermark never advances past data that is still only in memory. When memtable.size_bytes() crosses max_memtable_size_bytes (64 MiB by default), still inside the same critical section, the engine drains the memtable and rotates the WAL:

let drained = if memtable.size_bytes() > state.config.max_memtable_size_bytes {
    let drained_events = memtable.drain_all();
    let sealed_id = wal.rotate()
        .map_err(|e| AppError(anyhow::anyhow!("WAL rotate failed: {}", e)))?;
    Some(FlushMessage { events: drained_events, sealed_wal_id: sealed_id })
} else {
    None
};

Wal::rotate flushes and fsyncs the active file, closes it, opens the next-numbered file, and fsyncs the parent directory so the new file's directory entry is durable. It returns the id of the now-sealed file. The drained events plus that sealed id become a FlushMessage, sent over a channel to the background flusher after the locks are released. The sealed WAL file stays on disk: it is the durable copy of those events until a segment commit supersedes it.

The flusher and the re-insert-on-failure rule

The FlusherWorker in src/ingest/flusher.rs receives each FlushMessage, partitions the events by bucket, and writes one immutable raw segment per bucket. Once all segments are durable, it commits the manifest atomically, recording the new segments and advancing last_sealed_wal_id to the sealed WAL id. Only after that manifest commit succeeds does it call Wal::delete_files_through(sealed_id) to delete the WAL files whose events are now in committed segments. The ordering, segment then manifest then WAL deletion, is what makes crash recovery deterministic, and Part 5 covers the manifest atomic-commit and recovery path in full.

The correctness detail worth dwelling on is what happens when the flush fails. A segment write can fail (disk full, I/O error) or the manifest commit can fail. Naively, those events would now be sitting in a sealed WAL file that no longer receives appends, invisible to queries and to the rollup worker, with no retry until the next process restart replays the WAL. That is a silent stall on data a client already saw acknowledged. usageDb avoids it by re-inserting the drained events back into the memtable:

if let Err(failure) = self.attempt_flush(events, sealed_wal_id).await {
    error!("Flush failed: {}, re-inserting {} events into memtable for retry",
        failure.reason, failure.events.len());
    if !failure.events.is_empty() {
        let mut memtable = self.state.memtable.lock().await;
        for event in failure.events {
            memtable.insert(event);
        }
    }
}

Back in the memtable, those events are visible to queries again and they will be picked up by the next flush trigger. There is one careful exception. If the segments wrote fine but the manifest commit failed, attempt_flush returns an empty retry list on purpose: the events are still durable in the sealed WAL file (it has not been deleted), so re-inserting them into the memtable would risk a double-flush when recovery replays that same WAL on the next restart. In that single case the orphaned segment files are removed and recovery is left to replay the WAL cleanly. The distinction between "segment write failed, retry via memtable" and "manifest failed, let WAL recovery handle it" is exactly the kind of edge that property tests and deterministic simulation testing exist to pin down, which is the subject of Part 10.

Why this matters for billing

Tie it back to the invoice. In Strict mode, a 200 on /v1/usage/batch is a hard guarantee: the event is on disk, it survives a crash, and it will appear on the bill. A 500 is an equally hard guarantee in the other direction: nothing was recorded, the dedupe cache is untouched, and a retry is safe and will be accepted exactly once. There is no third state where an event is half-counted. The memtable re-insert rule extends that guarantee through the asynchronous flush: an acked event cannot get stuck invisibly between the WAL and a segment. For a system whose output is money, those are the properties that let you trust the number at the bottom of the invoice.

Next, Part 3 looks at how usageDb decides whether two events are the same: the blake3 identity and payload hashing, the NewEvent / ExactDuplicate / PayloadConflict classification, and how the dedupe cache is rebuilt on recovery so retries across a restart are still caught. Read it at Idempotency and deduplication.


This article is part of the usageDb internals series, a code-grounded walkthrough of the open-source Rust storage engine behind UsageBox. The full source is on GitHub.

Key Topics

  • usageDb
  • database internals
  • Rust
  • write-ahead log
  • durability
  • fsync
  • metering
  • idempotency

Related Articles

Explore more articles on similar topics to deepen your understanding of usage-based billing.

Why We Built usageDb: A Purpose-Built Rust Database for AI Usage and Billing

usageDb is an open-source Rust storage engine for AI usage metering and billing. Part 1 of a 10-part internals series: t...

8 min readRead more

Idempotent Metering in usageDb: Dedupe, Conflicts, and At-Least-Once Collectors

How usageDb guarantees each billable event is counted exactly once: stable event_ids, blake3 128-bit payload hashing, th...

9 min readRead more

Crash-Safe Metadata in usageDb: Atomic Manifest Commits and Generation Rollback

How usageDb keeps its single source of truth durable: temp-and-rename atomic commits with a parent-directory fsync, numb...

8 min readRead more

Explore More Articles

Discover our complete collection of usage-based billing guides and implementation patterns.

View all articles