The short version: Usage data is enormous in volume but extremely repetitive. The same handful of account, product, meter, and model IDs appear millions of times, timestamps climb almost monotonically, and most quantities are small integers. usageDb's .seg files are a custom columnar format that picks a per-column encoding (dictionary, delta, zigzag-varint, run-length, or plain) tuned to each of those shapes, then zstd-compresses each column. The result turns a heavy event stream into tiny, cheap-to-scan files, which is what makes keeping immutable raw segments as a permanent billing audit trail actually affordable.
This is Part 4 of the usageDb internals series. Part 3 covered how every event is deduplicated before it reaches durable state. Once a memtable fills up, the flusher writes those events out as an immutable raw segment that never changes again: the audit trail that backs every invoice line. The catch is that "never delete the raw events" is only viable if the raw events are cheap to store. This part is about the on-disk format that makes them cheap. Everything below is grounded in src/storage/segment_format.rs and its siblings; the full source is on GitHub.
Why columnar, and why a custom format
A usage event is wide but boring. A single UsageEvent carries an event ID, a kind, an optional correction reference, an account ID, an optional subscription ID, a product ID, a meter ID, a timestamp, a quantity, a unit, a source, an optional model ID, a dimensions map, and a server-stamped ingestion timestamp. Stored row by row, every event repeats the full string "acc_constant" or "claude-sonnet-4" on its own: across ten million events for one account, ten million copies of the same string.
Columnar storage flips the layout: all the account_id values are stored together, all the timestamp_ms values together, and so on. That co-location is the whole game. Homogeneous data sitting next to itself is exactly what a compressor like zstd is good at, and it lets each column choose an encoding suited to its own shape instead of settling for one row format. usageDb does not lean on Parquet here; it is a small, purpose-built layout the engine fully controls, so encodings can evolve without an external dependency dictating the file structure.
The .seg file layout
A segment file is a header, a run of self-describing columns, and a footer. The header names how many rows and columns to expect; each column carries its own name, encoding, codec, and compressed length; the footer carries a checksum and an end marker so truncation is detectable.
header:
magic b"UDBRAW1\n" (8 bytes)
version u8 = 1
row_count u32 LE
num_columns u16 LE
per column (repeated num_columns times):
name_len u16 LE
name utf-8 bytes
encoding u8 (0 Plain .. 4 Rle)
codec u8 (0 None, 1 Zstd, 2 Lz4)
compressed_len u32 LE
compressed_bytes bytes
footer:
checksum u64 LE (low 8 bytes of blake3 over everything above)
magic_end b"UDBEND01" (8 bytes)
The writer assembles the entire body into one buffer, computes the checksum over it, then appends the checksum and end magic before writing the file with a final sync_all(). Because each column header carries its own encoding and codec byte plus an explicit compressed_len, the file is fully self-describing: a reader walks it without any out-of-band schema and skips over any column it does not need.
The five encodings and why each column uses one
The encoding byte tells the reader how to interpret a column's payload after decompression. usageDb defines five, and the writer in src/storage/segment_writer.rs assigns each column the one that fits its data shape.
| Encoding | Byte | On-disk layout | Columns | Why |
|---|---|---|---|---|
Plain | 0 | bincode-serialized Vec<T> |
event_id, correction_ref, dimensions |
High-cardinality or structurally awkward columns where a dictionary would expand, not shrink. event_id is near-unique per row, so dictionary encoding it would just add an index column on top of the strings. |
Dictionary | 1 | bincode (Vec<String>, Vec<u32>): unique values plus one 4-byte index per row |
account_id, product_id, meter_id, model_id, source, unit, subscription_id |
The big win. Collapses O(rows x string size) down to O(unique values x string size + 4 bytes/row). For ID-heavy workloads that is roughly a 1000x shrink on the column. |
Delta | 2 | bincode Vec<i64> of running differences |
timestamp_ms, ingested_at_ms |
Timestamps are near-monotonic. Storing first-then-differences turns big absolute millisecond values into small deltas (often under a second) that zstd compresses dramatically better. |
Zigzag | 3 | u32 count plus concatenated zigzag-varints |
quantity |
Quantities are i128 but usually small. Zigzag maps signed values so that small magnitudes (positive or negative) stay small, then varint packs them into one or two bytes instead of a flat 16. |
Rle | 4 | bincode Vec<(u8, u32)> of (value, run length) |
kind |
Only three possible values and almost always a long run of Usage. Run-length encoding turns a 10,000-byte column into a single (0, 10000) pair plus framing. |
Dictionary: the workhorse
The dictionary encoder walks a column, assigns each first-seen string an index, and emits the unique values alongside a per-row index list. The dramatic case is real: the test suite writes 10,000 events that all share one account, product, meter, model, source, and unit, and asserts the whole segment lands under 250 KB on disk (observed around 150 KB locally). Without dictionary encoding those repeated ID strings alone would dominate the file. Nullable columns like subscription_id and model_id get an Option variant where None has no dictionary entry and is stored as a null index.
Delta and zigzag: numeric packing
Delta encoding stores the first value then successive differences, and the reader reconstructs by running sum. It is correct even for out-of-order or negative sequences, which matters because a Correction can land with an older timestamp than its neighbors. Zigzag-varint round-trips every i128 boundary value, including i128::MIN and i128::MAX; the count prefix is what lets the variable-width decoder know exactly how many values to pull. A thousand small quantities pack into well under 8 KB on disk.
Per-column compression and integrity
After encoding, each column's byte buffer is compressed independently. The codec byte records which compressor was used: 0 None, 1 Zstd, 2 Lz4. In the current writer that is always zstd at level 3 (see src/storage/compression.rs), but the format reserves the other codecs so a future writer can trade ratio for speed per column without changing the file structure. Encoding and compression compound: dictionary or delta or zigzag first removes the redundancy that is specific to the data's shape, then zstd mops up whatever general-purpose redundancy is left.
Integrity is enforced on every open. The footer stores the low 8 bytes of a blake3 hash over the entire body. When the reader in src/storage/segment_reader.rs loads a file, it checks the start magic, the end magic, and recomputes the checksum; any mismatch is treated as a corrupt segment and fails loudly rather than returning silently wrong billing data. It also verifies that every column decoded to exactly row_count rows, so a partial or scrambled column cannot pass.
A format built to evolve
Two design choices keep the format from painting future versions into a corner. First, the reader is permissive about encoding choices. A string column can arrive as Plain or Dictionary, an i64 column as Plain or Delta, and the reader handles either by dispatching on the per-column encoding byte. That means the writer is allowed to change its mind: a smarter future encoder can be deployed and old segments written by the previous writer still load correctly.
// reader dispatches on the per-column encoding byte
match encoding {
Encoding::Plain => de(&bytes), // legacy segments
Encoding::Delta => { // newer writer
let deltas: Vec<i64> = de(&bytes)?;
// reconstruct by running sum
}
_ => Err(corrupt("i64 column has wrong encoding")),
}
Second, adding a column is additive. A new writer can emit an extra column and an old reader simply ignores it. The asymmetry is deliberate: a missing required column makes the reader fail loud, because that is corruption, not a benign version skew. The combination gives forward room to grow the schema while keeping the "never silently wrong" contract that billing demands.
Sort-on-flush makes the encodings work harder
One detail upstream of the format amplifies everything above. When the flusher writes a bucket's events, it first sorts them into the canonical billing order (account_id, product_id, meter_id, model_id, timestamp_ms) via sort_events_canonical (in src/ingest/flusher.rs). That ordering is not cosmetic. Sorting by account then product then meter clusters identical ID values into long contiguous runs, exactly what makes dictionary indices and zstd's history window most effective, and sorting timestamps within each group keeps delta values small. Sort-on-flush also matches the order compaction would produce, so later passes skip a re-sort: smaller files at write time and cheaper merges later, covered in Part 8.
Put together, the format answers the economics of a billing audit trail. Usage data is huge in row count but tiny in entropy, so the right per-column encodings plus sort-on-flush turn millions of events into files small enough that keeping every raw event forever is a reasonable default, not a luxury. The segments produced here are then tracked, committed, and recovered through the manifest, the subject of Part 5.
Previous: Part 3: Idempotency and deduplication | Next: Part 5: The manifest and crash recovery
usageDb internals: the full series
- Why a purpose-built usage database
- The ingest path and durability contract
- Idempotency and deduplication
- The columnar segment format
- The manifest and crash recovery
- Hourly rollups and the watermark
- The query engine
- Compaction
- Period lifecycle and frozen snapshots
- Property tests and simulation testing
Part 4 of the usageDb internals series. usageDb is the open-source Rust storage engine behind UsageBox; the code is on GitHub.