Append-Only Storage and Two-Tier VACUUM: How KAI Handles Deletes Safely

Most databases treat storage as a mutable entity: update a record and its bytes are overwritten in place. This simplicity comes with a hidden cost - any concurrent reader that holds a pointer into the old location is now reading garbage. Managing that is the central problem of concurrency in mutable storage systems, and it’s why databases invented buffer pools, page locks, and multi-version concurrency control (MVCC).
KAI takes a different approach: the document store is append-only. Once a record’s bytes are written to disk, they are never moved or overwritten. This single invariant enables zero-copy reads via memory-mapped files, crash-safe appends, and a much simpler concurrency model. But it creates its own challenge: deletes. How do you remove data from a storage format that never changes data in place?
The answer borrows from PostgreSQL’s VACUUM and LSM-tree compaction: separate logical deletes (instant, correct, zero I/O) from physical reclamation (background, heavyweight, deferred).
The .tvdb Format
KAI’s document store is a binary file (metadata.tvdb) in the KAITVDB2 format. Each record is laid out as:
[16-byte header][rkyv-serialized payload][alignment padding to 16 bytes]
The 16-byte alignment is mandatory because rkyv’s zero-copy deserialization relies on relative pointers that assume the payload starts at a 16-byte boundary. If alignment breaks, the relative pointers compute wrong addresses and the deserialization reads arbitrary memory.
Appending a new record means:
- Serialize the
DocumentChunkstruct withrkyv. - Write the 16-byte header (containing the
chunk_idand payload length). - Write the serialized payload bytes.
- Pad to the next 16-byte boundary.
- Call a single
fsyncto flush to disk.
Reads use a memory-mapped view of the file (memmap2::Mmap). A lookup by chunk_id consults an in-memory DashMap<u64, DocLocation> to find the byte offset, then returns a &ArchivedDocumentChunk - a reference directly into the mapped memory region. No heap allocation. No deserialization. The data is already in memory courtesy of the OS page cache.
The Problem: Deletes in an Append-Only World
When a document is deleted, the naive approach is to simply skip it during search. KAI’s SIMD engine does this with a RoaringBitmap tombstone set - deleted internal vector slots are masked out during scoring. This works for the vector search path.
But the text storage has two additional issues:
Issue 1: Space bloat. The deleted record’s bytes remain in the .tvdb file forever. Under heavy CRUD workloads, the file grows without bound. A system that deletes and re-ingests millions of documents would eventually exhaust disk space while holding almost no live data - the classic “dead tuple” problem that PostgreSQL solves with VACUUM.
Issue 2: A correctness bug. Before the two-tier delete was implemented, DELETE /api/documents/:id only tombstoned the vector slot. The warm tier’s index maps (id_to_offset, parent_to_chunks) were not updated. This meant:
GET /api/documents/parent/:parent_doc_idstill returned the deleted chunk’s text.get_chunk(id)still resolved the deleted record.
A “deleted” document was only deleted from vector search - its text was still fully accessible. The logical delete in Tier 1 closes this gap by updating the warm tier maps immediately.
Tier 1: The Logical Delete
The logical delete runs synchronously inside the HTTP DELETE handler. It calls WarmTier::mark_deleted(chunk_id), which:
- Removes the
chunk_idfromid_to_offset- the record is no longer resolvable by ID. - Removes the
chunk_idfrom its parent’s list inparent_to_chunks- parent-document grouping is immediately correct. - Drops the parent entry entirely if no more sibling chunks remain.
- Adds the record’s byte span (header + payload + pad) to a running
dead_bytescounter.
After this call returns, the document is semantically gone. Any subsequent read, parent query, or search will not surface it. The bytes are still frozen in the .tvdb file - but from a correctness standpoint, the delete is done.
The dead_bytes counter is the trigger input for Tier 2. No explicit scheduling or periodic job is needed; the threshold is evaluated inside the delete handler itself.
Figure 1: The Two-Tier VACUUM Process. Tier 1 (Logical Delete) instantly marks records as deleted in the index. Tier 2 (Physical VACUUM) is a background process that copies live data to a new compacted file, reclaiming space via an atomic swap.
Tier 2: The Physical VACUUM
When dead_bytes exceeds a threshold (COMPACT_RATIO * file_len and COMPACT_MIN_BYTES floor), a background physical compaction is spawned. This is a stop-the-world swap with a pre-computed lock window:
Phase 1: Plan (under write lock, microseconds)
Capture the current state: record the highest chunk_id ever assigned (max_id) and the current file length. The live set is defined as “every chunk_id still present in id_to_offset” - Tier 1 has already removed all deleted ids, so the warm tier is self-describing.
Phase 2: Copy (the only O(file) work)
Walk the old mmap in physical order. For each record, check if it’s live:
- Live:
chunk_id ∈ live_setORchunk_id > max_id(brand-new ingest during copy - must preserve). - Dead:
chunk_id ≤ max_idAND absent fromlive_set.
Raw-copy the live records’ bytes into a staging buffer. “Raw-copy” is intentional - we don’t re-serialize with rkyv, because the bytes are already perfectly aligned (every record is a 16-byte multiple, so sequential copying preserves alignment automatically). Build fresh new_id_to_offset and new_parent_to_chunks maps as the copy proceeds. Write the staging buffer to a sibling temp file (metadata.tvdb.compact), then flush + sync_data.
Phase 3: Atomic Swap
fs::rename(temp → metadata_path). On POSIX, rename(2) is atomic - the directory entry switches from old to new with no intermediate state visible to other processes. On Windows, MoveFileEx with MOVEFILE_REPLACE_EXISTING provides the same guarantee.
After the rename: mmap the new file, assign metadata_map, id_to_offset, and parent_to_chunks together as one unit, and reset dead_bytes = 0.
Phase 4: Retirement (automatic)
The old Arc<Mmap> drops when its last reference goes. Because the swap required the write lock, and reads hold a read lock for the duration of their borrow, there is a mathematical guarantee: no reader can hold a borrow into the old mmap when the swap completes. RwLock semantics ensure the write lock is not granted until all read locks are released, and the old borrow’s lifetime is tied to the read lock. Dropping the old Arc<Mmap> therefore unmaps memory that no code can be borrowing - safe by construction.
The Liveness Rule and the Race Window
There is a subtle race. If compaction runs in pure stop-the-world mode (write lock held across the entire copy), there is no race - no concurrent write can happen. But if the future “smart timing” optimization is applied (lock released during the copy phase, re-acquired only for the swap), a new document could be ingested with a chunk_id > max_id during the copy.
KAI’s liveness rule handles this: keep a record iff chunk_id ∈ live_set OR chunk_id > max_id. The > max_id branch preserves any record ingested after the snapshot, regardless of whether it was captured in the original live set. This is not just defensive code - it’s the load-bearing correctness property that makes online compaction possible without a global pause.
Crash Safety
What happens if the process crashes mid-VACUUM? Three scenarios:
Crash before the rename: The temp file metadata.tvdb.compact is incomplete or missing. The live .tvdb is untouched. On next startup, a stale-temp cleanup pass removes any orphaned .compact siblings. No data is lost.
Crash during the rename: On both POSIX (ext4/xfs) and NTFS, rename is transactional at the filesystem level - either the old name resolves or the new one does, never a torn state. The database will load whichever file is present.
Crash after the rename: The new compacted file is authoritative. The old file no longer exists. The startup scan of the new file rebuilds the index maps from the compacted records.
In all cases, the invariant holds: the database at startup contains exactly the set of records that were live at the moment of the last completed compaction (or since startup if no compaction has run).
The Durable Delete Log
There is one more durability challenge, specific to KAI’s current architecture: vectors live only in RAM.
When the process restarts, the vector index is empty. The server rebuilds the in-memory vector index by scanning vectors from the .tv file - but that file is currently not written with live vectors (a known limitation; durable vector persistence is a future milestone). Effectively, vectors are lost on restart.
But the .tvdb text file is durable. On restart, the server scans it and reconstructs id_to_offset and parent_to_chunks from all the records it finds - including records for documents that had been deleted before the restart.
Without a fix, deleted documents would be resurrected: they’d reappear in parent queries and be resolvable by ID, even though they were deleted before the restart.
KAI solves this with a durable chunk-id delete log: data/deleted.bin. Every DELETE call appends the deleted chunk_id as an 8-byte little-endian integer to this append-only file, followed by an fsync. On startup, the server reads this log and replays each entry through mark_deleted before serving any traffic.
// On DELETE:
io::append_deleted_chunk(&deleted_path, chunk_id)?;
// On startup:
let deleted_ids = io::load_deleted_chunks(&deleted_path)?;
for id in deleted_ids {
warm_tier.mark_deleted(id); // idempotent
} The mark_deleted call is idempotent - replaying a delete for a chunk_id that’s already gone from the maps is a no-op. This makes the log safe to replay even if it contains duplicates (e.g., from a crash between two delete calls that targeted the same id).
The log grows ~8 bytes per delete. At 1 million deletes, that’s ~8 MB - negligible. Future work will truncate it safely inside the VACUUM compaction lock (once the physical compaction has removed every dead record, the log is no longer needed to handle resurrection on restart).
Monotonic ID Seeding
The final restart-safety guarantee is subtle but critical. On restart, the vector tier is empty (0 vectors). If the server seeded its next_id counter from the vector count, it would start issuing IDs from 0 - and the first new ingest would be assigned chunk_id = 0, which might already exist in the .tvdb for a document ingested in a previous session.
KAI avoids this by tracking a max_chunk_id high-water mark in the warm tier during startup scan. The server seeds its ID counter from max(max_chunk_id + 1, n_vectors). Since n_vectors = 0 on restart, the effective seed is max_chunk_id + 1 - safely beyond any ID that already exists in the .tvdb.
This also means physical compaction is safe for IDs: the compaction raw-copies records without renumbering chunk_ids. After compaction, max_chunk_id is still tracked correctly (the compactor preserves even the highest id among live records), so the next ingest after a post-compaction restart will still issue a fresh, non-colliding ID.
Design Trade-offs and Future Work
The two-tier VACUUM design is deliberately conservative:
What it achieves:
- Zero-copy reads via memory-mapped files, safe from concurrent writes.
- Instant logical delete correctness (no “zombie” text).
- Background physical space reclamation.
- Crash safety at every phase of the VACUUM.
- Monotonic ID guarantees across restarts.
What it defers:
- Online compaction: The current implementation holds the write lock across the entire copy phase. For large databases, this produces a brief stop-the-world pause. The §4.5 optimization (copy outside the lock, re-acquire only for the swap) would reduce this to O(appends-during-copy).
- Segmented storage: The true scalability end-state is Lucene-style immutable segments with incremental background merges. Each segment is a small
.tvdbfile; merges drop dead records without ever pausing reads. This is tracked as future work. - Delete-log truncation: The
deleted.binlog is never cleared. Safe truncation must happen atomically with a VACUUM (to avoid a resurrection race), and is deferred to avoid complexity in the current implementation.
The current design is directly inspired by PostgreSQL’s VACUUM (two-tier logical/physical split), LSM-tree compaction (atomic file swap, background scheduling), and mmap-based databases like LMDB (zero-copy reads under a reader-writer lock). It’s a pragmatic combination that achieves production-grade correctness for the use cases KAI targets today.
This article covers kai-core/src/storage.rs, kai-core/src/io.rs, and the kai-docs/tvdb-compaction.md design document in the KAI project.