KAI: A Training-Free Vector Search Engine

Vector search has become the backbone of modern AI applications - from semantic search and retrieval-augmented generation (RAG) to recommendation systems and multimodal similarity. But most production vector databases come with a significant hidden cost: training.

Classical approaches like FAISS IVF-PQ, HNSW, or ScaNN require you to train an index on your data before you can query it. That means you need to collect a representative sample corpus, run an expensive k-means clustering, build the graph or inverted file structure, and only then can you start ingesting real data. Re-training is needed as the data distribution shifts, which makes updates costly and operational complexity high.

KAI takes a fundamentally different approach.

What Is KAI?

KAI is a vector search and research engine built entirely in Rust. Its core thesis is simple: a vector database should not need to know anything about your data before it can index it. There are no offline training phases, no cluster centroids derived from your corpus, and no approximate graph structures that degrade over time.

Instead, KAI uses a paradigm called data-oblivious quantization - the compression scheme is derived purely from the mathematical properties of unit-sphere geometry, not from the data itself. The result is an engine that achieves competitive recall and latency from the first document ingested, with no warm-up period.

KAI ships as a workspace of Rust crates:

kai-core - The mathematical engine: SIMD scoring, quantization, codebook generation, append-only storage.
kai-server - An async HTTP/REST API layer built on Axum and Tokio.
kai-cli - A command-line tool for benchmarking, daemon lifecycle, and index inspection.
kai-proto - Shared rkyv zero-copy data structures.
kai-tools - Python tooling for PDF ingestion, embedding, and search scripting.

The Core Insight: Data-Oblivious Quantization

Modern embedding models (OpenAI text-embedding-3, BGE-M3, Cohere, etc.) produce vectors that live on the surface of a high-dimensional unit sphere. Once you normalize a vector to unit length, you know something powerful about each of its coordinates: they follow a Beta distribution.

Specifically, each coordinate of a uniformly distributed unit vector on $S^{d - 1}$ follows $Beta (\frac{d - 1}{2}, \frac{d - 1}{2})$ on $[- 1, 1]$ . This distribution is fully characterized by the dimensionality $d$ - it requires no data to compute. KAI exploits this to pre-compute an optimal quantization codebook using the Lloyd-Max algorithm on this known prior distribution. The result: the quantizer is tuned to the mathematical geometry of the space, not to any particular corpus.

Before quantization, KAI applies a random orthogonal rotation to each vector. This rotation - generated from a seeded QR decomposition and stored deterministically - redistributes quantization error uniformly across all coordinates. Without it, dimensions that happen to carry more signal would dominate quantization loss.

After quantization, each vector coordinate becomes a small integer (typically 4 bits). A full 1536-dimensional vector at float32 would occupy 6 KB; after quantization it occupies roughly 768 bytes - an 8× compression - with a carefully calibrated per-vector scale factor that compensates for the systematic shrinkage introduced by centroid reconstruction.

What Makes KAI Unique?

1. No Training Phase

Every other classical vector database workflow looks like: collect data → train → index → query. KAI’s workflow is: ingest → query. The codebook is a function of (bit_width, dimension), both of which are known at startup. You can point KAI at a brand-new empty database and start searching immediately.

2. Runtime SIMD Hardware Detection

Most systems are compiled for a fixed architecture. KAI queries the CPU at runtime and routes to the most capable kernel available: AVX-512 first, AVX2 as a fallback on x86, and NEON on ARM. This means the same binary runs at bare-metal speed whether deployed on a developer laptop or a cloud instance with AVX-512. On ARM (AWS Graviton), the NEON path uses vqtbl1q_u8 for sub-byte table lookups across 32-vector blocks.

3. Append-Only Storage with Crash-Safe VACUUM

The document store (.tvdb format) is strictly append-only. Writes never move existing data, which means the memory-mapped file used for zero-copy reads is always safe - there is no risk of reading a byte range that has been overwritten mid-read. Deletes are handled by a two-tier system borrowed from database internals: a fast logical delete (removes the document from index maps immediately) followed by a background physical compaction (VACUUM) that rewrites only the live records once dead bytes cross a threshold.

4. Zero-Copy Document Retrieval

Documents are serialized with rkyv, a zero-copy deserialization library. On a retrieval call, the engine computes the byte offset of the record in the memory-mapped file and returns a typed reference directly into that mapped region - no heap allocation, no deserialization. Lifetime safety is guaranteed by a RwLock that prevents the mmap from being retired while a reference into it is alive.

5. Async/Sync Architecture with CPU Affinity

KAI’s server strictly separates the async I/O world (Axum + Tokio) from the SIMD compute world (Rayon). Without this separation, a spike in HTTP traffic would cause context switches that evict L3 cache lines holding the quantized vector array mid-scan, causing catastrophic latency spikes under load. The server-isolation feature routes all SIMD work through a dedicated Rayon thread pool, ensuring 100% CPU core affinity.

Interesting Design Decisions

Per-Vector Scale Correction

A naïve quantization scheme compresses the vector and multiplies the score by ||v|| (the original vector’s norm) to partially undo compression loss. KAI goes further. It computes the inner product between the rotated unit vector and its centroid reconstruction (x_hat), then stores a scale of ||v|| / <u, x_hat>. This is a RaBitQ-style correction that makes the dot-product estimator unbiased - when the quantization is perfect, the correction collapses to 1 and doesn’t change anything. In practice, it meaningfully improves recall on embeddings with high-variance norms.

The MemTable Pattern (from LSM Trees)

Incoming vectors are not immediately placed into the SIMD-searched hot tier. They are first staged in an in-memory MemTable in their compressed (bit-packed) form. A search pass transparently covers both the warm mmap tier and the MemTable, so freshly ingested vectors are queryable without waiting for compaction. Background VACUUM flushes the MemTable into the hot tier and rebuilds the blocked layout, keeping the search path consistent.

Windows Daemonization via WMI

Running a long-lived server process that outlives the terminal window is trivial on Unix (fork + setsid). On Windows, it requires breaking out of the Job Object that associates child processes with their parent terminal. KAI solves this with a WMI call (Win32_Process.Create) that spawns the server process in a completely independent context - a technique borrowed from Windows service managers.

Tombstone-Based Delete Log

Because vectors live only in RAM (the .tv file is not yet persistent across restarts), a naive restart would “resurrect” any deleted documents by re-scanning the .tvdb text file. KAI prevents this by maintaining a durable, append-only delete log (deleted.bin) keyed by chunk_id. On startup, this log is replayed to re-apply all logical deletes before serving any requests.

Use Cases

Retrieval-Augmented Generation (RAG): KAI’s parent-document grouping (parent_doc_id) is a first-class feature. A PDF can be split into overlapping chunks, all sharing the same parent_doc_id. A search returns the most relevant chunks, and a single API call (GET /api/documents/parent/:id) returns the full original document - the pattern that feeds the best context windows to LLMs.

Semantic Code Search: Embedding models trained on code (UniXcoder, CodeBERT) produce 768–1536-dimensional vectors. KAI’s dynamic dimensionality (KAI_DIM env var) handles any model out of the box.

Document Deduplication at Scale: With bulk ingestion and millisecond-latency search, KAI can compare incoming documents against an existing corpus in real time to detect near-duplicates before they’re stored.

Offline/Air-Gapped Deployments: Because there is no training phase and no external service dependency, KAI can be dropped into an air-gapped environment, given a local embedding model (e.g., via ollama or a bundled ONNX runtime), and made fully self-contained.

Multi-Architecture Edge Inference: The Docker build uses $TARGETARCH to inject the right compiler flags at build time (-C target-cpu=x86-64-v4 for x86, native for ARM). Combined with runtime SIMD detection, KAI produces containers that run at bare-metal speed on both Intel/AMD cloud instances and AWS Graviton without maintaining separate images.

Benchmarking KAI

KAI ships with a first-class CLI benchmarking suite built on statrs for statistical precision:

# Measure raw SIMD encoding and query latency (P50, P95, P99)

kai benchmark —synth latency

# Measure in-memory compaction efficiency

kai benchmark —synth compaction

# Measure sequential append throughput

kai benchmark —synth ingest

# Measure async concurrency under parallel query load

kai benchmark —synth concurrency

# Export results as machine-readable JSON for CI pipelines

kai benchmark —synth all —json

The latency benchmarks measure the SIMD scan path directly, bypassing HTTP overhead, to isolate the mathematical engine’s performance.

Where KAI Is Headed

KAI is a research engine as much as a production tool. Several major capabilities are explicitly in progress:

Durable vector persistence - the current release keeps vectors in RAM; a versioned .tv v3 format with an explicit id-map is the next milestone.
Segmented warm tier - Lucene-style immutable segments with background merge-compaction, removing the full-file-rewrite pause that the current VACUUM requires.
Online compaction - moving the VACUUM copy phase outside the write lock for zero-pause reclamation on large databases.
Ingest atomicity - a write-ahead log spanning both the text tier and the vector tier to make ingestion crash-proof end-to-end.

Even in its current form, KAI demonstrates that a high-performance, production-grade vector database does not require the complexity of offline training pipelines. The mathematical structure of the embedding space is enough.

KAI is an open research project. The source is organized as a Rust workspace in kai-core, kai-server, kai-cli, kai-proto, and kai-tools.