Training-Free Vector Quantization: Lloyd-Max Codebooks on the Unit Sphere

Vector quantization is the art of approximating a high-dimensional floating-point vector with a compact integer representation. Do it well and you get an 8× memory reduction with acceptable recall loss. Do it naively and your search quality collapses. The interesting question is: what does “do it well” mean when you deliberately refuse to look at the data?
KAI’s answer draws from two bodies of theory: the geometry of uniform distributions on the unit sphere, and the classical Lloyd-Max algorithm for optimal scalar quantization. Together they produce a codebook that is provably optimal for the statistical distribution that unit-sphere vectors actually follow - without ever needing a training corpus.
The Problem with Corpus-Dependent Quantization
Most vector databases that use Product Quantization (PQ) or OPQ work as follows:
- Collect a large representative sample of your vectors.
- Divide each vector into sub-vectors.
- Run -means on each sub-vector space to find cluster centroids.
- Store the centroid index for each sub-vector.
This approach achieves high compression quality for the specific corpus you trained on. But it has serious operational problems:
- Upfront training cost. For 1536-dimensional vectors at 8 bits, you’re running 256 independent -means jobs on potentially millions of sub-vectors.
- Distribution shift. If your corpus changes significantly (new topics, new embedding model), your codebook degrades and eventually needs retraining.
- Cold-start problem. A brand-new deployment with no existing data has nothing to train on.
KAI sidesteps all of this. The codebook is derived from the mathematical prior of the space, not from samples.
The Geometry of Normalized Vectors
Modern embedding models produce vectors that are almost always -normalized before use (or should be, for cosine similarity). A normalized vector lives on the surface of the unit hypersphere .
What does a uniformly random point on look like, coordinate by coordinate? This is a classical result in probability theory. For a vector drawn uniformly from , each coordinate follows a marginal distribution:
This is a symmetric Beta distribution. For large (e.g., ), the parameters are very large, and the Beta distribution becomes tightly concentrated around zero - essentially a Gaussian with mean 0 and small variance. For small , it’s more spread out.
The crucial point is: this distribution is completely determined by . You don’t need data. You don’t need samples. You only need to know the dimensionality of your embedding space.
The Lloyd-Max Algorithm
Given a known distribution, the optimal scalar quantizer that minimizes mean-squared quantization error can be found by the Lloyd-Max algorithm. For quantization levels (corresponding to bits), Lloyd-Max iterates:
- Boundary update: Place the boundary between level and level at the midpoint of the corresponding centroids.
- Centroid update: Set each centroid to the conditional expectation (mean) of the distribution over its cell.
- Repeat until convergence.
For a symmetric distribution like , the centroids are symmetric around zero, and the algorithm converges quickly (200 iterations is more than enough).
KAI’s codebook module implements this precisely:
pub fn codebook(bits: usize, dim: usize) -> (Vec<f32>, Vec<f32>) {
lloyd_max(bits, dim, 200, 1e-12)
}
fn lloyd_max(bits: usize, dim: usize, max_iter: usize, tol: f64) -> (Vec<f32>, Vec<f32>) {
let a = (dim as f64 - 1.0) / 2.0;
let beta = Beta::new(a, a).unwrap(); // Beta(a,a) on [0,1], shifted to [-1,1]
let n_levels = 1usize << bits;
// Initialize centroids within ±3 std devs of the distribution
// ...
for _ in 0..max_iter {
// Boundaries = midpoints between consecutive centroids
// Centroid update = conditional mean via numerical integration
// Check convergence by max centroid displacement
}
} The centroid update requires computing conditional means of the Beta distribution over each cell. This is done via adaptive Simpson’s rule numerical integration - the integration intervals are non-uniform (some cells are very narrow near the tails), so adaptive refinement is necessary for the required tolerance of 1e-12.
The output is two arrays: boundaries (the decision thresholds for quantization) and centroids (the representative values for each quantization level). These are precomputed once at startup and cached for the lifetime of the index.
Orthogonal Rotation: Making Coordinates Exchangeable
There is a subtlety: real embedding vectors are not uniformly distributed on the sphere. They have systematic structure - some dimensions carry more information than others. A semantic embedding of the word “king” doesn’t have random coordinate values; certain dimensions are systematically large or small.
If the coordinates aren’t exchangeable (identically distributed), the per-dimension Lloyd-Max codebook is suboptimal for high-information dimensions. The standard solution from the PQ literature is the Optimized Product Quantization (OPQ) rotation, which rotates the vectors to find the direction of minimum quantization distortion - but this requires training data.
KAI uses a different approach: a random orthogonal rotation applied to every vector before quantization. A random rotation doesn’t minimize distortion for your specific corpus, but it does something else: it redistributes error uniformly across all dimensions. After a random rotation, no single dimension systematically carries more information than any other - the rotation mixes the structure across all coordinates. The resulting coordinate-wise distribution is closer to the uniform-on-sphere prior that the codebook was designed for.
The rotation matrix is generated once, deterministically, using a seeded QR decomposition:
pub fn make_rotation_matrix(dim: usize) -> Vec<f32> {
let mut rng = ChaCha8Rng::seed_from_u64(ROTATION_SEED);
// Fill dim×dim matrix with standard Gaussian entries
let mut g = faer::Mat::<f64>::zeros(dim, dim);
for j in 0..dim {
for i in 0..dim {
g.write(i, j, rng.sample(StandardNormal));
}
}
// QR decomposition gives an orthogonal matrix Q
let qr = g.qr();
let q = qr.compute_thin_q();
// Sign correction: ensure diag(R) > 0 for uniqueness
// ...
} The result is a dim × dim orthogonal matrix stored in row-major f32 format. Applying it to a vector is a matrix-vector multiply - accelerated via faer’s BLAS-backed matmul with Rayon parallelism for batches, and faer::Parallelism::None for single-vector encodes (to avoid Rayon overhead on small workloads).
Why ChaCha8Rng? It’s a cryptographically-decent PRNG with reproducible cross-platform output. The seed (ROTATION_SEED = 42) ensures that the rotation matrix is identical across all replicas of the service and across restarts - critical because the rotation must be the same at ingest time and at query time.
The Encoding Pipeline
With the rotation matrix and codebook in hand, encoding a vector proceeds in five steps:
Step 1: Normalize
Store the original norm for later use. From this point forward, all operations are on unit vectors.
Step 2: Rotate
The transpose is intentional - for an orthogonal matrix , applying is the same as applying , which takes the vector from the original space into the “aligned” space where the codebook applies.
Step 3: Quantize
For each dimension , find the quantization level by counting how many boundaries the value exceeds:
// For each boundary threshold b:
for b in boundaries {
for idx in 0..n * dim {
if rotated[idx] > *b {
codes[idx] += 1;
}
}
} This is a cumulative comparison - equivalent to a binary search, but vectorizable because all dimensions can be processed independently in parallel.
Step 4: Compute the Per-Vector Scale
This is the RaBitQ-inspired correction. After quantization, reconstruct what the vector would look like if decoded: . The inner product measures how well the codebook represents this particular vector. The stored scale is:
Where is the per-dimension centroid reconstruction. When quantization is perfect (x_hat = u_rotated), the inner product is 1.0 and scale = ||v|| - the standard norm correction. When quantization introduces error (the centroid is offset from the true coordinate value), the inner product is less than 1, and the scale inflates to compensate.
for i in 0..n {
let mut inner = 0.0f64;
for j in 0..dim {
let r = rotated[row_start + j] as f64;
let c = centroids[codes[row_start + j] as usize] as f64;
inner += r * c;
}
scales[i] = norms[i] / inner.max(1e-10) as f32;
} This is computed in f64 to avoid accumulated rounding errors across 1536 multiplications.
Step 5: Bit-Pack
The quantization codes (integers in [0, 2^bits - 1]) are packed into a bit-plane layout. For 4-bit codes and 1536 dimensions:
- Each code occupies 4 bits → 2 codes per byte
- Total storage:
1536 / 2 = 768bytes per vector - The bit-plane format separates the 4 bit-planes across the vector, which is the representation expected by the SIMD repack step
The final 768-byte packed code, together with the 4-byte f32 scale, fully represents the original 6 KB float32 vector. That’s an 8× compression ratio with a provably near-optimal codebook for the embedding space geometry.
Why This Works: The Theory Behind the Practice
The reason data-oblivious quantization is competitive with corpus-trained quantization comes down to a key property of modern embedding models: they produce vectors that are already well-distributed on the sphere.
Embedding models are trained with contrastive objectives (cosine similarity losses, InfoNCE, etc.) that explicitly push vectors towards the sphere surface and encourage isotropy. A well-trained embedding model’s output vectors are approximately uniformly distributed on - which is exactly the distribution that KAI’s Lloyd-Max codebook is optimized for.
This is the core insight: the embedding model’s training objective and KAI’s quantization objective are aligned by design. A corpus-trained PQ codebook would specialize for the particular covariance structure of your corpus; a sphere-optimal codebook is optimal for the universal structure that the model itself imposes.
The random orthogonal rotation completes the picture: it ensures that the remaining structure (the non-uniform directions in your specific corpus) doesn’t systematically concentrate quantization error in any particular dimension.
Practical Implications
Recall. At 4 bits and 1536 dimensions, KAI’s quantization achieves recall@10 values competitive with brute-force float32 search on typical RAG corpora. At 2 bits, compression is 16× with some recall degradation; at 8 bits, the codebook quality approaches float32 but with 4× compression.
Startup time. The codebook is computed at process startup via the Lloyd-Max iteration. For bits=4 and dim=1536, this takes tens of milliseconds - not a hot path.
Universality. The same codebook works for any embedding model that outputs 1536-dimensional vectors: OpenAI text-embedding-3-large, Cohere embed-english-v3.0, and others. No retraining when you switch models.
Cross-architecture consistency. The rotation matrix generation uses ChaCha8Rng with a fixed seed, producing identical output on x86, ARM, and any other platform Rust compiles to. This is essential for correctness: a query encoded on one machine must produce scores compatible with documents encoded on another.
This article covers kai-core/src/codebook.rs, kai-core/src/rotation.rs, and kai-core/src/encode.rs in the KAI project. The companion article on SIMD search covers how these quantized codes are consumed by the scoring pipeline.