Inside the SIMD Search Pipeline: How KAI Scans Millions of Vectors in Milliseconds

The bottleneck in any vector database is the inner-product scan. Given a query vector and a database of $N$ compressed document vectors, you need to compute a similarity score for every document and return the top- $k$ . At a million vectors and 1536 dimensions, that’s an enormous amount of arithmetic - and it has to happen in tens of milliseconds to be useful in a production system.

KAI’s SIMD pipeline is the component that makes this feasible. This article unpacks how it works, why the memory layout matters as much as the computation, and how the same codebase achieves peak throughput on both x86 (AVX2/AVX-512) and ARM (NEON).

The Scoring Problem

After quantization (covered in the companion article on RaBitQ), each document vector is represented as $d$ small integers - typically 4-bit values (nibbles) for a 1536-dimensional vector. A search query goes through the same encoding pipeline: normalize, rotate, quantize. The result is a query with the same nibble-per-dimension representation.

The scoring function approximates cosine similarity as:

score(query, doc) ≈ per_vec_scale × Σ_j lut[query_nibble_j][doc_nibble_j]

Where lut is a lookup table (LUT) precomputed from the query’s quantized values and the codebook centroids. The key insight: instead of computing floating-point multiply-accumulates over 1536 floats per document, we do integer table lookups over 1536 nibbles per document.

For 4-bit codes, each nibble takes one of 16 values, so the LUT has 16 entries. The per-group accumulation becomes a table lookup and a byte add - operations that fit neatly into SIMD registers.

The Nibble-Split Trick

A naive LUT approach would require a 256-entry table (for full-byte codes) per group - too large to keep hot in registers. SIMD architectures like AVX2 support efficient 16-entry shuffle operations (vpshufb on x86, vqtbl1q_u8 on ARM NEON). These operate on a 16-byte vector as a lookup into a 16-entry table, producing 16 results in a single instruction.

KAI splits each byte into two nibbles (low 4 bits and high 4 bits) and maintains two 16-entry tables per group:

byte = doc_code_byte

lo_nibble = byte & 0x0F → look up in lut_lo (low-half table)

hi_nibble = byte >> 4 → look up in lut_hi (high-half table)

score_contribution = lut_lo[lo_nibble] + lut_hi[hi_nibble];

Both lookups are performed simultaneously using a single vpshufb / vqtbl1q_u8 instruction across 16 bytes at once. This is the nibble-split LUT technique, and it’s the same idea used in FAISS’s PQ scanning kernel - adapted here for KAI’s own quantization scheme.

The Blocked Memory Layout

The memory layout of the quantized codes is as important as the computation itself. Consider two possible layouts for storing the codes of $N$ vectors, each with $G$ groups:

Layout A (vector-major): All groups for vector 0, then all groups for vector 1, etc.
Layout B (blocked/SIMD-friendly): All $B$ vectors of block 0 interleaved group-by-group.

With Layout A, scoring a single query against 32 consecutive vectors would require strided memory accesses - the bytes for group 0 of vectors 0..31 are spread across $G$ stride distances. SIMD units hate strided access.

With Layout B (the blocked layout KAI uses), all bytes for group 0 of vectors 0..31 sit contiguously in memory. A single 32-byte AVX2 load or a pair of 16-byte NEON loads picks up the group 0 bytes for all 32 vectors in one instruction.

KAI uses a block size of BLOCK = 32 vectors. Each block stores its codes in a layout like:

[group_0_vec_0..31][group_1_vec_0..31]…[group_G-1_vec_0..31]

This means every inner-loop iteration processes 32 vectors simultaneously, and every memory access is maximally dense.

The perm0 Interleave (AVX2 Cross-Lane Issue)

AVX2 has a subtle constraint: the vpshufb shuffle operates within each 128-bit lane independently. A 256-bit AVX2 register is logically two 128-bit lanes, and elements cannot cross between them. If 32 vector codes are stored naively (16 bytes from vectors 0..15 in the low lane, 16 bytes from vectors 16..31 in the high lane), there is no problem. But after accumulation, when we want to add the contributions of vectors 0 and 16 together, they land in different lanes and can’t be added with a simple horizontal add.

KAI uses the perm0 permutation to pre-interleave the codes at repack time:

let perm0: [usize; 16] = [0, 8, 1, 9, 2, 10, 3, 11, 4, 12, 5, 13, 6, 14, 7, 15];

This shuffles vectors so that corresponding high-lane and low-lane elements are interleaved. After scoring, the partial sums can be correctly reduced without any cross-lane penalty. It’s the FAISS trick, adapted to KAI’s layout. On ARM NEON, this problem doesn’t arise (no 128-bit lane constraint), so the ARM path uses a simpler sequential layout with no permutation.

The Flush Pattern: Preventing u8 Saturation

The inner accumulator for each block is a u8 - 8-bit unsigned integer. A single table lookup returns a value in [0, 255]. After enough additions, it overflows.

For a 1536-dimensional, 4-bit quantized vector, there are $1536 / 2 = 768$ byte groups. Adding 768 lookup results into a u8 accumulator would overflow repeatedly (768 > 255). To prevent this, KAI uses a flush pattern: it accumulates into u8 for FLUSH_EVERY = 256 groups at most, then widens to u16 accumulators and resets.

The outer loop structure looks like:

for batch in 0..n_batches {
    // u8 accumulation for up to FLUSH_EVERY groups
    let mut accum_u8 = [vdupq_n_u8(0); 4];  // NEON, 4 × 16-lane u8

    for g in batch_start..batch_end {
        // table lookup + add into accum_u8
    }

    // Widen u8 → u16 and add to float accumulators
    fa[i] = vaddq_f32(fa[i], vmulq_f32(v_scale, vcvtq_f32_u32(vpaddlq_u16(...))));
}

This “widening flush” is a standard technique in SIMD accumulation pipelines. The choice of 256 is deliberate: at 4-bit codes with a maximum table value of 15, 256 additions gives a worst-case sum of $256 t i m e s 15 = 3840$ , safely within u16 range after widening.

The 4-Group Unrolled Inner Loop

KAI’s NEON path unrolls the inner loop 4x to hide instruction latency:

while g + 3 < g_end {
    // Load 4 sets of LUT pairs and code bytes simultaneously
    let (lp0, lp1, lp2, lp3) = ...; // LUT pointers
    let (cp0, cp1, cp2, cp3) = ...; // code pointers

    for (lp, cp) in [(lp0,cp0), (lp1,cp1), (lp2,cp2), (lp3,cp3)] {
        let lut_hi = vld1q_u8(lp);
        let lut_lo = vld1q_u8(lp.add(16));
        let c0 = vld1q_u8(cp);
        let c1 = vld1q_u8(cp.add(16));
        // hi nibble lookup + lo nibble lookup + add
        let s0 = vaddq_u8(vqtbl1q_u8(lut_hi, vshrq_n_u8(c0, 4)), 
                          vqtbl1q_u8(lut_lo, vandq_u8(c0, mask)));
        // ... accumulate
    }
    g += 4;
}

vqtbl1q_u8 has a latency of roughly 2 cycles and a throughput of 1 per cycle on modern ARM cores. By issuing 4 independent lookup pairs back-to-back before waiting for results, the CPU can pipeline multiple in-flight lookups and hide per-instruction latency.

The Tombstone Mask

Not all vectors in the database should be returned. Deleted documents are represented as tombstoned internal slot IDs stored in a RoaringBitmap. Before the SIMD scan, the engine checks whether any live (non-tombstoned) vector falls within each 32-vector block. If an entire block is tombstoned, it is skipped entirely - no lookup, no accumulation.

// block_has_allowed: returns true if any slot in the block is live

if !block_has_allowed(&tombstones, block_idx) {
    BLOCKS_SKIPPED_BY_MASK.fetch_add(1, Ordering::Relaxed);
    continue; // skip the entire SIMD block
}

This is tracked by a global atomic counter (BLOCKS_SKIPPED_BY_MASK) for telemetry and test verification. In a database with heavy deletes, this early-exit dramatically reduces effective scan work.

Transparent MemTable Search

KAI’s bulk-ingest feature maintains a MemTable: an in-memory buffer of recently ingested vectors that haven’t yet been flushed into the cold blocked-layout tier. Searches transparently cover both:

The warm-tier mmap (memory-mapped .tv file, SIMD-blocked layout)
The MemTable (in-memory, SIMD-blocked layout)

The scores from both passes are merged into a single priority queue, and the top- $k$ are returned. From the caller’s perspective, a freshly ingested document is searchable immediately - there is no ingest lag.

Score Pipeline Summary

Putting it all together, a single search request through KAI’s engine follows this path:

HTTP POST /api/search

│

├─ [1] Deserialize query vector (rkyv, zero-copy)

├─ [2] Normalize + rotate query (faer BLAS, Rayon parallel)

├─ [3] Quantize query → nibbles (Lloyd-Max boundaries)

├─ [4] Build per-group LUTs (nibble_scores × n_groups)

│

├─ [5] SIMD scan - warm tier (blocked mmap)

│ └─ per 32-vector block: nibble-split LUT + flush + widening

├─ [6] SIMD scan - MemTable (in-memory staged vectors, if bulk-ingest)

│

├─ [7] Merge scores, apply tombstone mask, heapify top-k

├─ [8] For each top-k hit: warm-tier zero-copy text retrieval (rkyv)

└─ [9] Serialize response JSON

The profiling endpoint (?profile=true) exposes microsecond-level timings for each phase: json_parse_us, lock_acquire_us, simd_scan_us, rkyv_resolve_us, total_us. In practice, simd_scan_us is the dominant term for large databases, and it scales linearly with the number of non-tombstoned vectors.

What This Achieves in Practice

At 4-bit quantization on a 1536-dimensional space, a single vector occupies 768 bytes in the blocked layout. A million vectors occupy ~768 MB - comfortably within L3 cache on many server configurations. The SIMD scan of a cold (uncached) million-vector database on a modern AVX2 machine runs in the low single-digit milliseconds. On a warm (cached) database, it drops into the hundreds of microseconds.

The design deliberately trades a small recall loss (the cost of quantization) for a very large throughput gain. For typical embedding model outputs and real-world corpora, the recall@10 at 4-bit quantization with RaBitQ-style scale correction is competitive with exact brute-force float32 search.

This article focuses on kai-core/src/simd.rs and kai-core/src/pack.rs in the KAI project. The quantization scheme that produces the codes fed into this pipeline is covered in the companion article on Lloyd-Max codebooks and RaBitQ.