Back to lab
deepdivesprogramming

Bridging Async and SIMD: The Tokio + Rayon Architecture of a High-Performance Rust Server

Bridging Async and SIMD: The Tokio + Rayon Architecture of a High-Performance Rust Server

Writing a high-performance server in Rust is easy to do wrong. The ecosystem’s defaults - Tokio for async I/O, Rayon for data-parallel CPU work - are individually excellent, but combining them naively in a database workload produces a system where HTTP spikes destroy search latency and SIMD throughput collapses under concurrent load.

KAI’s architecture is a careful study in keeping these two worlds separate. This article explains the design, the specific failure modes it avoids, and the Rust patterns that make it work cleanly.


The Two Worlds of a Vector Server

A vector search server has two fundamentally different workload profiles:

I/O-bound work: Accepting TCP connections, parsing JSON request bodies, serializing JSON responses, writing to disk on ingest, reading the health check state. This is bursty, latency-sensitive, and mostly waiting - the ideal workload for an async runtime like Tokio. The CPU is barely touched; you just want to handle thousands of concurrent operations without blocking OS threads.

CPU-bound work: The SIMD scan over 768 MB of quantized codes. Matrix-vector multiply for the rotation step. Bit-packing during ingest encoding. This work is compute-intensive, cache-sensitive, and wants to run on dedicated CPU cores without interruption. Context switches mid-scan evict the quantized vector array from L3 cache, causing the next scan to re-fetch from DRAM - a catastrophic latency penalty.

If you run the SIMD scan directly on Tokio’s async worker threads, you get the worst of both worlds: the Tokio executor is starved (its threads are blocked on CPU work), and the SIMD scan is interrupted by I/O scheduling and cache evictions from concurrent HTTP parsing.


Axum + Tokio: The HTTP Layer

KAI’s server is built on Axum, a high-level async web framework from the Tokio project. Each route handler is an async fn that runs on Tokio’s multi-threaded executor:

// Route definition

let app = Router::new()
    .route("/api/search", post(search_endpoint))
    .route("/api/documents", post(ingest_endpoint))
    .route("/api/documents/batch", post(batch_ingest_endpoint))
    .route("/api/documents/:id", delete(delete_endpoint))
    .route("/api/documents/parent/:parent_doc_id", get(parent_endpoint))
    .route("/api/health", get(health_endpoint));

axum::serve(listener, app).await?;

Axum handles connection pooling, request parsing, and response serialization automatically. Each handler receives typed extractors (Json<T>, Path<String>, State<AppState>) and returns impl IntoResponse. The framework dispatches requests to Tokio’s thread pool, which multiplexes hundreds of concurrent connections across a handful of OS threads.

The shared state (AppState) wraps the index and warm tier in Arc<RwLock<_>>:

#[derive(Clone)]
struct AppState {
    index: Arc<RwLock<KaiIndex>>,
    warm_tier: Arc<RwLock<WarmTier>>,
}

Arc gives shared ownership across threads. RwLock allows multiple concurrent readers (searches) or a single exclusive writer (ingest, delete, compaction). The crucial constraint: the RwLock must never be held across an .await point. Holding a lock across .await while the Tokio executor is free to run other tasks would prevent any other search or ingest from proceeding - a deadlock-adjacent livelock.


spawn_blocking: Crossing the Async/Sync Boundary

For CPU-intensive work, KAI uses tokio::task::spawn_blocking. This moves work onto a dedicated thread pool managed by Tokio (separate from the async worker threads) and allows the async handler to await its completion without blocking the executor:

async fn search_endpoint(
    State(state): State<AppState>,
    Json(req): Json<SearchRequest>,
) -> impl IntoResponse {
    // Async: deserialize, validate (fast, non-blocking)
    
    let result = tokio::task::spawn_blocking(move || {
        // Sync: acquire read lock, run SIMD scan
        let index = state.index.read().unwrap();
        let warm_tier = state.warm_tier.read().unwrap();
        simd_search(&index, &warm_tier, &req)
    }).await.unwrap();
    
    Json(result)
}

spawn_blocking bridges async → sync. The async handler suspends at .await, the blocking thread runs the SIMD scan to completion, and then the async handler resumes with the result. The Tokio executor is free to handle other requests during the scan.

For ingest, the operation is both I/O-bound (write to .tvdb, fsync) and CPU-bound (encode the vector). spawn_blocking handles both phases - the blocking thread takes the write lock, encodes, appends, and syncs before returning.


The Server Isolation Feature: Dedicated SIMD Threads

spawn_blocking solves the executor starvation problem. But there’s a second, subtler issue: Tokio’s blocking thread pool is shared with any other blocking work in the process. If a slow ingest (with its disk fsync) runs on the same thread pool as a SIMD search, the search thread may share CPU scheduling time with the ingest thread. More critically, Rayon’s internal thread pool (used for the matrix multiply and batch encoding) can itself contend with Tokio’s blocking pool.

The server-isolation feature adds a dedicated Rayon thread pool for SIMD operations:

#[cfg(feature = “server-isolation”)]

static SEARCH_POOL: Lazy<rayon::ThreadPool> = Lazy::new(|| {
    rayon::ThreadPoolBuilder::new()
        .num_threads(0)
        .thread_name(|i| format!("search-workers-{i}"))
        .build()
        .unwrap()
});

When this feature is enabled, SIMD searches are dispatched through SEARCH_POOL.install(|| ...) rather than the global Rayon pool. This creates hard CPU affinity: the SIMD cores are reserved for vector scoring, and Tokio’s threads (and any global Rayon tasks) run on separate cores.

The benefit shows up under high concurrent load. Without server isolation, a burst of 50 concurrent search requests causes Tokio and Rayon to contend for the same CPU cores, producing L3 cache evictions and inter-thread interference that spikes P99 latency. With server isolation, the search pool runs uninterrupted on its dedicated cores, and the L3 cache stays warm with the quantized vector array across consecutive scans.


The MemTable: Ingest Without Blocking Search

The bulk-ingest feature introduces a MemTable - an in-memory buffer that stages recently ingested vectors before they’re merged into the cold SIMD-blocked tier. This is the same pattern used by LSM-tree databases (LevelDB, RocksDB): write to a fast in-memory structure first, flush to durable storage in the background.

The MemTable in KAI is protected by its own Arc<RwLock<BulkIngestBuffer>>:

pub struct BulkIngestBuffer {
    vectors: Vec<f32>,
    metadata: Vec<String>,
}

On ingest, vectors are quantized inline (encode → bit-pack) and appended to the staging buffer. When staging_ids.len() reaches batch_threshold, a repack (pack::repack) converts the staging buffer into SIMD-blocked layout.

Searches cover both the main index (warm tier, loaded from the .tv file) and the MemTable. The scores from both passes are merged before top- selection:

// In search:

let warm_scores = simd::search(&index.blocked_codes, …);

let memtable_scores = simd::search(&memtable.blocked_codes, …);

let merged = merge_topk(warm_scores, memtable_scores, k);

This gives zero-lag searchability: a vector ingested a millisecond ago is already in the MemTable and will be returned by search. There is no “index build” step to wait for.

The lock ordering matters here. KAI’s codebase has a strict invariant: no handler holds the index lock and the warm_tier lock simultaneously, and neither is held across an .await. Violating this would create deadlock possibilities that are extremely hard to debug under concurrent load. The MemTable has its own separate lock, acquired independently.


Zero-Copy Serialization with rkyv

A search that returns 10 text results has to fetch 10 document chunks from the warm tier and serialize them to JSON. In a naïve implementation, this involves:

  1. malloc a new DocumentChunk struct.
  2. Copy the bytes from the mmap into it.
  3. Serialize to JSON.

KAI uses rkyv to skip step 1 and step 2. The warm tier stores serialized ArchivedDocumentChunk structs directly in the mmap. A retrieval call returns an &ArchivedDocumentChunk - a reference into the mapped memory with zero copy:

pub fn get_chunk(&self, chunk_id: u64) -> Option<&ArchivedDocumentChunk> {
    let offset = self.id_to_offset.get(&chunk_id)?;
    let payload = &self.metadata_map[*offset..];
    Some(unsafe { rkyv::archived_root::<DocumentChunk>(payload) })
}

The lifetime of this reference is tied to the RwLock read guard held by the caller. As long as the guard is alive, the mmap is guaranteed not to be retired (the VACUUM swap requires a write lock). This is the formal safety invariant that makes the unsafe block sound.

The rkyv ArchivedDocumentChunk can be serialized to JSON directly from its archived form, without materializing an owned DocumentChunk. For a database that returns thousands of results per second, this is a meaningful throughput gain.


Profiling: Built-in Microsecond Timing

Every endpoint supports a ?profile=true query parameter that adds a timings field to the response. The profiling is zero-cost when disabled (the parameter isn’t present in most requests) and opt-in for debugging:

{
  "results": [...],
  "timings": {
    "json_parse_us": 12,
    "lock_acquire_us": 3,
    "simd_scan_us": 847,
    "rkyv_resolve_us": 41,
    "total_us": 912
  }
}

Timings are captured with std::time::Instant::now() around each phase inside the spawn_blocking closure. The granularity is microseconds on both Linux (via clock_gettime(CLOCK_MONOTONIC)) and Windows (via QueryPerformanceCounter).

The simd_scan_us field is the most operationally useful. It isolates the mathematical engine’s contribution from JSON overhead and lock contention, making it easy to profile the effect of adding more vectors or enabling server isolation.


Graceful Shutdown and Daemon Lifecycle

KAI’s server is designed to run as a background daemon. The CLI manages its lifecycle:

kai server start # spawn the server, break away from terminal

kai server status # check PID, vector count, health

kai server stop # send POST /api/admin/shutdown

The shutdown endpoint calls axum::Server::graceful_shutdown, which stops accepting new connections and waits for all in-progress handlers to complete before the process exits. This ensures no in-flight ingest is interrupted mid-write.

On Windows, the daemon breakaway uses WMI (Win32_Process.Create) to spawn the server in a context completely independent of the parent terminal’s Job Object. Without this, closing the terminal window kills the server - a common source of data loss in naive daemon implementations. On Unix, a standard double-fork + setsid achieves the equivalent.


Summary: The Architecture at a Glance

Each layer has a clear responsibility. The Tokio executor handles concurrency without ever touching the vector math. The blocking pool handles synchronous I/O and lock acquisition. The Rayon pool handles SIMD computation with dedicated CPU cores. The shared state uses RwLock to allow maximum read concurrency with exclusive write access during ingests and compaction.

This separation is what makes KAI’s latency characteristics stable under load - adding more concurrent HTTP connections doesn’t degrade search latency, because the SIMD work runs on isolated cores that HTTP traffic never touches.


This article covers kai-server/src/main.rs, kai-core/src/index.rs, and the server-isolation and bulk-ingest feature implementations in the KAI project.

Let's build

Build
better things.

Small team, full stack, real results. If you have an interesting engineering problem, we want in.