This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
vLLM-lite is a lightweight LLM inference engine written in Rust, implementing key vLLM innovations: Continuous Batching, Paged KV Cache, Prefix Caching, Speculative Decoding, and an OpenAI-compatible API.
# Build the workspace
cargo build --workspace
# Run tests (all features)
just test
# or: cargo test --all-features --workspace
# Run a single test
cargo test -p vllm-core -- <test_name>
cargo test -p vllm-model -- <test_name>
# Clippy linting (required before commit)
just clippy
# Format code
just fmt
# Run the server
cargo run -p vllm-server
# Fast check (faster than build)
cargo check -p vllm-model
# Run all CI checks locally
just ciThe project is organized as a Rust workspace with 4 crates:
- vllm-traits (
crates/traits/): Interface definitions -ModelBackendtrait, shared types (SeqId, TokenId, Batch) - vllm-core (
crates/core/): Core inference engine - Engine, Scheduler, KV cache management, metrics - vllm-model (
crates/model/): ML model implementations - Qwen3, attention layers, PagedKvCache, SafeTensors loader - vllm-server (
crates/server/): HTTP server - OpenAI-compatible API endpoints, config, auth
Engine (crates/core/src/engine.rs): Main orchestrator managing scheduler, target/draft models, and metrics
pub struct Engine<M: ModelBackend> {
pub scheduler: Scheduler,
pub target_model: M,
pub draft_model: M,
pub max_draft_tokens: usize,
pub metrics: MetricsCollector,
}Scheduler (crates/core/src/scheduler.rs): Batch scheduling with waiting/running/finished queues, prefix cache, block allocator
GqaAttention (crates/model/src/qwen3/attention.rs): Grouped-query attention with forward_prefill and forward_decode methods
- Make changes
- Run verification:
just ci(fmt-check → clippy → doc-check → nextest) - Commit with format:
<type>(<scope>): <subject>- types: feat, fix, refactor, test, docs, chore
- Single GPU worker thread (avoid GPU contention)
- Block size: 16 tokens per KV block
- Max batched tokens: 4096 default
- Max concurrent sequences: 256 default
- Uses
ModelBackendtrait for model abstraction - CUDA support via Candle