All notable changes to vLLM-lite will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- MambaBlock Weight Loading
- Added
MambaBlock::from_weightsmethod to load SSM layer weights - Implemented full weight loading for Qwen3.5 Mamba models
- Supports fallback for embed_tokens and lm_head weight names
- Supports tied embeddings (tie_word_embeddings)
- Added
-
Scheduler Module Split
- Split monolithic
scheduler.rsinto focused submodules - Created
scheduler/queue.rswithRequestQueuefor queue management - Created
scheduler/preemption.rswithPreemptionManagerfor preemption decisions - Created
scheduler/eviction.rswithEvictionPolicyfor block eviction - Fully integrated all modules into
Schedulerstruct
- Split monolithic
-
KV Cache Layer Separation
- Split
core/kv_cache.rsintokv_cache/block_allocator.rsandkv_cache/prefix_cache.rs - Created
model/paged_tensor/module (separating logical and physical KV cache) tensor_store.rsfor GPU KV tensor managementquantization.rsfor INT8/FP8 quantization- Added deprecated alias in
kv_cache.rsfor backward compatibility
- Split
-
Kernel Layer Extraction
- Created
model/kernels/directory for GPU kernels - Moved
flash_attention.rs→kernels/flash_attention.rs - Moved
fused_kernel.rs→kernels/fused_mlp.rs - Moved
cuda_graph.rsfrom core tomodel/kernels/cuda_graph.rs - Updated
components/to use kernels module
- Created
- Quantization Support
- FP16 support
- INT8 Weight-Only quantization (
QuantizedLinear,quantize_2d) - INT8 KV Cache with per-layer scaling
- QuantizationCalibrator for calibration
- Compute Optimization
- Flash Attention framework with software fallback (
FlashAttention,ScaledDotProductAttention) - Sliding window attention support
- CUDA Graph framework (
CudaGraph,CudaGraphExecutor)
- Flash Attention framework with software fallback (
- Scheduling Optimization
- PD Separation (Prefill/Decode separation)
- Chunked Prefill with configurable chunk size
- Dynamic Batch Size based on available KV blocks
- Priority-based scheduling (
Priority,enable_priority_scheduling)
- Distributed
- Multi-GPU Tensor Parallelism (
DeviceMesh,ColumnParallelLinear,RowParallelLinear,AllReduce)
- Multi-GPU Tensor Parallelism (
- Request timeout support (
timeoutparameter) - Graceful shutdown (SIGINT/SIGTERM handling)
- YAML configuration file support
- Environment variable overrides (
VLLM_HOST,VLLM_PORT, etc.) - Structured JSON logging with file rotation
- Grafana dashboard (
docs/grafana/dashboard.json) - Config validation on startup
- Error retry support (
retriesparameter)
- Real-time metrics collection with
/v1/statsand/metricsendpoints - Quantization utilities (
crates/model/src/quantize.rs) - Tiled Attention for memory optimization
- INT8 quantization support in KV Cache
- Forward pass with tiled attention strategy
- Comprehensive test suite for tiled attention
- Improved documentation structure (README.md, docs/README.md, ROADMAP.md)
- Added detailed development roadmap
- Clippy warnings and code quality improvements
- Test compatibility with new AttentionConfig
- Continuous Batching - Dynamic batch scheduling with decode-priority
- Paged KV Cache - Memory-efficient cache management with LRU eviction
- Prefix Caching - Exact match and prefix hit support
- Speculative Decoding - Draft-target verification architecture
- Qwen3 Model Integration - Support for Qwen2.5-0.5B model with real weights
- OpenAI-compatible API -
/v1/completions,/v1/chat/completions - Streaming (SSE) - Real-time token streaming
- Sampling - Temperature, Top-P support
- Chunked Prefill - Process long prompts in chunks
- 3-Crate Structure:
vllm-core: Scheduler, Engine, KV Cache, Typesvllm-model: Qwen3, Attention, MLPvllm-server: HTTP API (axum)
- Rust (edition 2021)
- Candle (ML backend)
- Axum (HTTP)
- Tokio (async runtime)
- SafeTensors (weight loading)
No migration needed - initial release.
- Limited model support (Qwen3 only)
- No multi-GPU support
- Quantization in progress
Thanks to all contributors and the vLLM project for inspiration.