Changelog

All notable changes to vLLM-lite will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Added

MambaBlock Weight Loading
- Added MambaBlock::from_weights method to load SSM layer weights
- Implemented full weight loading for Qwen3.5 Mamba models
- Supports fallback for embed_tokens and lm_head weight names
- Supports tied embeddings (tie_word_embeddings)

Refactored

Architecture Refactoring

Scheduler Module Split
- Split monolithic scheduler.rs into focused submodules
- Created scheduler/queue.rs with RequestQueue for queue management
- Created scheduler/preemption.rs with PreemptionManager for preemption decisions
- Created scheduler/eviction.rs with EvictionPolicy for block eviction
- Fully integrated all modules into Scheduler struct
KV Cache Layer Separation
- Split core/kv_cache.rs into kv_cache/block_allocator.rs and kv_cache/prefix_cache.rs
- Created model/paged_tensor/ module (separating logical and physical KV cache)
- tensor_store.rs for GPU KV tensor management
- quantization.rs for INT8/FP8 quantization
- Added deprecated alias in kv_cache.rs for backward compatibility
Kernel Layer Extraction
- Created model/kernels/ directory for GPU kernels
- Moved flash_attention.rs → kernels/flash_attention.rs
- Moved fused_kernel.rs → kernels/fused_mlp.rs
- Moved cuda_graph.rs from core to model/kernels/cuda_graph.rs
- Updated components/ to use kernels module

Added

Phase 4: Performance Optimization

Quantization Support
- FP16 support
- INT8 Weight-Only quantization (QuantizedLinear, quantize_2d)
- INT8 KV Cache with per-layer scaling
- QuantizationCalibrator for calibration
Compute Optimization
- Flash Attention framework with software fallback (FlashAttention, ScaledDotProductAttention)
- Sliding window attention support
- CUDA Graph framework (CudaGraph, CudaGraphExecutor)
Scheduling Optimization
- PD Separation (Prefill/Decode separation)
- Chunked Prefill with configurable chunk size
- Dynamic Batch Size based on available KV blocks
- Priority-based scheduling (Priority, enable_priority_scheduling)
Distributed
- Multi-GPU Tensor Parallelism (DeviceMesh, ColumnParallelLinear, RowParallelLinear, AllReduce)

Phase 5: Production Readiness

Request timeout support (timeout parameter)
Graceful shutdown (SIGINT/SIGTERM handling)
YAML configuration file support
Environment variable overrides (VLLM_HOST, VLLM_PORT, etc.)
Structured JSON logging with file rotation
Grafana dashboard (docs/grafana/dashboard.json)
Config validation on startup
Error retry support (retries parameter)

Core Features

Real-time metrics collection with /v1/stats and /metrics endpoints
Quantization utilities (crates/model/src/quantize.rs)
Tiled Attention for memory optimization
INT8 quantization support in KV Cache
Forward pass with tiled attention strategy
Comprehensive test suite for tiled attention

Changed

Improved documentation structure (README.md, docs/README.md, ROADMAP.md)
Added detailed development roadmap

Fixed

Clippy warnings and code quality improvements
Test compatibility with new AttentionConfig

[0.1.0] - 2026-03-31

Added

Continuous Batching - Dynamic batch scheduling with decode-priority
Paged KV Cache - Memory-efficient cache management with LRU eviction
Prefix Caching - Exact match and prefix hit support
Speculative Decoding - Draft-target verification architecture
Qwen3 Model Integration - Support for Qwen2.5-0.5B model with real weights
OpenAI-compatible API - /v1/completions, /v1/chat/completions
Streaming (SSE) - Real-time token streaming
Sampling - Temperature, Top-P support
Chunked Prefill - Process long prompts in chunks

Architecture

3-Crate Structure:
- vllm-core: Scheduler, Engine, KV Cache, Types
- vllm-model: Qwen3, Attention, MLP
- vllm-server: HTTP API (axum)

Dependencies

Rust (edition 2021)
Candle (ML backend)
Axum (HTTP)
Tokio (async runtime)
SafeTensors (weight loading)

Migration Guides

Upgrading to 0.1.0

No migration needed - initial release.

Known Issues

Limited model support (Qwen3 only)
No multi-GPU support
Quantization in progress

Credits

Thanks to all contributors and the vLLM project for inspiration.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changelog

[Unreleased]

Added

Refactored

Architecture Refactoring

Added

Phase 4: Performance Optimization

Phase 5: Production Readiness

Core Features

Changed

Fixed

[0.1.0] - 2026-03-31

Added

Architecture

Dependencies

Migration Guides

Upgrading to 0.1.0

Known Issues

Credits

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[Unreleased]

Added

Refactored

Architecture Refactoring

Added

Phase 4: Performance Optimization

Phase 5: Production Readiness

Core Features

Changed

Fixed

[0.1.0] - 2026-03-31

Added

Architecture

Dependencies

Migration Guides

Upgrading to 0.1.0

Known Issues

Credits