Skip to content

Latest commit

 

History

History
143 lines (105 loc) · 4.66 KB

File metadata and controls

143 lines (105 loc) · 4.66 KB

Changelog

All notable changes to vLLM-lite will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Added

  • MambaBlock Weight Loading
    • Added MambaBlock::from_weights method to load SSM layer weights
    • Implemented full weight loading for Qwen3.5 Mamba models
    • Supports fallback for embed_tokens and lm_head weight names
    • Supports tied embeddings (tie_word_embeddings)

Refactored

Architecture Refactoring

  • Scheduler Module Split

    • Split monolithic scheduler.rs into focused submodules
    • Created scheduler/queue.rs with RequestQueue for queue management
    • Created scheduler/preemption.rs with PreemptionManager for preemption decisions
    • Created scheduler/eviction.rs with EvictionPolicy for block eviction
    • Fully integrated all modules into Scheduler struct
  • KV Cache Layer Separation

    • Split core/kv_cache.rs into kv_cache/block_allocator.rs and kv_cache/prefix_cache.rs
    • Created model/paged_tensor/ module (separating logical and physical KV cache)
    • tensor_store.rs for GPU KV tensor management
    • quantization.rs for INT8/FP8 quantization
    • Added deprecated alias in kv_cache.rs for backward compatibility
  • Kernel Layer Extraction

    • Created model/kernels/ directory for GPU kernels
    • Moved flash_attention.rskernels/flash_attention.rs
    • Moved fused_kernel.rskernels/fused_mlp.rs
    • Moved cuda_graph.rs from core to model/kernels/cuda_graph.rs
    • Updated components/ to use kernels module

Added

Phase 4: Performance Optimization

  • Quantization Support
    • FP16 support
    • INT8 Weight-Only quantization (QuantizedLinear, quantize_2d)
    • INT8 KV Cache with per-layer scaling
    • QuantizationCalibrator for calibration
  • Compute Optimization
    • Flash Attention framework with software fallback (FlashAttention, ScaledDotProductAttention)
    • Sliding window attention support
    • CUDA Graph framework (CudaGraph, CudaGraphExecutor)
  • Scheduling Optimization
    • PD Separation (Prefill/Decode separation)
    • Chunked Prefill with configurable chunk size
    • Dynamic Batch Size based on available KV blocks
    • Priority-based scheduling (Priority, enable_priority_scheduling)
  • Distributed
    • Multi-GPU Tensor Parallelism (DeviceMesh, ColumnParallelLinear, RowParallelLinear, AllReduce)

Phase 5: Production Readiness

  • Request timeout support (timeout parameter)
  • Graceful shutdown (SIGINT/SIGTERM handling)
  • YAML configuration file support
  • Environment variable overrides (VLLM_HOST, VLLM_PORT, etc.)
  • Structured JSON logging with file rotation
  • Grafana dashboard (docs/grafana/dashboard.json)
  • Config validation on startup
  • Error retry support (retries parameter)

Core Features

  • Real-time metrics collection with /v1/stats and /metrics endpoints
  • Quantization utilities (crates/model/src/quantize.rs)
  • Tiled Attention for memory optimization
  • INT8 quantization support in KV Cache
  • Forward pass with tiled attention strategy
  • Comprehensive test suite for tiled attention

Changed

  • Improved documentation structure (README.md, docs/README.md, ROADMAP.md)
  • Added detailed development roadmap

Fixed

  • Clippy warnings and code quality improvements
  • Test compatibility with new AttentionConfig

[0.1.0] - 2026-03-31

Added

  • Continuous Batching - Dynamic batch scheduling with decode-priority
  • Paged KV Cache - Memory-efficient cache management with LRU eviction
  • Prefix Caching - Exact match and prefix hit support
  • Speculative Decoding - Draft-target verification architecture
  • Qwen3 Model Integration - Support for Qwen2.5-0.5B model with real weights
  • OpenAI-compatible API - /v1/completions, /v1/chat/completions
  • Streaming (SSE) - Real-time token streaming
  • Sampling - Temperature, Top-P support
  • Chunked Prefill - Process long prompts in chunks

Architecture

  • 3-Crate Structure:
    • vllm-core: Scheduler, Engine, KV Cache, Types
    • vllm-model: Qwen3, Attention, MLP
    • vllm-server: HTTP API (axum)

Dependencies

  • Rust (edition 2021)
  • Candle (ML backend)
  • Axum (HTTP)
  • Tokio (async runtime)
  • SafeTensors (weight loading)

Migration Guides

Upgrading to 0.1.0

No migration needed - initial release.


Known Issues

  • Limited model support (Qwen3 only)
  • No multi-GPU support
  • Quantization in progress

Credits

Thanks to all contributors and the vLLM project for inspiration.