Skip to content

[claude] Add IVF (Inverted File) vector index layout for Vortex#7470

Open
connortsui20 wants to merge 3 commits intodevelopfrom
claude/add-ivf-vector-index-ZTeQF
Open

[claude] Add IVF (Inverted File) vector index layout for Vortex#7470
connortsui20 wants to merge 3 commits intodevelopfrom
claude/add-ivf-vector-index-ZTeQF

Conversation

@connortsui20
Copy link
Copy Markdown
Contributor

Summary

This PR introduces a complete IVF (Inverted File) vector index implementation for Vortex, enabling efficient approximate nearest neighbor search on vector columns through k-means clustering and cluster-based pruning.

What's included

The implementation spans three layers:

  1. In-memory index (IvfIndex, IvfBuildConfig): k-means clustering and probe selection without layout machinery. Provides the core clustering algorithm and query-time cluster selection.

  2. Layout integration (IvfLayout, IvfStrategy, IvfReader): A first-class Vortex layout that stores data sorted by cluster with an auxiliary centroid child. IvfStrategy writes the layout by clustering data and creating one chunk per cluster. IvfReader transparently prunes chunks at read time when the filter is a cosine-similarity expression against a constant query vector.

  3. TurboQuant integration (TurboQuantIvfIndex): Builds the IVF index directly over TQ-compressed data. Centroids live in the SORF-rotated quantized space, so queries are rotated once instead of decompressing every database vector.

Key features

  • k-means++ initialization for better centroid placement followed by Lloyd's algorithm refinement
  • Cluster-based pruning at read time: when a cosine-similarity filter is detected, only the closest clusters are scanned
  • Production-ready write/read workflow with full session setup, Vector extension type support, and file I/O integration
  • Comprehensive test coverage including unit tests, layout round-trip tests, file I/O tests, and recall benchmarks

Implementation details

  • K-means clustering operates on f32 vectors in row-major layout
  • Cluster assignments are stored as u32 indices
  • The layout stores data as a chunked layout (one chunk per cluster) with an auxiliary centroid child
  • Query pruning extracts constant query vectors from cosine-similarity expressions and uses centroid distances to eliminate non-probed clusters
  • TurboQuant support materializes rotated f32 coordinates from quantized codes for clustering

Testing

  • Added comprehensive unit tests in src/tests.rs covering empty indices, single vectors, small datasets, and cluster consistency
  • Added layout round-trip tests in src/layout/tests.rs verifying write/read correctness and pruning behavior
  • Added end-to-end file I/O tests in src/file_tests.rs demonstrating the production workflow
  • Added recall regression tests in src/recall_tests.rs verifying quality metrics on synthetic clustered data
  • All tests pass with the new implementation

API Changes

Exports new public modules and types:

  • vortex_ivf::IvfIndex, IvfBuildConfig for in-memory clustering
  • vortex_ivf::layout::IvfLayout, IvfStrategy, IvfReader for layout integration
  • vortex_ivf::tq::TurboQuantIvfIndex for TurboQuant-aware indexing
  • vortex_ivf::layout::register_ivf_layout() for session registration

Also makes vortex_tensor::utils public to support casting utilities needed by IVF.

https://claude.ai/code/session_01AQwoFbonU23fEamWhFWhSo

claude added 3 commits April 16, 2026 02:58
…earch

Implements an Inverted File (IVF) index in the vortex-tensor crate that
clusters vectors into K groups using k-means++ initialization and Lloyd's
algorithm, then at query time only searches the nprobes most promising
clusters based on centroid similarity.

Key components:
- `IvfIndex`: Core index with k-means clustering, probe selection (cosine
  similarity to centroids), and boolean mask generation for row filtering
- `IvfPartitionedIndex`: Augments the index with cluster boundary offsets
  and a permutation array for efficient range-based reads of sorted data
- `search::build_ivf_index`: Builds an IVF index from a Vortex Vector
  extension array by materializing vectors to f32
- `search::ivf_similarity_search`: Accelerated cosine similarity search
  that combines IVF cluster pruning with the existing expression tree

The implementation uses the existing SplitMix64 PRNG for deterministic
k-means++ initialization (no new dependencies). Tested with 18 tests
covering: empty/single-vector edge cases, cluster correctness, probe
recall quality, integration with Vortex Vector arrays, and end-to-end
search through TurboQuant-compressed data.

Signed-off-by: "Claude" <noreply@anthropic.com>

https://claude.ai/code/session_01AQwoFbonU23fEamWhFWhSo
Creates a new `vortex-ivf` crate that implements a full IVF (Inverted File)
vector index as a first-class Vortex layout. Data is clustered by k-means++,
written one chunk per cluster, and paired with a centroid child layout that
the reader consults at read time.

New crate layout:
- `vortex-ivf/src/lib.rs`                      core IvfIndex + k-means
- `vortex-ivf/src/kmeans.rs`                   k-means++ clustering (deterministic, no new deps)
- `vortex-ivf/src/partitioned.rs`              cluster-boundary bookkeeping for range reads
- `vortex-ivf/src/search.rs`                   in-memory IVF-accelerated cosine search
- `vortex-ivf/src/tq.rs`                       TurboQuant-aware builder that clusters in
                                               the SORF-rotated quantized space and rotates
                                               queries instead of inverting SORF N times
- `vortex-ivf/src/layout/mod.rs`               IvfLayout module docs + session registration
- `vortex-ivf/src/layout/metadata.rs`          Prost metadata (dim, nprobes, num_clusters)
- `vortex-ivf/src/layout/query.rs`             extract literal query vector from a
                                               CosineSimilarity expression tree
- `vortex-ivf/src/layout/reader.rs`            IvfReader: pruning_evaluation probes the
                                               centroids and returns a mask that eliminates
                                               all rows in non-probed clusters
- `vortex-ivf/src/layout/vtable.rs`            VTable binding (data + centroids children)
- `vortex-ivf/src/layout/writer.rs`            IvfStrategy: LayoutStrategy that runs k-means,
                                               partitions rows into one chunk per cluster,
                                               and writes centroids as an auxiliary child

Key design choices:
- Separate crate. vortex-layout depends transitively on vortex-btrblocks, which
  optionally depends on vortex-tensor, so putting IVF in either crate creates a
  cargo cycle. A dedicated crate depending on both is the only clean option.
- `OwnedLayoutChildren::layout_children` is now public so out-of-tree layout
  writers can construct `Arc<dyn LayoutChildren>` for `VTable::build`.
- TurboQuant integration lives in `tq.rs`. Rather than decompress the whole
  column, the builder descends to the `Dict(codes, centroids)` FSL and
  materializes each row's rotated coordinates via a single dict lookup. The
  query is rotated once by SORF before probing, so the centroid comparison
  happens entirely in the quantized/rotated space.

Test coverage (24 tests pass):
- 13 core IVF index tests (build, probe, recall, edge cases)
- 3 partitioned-index tests (cluster ranges and selectivity)
- 2 TQ-aware tests (self-match through TQ compression, non-TQ rejection)
- 2 in-memory IVF search tests (with and without TQ compression)
- 4 end-to-end layout tests (write, project, pruning on cosine-similarity expr,
  pass-through for non-cosine expressions)

Signed-off-by: "Claude" <noreply@anthropic.com>

https://claude.ai/code/session_01AQwoFbonU23fEamWhFWhSo
Adds three pieces that prove the IVF layout works end-to-end and document how
to use it as a real production workload.

1. Recall benchmark (`src/recall_tests.rs`, 4 tests):

   Measures recall@K against brute-force ground truth on a clustered synthetic
   corpus (dim=128, 2000 rows, 16 clusters, 50 queries, seed=42):

   | nprobes | clusters read | avg recall@10 | scan fraction |
   |---------|---------------|---------------|---------------|
   |     2   |     2/16      |      1.000    |     0.139     |
   |     4   |     4/16      |      1.000    |     0.282     |
   |     8   |     8/16      |      1.000    |     0.533     |
   |    16   |    16/16      |      1.000    |     1.000     |

   Perfect recall at 14% of the data scanned. On well-clustered data like
   this the index is effectively lossless; the docstring flags that real
   embedding corpora typically need ~sqrt(num_clusters) probes for the same
   quality.

2. File round-trip (`src/file_tests.rs`, 2 tests):

   End-to-end test through the full `VortexWriteOptions` → `VortexOpenOptions`
   API. Writes a `Vector<16, f32>` column with `IvfStrategy` as the top-level
   strategy, opens the resulting bytes as a `VortexFile`, and runs
   `scan().with_filter(CosineSimilarity(root, query) > 0.5)`. Confirms:
   - the file round-trips all rows when no filter is applied,
   - with nprobes=1 the cosine filter returns only rows from the one
     probed cluster.

3. Production documentation (`src/lib.rs` crate-level docs):

   Step-by-step write/read workflow with compile-checked doctests:
   - session setup (ArraySession, LayoutSession, ScalarFnSession, RuntimeSession,
     default encodings, tensor initialize, `register_ivf_layout`),
   - ingest (raw f32 → `FixedSizeList<f32>` → `Vector<dim, f32>` extension array),
   - write (`IvfStrategy` as the top-level layout strategy, with per-cluster
     data strategy and centroid strategy),
   - read (build a literal `Vector<dim, f32>` query scalar, wrap in
     `CosineSimilarity > threshold`, scan — pruning is automatic).

   Includes the observed recall table and tuning guidance.

Test count: 31 unit + 4 doc tests (was 24 + 0). All passing, clippy clean.

Signed-off-by: "Claude" <noreply@anthropic.com>

https://claude.ai/code/session_01AQwoFbonU23fEamWhFWhSo
@connortsui20 connortsui20 changed the title Add IVF (Inverted File) vector index layout for Vortex [claude] Add IVF (Inverted File) vector index layout for Vortex Apr 16, 2026
@connortsui20 connortsui20 added the changelog/skip Do not list PR in the changelog label Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/skip Do not list PR in the changelog

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants