[claude] Add IVF (Inverted File) vector index layout for Vortex#7470
Open
connortsui20 wants to merge 3 commits intodevelopfrom
Open
[claude] Add IVF (Inverted File) vector index layout for Vortex#7470connortsui20 wants to merge 3 commits intodevelopfrom
connortsui20 wants to merge 3 commits intodevelopfrom
Conversation
…earch Implements an Inverted File (IVF) index in the vortex-tensor crate that clusters vectors into K groups using k-means++ initialization and Lloyd's algorithm, then at query time only searches the nprobes most promising clusters based on centroid similarity. Key components: - `IvfIndex`: Core index with k-means clustering, probe selection (cosine similarity to centroids), and boolean mask generation for row filtering - `IvfPartitionedIndex`: Augments the index with cluster boundary offsets and a permutation array for efficient range-based reads of sorted data - `search::build_ivf_index`: Builds an IVF index from a Vortex Vector extension array by materializing vectors to f32 - `search::ivf_similarity_search`: Accelerated cosine similarity search that combines IVF cluster pruning with the existing expression tree The implementation uses the existing SplitMix64 PRNG for deterministic k-means++ initialization (no new dependencies). Tested with 18 tests covering: empty/single-vector edge cases, cluster correctness, probe recall quality, integration with Vortex Vector arrays, and end-to-end search through TurboQuant-compressed data. Signed-off-by: "Claude" <noreply@anthropic.com> https://claude.ai/code/session_01AQwoFbonU23fEamWhFWhSo
Creates a new `vortex-ivf` crate that implements a full IVF (Inverted File)
vector index as a first-class Vortex layout. Data is clustered by k-means++,
written one chunk per cluster, and paired with a centroid child layout that
the reader consults at read time.
New crate layout:
- `vortex-ivf/src/lib.rs` core IvfIndex + k-means
- `vortex-ivf/src/kmeans.rs` k-means++ clustering (deterministic, no new deps)
- `vortex-ivf/src/partitioned.rs` cluster-boundary bookkeeping for range reads
- `vortex-ivf/src/search.rs` in-memory IVF-accelerated cosine search
- `vortex-ivf/src/tq.rs` TurboQuant-aware builder that clusters in
the SORF-rotated quantized space and rotates
queries instead of inverting SORF N times
- `vortex-ivf/src/layout/mod.rs` IvfLayout module docs + session registration
- `vortex-ivf/src/layout/metadata.rs` Prost metadata (dim, nprobes, num_clusters)
- `vortex-ivf/src/layout/query.rs` extract literal query vector from a
CosineSimilarity expression tree
- `vortex-ivf/src/layout/reader.rs` IvfReader: pruning_evaluation probes the
centroids and returns a mask that eliminates
all rows in non-probed clusters
- `vortex-ivf/src/layout/vtable.rs` VTable binding (data + centroids children)
- `vortex-ivf/src/layout/writer.rs` IvfStrategy: LayoutStrategy that runs k-means,
partitions rows into one chunk per cluster,
and writes centroids as an auxiliary child
Key design choices:
- Separate crate. vortex-layout depends transitively on vortex-btrblocks, which
optionally depends on vortex-tensor, so putting IVF in either crate creates a
cargo cycle. A dedicated crate depending on both is the only clean option.
- `OwnedLayoutChildren::layout_children` is now public so out-of-tree layout
writers can construct `Arc<dyn LayoutChildren>` for `VTable::build`.
- TurboQuant integration lives in `tq.rs`. Rather than decompress the whole
column, the builder descends to the `Dict(codes, centroids)` FSL and
materializes each row's rotated coordinates via a single dict lookup. The
query is rotated once by SORF before probing, so the centroid comparison
happens entirely in the quantized/rotated space.
Test coverage (24 tests pass):
- 13 core IVF index tests (build, probe, recall, edge cases)
- 3 partitioned-index tests (cluster ranges and selectivity)
- 2 TQ-aware tests (self-match through TQ compression, non-TQ rejection)
- 2 in-memory IVF search tests (with and without TQ compression)
- 4 end-to-end layout tests (write, project, pruning on cosine-similarity expr,
pass-through for non-cosine expressions)
Signed-off-by: "Claude" <noreply@anthropic.com>
https://claude.ai/code/session_01AQwoFbonU23fEamWhFWhSo
Adds three pieces that prove the IVF layout works end-to-end and document how
to use it as a real production workload.
1. Recall benchmark (`src/recall_tests.rs`, 4 tests):
Measures recall@K against brute-force ground truth on a clustered synthetic
corpus (dim=128, 2000 rows, 16 clusters, 50 queries, seed=42):
| nprobes | clusters read | avg recall@10 | scan fraction |
|---------|---------------|---------------|---------------|
| 2 | 2/16 | 1.000 | 0.139 |
| 4 | 4/16 | 1.000 | 0.282 |
| 8 | 8/16 | 1.000 | 0.533 |
| 16 | 16/16 | 1.000 | 1.000 |
Perfect recall at 14% of the data scanned. On well-clustered data like
this the index is effectively lossless; the docstring flags that real
embedding corpora typically need ~sqrt(num_clusters) probes for the same
quality.
2. File round-trip (`src/file_tests.rs`, 2 tests):
End-to-end test through the full `VortexWriteOptions` → `VortexOpenOptions`
API. Writes a `Vector<16, f32>` column with `IvfStrategy` as the top-level
strategy, opens the resulting bytes as a `VortexFile`, and runs
`scan().with_filter(CosineSimilarity(root, query) > 0.5)`. Confirms:
- the file round-trips all rows when no filter is applied,
- with nprobes=1 the cosine filter returns only rows from the one
probed cluster.
3. Production documentation (`src/lib.rs` crate-level docs):
Step-by-step write/read workflow with compile-checked doctests:
- session setup (ArraySession, LayoutSession, ScalarFnSession, RuntimeSession,
default encodings, tensor initialize, `register_ivf_layout`),
- ingest (raw f32 → `FixedSizeList<f32>` → `Vector<dim, f32>` extension array),
- write (`IvfStrategy` as the top-level layout strategy, with per-cluster
data strategy and centroid strategy),
- read (build a literal `Vector<dim, f32>` query scalar, wrap in
`CosineSimilarity > threshold`, scan — pruning is automatic).
Includes the observed recall table and tuning guidance.
Test count: 31 unit + 4 doc tests (was 24 + 0). All passing, clippy clean.
Signed-off-by: "Claude" <noreply@anthropic.com>
https://claude.ai/code/session_01AQwoFbonU23fEamWhFWhSo
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces a complete IVF (Inverted File) vector index implementation for Vortex, enabling efficient approximate nearest neighbor search on vector columns through k-means clustering and cluster-based pruning.
What's included
The implementation spans three layers:
In-memory index (
IvfIndex,IvfBuildConfig): k-means clustering and probe selection without layout machinery. Provides the core clustering algorithm and query-time cluster selection.Layout integration (
IvfLayout,IvfStrategy,IvfReader): A first-class Vortex layout that stores data sorted by cluster with an auxiliary centroid child.IvfStrategywrites the layout by clustering data and creating one chunk per cluster.IvfReadertransparently prunes chunks at read time when the filter is a cosine-similarity expression against a constant query vector.TurboQuant integration (
TurboQuantIvfIndex): Builds the IVF index directly over TQ-compressed data. Centroids live in the SORF-rotated quantized space, so queries are rotated once instead of decompressing every database vector.Key features
Implementation details
Testing
src/tests.rscovering empty indices, single vectors, small datasets, and cluster consistencysrc/layout/tests.rsverifying write/read correctness and pruning behaviorsrc/file_tests.rsdemonstrating the production workflowsrc/recall_tests.rsverifying quality metrics on synthetic clustered dataAPI Changes
Exports new public modules and types:
vortex_ivf::IvfIndex,IvfBuildConfigfor in-memory clusteringvortex_ivf::layout::IvfLayout,IvfStrategy,IvfReaderfor layout integrationvortex_ivf::tq::TurboQuantIvfIndexfor TurboQuant-aware indexingvortex_ivf::layout::register_ivf_layout()for session registrationAlso makes
vortex_tensor::utilspublic to support casting utilities needed by IVF.https://claude.ai/code/session_01AQwoFbonU23fEamWhFWhSo