Skip to content

feat: support semantic search in AI chat and embedding ability#1510

Open
hgaol wants to merge 6 commits intoapache:devfrom
hgaol:1468
Open

feat: support semantic search in AI chat and embedding ability#1510
hgaol wants to merge 6 commits intoapache:devfrom
hgaol:1468

Conversation

@hgaol
Copy link
Copy Markdown
Member

@hgaol hgaol commented Mar 2, 2026

Implementation of #1468 , add a semantic search tool to keep user experience consistent.

Settings
image

Calling semantic search tool
image

logs
image

@LinkinStars LinkinStars self-requested a review March 5, 2026 12:22
}

// SearchSimilar performs brute-force cosine similarity search in Go.
func (r *embeddingRepo) SearchSimilar(ctx context.Context, queryVector []float32, topK int) ([]SimilarResult, error) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biggest issue with this approach lies here: all data queries are performed in-memory. While this works for small datasets, it is undoubtedly unacceptable for large-scale data.

Therefore, my suggestion is that if users are using PostgreSQL, they could directly utilize PostgreSQL + pgvector to store vector data and perform searches within the database.

Furthermore, a more 'ideal' approach would be to expose this component as a plugin, allowing user Q&A data to be synchronized to external systems. This design is similar to how search plugins operate. For instance, an Elasticsearch (ES) plugin synchronizes Q&A data to ES for retrieval. The benefit of a plugin-based implementation is extensibility: users aren't restricted to the built-in database and can use their own custom vector databases. The required interfaces would likely mirror those of a search plugin, such as a data synchronization interface and a search interface.

What do you think?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, thank you @LinkinStars for the careful review! The goal of this PR is also to make it clear, so feel free to raise your concern and suggestions!

The biggest issue with this approach lies here: all data queries are performed in-memory. While this works for small datasets, it is undoubtedly unacceptable for large-scale data.

Therefore, my suggestion is that if users are using PostgreSQL, they could directly utilize PostgreSQL + pgvector to store vector data and perform searches within the database.

It makes sense. It's better to save vectors in vector DB. And I think it could support multiple vector DBs, while still support saving in main database or memory for test purpose. WDYT?

Furthermore, a more 'ideal' approach would be to expose this component as a plugin, allowing user Q&A data to be synchronized to external systems. This design is similar to how search plugins operate. For instance, an Elasticsearch (ES) plugin synchronizes Q&A data to ES for retrieval. The benefit of a plugin-based implementation is extensibility: users aren't restricted to the built-in database and can use their own custom vector databases. The required interfaces would likely mirror those of a search plugin, such as a data synchronization interface and a search interface.

About plugin, do you mean make all the functionalities as a plugin, or just the storage for vectors? Per your statement, I assume the later one?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think it could support multiple vector DBs
About plugin, do you mean make all the functionalities as a plugin, or just the storage for vectors? Per your statement, I assume the later one?

Yes, if implemented via plugins, it can support different vector databases. I think we can omit the in-memory implementation since it won't be used in most cases.

You're right—it's just a matter of providing vector storage functionality as a plugin. Of course, vector search capabilities are also included.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I love this idea! I'll update the PR per our discussions.

@LinkinStars LinkinStars mentioned this pull request Mar 10, 2026
@hgaol
Copy link
Copy Markdown
Member Author

hgaol commented Mar 20, 2026

Sorry for late update and I'm kind of busy recently, will start this feature at earliest convenience! Most likely I can start it in next week! Actually it's already WIP.

@LinkinStars
Copy link
Copy Markdown
Member

Sorry for late update and I'm kind of busy recently, will start this feature at earliest convenience! Most likely I can start it in next week! Actually it's already WIP.

@hgaol Don't worry. We have other features in the works as well. So take your time. ♥️

hgaol added 3 commits April 12, 2026 13:22
…mbeddings

- Added a new vector search syncer to aggregate questions and answers with comments for vector embedding.
- Introduced a new VectorSearch interface and related structures for managing vector storage and similarity search.
- Refactored embedding service to delegate semantic search to the new vector search plugin.
- Removed embedding-related fields from SiteAIProvider and UI forms as part of the transition to the new vector search architecture.
- Updated plugin registration to include vector search capabilities.
- Cleaned up embedding service methods and removed unused dependencies.
@hgaol
Copy link
Copy Markdown
Member Author

hgaol commented Apr 12, 2026

Updated to involve new vector search plugin. See demo below.

vector-search-plugin.mp4

Here's summary of the design and implementation.


Vector Search & Semantic Search Design

Architecture Layers

┌──────────────────────────────────────────────┐
│  AI Chat / MCP Tool ("semantic_search")      │  ← Controller layer
├──────────────────────────────────────────────┤
│  EmbeddingService                            │  ← Service layer (thin facade)
├──────────────────────────────────────────────┤
│  plugin.VectorSearch interface               │  ← Plugin abstraction
├──────────────────────────────────────────────┤
│  pgvector / elasticsearch / weaviate / ...   │  ← Plugin implementations
└──────────────────────────────────────────────┘

Plugin Interface (plugin/vector_search.go)

  • RegisterSyncer(ctx, syncer) — core provides a syncer for bulk data pull
  • SearchSimilar(ctx, query, topK) — returns []VectorSearchResult{ObjectID, ObjectType, Metadata, Score}
  • UpdateContent(ctx, content) — upserts a document with embedding
  • DeleteContent(ctx, objectID) — removes a document
  • ConfigReceiver(config) / ConfigFields() — plugin config lifecycle

GenerateEmbedding() is the shared embedding utility used by plugins.

Content Syncing (vector_search_sync/syncer.go)

Core implements VectorSearchSyncer with:

  • GetQuestionsPage(page, pageSize)
  • GetAnswersPage(page, pageSize)

Each indexed document aggregates question/answer/comment text. Metadata stores deshortened IDs for reconstruction at query time.

Sync is triggered from RegisterSyncer() (startup + config update flow).

Startup & Activation Flow

initPluginData():
  1. Load plugin status from DB
  2. Call ConfigReceiver for configured plugins
     -> parse config always
     -> if active: run heavy init (probe embedding + connect/schema checks)
     -> if inactive: skip heavy init (IsEnabled guard)
  3. Call RegisterSyncer for vector search plugins
     -> if active/initialized: trigger full sync
     -> if inactive/uninitialized: skip sync

On admin config save:

  1. ConfigReceiver
  2. UpdatePluginConfig -> RegisterSyncer -> full sync

Current Behavior Summary

  • Active plugin on startup: does one probe embedding call, then full sync to vector storage.
  • Inactive plugin on startup: parses config only; no probe embedding and no sync.
  • Config save for active plugin: re-runs init path and full sync.

Semantic Search Query Flow

User query -> MCP tool "semantic_search"
  -> EmbeddingService.SearchSimilar(query, topK)
  -> plugin.SearchSimilar() returns scored IDs + metadata
  -> handler fetches full DB content (question/answers/comments)
  -> returns structured semantic search response

@hgaol
Copy link
Copy Markdown
Member Author

hgaol commented Apr 12, 2026

There're also some follow ups I don't have clear answer for now. Post here for discussion.

Follow-up: When to calc the embeddings and sync to vector storage

Comparison with Search Plugin

Aspect Search Plugin VectorSearch Plugin
Bulk sync Yes Yes
Real-time sync Yes (create/update/delete hooks) No
Trigger Event-driven + startup/config Startup/config only
Consistency Near real-time Eventually consistent

Current Gap

UpdateContent() / DeleteContent() exist in plugin.VectorSearch, but are not called from question/answer service events. So after initial sync, content changes are not reflected until next full re-sync.

Options

  1. Manual (current)

    • Re-sync only on plugin config save/update
    • Simple, but stale results between syncs
  2. Real-time

    • Add event hooks to call vector search update/delete
    • Can be async (goroutine / queue) to avoid write-path latency
    • Higher embedding API call volume
  3. Scheduled (cron)

    • Periodic bulk sync via cron expression
    • Good for off-peak syncing
    • Delayed freshness until next run

@hgaol
Copy link
Copy Markdown
Member Author

hgaol commented Apr 12, 2026

I'll create another PR in plugin repo later for memory vector search plugin. I feel that for test purpose, memory has less settings compared to database option. I also implemented some other 3rd party vector store plugins like weaviate etc, but some of them haven't validated yet. I'll submit PR for them once done the validations.

@hgaol
Copy link
Copy Markdown
Member Author

hgaol commented Apr 12, 2026

Here's the PR for memory vector search plugin. apache/answer-plugins#315

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants