SOLR-18187: Document enrichment with LLMs#4259
SOLR-18187: Document enrichment with LLMs#4259nicolo-rinaldi wants to merge 21 commits intoapache:mainfrom
Conversation
…tUpdateProcessorFactory
- multivalued outputField - outputField different from Str/Text, with numeric, boolean and date
…t with LLMs' module
| restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1"); | ||
| } | ||
|
|
||
| private UpdateRequestProcessor createUpdateProcessor( |
There was a problem hiding this comment.
Can't this always be generalised and used for all the tests? In some of them, you are now repeating this code with small changes...
There was a problem hiding this comment.
this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?
There was a problem hiding this comment.
I created a function initializeUpdateProcessorFactory that is used inside createUpdateProcessor. In this way, the code inside the first one can be reused
There was a problem hiding this comment.
why some test could not use these new functions?
e.g. init_multipleInputFields_shouldInitAllFields
There was a problem hiding this comment.
I kept them unrelated to the model creation, just to see the proper initialization of the Factory. I can see if this can be changed if you want
| restTestHarness.delete(ManagedChatModelStore.REST_END_POINT + "/model1"); | ||
| } | ||
|
|
||
| private UpdateRequestProcessor createUpdateProcessor( |
There was a problem hiding this comment.
this is the same as createUpdateProcessor a part from the creation of the request and getInstance()
maybe we can exclude the solr request + getInstance() and use that method also here? calling it like "initializeUpdateProcessorFactory"?
what do you think?
…fields and updated documentation
|
|
||
| === Models | ||
|
|
||
| * A model in this module is a chat model, that answers with text given a prompt. |
| === Models | ||
|
|
||
| * A model in this module is a chat model, that answers with text given a prompt. | ||
| * A model in this Solr module is a reference to an external API that runs the Large Language Model responsible for chat |
|
|
||
| Exactly one of the following parameters is required: `prompt` or `promptFile`. | ||
|
|
||
| Another important feature of this module is that one (or more) `inputField` needs to be injected in the prompt. This is |
| .messages(UserMessage.from(prompt)) | ||
| .build(); | ||
| String rawJson = chatModel.chat(chatRequest).aiMessage().text(); | ||
| Object parsed = Utils.fromJSONString(rawJson); |
There was a problem hiding this comment.
Is parsing an 'Object' necessary?
| public SolrChatModel getModel(String modelName) { | ||
| return store.getModel(modelName); | ||
| } | ||
|
|
There was a problem hiding this comment.
this entire class feels like exactly the same of the one I implemented for embedding models.
Can't we use the same class but for multiple storage solutions?
So you instantiate different endpoints but same class.
It feels a lot of duplicate code
| "model '" + name + "' already exists. Please use a different name"); | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
see above in regards of duplicated code
| // as for now, only a plain text as prompt is sent to the model (no support for | ||
| // tools/skills/agents) | ||
| // chatModel.chat returns the parsed value from the structured JSON response | ||
| Object value = chatModel.chat(injectedPrompt, responseFormat); |
There was a problem hiding this comment.
value? isn't it the output? Also, langchain4j returns an 'Object'? is that a weak typing?
| String injectedPrompt = prompt; | ||
| for (String fieldName : inputFields) { | ||
| SolrInputField field = doc.get(fieldName); | ||
| if (isNullOrEmpty(field)) { |
There was a problem hiding this comment.
if a field is null you skip enrichment for the entire document? It's suspicious, we should discuss what it means to have multiple fields in input.
My assumption would be you define a processor for a list of fields, to enrich each of them in the same way.
having a null field should only skip the field rather than the full document?
| * <li>Exactly one of {@code prompt} or {@code promptFile} must be provided. | ||
| * <li>Every declared {@code inputField} must have a corresponding {@code {fieldName}} placeholder | ||
| * in the prompt. | ||
| * <li>Every {@code {placeholder}} in the prompt must correspond to a declared {@code inputField}. |
There was a problem hiding this comment.
this huge comment is unreadable code-wise and suspicious, probably it should be improved
https://issues.apache.org/jira/browse/SOLR-18187
Description
The goal of this PR is to add a way to integrate LLMs directly into Solr at index time to fill fields that might be useful (e.g., categories, tags, etc.)
Solution
This PR adds LLM-based document enrichment capabilities to Solr's indexing pipeline via a new DocumentEnrichmentUpdateProcessorFactory in the language-models module. The processor allows users to enrich documents at index time by calling an LLM (via https://github.com/langchain4j/langchain4j) with a configurable prompt built from one or more existing document fields (inputFields), and storing the model's response into an output field. The output field can be of different types (i.e., string, text, int, long, float, double, boolean, and date) and can be single-valued or multi-valued. The structured output has been used to adapt to the output field type.
The implementation has taken inspiration from the text-to-vector feature in the same module. This has been done to keep the implementation consistent with conventions already in the language-models module.
Note: this PR was developed with assistance from Claude Code (Anthropic).
Tests
Tests covering configuration validation (missing required params, conflicting params, invalid field types, placeholder mismatches), and processor initialization.
Tests covering single-valued and multi-valued output fields of all supported types, multi-input-field prompts, prompt file loading, error handling (model exceptions, ambiguous/malformed JSON responses, unsupported model types), and skipNullOrMissingFieldValues behaviour. All the supported models have been tested.
Checklist
Please review the following and check all that apply:
mainbranch../gradlew check.