07. MultiVectorRetriever
In LangChain, a special feature that allows you to efficiently query documents in a variety of situations, right MultiVectorRetriever Gives. This feature allows you to store and manage documents in multiple vectors, which can significantly improve the accuracy and efficiency of information retrieval.
MultiVectorRetriever Let's take a look at some ways to create multiple vectors per document using.
Introduction to multiple vector creation methods per document
Small chunk generation : After dividing the document into smaller units, a separate embedding is generated for each chunk. This way, you can pay more attention to certain parts of the document. This course
ParentDocumentRetrieverIt can be implemented through, making navigation to details easier.Summary embedding : Generate a summary of each document, and create an embedding from this summary. This summary embedding is a great help in quickly grasping the core content of the document. Instead of analyzing the entire document, you can maximize efficiency by using only the key summary parts.
Using hypothetical questions : Create a suitable hypothetical question for each document, and create an embedding based on this question. This method is useful when you want a deep exploration of a particular topic or content. The hypothetical question allows the content of the document to be approached from a variety of perspectives, enabling a broader understanding.
Manual addition method : Users can add specific questions or queries directly to consider when searching documents. This method allows users to have more detailed control in the search process, and allows customized searches tailored to their needs.
Documents utilized for practice
Software Policy Institute (SPRi)-December 2023
Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)
Link: https://spri.kr/posts/view/23669
File name:
SPRI_AI_Brief_2023년12월호_F.pdf
Reference : The file above data Get download within the folder
Copy
# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv
# API Load key information
load_dotenv()Copy
Copy
Copy
Perform a preprocessing process that loads data from text files and divides loaded documents into specified sizes.
Split documents can be used for future vectorization and retrieval.
Copy
The original document loaded from the data docs In variable.
Copy
Copy
Chunk + Original Document Search
When searching for large amounts of information, it may be useful to embed information in smaller units.
MultiVectorRetriever You can save and manage documents in multiple vectors.
docstore Save the original document on, vectorstore Save the embedded document.
This divides the document into smaller units, allowing for more accurate searches. Depending on the time, the contents of the original document can be viewed.
Copy
Copy
Here to split into large chunks parent_text_splitter
To split into smaller chunks child_text_splitter Define.
Copy
Generate a larger Chunk, Parent document.
Copy
parent_docs Written in doc_id Check.
Copy
Copy
Generates Child documents, which are relatively smaller Chunk.
Copy
child_docs Written in doc_id Check.
Copy
Copy
Check the number of chunks each divided.
Copy
Add a set of newly created small splits to the vector storage.
Next, the parent document is mapped with the generated UUID docstore Add to.
mset()Save the document ID and document content to the document repository in pairs of key-value through the method.
Copy
Similarity search. Outputs the first piece of document with the highest similarity.
here retriever.vectorstore.similarity_search The method performs a search within the child + parent document chunk.
Copy
Copy
Copy
Copy
this time retriever.invoke() Run the query using methods.
retriever.invoke() The method searches the entire contents of the original document.
Copy
Copy
The type of search that retriever performs by default in a vector database is a similarity search.
LangChain Vector Stores Max Marginal Relevance Search via also supports, so if you want to use it instead, search_type Just set the attribute.
retrieverObjectsearch_typepropertySearchType.mmrSet to.This is to specify the use of the Maximum Marginal Relevance (MMR) algorithm at the time of search.
Copy
Copy
Copy
Copy
Copy
Copy
Save summary (summary) to vector storage
Summaries can often extract chunk content more accurately, resulting in better search results.
Here we will explain how to generate a summary and how to embed it.
Copy
Copy
Copy
chain.batch Using methods docs A summary of the list's documents. - here max_concurrency Set the parameter to 10 so that up to 10 documents can be processed simultaneously.
Copy
Copy
Copy
Output the summarized content to confirm the results.
Copy
Copy
Chroma Initialize the vector repository to index child chunks. At this time OpenAIEmbeddings Use as an embedding function.
With key indicating document ID
"doc_id"Use.
Copy
Copy
Copy
The number of documents in the summary matches the number of original documents.
Copy
Summarized documents and metadata (for the summary generated here) Document ID Save).
Copy
Copy
vectorstore Object similarity_search Perform similarity searches using methods.
Copy
retriever.vectorstore.add_documents(summary_docs)Throughsummary_docsAdd to vector repository.retriever.docstore.mset(list(zip(doc_ids, docs)))Usingdoc_idsWowdocsMap and save it to the document repository.
Last edited by: Aug. 31, 2024, 12:15 a.m.
Copy
Copy
retriever Object invoke Search for documents related to queries using methods.
Copy
Copy
Since we have only added the hypothetical queries we have created here, we return the document with the highest similarity among the hypothetical queries we have created.
Below are the results of similar search.
Copy
vectorstore Object similarity_search Perform similarity searches using methods.
Copy
Add hypothetical queries to documents, original documents docstore Add to.
Copy
question_docs Add metadata (document ID) to the list.
Copy
Below is the process of storing hypothetical Queries in vector storage, the same way they did before.
Copy
Copy
Copy
chain.batch Using methods split_docs Process multiple requests simultaneously for data.
Copy
Copy
The output contains three hypothetical Queries generated.
Output answers to documents.
Copy
functionsWowfunction_callSet to call the virtual question generation function.JsonKeyOutputFunctionsParserParse virtual questions created using,questionsExtract the value corresponding to the key.
ChatPromptTemplate Use to define prompt templates that generate 3 virtual questions based on a given document.
Copy
Below Function Calling This is an example of using to generate hypothetical questions.
Creating home questions helps you grasp the main topics and concepts of the document, and can lead readers to more curious about the content of the document.
The questions created in this way can be embedded, which allows you to explore and understand the content of the document in more depth.
LLM can also be used to generate a list of questions that can be assumed for a particular document.
Explore document content using Hypothetical Queries
Copy
Copy
retriever Object invoke() Use it to search for documents related to your question.
Copy
Copy
Copy
retriever Object invoke Search for documents related to queries using methods.
Copy
Copy
Since we have only added the hypothetical queries we have created here, we return the document with the highest similarity among the hypothetical queries we have created.
Below are the results of similar search.
Copy
vectorstore Object similarity_search Perform similarity searches using methods.
Copy
Add hypothetical queries to documents, original documents docstore Add to.
Copy
question_docs Add metadata (document ID) to the list.
Copy
Below is the process of storing hypothetical Queries in vector storage, the same way they did before.
Copy
Copy
Copy
chain.batch Using methods split_docs Process multiple requests simultaneously for data.
Copy
Copy
The output contains three hypothetical Queries generated.
Output answers to documents.
Copy
functionsWowfunction_callSet to call the virtual question generation function.JsonKeyOutputFunctionsParserParse virtual questions created using,questionsExtract the value corresponding to the key.
ChatPromptTemplate Use to define prompt templates that generate 3 virtual questions based on a given document.
Copy
Below Function Calling This is an example of using to generate hypothetical questions.
Creating home questions helps you grasp the main topics and concepts of the document, and can lead readers to more curious about the content of the document.
The questions created in this way can be embedded, which allows you to explore and understand the content of the document in more depth.
LLM can also be used to generate a list of questions that can be assumed for a particular document.
Explore document content using Hypothetical Queries
Copy
Copy
retriever Object invoke() Use it to search for documents related to your question.
Copy
Last updated