07. MultiVectorRetriever

In LangChain, a special feature that allows you to efficiently query documents in a variety of situations, right MultiVectorRetriever Gives. This feature allows you to store and manage documents in multiple vectors, which can significantly improve the accuracy and efficiency of information retrieval.

MultiVectorRetriever Let's take a look at some ways to create multiple vectors per document using.

Introduction to multiple vector creation methods per document

  1. Small chunk generation : After dividing the document into smaller units, a separate embedding is generated for each chunk. This way, you can pay more attention to certain parts of the document. This course ParentDocumentRetriever It can be implemented through, making navigation to details easier.

  2. Summary embedding : Generate a summary of each document, and create an embedding from this summary. This summary embedding is a great help in quickly grasping the core content of the document. Instead of analyzing the entire document, you can maximize efficiency by using only the key summary parts.

  3. Using hypothetical questions : Create a suitable hypothetical question for each document, and create an embedding based on this question. This method is useful when you want a deep exploration of a particular topic or content. The hypothetical question allows the content of the document to be approached from a variety of perspectives, enabling a broader understanding.

  4. Manual addition method : Users can add specific questions or queries directly to consider when searching documents. This method allows users to have more detailed control in the search process, and allows customized searches tailored to their needs.

Documents utilized for practice

Software Policy Institute (SPRi)-December 2023

  • Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)

  • Link: https://spri.kr/posts/view/23669

  • File name: SPRI_AI_Brief_2023년12월호_F.pdf

Reference : The file above data Get download within the folder

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

Copy

Copy

Perform a preprocessing process that loads data from text files and divides loaded documents into specified sizes.

Split documents can be used for future vectorization and retrieval.

Copy

The original document loaded from the data docs In variable.

Copy

Copy

Chunk + Original Document Search

When searching for large amounts of information, it may be useful to embed information in smaller units.

MultiVectorRetriever You can save and manage documents in multiple vectors.

docstore Save the original document on, vectorstore Save the embedded document.

This divides the document into smaller units, allowing for more accurate searches. Depending on the time, the contents of the original document can be viewed.

Copy

Copy

Here to split into large chunks parent_text_splitter

To split into smaller chunks child_text_splitter Define.

Copy

Generate a larger Chunk, Parent document.

Copy

parent_docs Written in doc_id Check.

Copy

Copy

Generates Child documents, which are relatively smaller Chunk.

Copy

child_docs Written in doc_id Check.

Copy

Copy

Check the number of chunks each divided.

Copy

Add a set of newly created small splits to the vector storage.

Next, the parent document is mapped with the generated UUID docstore Add to.

  • mset() Save the document ID and document content to the document repository in pairs of key-value through the method.

Copy

Similarity search. Outputs the first piece of document with the highest similarity.

here retriever.vectorstore.similarity_search The method performs a search within the child + parent document chunk.

Copy

Copy

Copy

Copy

this time retriever.invoke() Run the query using methods.

retriever.invoke() The method searches the entire contents of the original document.

Copy

Copy

The type of search that retriever performs by default in a vector database is a similarity search.

LangChain Vector Stores Max Marginal Relevance Search via also supports, so if you want to use it instead, search_type Just set the attribute.

  • retriever Object search_type property SearchType.mmr Set to.

  • This is to specify the use of the Maximum Marginal Relevance (MMR) algorithm at the time of search.

Copy

Copy

Copy

Copy

Copy

Copy

Save summary (summary) to vector storage

Summaries can often extract chunk content more accurately, resulting in better search results.

Here we will explain how to generate a summary and how to embed it.

Copy

Copy

Copy

chain.batch Using methods docs A summary of the list's documents. - here max_concurrency Set the parameter to 10 so that up to 10 documents can be processed simultaneously.

Copy

Copy

Copy

Output the summarized content to confirm the results.

Copy

Copy

Chroma Initialize the vector repository to index child chunks. At this time OpenAIEmbeddings Use as an embedding function.

  • With key indicating document ID "doc_id" Use.

Copy

Copy

Copy

The number of documents in the summary matches the number of original documents.

Copy

Summarized documents and metadata (for the summary generated here) Document ID Save).

Copy

Copy

vectorstore Object similarity_search Perform similarity searches using methods.

Copy

  • retriever.vectorstore.add_documents(summary_docs) Through summary_docs Add to vector repository.

  • retriever.docstore.mset(list(zip(doc_ids, docs))) Using doc_ids Wow docs Map and save it to the document repository.

Last edited by: Aug. 31, 2024, 12:15 a.m.

Copy

Copy

retriever Object invoke Search for documents related to queries using methods.

Copy

Copy

Since we have only added the hypothetical queries we have created here, we return the document with the highest similarity among the hypothetical queries we have created.

Below are the results of similar search.

Copy

vectorstore Object similarity_search Perform similarity searches using methods.

Copy

Add hypothetical queries to documents, original documents docstore Add to.

Copy

question_docs Add metadata (document ID) to the list.

Copy

Below is the process of storing hypothetical Queries in vector storage, the same way they did before.

Copy

Copy

Copy

chain.batch Using methods split_docs Process multiple requests simultaneously for data.

Copy

Copy

  • The output contains three hypothetical Queries generated.

Output answers to documents.

Copy

  • functions Wow function_call Set to call the virtual question generation function.

  • JsonKeyOutputFunctionsParser Parse virtual questions created using, questions Extract the value corresponding to the key.

ChatPromptTemplate Use to define prompt templates that generate 3 virtual questions based on a given document.

Copy

Below Function Calling This is an example of using to generate hypothetical questions.

Creating home questions helps you grasp the main topics and concepts of the document, and can lead readers to more curious about the content of the document.

The questions created in this way can be embedded, which allows you to explore and understand the content of the document in more depth.

LLM can also be used to generate a list of questions that can be assumed for a particular document.

Explore document content using Hypothetical Queries

Copy

Copy

retriever Object invoke() Use it to search for documents related to your question.

Copy

Copy

Copy

retriever Object invoke Search for documents related to queries using methods.

Copy

Copy

Since we have only added the hypothetical queries we have created here, we return the document with the highest similarity among the hypothetical queries we have created.

Below are the results of similar search.

Copy

vectorstore Object similarity_search Perform similarity searches using methods.

Copy

Add hypothetical queries to documents, original documents docstore Add to.

Copy

question_docs Add metadata (document ID) to the list.

Copy

Below is the process of storing hypothetical Queries in vector storage, the same way they did before.

Copy

Copy

Copy

chain.batch Using methods split_docs Process multiple requests simultaneously for data.

Copy

Copy

  • The output contains three hypothetical Queries generated.

Output answers to documents.

Copy

  • functions Wow function_call Set to call the virtual question generation function.

  • JsonKeyOutputFunctionsParser Parse virtual questions created using, questions Extract the value corresponding to the key.

ChatPromptTemplate Use to define prompt templates that generate 3 virtual questions based on a given document.

Copy

Below Function Calling This is an example of using to generate hypothetical questions.

Creating home questions helps you grasp the main topics and concepts of the document, and can lead readers to more curious about the content of the document.

The questions created in this way can be embedded, which allows you to explore and understand the content of the document in more depth.

LLM can also be used to generate a list of questions that can be assumed for a particular document.

Explore document content using Hypothetical Queries

Copy

Copy

retriever Object invoke() Use it to search for documents related to your question.

Copy

Last updated