05. ParentDocumentRetriever

Balancing document search and document splitting

Dividing a document into pieces of appropriate size (chunks) during the document retrieval process is: Consider two conflicting important factors Should do.

  1. If you want a small document: This will allow the embedding of the document to most accurately reflect its meaning. If the document is too long, embedding may lose meaning.

  2. This is the case when you want a document long enough to maintain the context of each chunk.

ParentDocumentRetriever Role of

To balance between these two requirements ParentDocumentRetriever Ragi tools are used. This tool divides documents into small pieces and manages these pieces. When you go through the search, you can first find these small pieces, then grasp the overall context through the identifier (ID) of the original document (or larger piece) to which these pieces belong.

The term'parent document' here refers to the original document in which small pieces are divided. This could be a full document, or another relatively large piece. This way, you can accurately grasp the meaning of the document, but maintain the overall context.

theorem

  • Leverage hierarchies between documents : ParentDocumentRetriever Utilizes hierarchies between documents to increase the efficiency of document retrieval.

  • Improved search performance : Quickly find relevant documents, and effectively find documents that provide the best answers to a given question. There are two conflicting requirements that often arise when searching for documents:

To load multiple text files TextLoader Create objects and load data.

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

Copy

Copy

Copy

Copy

Search for entire document

In this mode, I want to search the entire document. therefore child_splitter I'll only specify.

  • Later parent_splitter Let's compare the results by specifying the degrees.

Copy

retriever.add_documents(docs, ids=None) Add a document list as a function.

  • ids end None This is automatically generated.

  • add_to_docstore=False Do not add document as duplicate when setting to. However, to check for duplicates ids Values are required as required values.

Copy

This code must return two keys. The reason is that we added two documents.

  • store Object yield_keys() Call the method to convert the returned key values to the list.

Copy

Copy

Now let's call the vector store search function.

Since we are storing small chunks, we will be able to confirm that small chunks are returned as a result of the search.

vectorstore Object similarity_search Perform similarity searches using methods.

Copy

sub_docs[0].page_content Outputs.

Copy

Copy

Now let's search in the whole retriever. In this process, small chunks are located Return document Because of this, relatively large documents will be returned.

retriever Object invoke() Search for documents related to queries using methods.

Copy

Documents retrieved ( retrieved_docs[0] ) Outputs some content.

Copy

Copy

resize larger Chunk

Like the previous result Not suitable to search as the entire document is too large You can.

In this case, what we really want to do is first split the raw document into larger chunks, then into smaller chunks.

Then index small chunks, but search for larger chunks at the time of search (but still not the whole document).

  • RecursiveCharacterTextSplitter Use to create parent and child documents.

  • Parent documents chunk_size It is set to 1000.

  • Child documents chunk_size It is set to 200, and is created in a smaller size than the parent document.

Copy

ParentDocumentRetriever Code to initialize.

  • vectorstore The parameter specifies a vector repository that stores document vectors.

  • docstore The parameter specifies the document repository that stores document data.

  • child_splitter The parameter specifies the document divider used to split the sub-documents.

  • parent_splitter The parameter specifies the document divider used to split the parent document.

ParentDocumentRetriever handles hierarchical document structures, and divides and stores parent and sub-documents separately. This allows you to effectively utilize the parent and sub-documents at the time of search.

Copy

retriever On the object docs Add. retriever It serves to add new documents to a set of searchable documents.

Copy

Now you can see that the number of documents is much higher. These are the larger chunks.

Copy

Copy

Let's see if the default vector repository still searches for small chunks.

vectorstore Object similarity_search Perform similarity searches using methods.

Copy

Copy

this time retriever Object invoke() Search for documents using methods.

Copy

Copy

Last updated