05. ParentDocumentRetriever

Balancing document search and document splitting

Dividing a document into pieces of appropriate size (chunks) during the document retrieval process is: Consider two conflicting important factors Should do.

If you want a small document: This will allow the embedding of the document to most accurately reflect its meaning. If the document is too long, embedding may lose meaning.
This is the case when you want a document long enough to maintain the context of each chunk.

ParentDocumentRetriever Role of

To balance between these two requirements ParentDocumentRetriever Ragi tools are used. This tool divides documents into small pieces and manages these pieces. When you go through the search, you can first find these small pieces, then grasp the overall context through the identifier (ID) of the original document (or larger piece) to which these pieces belong.

The term'parent document' here refers to the original document in which small pieces are divided. This could be a full document, or another relatively large piece. This way, you can accurately grasp the meaning of the document, but maintain the overall context.

theorem

Leverage hierarchies between documents : ParentDocumentRetriever Utilizes hierarchies between documents to increase the efficiency of document retrieval.
Improved search performance : Quickly find relevant documents, and effectively find documents that provide the best answers to a given question. There are two conflicting requirements that often arise when searching for documents:

To load multiple text files TextLoader Create objects and load data.

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

True

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH11-Retriever")

Copy

Start tracking LangSmith. 
[Project name] 
CH11-Retriever

Copy

from langchain.storage import InMemoryStore
from langchain_community.document_loaders import TextLoader
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers import ParentDocumentRetriever

Copy

loaders = [
    # load the file.
    TextLoader("./data/appendix-keywords.txt"),
]

docs = []
for loader in loaders:
    # Use the loader to load a document and add it to the docs list.
    docs.extend(loader.load())

Search for entire document

In this mode, I want to search the entire document. therefore child_splitter I'll only specify.

Later parent_splitter Let's compare the results by specifying the degrees.

Copy

# Create a child splitter.
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

# Create a DB.
vectorstore = Chroma(
    collection_name="full_documents", embedding_function=OpenAIEmbeddings()
)

store = InMemoryStore()

# Retriever creates.
retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

retriever.add_documents(docs, ids=None) Add a document list as a function.

ids end None This is automatically generated.
add_to_docstore=False Do not add document as duplicate when setting to. However, to check for duplicates ids Values are required as required values.

Copy

# Adds a document to the searcher. docs is a list of documents, and ids is a list of unique identifiers for the documents..
retriever.add_documents(docs, ids=None, add_to_docstore=True)

This code must return two keys. The reason is that we added two documents.

store Object yield_keys() Call the method to convert the returned key values to the list.

Copy

# Returns a list of all keys in the repository.
list(store.yield_keys())

Copy

 ['c2a89a0f-a690-4915-af68-2ea432fb6e51']

Now let's call the vector store search function.

Since we are storing small chunks, we will be able to confirm that small chunks are returned as a result of the search.

vectorstore Object similarity_search Perform similarity searches using methods.

Copy

# Perform a similarity search.
sub_docs = vectorstore.similarity_search("Word2Vec")

sub_docs[0].page_content Outputs.

Copy

# Outputs the page_content attribute of the first element in the sub_docs list.
print(sub_docs[0].page_content)

Copy

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity

Now let's search in the whole retriever. In this process, small chunks are located Return document Because of this, relatively large documents will be returned.

retriever Object invoke() Search for documents related to queries using methods.

Copy

# Search and retrieve documents.
retrieved_docs = retriever.invoke("Word2Vec")

Documents retrieved ( retrieved_docs[0] ) Outputs some content.

Copy

# Outputs the length of the page content of the document in the searched document.
print(
    f"The squirrel: {len(retrieved_docs[0].page_content)}",
    end="\n\n=====================\n\n",
)

# Print part of the document.
print(retrieved_docs[0].page_content[2000:2500])

Copy

Document length: 5733 

===================== 

 Innovating data storage and processing by introducing computing is an example of digital transformation. 
Related Keywords: innovation, technology, business model 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 

Definition: LLM is a large language model trained with large-scale text data

resize larger Chunk

Like the previous result Not suitable to search as the entire document is too large You can.

In this case, what we really want to do is first split the raw document into larger chunks, then into smaller chunks.

Then index small chunks, but search for larger chunks at the time of search (but still not the whole document).

RecursiveCharacterTextSplitter Use to create parent and child documents.
Parent documents chunk_size It is set to 1000.
Child documents chunk_size It is set to 200, and is created in a smaller size than the parent document.

Copy

# A text splitter used to generate the parent document.
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
# A text splitter used to generate child documents.
# You must create a document that is smaller than its parent.
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)
# A vector store to use for indexing child chunks.
vectorstore = Chroma(
    collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# This is the storage layer of the parent document.
store = InMemoryStore()

ParentDocumentRetriever Code to initialize.

vectorstore The parameter specifies a vector repository that stores document vectors.
docstore The parameter specifies the document repository that stores document data.
child_splitter The parameter specifies the document divider used to split the sub-documents.
parent_splitter The parameter specifies the document divider used to split the parent document.

ParentDocumentRetriever handles hierarchical document structures, and divides and stores parent and sub-documents separately. This allows you to effectively utilize the parent and sub-documents at the time of search.

Copy

retriever = ParentDocumentRetriever(
    # Specifies a vector storage.
    vectorstore=vectorstore,
    # Specify a document repository.
    docstore=store,
    # Specifies a subdocument divider.
    child_splitter=child_splitter,
    # Specifies the parent document divider.
    parent_splitter=parent_splitter,
)

retriever On the object docs Add. retriever It serves to add new documents to a set of searchable documents.

Copy

retriever.add_documents(docs)  # Adds a document to the retriever.

Now you can see that the number of documents is much higher. These are the larger chunks.

Copy

# Generates a key from the storage, converts it to a list, and returns its length.
len(list(store.yield_keys()))

Copy

Let's see if the default vector repository still searches for small chunks.

vectorstore Object similarity_search Perform similarity searches using methods.

Copy

# Perform a similarity search.
sub_docs = vectorstore.similarity_search("Word2Vec")
# sub_docs Outputs the page_content attribute of the first element in the list.
print(sub_docs[0].page_content)

Copy

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity

this time retriever Object invoke() Search for documents using methods.

Copy

# Search and retrieve documents.
retrieved_docs = retriever.invoke("Word2Vec")

# Returns the length of the page content of the first document in the searched documents.
print(retrieved_docs[0].page_content)

Copy

 Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism. 
Example: Google translators use transformer models to perform translations between different languages. 
Associated Keywords: deep learning, natural language processing, Attention 

HuggingFace 

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. 
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. 
Associates: Natural language processing, deep learning, library 

Digital Transformation 

Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. 
Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. 
Related Keywords: innovation, technology, business model 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model)

Previous04. Long context rearrangement (LongContextReorder)Next06. MultiQueryRetriever

Last updated 5 months ago