02. Contextual CompressionRetriever

One of the difficulties faced by the search system is that when collecting data into the system, it is not known in advance which specific queries to be processed.

This means that the information most relevant to the inquiry may be buried in documents containing large amounts of unrelated text.

Passing these entire documents to your application can lead to more expensive LLM calls and lower quality responses.

ContextualCompressionRetriever Silver is designed to solve this problem.

The idea is simple. Instead of immediately returning the retrieved document as it is, you can use the context of a given query to compress the document so that only relevant information is returned.

"Compression" here means both compressing the contents of an individual document and filtering the document as a whole.

ContextualCompressionRetriever Pass the query to the base retriever, take the initial document and pass the Document Compressor.

Document Compressor takes a list of documents to reduce the content of the document or completely delete the document to shrink the list.

Source: https://drive.google.com/uc?id=1CtNgWODXZudxAWSRiWgSGEoTNrUFT98v

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# load API key information
load_dotenv()

Copy

True

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH11-Retriever")

Copy

Start tracking LangSmith. 
[Project name] 
CH11-Retriever

Copy

# Package Update
!pip install -qU langchain-teddynote

pretty_print_docs A function is a helper function that outputs a document list beautifully.

Copy

# Helper functions to make documents look pretty
def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"문서 {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Default Retriever Settings

Let's start by initializing a simple vector store retriever and storing text documents in chunks.

When you ask an example question, you can confirm that retriever returns a few unrelated documents from 1~2 related documents.

Copy

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# TextLoader using "appendix-keywords.txt" 파일에서 문서를 로드합니다.
loader = TextLoader("./data/appendix-keywords.txt")

# CharacterTextSplitter Split the document into chunks of size 300 characters and 0 overlap between chunks.
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)
texts = loader.load_and_split(text_splitter)

# OpenAIEmbeddings Create a FAISS vector store using and convert it to a search engine.
retriever = FAISS.from_documents(texts, OpenAIEmbeddings()).as_retriever()

# Define a question in a query and retrieve relevant documents.
docs = retriever.invoke("Semantic Search tell me about.")

# Prints the searched documents beautifully.
pretty_print_docs(docs)

Copy

Document 1: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 
---------------------------------------------------------------------------------------------------- 
Document 2: 

Definition: Keyword search is the process of finding information based on keywords entered by the user. It is used as a basic search method in most search engines and database systems. 
Example: When a user searches for "Coffee Shop Seoul", it returns a list of related coffee shops. 
Associates: search engine, data search, information search 

Page Rank 
---------------------------------------------------------------------------------------------------- 
Document 3: 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 
---------------------------------------------------------------------------------------------------- 
Document 4: 

Definition: A page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This is evaluated by analyzing the link structure between web pages. 
Example: Google search engines use the page rank algorithm to rank search results. 
Associates: Search engine optimization, web analytics, link analysis 

Data mining

ContextualCompression

LLMChainExtractor Created using DocumentCompressor That's what I applied to retriever ContextualCompressionRetriever is.

Copy

from langchain_teddynote.document_compressors import LLMChainExtractor
from langchain.retrievers import ContextualCompressionRetriever

# from langchain.retrievers.document_compressors import LLMChainExtractor
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-4o-mini")  # OpenAI 언어 모델 초기화

# LLM Create a document compressor using

compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
    # Creating a context compression retriever using document compressor and retriever

    base_compressor=compressor,
    base_retriever=retriever,
)

pretty_print_docs(retriever.invoke("Semantic Search 에 대해서 알려줘."))

print("=========================================================")
print("============== LLMChainExtractor 적용 후 ==================")

compressed_docs = (
    compression_retriever.invoke(  # Retrieving relevant documents using context compression retriever
        "Semantic Search tell me about."
    )
)
pretty_print_docs(compressed_docs)  # Print the searched documents beautifully

Copy

Document 1: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 
---------------------------------------------------------------------------------------------------- 
Document 2: 

Definition: Keyword search is the process of finding information based on keywords entered by the user. It is used as a basic search method in most search engines and database systems. 
Example: When a user searches for "Coffee Shop Seoul", it returns a list of related coffee shops. 
Associates: search engine, data search, information search 

Page Rank 
---------------------------------------------------------------------------------------------------- 
Document 3: 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 
---------------------------------------------------------------------------------------------------- 
Document 4: 

Definition: A page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This is evaluated by analyzing the link structure between web pages. 
Example: Google search engines use the page rank algorithm to rank search results. 
Associates: Search engine optimization, web analytics, link analysis 

Data mining 
========================================================= 
============== After applying LLMChainExtractor ================== 
Document 1: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.

Filter documents using LLM

LLMChainFilter

LLMChainFilter Is a simpler but more powerful compressor that uses the LLM chain to filter out which of the initially searched documents and decide which one to return.

This filter does not change (compress) the document content Optionally returned To.

Copy

from langchain_teddynote.document_compressors import LLMChainFilter

# Create an LLMChainFilter object using LLM.
_filter = LLMChainFilter.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    # LLMChainFilter와 retriever More videos Your browser can't play this video.
 ContextualCompressionRetriever create an object.
    base_compressor=_filter,
    base_retriever=retriever,
)

compressed_docs = compression_retriever.invoke(
    # 쿼리
    "Semantic Search tell me about."
)
pretty_print_docs(compressed_docs)  # Print compressed documents beautifully.

Copy

Document 1: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding

EmbeddingsFilter

Performing additional LLM calls for each searched document is expensive and slow.

EmbeddingsFilter It provides a cheaper and faster option by embedding documents and queries and returning only documents with embedding sufficiently similar to queries.

This saves you money and time while maintaining the relevance of your search results.

EmbeddingsFilter Wow ContextualCompressionRetriever The process of compressing and retrieving related documents using.

EmbeddingsFilter Specified using Similarity threshold (0.86) Filter the document above.

Copy

from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# The similarity threshold is 0.76인 EmbeddingsFilter create an object.
embeddings_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.86)

# embeddings_filter as the default compressor, Using retriever as the default search engine ContextualCompressionRetriever Please wait a minute.
compression_retriever = ContextualCompressionRetriever(
    base_compressor=embeddings_filter, base_retriever=retriever
)

# ContextualCompressionRetriever Find related documents using objects.
compressed_docs = compression_retriever.invoke(
    # Query
    "Tell me about Semantic Search."
)
# Print compressed documents beautifully.
pretty_print_docs(compressed_docs)

Copy

Document 1: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding

Pipeline generation (compressor + document converter)

DocumentCompressorPipeline Using allows you to combine multiple compressors sequentially.

With Compressor BaseDocumentTransformer You can add to the pipeline, which does not perform contextual compression, but simply converts to a set of documents.

For example, TextSplitter can be used as a document transformer to split documents into smaller pieces, EmbeddingsRedundantFilter Can be used to filter duplicate documents based on embedding similarity between documents (default: 0.95 similarity or higher is considered duplicate documents).

Below, the document is first divided into smaller chunks, then duplicate documents are removed, and filtered based on relevance to the query to create a compressor pipeline.

Copy

from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain_community.document_transformers import EmbeddingsRedundantFilter
from langchain_text_splitters import CharacterTextSplitter

# Create a character-based text splitter, set the chunk size to 300 and the overlap between chunks to 0.
splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)

# Generate redundant filters using embeddings.
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)

# We use embeddings to generate relevance filters, setting the similarity threshold to 0.86.
relevant_filter = EmbeddingsFilter(embeddings=embeddings, similarity_threshold=0.86)

pipeline_compressor = DocumentCompressorPipeline(
    # Create a document compression pipeline, setting up a splitter, a duplicate filter, a relevance filter, and an LLMChainExtractor as transformers.
    transformers=[
        splitter,
        redundant_filter,
        relevant_filter,
        LLMChainExtractor.from_llm(llm),
    ]
)

ContextualCompressionRetriever Initialize, base_compressor in pipeline_compressor , base_retriever in retriever Use.

Copy

compression_retriever = ContextualCompressionRetriever(
    # With basic compressor pipeline_compressor ,and use retriever as the default search engine. ContextualCompressionRetriever를 초기화합니다.
    base_compressor=pipeline_compressor,
    base_retriever=retriever,
)

compressed_docs = compression_retriever.invoke(
    # query
    "Semantic Search tell me about."
)
# Print compressed documents beautifully.
pretty_print_docs(compressed_docs)

Copy

Document 1: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.

Previous01. VectorStore-backed Retriever Next03. EnsembleRetriever

Last updated 5 months ago