01. VectorStore-backed Retriever

VectorStore Support Finder is a retriever that searches for documents using the vector store.

Vector store Similarity search Ina MMR Query text within the vector store using the same search method.

Run the code below to generate VectorStore

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

True

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH11-Retriever")

Copy

 Start tracking LangSmith. 
[Project name] 
CH11-Retriever

Copy

from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader

# TextLoader Load the file using.
loader = TextLoader("./data/appendix-keywords.txt")

# Load the document.
documents = loader.load()

# Create a CharacterTextSplitter that splits text based on characters. The chunk size is 300 and there is no overlap between chunks..
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)

# Split the loaded document.
split_docs = text_splitter.split_documents(documents)

# OpenAI Generate embeddings.
embeddings = OpenAIEmbeddings()

# Create a FAISS vector database using segmented text and embeddings.
db = FAISS.from_documents(split_docs, embeddings)

VectorStoreRetriever initialization at VectorStore (as_retriever)

as_retriever The method initializes and returns VectorStoreRetriever based on the VectorStore object. This method allows you to set up various search options to perform document searches tailored to your needs.

Parameters

**kwargs : Keyword factor to pass to search function
search_type : Search type ("similarity", "mmr", "similarity_score_threshold")
search_kwargs : Additional search options
- k : Number of documents to return (default: 4)
- score_threshold : minimum similarity threshold for similarity_score_threshold search
- fetch_k : Number of documents to pass to MMR algorithm (default: 20)
- lambda_mult : Diversity regulation of MMR results (between 0-1, default: 0.5)
- filter : Document metadata based filtering

Return value

VectorStoreRetriever : Initialized VectorStoreRetriever object

Reference

Various search strategies can be implemented (similarity, MMR, threshold based)
MMR (Maximal Marginal Relevance) algorithm allows you to regulate the diversity of search results
Metadata filtering allows only documents with specific conditions to be retrieved
tags Tagging can be added to the searcher via parameters

caution

search_type and search_kwargs Proper combination required
When using MMR fetch_k Wow k Need to balance values
score_threshold Values that are too high at the time of setting may not have search results
When using the filter, it is necessary to pinpoint the metadata structure of the dataset.
lambda_mult The closer the value is to 0, the higher the diversity, the closer to 1, the higher the similarity.

Copy

# Assign the database to the retriever variable to use it as a search engine.
retriever = db.as_retriever()

Retriever invoke( )

invoke The method is Retriever's main entry point, used to retrieve related documents. This method synchronously calls Retriever to return relevant documents for a given query.

Parameters

input : Search query string
config : Retriever configuration (Optional[RunnableConfig])
**kwargs : Additional factors to pass to Retriever

Return value

List[Document] : List of related documents

Copy

# Search for related documents
docs = retriever.invoke(" What is Embedding ?")

for doc in docs:
    print(doc.page_content)
    print("=========================================================")

Copy

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
========================================================= 
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
========================================================= 
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 
========================================================= 
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 
=========================================================

Max Marginal Relevance (MMR)

MMR(Maximal Marginal Relevance) The way the documents retrieved when searching for related items for queries Duplicate This is one way to avoid.

Instead of simply searching for only the most relevant items, MMR is about queries Document relevance And already selected simultaneously consider discrimination against documents To.

search_type parameter "mmr" By setting MMR (Maximal Marginal Relevance) Use search algorithms.
k : Number of documents to return (default: 4)
fetch_k : Number of documents to pass to MMR algorithm (default: 20)
lambda_mult : Diversity control of MMR results (0~1, default: 0.5, 0: Similarity score only, 1: Diversity only)

Copy

# MMR(Maximal Marginal Relevance) Specify the search type.
retriever = db.as_retriever(
    search_type="mmr", search_kwargs={"k": 2, "fetch_k": 10, "lambda_mult": 0.6}
)

# Search for related documents.
docs = retriever.invoke("What is Embedding?")

# Search for related documents
for doc in docs:
    print(doc.page_content)
    print("=========================================================")

Copy

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
========================================================= 
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 
=========================================================

Similarity score threshold search (similarity_score_threshold)

You can set a similarity score threshold and set a search method that returns only documents with points above that threshold.

By setting the threshold appropriately Filter less relevant documents Do, Screening only the most similar documents You can. - search_type parameter "similarity_score_threshold" Set to perform a search based on the similarity score threshold.

search_kwargs In parameters {"score_threshold": 0.8} Pass the similarity score threshold to 0.8. This is the search result Only documents with a similarity score of 0.8 or higher are returned Means.

Copy

retriever = db.as_retriever(
    # Search type "similarity_score_threshold set to
    search_type="similarity_score_threshold",
    # Setting the threshold
    search_kwargs={"score_threshold": 0.8},
)

#  Search for related documents

for doc in retriever.invoke("Word2Vec 은 무엇인가요?"):
    print(doc.page_content)
    print("=========================================================")

Copy

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
=========================================================

top_k setting

Use when searching k You can specify search keyword factors (kwargs) like this.

k The parameter represents the number of parent results to return from the search results. - search_kwargs in k Set the parameter to 1 to specify the number of documents to return as search results.

Copy

# k setting
retriever = db.as_retriever(search_kwargs={"k": 1})

# Search for related documents
docs = retriever.invoke("What is Embedding??")

#Search for related documents
for doc in docs:
    print(doc.page_content)
    print("=========================================================")

Copy

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
=========================================================

Dynamic settings (Configurable)

To dynamically adjust search settings ConfigurableField Use.
ConfigurableField Is the role of setting the unique identifier, name, and description of the search parameter.
To adjust search settings config Specify search settings using parameters.
Search settings config Of the dictionary passed to the parameter configurable Stored in the key.
Search settings are passed along with search queries, dynamically adjusted according to search queries.

Copy

from langchain_core.runnables import ConfigurableField

# k setting
retriever = db.as_retriever(search_kwargs={"k": 1}).configurable_fields(
    search_type=ConfigurableField(
        id="search_type",
        name="Search Type",
        description="The search type to use",
    ),
    search_kwargs=ConfigurableField(
        # Set a unique identifier for the search parameters
        id="search_kwargs",
        # Set the name of the search parameter
        name="Search Kwargs",
        # Write a description for your search parameters
        description="The search kwargs to use",
    ),
)

Below is an example with dynamic search settings.

Copy

#Specify search settings. Faiss Set k=3 in the search to return the 3 most similar documents.
config = {"configurable": {"search_kwargs": {"k": 3}}}

# Search for related documents

docs = retriever.invoke("임베딩(Embedding)은 무엇인가요?", config=config)

# Search for related documents
for doc in docs:     
    print(doc.page_content)
    print("=========================================================")

Copy

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
========================================================= 
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
========================================================= 
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 
=========================================================

Copy

# Specify search settings. score_threshold 0.8 Only documents with a score above will be counted.
config = {
    "configurable": {
        "search_type": "similarity_score_threshold",
        "search_kwargs": {
            "score_threshold": 0.8,
        },
    }
}

# Search for related documents
docs = retriever.invoke("Word2Vec The best way to get started?", config=config)

# The best way to get started
for doc in docs:
    print(doc.page_content)
    print("=========================================================")

Copy

 Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
=========================================================

Copy

# Specify search settings. mmr search settings..
config = {
    "configurable": {
        "search_type": "mmr",
        "search_kwargs": {"k": 2, "fetch_k": 10, "lambda_mult": 0.6},
    }
}

# Search for related documents
docs = retriever.invoke("Word2Vec The best way to get started?", config=config)

# Search for related documents
for doc in docs:
    print(doc.page_content)
    print("=========================================================")

Copy

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
========================================================= 
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 
=========================================================

Query & Passage embedding model separated, such as Upstage embedding

The default retriever uses the same embedding model for queries and documents.

However, there are cases where different embedding models are used for queries and documents.

In these cases, the query is embedded using the query embedding model, and the document is embedded using the document embedding model.

This allows you to use different embedding models for queries and documents.

Copy

from langchain_community.vectorstores import FAISS
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_upstage import UpstageEmbeddings

# TextLoader Load the file using .
loader = TextLoader("./data/appendix-keywords.txt")

# Load the document.
documents = loader.load()

# Splitting text based on characters CharacterTextSplitter It generates chunks with a chunk size of 300 and no duplication between chunks..
text_splitter = CharacterTextSplitter(chunk_size=300, chunk_overlap=0)

# Split the loaded document.
split_docs = text_splitter.split_documents(documents)

# Upstage Generate embeddings using the document model.
doc_embedder = UpstageEmbeddings(model="solar-embedding-1-large-passage")

# Create a FAISS vector database using segmented text and embeddings.
db = FAISS.from_documents(split_docs, doc_embedder)

Below is an example of creating an Upstage embedding for queries and converting query sentences to vectors to perform vector similarity searches.

Copy

# Generate upstage embeddings for queries. Use the model for queries.
query_embedder = UpstageEmbeddings(model="solar-embedding-1-large-query")

# Converts query sentences into vectors.
query_vector = query_embedder.embed_query("What is Embedding?")

# Performs a vector similarity search to return the two most similar documents.
db.similarity_search_by_vector(query_vector, k=2)

Copy

[Document (metadata={'source':'./data/appendix-keywords.txt'}, page_content=' Definition: Embedding is the process of converting text data such as words or sentences into a continuous vector of low dimensions. This allows the computer to understand and process the text.\n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\n.Keyword: Natural language processing, vectorization, deep learning\n\nToken'), Document (metadata={'source':'app. This creates a vector based on the contextual similarity of the word.\n Example: In the Word2Vec model, "king" and "Queen" are represented by vectors in close positions with each other.\nangi-keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)')]

PreviousCH10 finder (Retriever)Next02. Contextual CompressionRetriever

Last updated 5 months ago