02. FAISS

Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors.

Faiss contains algorithms that search for sets of vectors of all sizes, including sets of vectors that may not fit RAM.

It also includes support codes for evaluation and parameter tuning.

Reference - LangChain FAISS documents - FAISS documents

Copy

# API A configuration file for managing keys as environment variables
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

True

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH10-VectorStores")

Copy

Start tracking LangSmith. 
[Project name] 
CH10-VectorStores

Load the sample dataset.

Copy

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter


# Text Splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=0)

# Text file load -> List[Document] Convert to form
loader1 = TextLoader("data/nlp-keywords.txt")
loader2 = TextLoader("data/finance-keywords.txt")

# Split document
split_doc1 = loader1.load_and_split(text_splitter)
split_doc2 = loader2.load_and_split(text_splitter)

# Check the number of documents
len(split_doc1), len(split_doc2)

Copy

 (11, 6)

VectorStore creation

Main initialization parameters

Indexing parameters - embedding_function (Embeddings): Embedding function to use

Client parameters - index (Any): FAISS index to use - docstore (Docstore): Document repository to use - index_to_docstore_id (Dict[int, str]): Mapping from index to document repository ID

Reference

FAISS is a library for high performance vector search and clustering.
This class integrates FAISS with LangChain's VectorStore interface.
You can build an efficient vector search system by combining embedding functions, FAISS indexes, and document repositories.

Copy

import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings

# 임베딩
embeddings = OpenAIEmbeddings()

# Compute the embedding dimension size
dimension_size = len(embeddings.embed_query("hello world"))
print(dimension_size)

Copy

# FAISS Create a vector store
db = FAISS(
    embedding_function=OpenAIEmbeddings(),
    index=faiss.IndexFlatL2(dimension_size),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

FAISS vector repository creation (from_documents)

from_documents Class methods use document list and embedding functions to generate FAISS vector repositories.

parameter

documents (List[Document]): List of documents to add to the vector repository
embedding (Embeddings): Embedding function to use
**kwargs : Additional keyword factors

Motion method

Text content in document list ( page_content ) And metadata.
Using extracted text and metadata from_texts Call the method.

return value

VectorStore : Vector repository instances initialized with documents and embedding

Reference

This method from_texts Generate a vector repository by calling the method internally.
Document page_content In text, metadata Is used as a metadata.
If additional setup is required kwargs You can pass through.

Copy

# DB generation
db = FAISS.from_documents(documents=split_doc1, embedding=OpenAIEmbeddings())

Copy

# Check the document repository ID
db.index_to_docstore_id

Copy

{0:'e7a34419-3e8d-4c4c-aa4d-31d2c7d488e8', 1:'f4be9a4f-8361-400d-9f4c-fef29f67351f', 2:'42c10ad6e99b1710-

Copy

# ID of the saved document: Document Check
db.docstore._dict

Copy

Searching for {'e7a34419-3e8d-4c4c-aa4d-31d2c7d488e8': Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Semantic SearchnEmbedding\n\n Definition: Embedding is the process of converting text data such as words or sentences into a low-dimensional continuous vector. This allows the computer to understand and process the text.\n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\nAssociation keyword: natural language processing, vectorization, deep learning\n\nToken\n\n Definition: Tokens mean splitting text into smaller units. This can usually be a word, sentence, or verse.\n Example: Split the sentence "I go to school" into "I", "To school", "Goes".\nAssociation keyword: tokenization, natural language processing, parsing\n\nTokenizer'),  
... 
(meditation) 
... 
Definition of '68364bed-e221-422e-ab05-3b6470444dbc': Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' definition: Multimodal is a different type of data mode (eg It is used to extract or predict richer and more accurate information through interactions between different types of data.\n Example: A system that analyzes images and descriptive text together to perform more accurate image classification is an example of multimodal technology. \Non-guide keyword: data fusion, artificial, deep learning')

FAISS vector repository creation (from_texts)

from_texts Class methods use text list and embedding functions to generate FAISS vector repositories.

parameter

texts (List[str]): Text list to add to the vector repository
embedding (Embeddings): Embedding function to use
metadatas (Optional[List[dict]]): Metadata list. The default is None
ids (Optional[List[str]]): Document ID list. The default is None
**kwargs : Additional keyword factors

Motion method

Embed text using the embedding function provided.
With embedded vector __from Create a FAISS instance by calling the method.

return value

FAISS : Created FAISS vector repository instance

Reference

This method is a user-friendly interface that handles document embedding, in-memory document storage creation, and FAISS database initialization at once.
It's a convenient way to get started quickly.

caution

When processing large amounts of text, you need to pay attention to memory usage.
To use a metadata or ID, you must provide it as a list of the same length as the text list.

Copy

# 문Create a list of strings
db2 = FAISS.from_texts(
    ["Hello, it's really nice to meet you.", "My name is Teddy."],
    embedding=OpenAIEmbeddings(),
    metadatas=[{"source": "Text document"}, {"source": "Text document"}],
    ids=["doc1", "doc2"],
)

Check the saved results. The id value checks if the specified id value is well entered.

Copy

# Saved Content
db2.docstore._dict

Copy

{'doc1': Document (metadata={'source':'Text Document' }, page_content=' Hello. Nice to meet you.'),'doc2': Document (metadata={'source':'Text Document' }, page_content=' My name is Teddy.')}

Similarity Search

similarity_search The method provides the ability to search for documents most similar to a given query.

parameter

query (str): Search query text to find similar documents
k (int): Number of documents to return. Default is 4
filter (Optional[Union[Callable, Dict[str, Any]]]): Metadata filtering function or dictionary. The default is None
fetch_k (int): Number of documents to import before filtering. Default is 20
**kwargs : Additional keyword factors

return value

List[Document] : List of documents most similar to queries

Motion method

similarity_search_with_score Search for documents with similarity scores by calling the method internally.
In the search results, only documents are extracted and returned, excluding scores.

Main features

filter Metadata-based filtering is possible using parameters.
fetch_k You can adjust the number of documents to search before filtering, so you can get the desired number of documents after filtering.

Consideration port when used

Search performance is highly dependent on the quality of the embedding model used.
On large data sets k Wow fetch_k It is important to balance the search speed and accuracy by adjusting the values accordingly.
If complex filtering is required, filter Fine control is possible by passing the custom function to the parameter.

Optimization tips

For frequently used queries, you can cache the results to improve your repetitive search speed.
fetch_k Setting too large can slow down your search, so it's a good idea to experiment with the appropriate values.

Copy

# Similarity Search
db.similarity_search("TF IDF Tell me about")

Copy

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database  
... 
(meditation) 
... 
Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: CSV (Comma-Separated Values) is a file format that stores data, each data value is comma Is separated by Used to simply store and exchange data in tabular form.\n Example: CSV files with headers called name, age, job may contain data such as Hong Gil-dong, 30, developer.\NAssociation keyword: data format, file processing, data exchange\n\nJSON\n\n Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format, using readable text for both people and machines}is data in JSON format.\nAssociation: Data exchange, web development, API\n\nTransformer\n\n Definition: Transformers are a type of deep learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism.\n Example: Google Translator uses a transformer model to perform translations between different languages.\nAssociation Keywords: deep learning, natural language processing, Attention\n\nHuggingFace')]

k You can specify the number of search results in the value.

Copy

# Specify k value
db.similarity_search("TF IDF 에 대하여 알려줘", k=2)

Copy

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Open source means software that has source code released and can be freely used, modified, and distributed by anyone. This plays an important role in promoting collaboration and innovation.\n Example: The Linux operating system is a representative open source project.  \nAssociation keyword: software development, community, technical collaboration\n\nStructured Data\n\n Definition: Structured data is data organized according to a defined format or schema. This can be easily retrieved and analyzed from databases, spreadsheets, etc..\n Example: A customer information table stored in a relational database is an example of structured data.\nAssociation: Database, data analysis, data modeling\n\nParser\n\n Definition: Parser is given data (String, file, etc.) is a tool to analyze and convert it into a structured form. It is used for parsing of programming languages or processing file data.\n Example: Parsing HTML documents to create a DOM structure for a web page is an example of parsing.\nAssociation: parsing, compiler, data processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]  Compiler, Data Processing\n\nTF-IDF (Term Frequency-Inverse Document Frequency)')]

You can filter by utilizing metadata information on the filter.

Copy

# use filter 
db.similarity_search(
    "Tell me about TF IDF", filter={"source": "data/nlp-keywords.txt"}, k=2
)

Copy

[]

Addition from document (Document) (add_documents)

add_documents The method provides the ability to add or update documents to the vector repository.

parameter

documents (List[Document]): List of documents to add to the vector repository
**kwargs : Additional keyword factors

return value

List[str] : ID list of added text

Motion method

Extract text content and metadata from documents.
add_texts Call the method to do the actual additional work.

Main features

It is convenient to handle document objects directly.
ID processing logic is included to ensure the uniqueness of the document.
add_texts Operates based on methods to increase code reusability.

Copy

from langchain_core.documents import Document

# page_content, metadata designation
db.add_documents(
    [
        Document(
            page_content="Hello! This time I will add a new document.",
            metadata={"source": "mydata.txt"},
        )
    ],
    ids=["new_doc1"],
)

Copy

['new_doc1']

Copy

# Check the added data
db.similarity_search("hello", k=1)

Copy

 [Document (metadata={'source':'mydata.txt'}, page_content='Hello! I'll add a new document this time')]

Add from text (add_texts)

add_texts The method provides the ability to embed text and add it to the vector repository.

parameter

texts (Iterable[str]): Text to add to the vector repository
metadatas (Optional[List[dict]]): Metadata list associated with text (optional)
ids (Optional[List[str]]): Text's unique identifier list (optional)
**kwargs : Additional keyword factors

return value

List[str] : ID list of text added to vector repository

Motion method

Convert the entered text uterible to the list.
_embed_documents Embed text using methods.
__add Call the method to add the embedded text to the vector repository.

Copy

# Add new data
db.add_texts(
    ["This time, we add text data.", "This is the second text data we added."],
    metadatas=[{"source": "mydata.txt"}, {"source": "mydata.txt"}],
    ids=["new_doc2", "new_doc3"],
)

Copy

['new_doc2','new_doc3']

Copy

# Check the added data
db.index_to_docstore_id

Copy

{0:'e7a34419-3e8d-4c4c-aa4d-31d2c7d488e8', 1:'f4be9a4f-8361-400d-9f4c-fef29f67351f', 2:'42c10ad6e99b1710-0b7b-4d6d-9932-7267501afa47', 7:'2ef86c91-1e36-4037-ab64-2845acd67080', 8:'63b2f1-3ecb-4b99-82f

Delete Documents

delete The method provides the ability to delete documents corresponding to the specified ID from the vector repository.

parameter

ids (Optional[List[str]]): ID list of documents to be deleted
**kwargs : Additional keyword factors (not used in this method)

return value

Optional[bool] : Delete Success True, Fail False, None if not implemented

Motion method

Validate the entered ID.
Find the index corresponding to the ID you want to delete.
Remove that ID from the FAISS index.
Delete documents from that ID from the document repository.
Update index and ID mapping.

Main features

Accurate document management is possible with ID-based deletion.
Delete is done on both the FAISS index and document repository.
Maintain data consistency through index rearrangement after deletion.

caution

Delete operations are irreversible and must be done carefully.
Concurrency control is not implemented and requires attention in a multi-threaded environment.

Copy

# Add data for deletion
ids = db.add_texts(
    ["Add data for deletion.", "This is the second data for deletion."],
    metadatas=[{"source": "mydata.txt"}, {"source": "mydata.txt"}],
    ids=["delete_doc1", "delete_doc2"],
)

Copy

# Check the ID to be deleted
print(ids)

Copy

['delete_doc1','delete_doc2']

delete You can delete it by entering ids.

Copy

# delete by id
db.delete(ids)

Copy

True

Copy

# Output the deleted results
db.index_to_docstore_id

Copy

{0:'e7a34419-3e8d-4c4c-aa4d-31d2c7d488e8', 1:'f4be9a4f-8361-400d-9f4c-fef29f67351f', 2:'42c10ad6e99b1710-0b7b-4d6d-9932-7267501afa47', 7:'2ef86c91-1e36-4037-ab64-2845acd67080', 8:'63b2f1-3ecb-4b99-82f

Save and load

Local Save (Save Local)

save_local The method provides the ability to store FAISS indexes, document repositories, and index-document ID mapping to local disks.

parameter

folder_path (str): folder path to save
index_name (str): Index file name to save (default: "index")

Motion method

Create a specified folder path (ignore if already present).
Save the FAISS index as a separate file.
Save document repository and index-document ID mapping in pickle format.

Consideration port when used

You need write permission for the storage path.
For large-capacity data, storage space and time can be quite time consuming.
You should consider the security risks of using pickle.

Copy

# Save to local disk
db.save_local(folder_path="faiss_db", index_name="faiss_index")

Locally called (Load Local)

load_local Class methods provide the ability to load FAISS indexes, document repositories, and index-document ID mapping stored on local disks.

parameter

folder_path (str): Folder path where files to load are stored
embeddings (Embeddings): Embedding objects to use for query creation
index_name (str): The name of the index file to be recalled (default: "index")
allow_dangerous_deserialization (bool): Allow pickle file inverse matrix (default: False)

return value

FAISS : Loaded FAISS object

Motion method

Verify the risk of reverse serialization and require explicit permission from the user.
Bring the FAISS index separately.
Use pickle to bring up the document repository and index-document ID mapping.
Generate and return FAISS objects with the data you call.

Copy

# Load saved data
loaded_db = FAISS.load_local(
    folder_path="faiss_db",
    index_name="faiss_index",
    embeddings=embeddings,
    allow_dangerous_deserialization=True,
)

Copy

# Check the loaded data
loaded_db.index_to_docstore_id

Copy

{0:'e7a34419-3e8d-4c4c-aa4d-31d2c7d488e8', 1:'f4be9a4f-8361-400d-9f4c-fef29f67351f', 2:'42c10ad6e99b1710-0b7b-4d6d-9932-7267501afa47', 7:'2ef86c91-1e36-4037-ab64-2845acd67080', 8:'63b2f1-3ecb-4b99-82f

FAISS object merge (Merge From)

merge_from The method provides the ability to merge different FAISS objects into the current FAISS object.

parameter

target (FAISS): The target FAISS object to merge with the current object

Motion method

Check the document repository for merging.
Set indexes for new documents based on the length of the existing index.
Merge FAISS indexes.
Extract documents and ID information from the target FAISS object.
Add the extracted information to the current document repository and index-document ID mapping.

Main features

Merge indexes, document repositories, and index-document ID mappings for both FAISS objects.
Merge while maintaining continuity of index numbers.
Check in advance whether document repositories can be merged.

caution

The structure of the merged target FAISS object and the current object must be compatible.
You should be careful with duplicate ID processing. Duplicate checks are not performed in the current implementation.
If an exception occurs during the merging process, it may be partially merged.

Copy

# Load saved data
db = FAISS.load_local(
    folder_path="faiss_db",
    index_name="faiss_index",
    embeddings=embeddings,
    allow_dangerous_deserialization=True,
)

Copy

# Create a new FAISS vector repository
db2 = FAISS.from_documents(documents=split_doc2, embedding=OpenAIEmbeddings())

Copy

# Check data in db
db.index_to_docstore_id

Copy

{0:'e7a34419-3e8d-4c4c-aa4d-31d2c7d488e8', 1:'f4be9a4f-8361-400d-9f4c-fef29f67351f', 2:'42c10ad6e99b1710-0b7b-4d6d-9932-7267501afa47', 7:'2ef86c91-1e36-4037-ab64-2845acd67080', 8:'63b2f1-3ecb-4b99-82f

Copy

# Check data in db2
db2.index_to_docstore_id

Copy

{0: '832fe184-5aec-469e-9cf0-1df951cafb66', 1: '734c91d1-001e-4165-87bb-bab8149d7cf9', 2:'c786c99a-3

merge_from Use to merge 2 db.

Copy

# merge db + db2
db.merge_from(db2)

Copy

# Check merged data
db.index_to_docstore_id

Copy

{0:'e7a34419-3e8d-4c4c-aa4d-31d2c7d488e8', 1:'f4be9a4f-8361-400d-9f4c-fef29f67351f', 2:'42c10ad6e99b1710-0b7b-4d6d-9932-7267501afa47', 7:'2ef86c91-1e36-4037-ab64-2845acd67080', 8:'63b2f734c91d1-001e-4165-87bb-bab8149d7cf9', 16:'c786c99a-8630-4ef4-ad40-4b1b83446243', 17:'b4f8aaaa-ad09-4282-  67f3d25f-8836-4371-ad83-ffdb1dd84e06', 19:'d62a6812-716f-400e-84dd-0529b7c90e01'}

Improper parameter settings can affect search performance or quality of results.
High on large data sets k Setting values can increase search time. Four documents set to default values are viewed by performing a similar search.

caution

MMR search time fetch_k Raise lambda_mult By adjusting, you can balance diversity and relevance.
You can use threshold-based searches to return only highly relevant documents.

Optimization tips

You can adjust the quality and versatility of your search results by properly selecting search types and parameters.
On large data sets fetch_k Wow k You can balance performance and accuracy by adjusting the values.
You can take advantage of the filtering function to search only documents that meet certain conditions.

Consideration port when used

k : Number of documents to return
score_threshold : Similarity score threshold
fetch_k : Number of documents to pass to MMR algorithm
lambda_mult : MMR diversity control parameters
filter : Document metadata based filtering

Customizing search parameters

"similarity" : Similarity based search (default)
"mmr" : Search for Maximal Marginal Relevance
"similarity_score_threshold" : Search for threshold-based similarities

Support for various search types

Main features

VectorStoreRetriever : Vector repository based searcher object

return value

**kwargs : Keyword factor to pass to search function
search_type (Optional[str]): Search type ( "similarity" , "mmr" , "similarity_score_threshold" )
search_kwargs (Optional[Dict]): Additional keyword factors to pass to search functions

parameter

as_retriever Methods are based on current vector storage VectorStoreRetriever Provides the ability to create objects.

Convert to searcher (as_retriever)

Copy

# Create a new FAISS vector repository
db = FAISS.from_documents(
    documents=split_doc1 + split_doc2, embedding=OpenAIEmbeddings()
)

The default searcher (retriever) returns 4 documents.

Copy

# Convert to search engine
retriever = db.as_retriever()
# Perform a search
retriever.invoke("Word2Vec The first time I met him")

Copy

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source'),  
... 
(meditation) 
... 
Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.\n Example: Words that do not appear frequently in many documents have high TF-IDF values.\nAssociation Keywords: natural language processing, information retrieval, data mining\n\nDeep Learning\n\n Definition: Deep learning is an area of machine learning that solves complex problems using the artificial neural network. This focuses on learning high-level expressions from data.\n Example: Dip-learning models are utilized in image recognition, speech recognition, natural language processing, etc.\nAssociation keyword: artificial neural network, machine learning, data analysis\n\nSchema\n\n Definition: Schema is a database or file Defines the structure, it provides a blueprint of how data is stored and organized.\n Example: Table schema in relational database

Search for more documents with high diversity

k : Number of documents to return (default: 4)
fetch_k : Number of documents to pass to MMR algorithm (default: 20)
lambda_mult : Diversity regulation of MMR results (0~1, default: 0.5)

Copy

# Perform MMR search
retriever = db.as_retriever(
    search_type="mmr", search_kwargs={"k": 6, "lambda_mult": 0.25, "fetch_k": 10}
)
retriever.invoke("Word2Vec The first time I met him")

Copy

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source'),  
... 
(meditation) 
... 
Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Keyword search is the process of finding information based on keywords entered by the user. This is used as a basic search method in most search engines and database systems.\n Example: When a user searches for "Coffee Shop Seoul", it returns a list of related coffee shops.\nAssociation Keywords: Search Engine, Data Search, Information Search\n\nPage Rank\n\n Definition: The page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This analyzes and evaluates the link structure between web pages.\n Example: Google search engines use page rank algorithms to rank search results.\nAssociation: Search engine optimization, web analytics, link analysis\n\ndata mining\n\n Definition: Data mining is the process of discovering useful information from large amounts of data. It utilizes technologies such as statistics, machine learning, pattern recognition, etc..\n Example: It is an example of data mining that retailers analyze customer purchase data to develop a sales strategy.\N Associated Keyword: Big Data, Pattern Recognition, Predictive Analysis\n\n Multimodal')]

Get more documents for the MMR algorithm, but only return the top two

Copy

# Perform MMR search, return only top 2
retriever = db.as_retriever(search_type="mmr", search_kwargs={"k": 2, "fetch_k": 10})
retriever.invoke("Word2Vec The first time I met him")

Copy

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. This creates a vector based on the contextual similarity of the word.\n Example: In the Word2Vec model, "king" and "kingdom" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively searching for similar vectors in large vectors. FAISS can be used to quickly find similar images among millions of image vectors.\nAssociate: Vector Search, Machine Learning, Database Optimization\n\nOpen Source'), Document (metadata={'source':'data/nlp-keywords.txt'}, page_content='GPT  GPT is a proactive language model pre-trained with a large dataset, utilized for a variety of text-based tasks. This can generate a natural language based on the text entered.\n Example: A chatbot that generates detailed answers to questions provided by the user can use the GPT model.\nAssociation Keywords: natural language processing, text generation, deepening\n\nInstructGPT\n\n Definition: InstructGPT is a GPT model optimized to perform specific tasks according to the user's instructions. This model is designed to produce more accurate and relevant results.\n Example: If a user provides specific instructions such as "draft email", InstructGPT will create an email based on the relevant content.\NAssociation keyword: Artificial, natural language understanding, Command-based processing\n\nKeyword Search')]  Command-based processing\n\nKeyword Search')]  Command-based processing\n\nKeyword Search')]

Search only documents with similarities above a certain threshold

Copy

# Perform threshold-based searches
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}
)

retriever.invoke("Word2Vec The first time I met him")

Copy

[Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source')]

Search only the single most similar document

Copy

# Set k=1 to retrieve only the most similar documents
retriever = db.as_retriever(search_kwargs={"k": 1})

retriever.invoke("Word2Vec The first time I met him")

Copy

 [Document (metadata={'source':'data/nlp-keywords.txt'}, page_content=' Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It creates vectors based on the contextual similarity of words.\n Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in close positions to each other.\nAssociation keyword: natural language processing, embedding, semantic similarity\nLLM (Large Language Model)\n\n Definition: LLM refers to a large language model trained with large text data. These models are used for various natural language understanding and creation tasks.\n Example: OpenAI's GPT series is a representative large language model.\nAssociation Keyword: Natural Language Processing, Diplearning, Text Generation\n\nFAISS (Facebook AI Similarity Search)\n\n Definition: FAISS is a high-speed similarity search library developed by Facebook, especially when effectively retrieving analog vectors from large vectors. FAISS can be used to quickly find similar images out of millions of image vectors.\NAssociation Keywords: vector search, machine learning, database optimization\n\nOpen Source')]

Apply specific metadata filters

Copy

# Apply metadata filter
retriever = db.as_retriever(
    search_kwargs={"filter": {"source": "data/finance-keywords.txt"}, "k": 2}
)
retriever.invoke("ESG Tell me about")

Copy

[Document (metadata={'source':'data/finance-keywords.txt'}, page_content=' Definition: ESG is an investment approach that takes into account the environmental, social and governance aspects of the enterprise.\N Example: The S&P 500 ESG index is an index consisting of companies with excellent ESG performance\nP 500 companies have the largest purchase of their own shares.\n Equestrian keyword: shareholder value, capital management, stock price stimulus\n\nCyclical Stocks\n\n Definition: The circulatory state refers to the shares of companies whose performance varies greatly depending on the economic situation. \N Example: Ford, General Motors Auto companies like are representative recalculators included in the S&P 500. Defensive shares are stocks of companies with stable performance regardless of economic fluctuations.\n Example: Life-must-have companies such as Procter & Bl, Johnson & Johnson are referred to as representative defenses within the S&P 500.\N.Keyword: stable return, low volatility, risk management'), Document (metadata={'source':'data/finance-key  It's an activity that analyzes competitiveness, etc. to help you make investment decisions. \n Example: Goldmanx analysts have announced quarterly earnings prospects for S&P 500 companies. \n Associate Keyword: Investment Analysis, Corporate Valuation, Market Outlook\n\nCorporate Governance\n\n Definition: Corporate Governance Means systems and processes for corporate management and control.\n Example: S&P 500 companiesnMergers and Acquisitions (M&A)\n\n Definition: The merger refers to the process by which companies buy or merge with other companies.\n Example: As Microsoft acquired the activity blizzard, the fando of the game industry within the S&P 500 has changed.\Non-guide keyword: Corporate strategy, synergy, corporate value\n\nESG (Environmental, Social and  As Microsoft acquired Activation Blizzard, the game industry in the S&P 500 has changed.\NAssociation Keyword: Corporate Strategy, Synergy, Corporate Value\n\nESG (Environmental, Social, and Governance)']  As Microsoft acquired Activation Blizzard, the game industry in the S&P 500 has changed.\NAssociation Keyword: Corporate Strategy, Synergy, Corporate Value\n\nESG (Environmental, Social, and Governance)']

Previous01. Chroma Next03. Pinecone

Last updated 5 months ago