02. FAISS
Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors.
Faiss contains algorithms that search for sets of vectors of all sizes, including sets of vectors that may not fit RAM.
It also includes support codes for evaluation and parameter tuning.
Reference - LangChain FAISS documents - FAISS documents
Copy
# API A configuration file for managing keys as environment variables
from dotenv import load_dotenv
# API Load key information
load_dotenv()Copy
TrueCopy
# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging
# Enter a project name.
logging.langsmith("CH10-VectorStores")Copy
Load the sample dataset.
Copy
Copy
VectorStore creation
Main initialization parameters
Indexing parameters - embedding_function (Embeddings): Embedding function to use
Client parameters - index (Any): FAISS index to use - docstore (Docstore): Document repository to use - index_to_docstore_id (Dict[int, str]): Mapping from index to document repository ID
Reference
FAISS is a library for high performance vector search and clustering.
This class integrates FAISS with LangChain's VectorStore interface.
You can build an efficient vector search system by combining embedding functions, FAISS indexes, and document repositories.
Copy
Copy
Copy
FAISS vector repository creation (from_documents)
from_documents Class methods use document list and embedding functions to generate FAISS vector repositories.
parameter
documents(List[Document]): List of documents to add to the vector repositoryembedding(Embeddings): Embedding function to use**kwargs: Additional keyword factors
Motion method
Text content in document list (
page_content) And metadata.Using extracted text and metadata
from_textsCall the method.
return value
VectorStore: Vector repository instances initialized with documents and embedding
Reference
This method
from_textsGenerate a vector repository by calling the method internally.Document
page_contentIn text,metadataIs used as a metadata.If additional setup is required
kwargsYou can pass through.
Copy
Copy
Copy
Copy
Copy
FAISS vector repository creation (from_texts)
from_texts Class methods use text list and embedding functions to generate FAISS vector repositories.
parameter
texts(List[str]): Text list to add to the vector repositoryembedding(Embeddings): Embedding function to usemetadatas(Optional[List[dict]]): Metadata list. The default is Noneids(Optional[List[str]]): Document ID list. The default is None**kwargs: Additional keyword factors
Motion method
Embed text using the embedding function provided.
With embedded vector
__fromCreate a FAISS instance by calling the method.
return value
FAISS: Created FAISS vector repository instance
Reference
This method is a user-friendly interface that handles document embedding, in-memory document storage creation, and FAISS database initialization at once.
It's a convenient way to get started quickly.
caution
When processing large amounts of text, you need to pay attention to memory usage.
To use a metadata or ID, you must provide it as a list of the same length as the text list.
Copy
Check the saved results. The id value checks if the specified id value is well entered.
Copy
Copy
Similarity Search
similarity_search The method provides the ability to search for documents most similar to a given query.
parameter
query(str): Search query text to find similar documentsk(int): Number of documents to return. Default is 4filter(Optional[Union[Callable, Dict[str, Any]]]): Metadata filtering function or dictionary. The default is Nonefetch_k(int): Number of documents to import before filtering. Default is 20**kwargs: Additional keyword factors
return value
List[Document]: List of documents most similar to queries
Motion method
similarity_search_with_scoreSearch for documents with similarity scores by calling the method internally.In the search results, only documents are extracted and returned, excluding scores.
Main features
filterMetadata-based filtering is possible using parameters.fetch_kYou can adjust the number of documents to search before filtering, so you can get the desired number of documents after filtering.
Consideration port when used
Search performance is highly dependent on the quality of the embedding model used.
On large data sets
kWowfetch_kIt is important to balance the search speed and accuracy by adjusting the values accordingly.If complex filtering is required,
filterFine control is possible by passing the custom function to the parameter.
Optimization tips
For frequently used queries, you can cache the results to improve your repetitive search speed.
fetch_kSetting too large can slow down your search, so it's a good idea to experiment with the appropriate values.
Copy
Copy
k You can specify the number of search results in the value.
Copy
Copy
You can filter by utilizing metadata information on the filter.
Copy
Copy
Addition from document (Document) (add_documents)
add_documents The method provides the ability to add or update documents to the vector repository.
parameter
documents(List[Document]): List of documents to add to the vector repository**kwargs: Additional keyword factors
return value
List[str]: ID list of added text
Motion method
Extract text content and metadata from documents.
add_textsCall the method to do the actual additional work.
Main features
It is convenient to handle document objects directly.
ID processing logic is included to ensure the uniqueness of the document.
add_textsOperates based on methods to increase code reusability.
Copy
Copy
Copy
Copy
Add from text (add_texts)
add_texts The method provides the ability to embed text and add it to the vector repository.
parameter
texts(Iterable[str]): Text to add to the vector repositorymetadatas(Optional[List[dict]]): Metadata list associated with text (optional)ids(Optional[List[str]]): Text's unique identifier list (optional)**kwargs: Additional keyword factors
return value
List[str]: ID list of text added to vector repository
Motion method
Convert the entered text uterible to the list.
_embed_documentsEmbed text using methods.__addCall the method to add the embedded text to the vector repository.
Copy
Copy
Copy
Copy
Delete Documents
delete The method provides the ability to delete documents corresponding to the specified ID from the vector repository.
parameter
ids(Optional[List[str]]): ID list of documents to be deleted**kwargs: Additional keyword factors (not used in this method)
return value
Optional[bool]: Delete Success True, Fail False, None if not implemented
Motion method
Validate the entered ID.
Find the index corresponding to the ID you want to delete.
Remove that ID from the FAISS index.
Delete documents from that ID from the document repository.
Update index and ID mapping.
Main features
Accurate document management is possible with ID-based deletion.
Delete is done on both the FAISS index and document repository.
Maintain data consistency through index rearrangement after deletion.
caution
Delete operations are irreversible and must be done carefully.
Concurrency control is not implemented and requires attention in a multi-threaded environment.
Copy
Copy
Copy
delete You can delete it by entering ids.
Copy
Copy
Copy
Copy
Save and load
Local Save (Save Local)
save_local The method provides the ability to store FAISS indexes, document repositories, and index-document ID mapping to local disks.
parameter
folder_path(str): folder path to saveindex_name(str): Index file name to save (default: "index")
Motion method
Create a specified folder path (ignore if already present).
Save the FAISS index as a separate file.
Save document repository and index-document ID mapping in pickle format.
Consideration port when used
You need write permission for the storage path.
For large-capacity data, storage space and time can be quite time consuming.
You should consider the security risks of using pickle.
Copy
Locally called (Load Local)
load_local Class methods provide the ability to load FAISS indexes, document repositories, and index-document ID mapping stored on local disks.
parameter
folder_path(str): Folder path where files to load are storedembeddings(Embeddings): Embedding objects to use for query creationindex_name(str): The name of the index file to be recalled (default: "index")allow_dangerous_deserialization(bool): Allow pickle file inverse matrix (default: False)
return value
FAISS: Loaded FAISS object
Motion method
Verify the risk of reverse serialization and require explicit permission from the user.
Bring the FAISS index separately.
Use pickle to bring up the document repository and index-document ID mapping.
Generate and return FAISS objects with the data you call.
Copy
Copy
Copy
FAISS object merge (Merge From)
merge_from The method provides the ability to merge different FAISS objects into the current FAISS object.
parameter
target(FAISS): The target FAISS object to merge with the current object
Motion method
Check the document repository for merging.
Set indexes for new documents based on the length of the existing index.
Merge FAISS indexes.
Extract documents and ID information from the target FAISS object.
Add the extracted information to the current document repository and index-document ID mapping.
Main features
Merge indexes, document repositories, and index-document ID mappings for both FAISS objects.
Merge while maintaining continuity of index numbers.
Check in advance whether document repositories can be merged.
caution
The structure of the merged target FAISS object and the current object must be compatible.
You should be careful with duplicate ID processing. Duplicate checks are not performed in the current implementation.
If an exception occurs during the merging process, it may be partially merged.
Copy
Copy
Copy
Copy
Copy
Copy
merge_from Use to merge 2 db.
Copy
Copy
Copy
Improper parameter settings can affect search performance or quality of results.
High on large data sets
kSetting values can increase search time. Four documents set to default values are viewed by performing a similar search.
caution
MMR search time
fetch_kRaiselambda_multBy adjusting, you can balance diversity and relevance.You can use threshold-based searches to return only highly relevant documents.
Optimization tips
You can adjust the quality and versatility of your search results by properly selecting search types and parameters.
On large data sets
fetch_kWowkYou can balance performance and accuracy by adjusting the values.You can take advantage of the filtering function to search only documents that meet certain conditions.
Consideration port when used
k: Number of documents to returnscore_threshold: Similarity score thresholdfetch_k: Number of documents to pass to MMR algorithmlambda_mult: MMR diversity control parametersfilter: Document metadata based filtering
Customizing search parameters
"similarity": Similarity based search (default)"mmr": Search for Maximal Marginal Relevance"similarity_score_threshold": Search for threshold-based similarities
Support for various search types
Main features
VectorStoreRetriever: Vector repository based searcher object
return value
**kwargs: Keyword factor to pass to search functionsearch_type(Optional[str]): Search type ("similarity","mmr","similarity_score_threshold")search_kwargs(Optional[Dict]): Additional keyword factors to pass to search functions
parameter
as_retriever Methods are based on current vector storage VectorStoreRetriever Provides the ability to create objects.
Convert to searcher (as_retriever)
Copy
The default searcher (retriever) returns 4 documents.
Copy
Copy
Search for more documents with high diversity
k: Number of documents to return (default: 4)fetch_k: Number of documents to pass to MMR algorithm (default: 20)lambda_mult: Diversity regulation of MMR results (0~1, default: 0.5)
Copy
Copy
Get more documents for the MMR algorithm, but only return the top two
Copy
Copy
Search only documents with similarities above a certain threshold
Copy
Copy
Search only the single most similar document
Copy
Copy
Apply specific metadata filters
Copy
Copy
Last updated