02. FAISS

Facebook AI Similarity Search (Faiss) is a library for efficient similarity search and clustering of dense vectors.

Faiss contains algorithms that search for sets of vectors of all sizes, including sets of vectors that may not fit RAM.

It also includes support codes for evaluation and parameter tuning.

Reference - LangChain FAISS documents - FAISS documents

Copy

# API A configuration file for managing keys as environment variables
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

True

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH10-VectorStores")

Copy

Load the sample dataset.

Copy

Copy

VectorStore creation

Main initialization parameters

Indexing parameters - embedding_function (Embeddings): Embedding function to use

Client parameters - index (Any): FAISS index to use - docstore (Docstore): Document repository to use - index_to_docstore_id (Dict[int, str]): Mapping from index to document repository ID

Reference

  • FAISS is a library for high performance vector search and clustering.

  • This class integrates FAISS with LangChain's VectorStore interface.

  • You can build an efficient vector search system by combining embedding functions, FAISS indexes, and document repositories.

Copy

Copy

Copy

FAISS vector repository creation (from_documents)

from_documents Class methods use document list and embedding functions to generate FAISS vector repositories.

parameter

  • documents (List[Document]): List of documents to add to the vector repository

  • embedding (Embeddings): Embedding function to use

  • **kwargs : Additional keyword factors

Motion method

  1. Text content in document list ( page_content ) And metadata.

  2. Using extracted text and metadata from_texts Call the method.

return value

  • VectorStore : Vector repository instances initialized with documents and embedding

Reference

  • This method from_texts Generate a vector repository by calling the method internally.

  • Document page_content In text, metadata Is used as a metadata.

  • If additional setup is required kwargs You can pass through.

Copy

Copy

Copy

Copy

Copy

FAISS vector repository creation (from_texts)

from_texts Class methods use text list and embedding functions to generate FAISS vector repositories.

parameter

  • texts (List[str]): Text list to add to the vector repository

  • embedding (Embeddings): Embedding function to use

  • metadatas (Optional[List[dict]]): Metadata list. The default is None

  • ids (Optional[List[str]]): Document ID list. The default is None

  • **kwargs : Additional keyword factors

Motion method

  1. Embed text using the embedding function provided.

  2. With embedded vector __from Create a FAISS instance by calling the method.

return value

  • FAISS : Created FAISS vector repository instance

Reference

  • This method is a user-friendly interface that handles document embedding, in-memory document storage creation, and FAISS database initialization at once.

  • It's a convenient way to get started quickly.

caution

  • When processing large amounts of text, you need to pay attention to memory usage.

  • To use a metadata or ID, you must provide it as a list of the same length as the text list.

Copy

Check the saved results. The id value checks if the specified id value is well entered.

Copy

Copy

Similarity Search

similarity_search The method provides the ability to search for documents most similar to a given query.

parameter

  • query (str): Search query text to find similar documents

  • k (int): Number of documents to return. Default is 4

  • filter (Optional[Union[Callable, Dict[str, Any]]]): Metadata filtering function or dictionary. The default is None

  • fetch_k (int): Number of documents to import before filtering. Default is 20

  • **kwargs : Additional keyword factors

return value

  • List[Document] : List of documents most similar to queries

Motion method

  1. similarity_search_with_score Search for documents with similarity scores by calling the method internally.

  2. In the search results, only documents are extracted and returned, excluding scores.

Main features

  • filter Metadata-based filtering is possible using parameters.

  • fetch_k You can adjust the number of documents to search before filtering, so you can get the desired number of documents after filtering.

Consideration port when used

  • Search performance is highly dependent on the quality of the embedding model used.

  • On large data sets k Wow fetch_k It is important to balance the search speed and accuracy by adjusting the values accordingly.

  • If complex filtering is required, filter Fine control is possible by passing the custom function to the parameter.

Optimization tips

  • For frequently used queries, you can cache the results to improve your repetitive search speed.

  • fetch_k Setting too large can slow down your search, so it's a good idea to experiment with the appropriate values.

Copy

Copy

k You can specify the number of search results in the value.

Copy

Copy

You can filter by utilizing metadata information on the filter.

Copy

Copy

Addition from document (Document) (add_documents)

add_documents The method provides the ability to add or update documents to the vector repository.

parameter

  • documents (List[Document]): List of documents to add to the vector repository

  • **kwargs : Additional keyword factors

return value

  • List[str] : ID list of added text

Motion method

  1. Extract text content and metadata from documents.

  2. add_texts Call the method to do the actual additional work.

Main features

  • It is convenient to handle document objects directly.

  • ID processing logic is included to ensure the uniqueness of the document.

  • add_texts Operates based on methods to increase code reusability.

Copy

Copy

Copy

Copy

Add from text (add_texts)

add_texts The method provides the ability to embed text and add it to the vector repository.

parameter

  • texts (Iterable[str]): Text to add to the vector repository

  • metadatas (Optional[List[dict]]): Metadata list associated with text (optional)

  • ids (Optional[List[str]]): Text's unique identifier list (optional)

  • **kwargs : Additional keyword factors

return value

  • List[str] : ID list of text added to vector repository

Motion method

  1. Convert the entered text uterible to the list.

  2. _embed_documents Embed text using methods.

  3. __add Call the method to add the embedded text to the vector repository.

Copy

Copy

Copy

Copy

Delete Documents

delete The method provides the ability to delete documents corresponding to the specified ID from the vector repository.

parameter

  • ids (Optional[List[str]]): ID list of documents to be deleted

  • **kwargs : Additional keyword factors (not used in this method)

return value

  • Optional[bool] : Delete Success True, Fail False, None if not implemented

Motion method

  1. Validate the entered ID.

  2. Find the index corresponding to the ID you want to delete.

  3. Remove that ID from the FAISS index.

  4. Delete documents from that ID from the document repository.

  5. Update index and ID mapping.

Main features

  • Accurate document management is possible with ID-based deletion.

  • Delete is done on both the FAISS index and document repository.

  • Maintain data consistency through index rearrangement after deletion.

caution

  • Delete operations are irreversible and must be done carefully.

  • Concurrency control is not implemented and requires attention in a multi-threaded environment.

Copy

Copy

Copy

delete You can delete it by entering ids.

Copy

Copy

Copy

Copy

Save and load

Local Save (Save Local)

save_local The method provides the ability to store FAISS indexes, document repositories, and index-document ID mapping to local disks.

parameter

  • folder_path (str): folder path to save

  • index_name (str): Index file name to save (default: "index")

Motion method

  1. Create a specified folder path (ignore if already present).

  2. Save the FAISS index as a separate file.

  3. Save document repository and index-document ID mapping in pickle format.

Consideration port when used

  • You need write permission for the storage path.

  • For large-capacity data, storage space and time can be quite time consuming.

  • You should consider the security risks of using pickle.

Copy

Locally called (Load Local)

load_local Class methods provide the ability to load FAISS indexes, document repositories, and index-document ID mapping stored on local disks.

parameter

  • folder_path (str): Folder path where files to load are stored

  • embeddings (Embeddings): Embedding objects to use for query creation

  • index_name (str): The name of the index file to be recalled (default: "index")

  • allow_dangerous_deserialization (bool): Allow pickle file inverse matrix (default: False)

return value

  • FAISS : Loaded FAISS object

Motion method

  1. Verify the risk of reverse serialization and require explicit permission from the user.

  2. Bring the FAISS index separately.

  3. Use pickle to bring up the document repository and index-document ID mapping.

  4. Generate and return FAISS objects with the data you call.

Copy

Copy

Copy

FAISS object merge (Merge From)

merge_from The method provides the ability to merge different FAISS objects into the current FAISS object.

parameter

  • target (FAISS): The target FAISS object to merge with the current object

Motion method

  1. Check the document repository for merging.

  2. Set indexes for new documents based on the length of the existing index.

  3. Merge FAISS indexes.

  4. Extract documents and ID information from the target FAISS object.

  5. Add the extracted information to the current document repository and index-document ID mapping.

Main features

  • Merge indexes, document repositories, and index-document ID mappings for both FAISS objects.

  • Merge while maintaining continuity of index numbers.

  • Check in advance whether document repositories can be merged.

caution

  • The structure of the merged target FAISS object and the current object must be compatible.

  • You should be careful with duplicate ID processing. Duplicate checks are not performed in the current implementation.

  • If an exception occurs during the merging process, it may be partially merged.

Copy

Copy

Copy

Copy

Copy

Copy

merge_from Use to merge 2 db.

Copy

Copy

Copy

  • Improper parameter settings can affect search performance or quality of results.

  • High on large data sets k Setting values can increase search time. Four documents set to default values are viewed by performing a similar search.

caution

  • MMR search time fetch_k Raise lambda_mult By adjusting, you can balance diversity and relevance.

  • You can use threshold-based searches to return only highly relevant documents.

Optimization tips

  • You can adjust the quality and versatility of your search results by properly selecting search types and parameters.

  • On large data sets fetch_k Wow k You can balance performance and accuracy by adjusting the values.

  • You can take advantage of the filtering function to search only documents that meet certain conditions.

Consideration port when used

  • k : Number of documents to return

  • score_threshold : Similarity score threshold

  • fetch_k : Number of documents to pass to MMR algorithm

  • lambda_mult : MMR diversity control parameters

  • filter : Document metadata based filtering

Customizing search parameters

  • "similarity" : Similarity based search (default)

  • "mmr" : Search for Maximal Marginal Relevance

  • "similarity_score_threshold" : Search for threshold-based similarities

Support for various search types

Main features

  • VectorStoreRetriever : Vector repository based searcher object

return value

  • **kwargs : Keyword factor to pass to search function

  • search_type (Optional[str]): Search type ( "similarity" , "mmr" , "similarity_score_threshold" )

  • search_kwargs (Optional[Dict]): Additional keyword factors to pass to search functions

parameter

as_retriever Methods are based on current vector storage VectorStoreRetriever Provides the ability to create objects.

Convert to searcher (as_retriever)

Copy

The default searcher (retriever) returns 4 documents.

Copy

Copy

Search for more documents with high diversity

  • k : Number of documents to return (default: 4)

  • fetch_k : Number of documents to pass to MMR algorithm (default: 20)

  • lambda_mult : Diversity regulation of MMR results (0~1, default: 0.5)

Copy

Copy

Get more documents for the MMR algorithm, but only return the top two

Copy

Copy

Search only documents with similarities above a certain threshold

Copy

Copy

Search only the single most similar document

Copy

Copy

Apply specific metadata filters

Copy

Copy

Last updated