01. Chroma
This laptop covers how to start the Chroma vector store.
Chroma is an AI-native open source vector database focused on developer productivity and happiness. Chroma is licensed according to Apache 2.0.
Note link
Copy
# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv
# API Load key information
load_dotenv()Copy
TrueCopy
# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging
# Enter a project name.
logging.langsmith("CH10-VectorStores")Copy
Load the sample dataset.
Copy
Copy
VectorStore creation
Vector repository creation (from_documents)
from_documents Class methods create vector repositories from document listings.
parameter
documents(List[Document]): List of documents to add to the vector repositoryembedding(Optional[Embeddings]): Embedding function. The default is Noneids(Optional[List[str]]): Document ID list. The default is Nonecollection_name(str): The name of the collection to be created.persist_directory(Optional[str]): Directory to store collections. The default is Noneclient_settings(Optional [chromadb.config.Settings]): Chroma client setupclient(Optional [chromadb.Client]): Chroma client instancecollection_metadata(Optional[Dict]): Collection composition information. The default is None
Reference
persist_directoryIf specified, the collection is stored in that directory. If not specified, data is temporarily stored in memory.This method is internally
from_textsCreate a vector repository by calling the method.Document
page_contentIn text,metadataIs used as a metadata.
return value
Chroma: Created Chroma vector repository instance When generatingdocumentsAs a parameterDocumentPass the list. Specifies the embedding model to use for embedding,namespacePlaying the rolecollection_nameYou can specify.
Copy
persist_directory When specified, disk saves it in file form.
Copy
By running the code below DB_PATH Load the data stored in.
Copy
Check the stored data in the called VectorStore.
Copy
Copy
if collection_name If you specify it differently, you will get no results because there is no stored data.
Copy
Copy
Vector repository creation (from_texts)
from_texts Class methods create vector repositories from text listings.
parameter
texts(List[str]): Text list to add to the collectionembedding(Optional[Embeddings]): Embedding function. The default is Nonemetadatas(Optional[List[dict]]): Metadata list. The default is Noneids(Optional[List[str]]): Document ID list. The default is Nonecollection_name(str): The name of the collection to be created. The default is'_LANGCHAIN_DEFAULT_COLLECTION_NAME'persist_directory(Optional[str]): Directory to store collections. The default is Noneclient_settings(Optional [chromadb.config.Settings]): Chroma client setupclient(Optional [chromadb.Client]): Chroma client instancecollection_metadata(Optional[Dict]): Collection composition information. The default is None
Reference
persist_directoryIf specified, the collection is stored in that directory. If not specified, data is temporarily stored in memory.idsIf not provided, it is automatically generated using UUID.
return value
Created vector repository instance
Copy
Copy
Copy
Similarity search
similarity_search The method performs a similarity search in the Chroma database. This method returns the documents most similar to the given query.
parameter
query(str): Query text to searchk(int, optional): Number of results to return. The default is 4.filter(Dict[str, str], optional): Filter by metadata. The default is None.
Reference
kYou can adjust the value to get the desired number of results.filterYou can use parameters to search only documents that meet certain metadata conditions.This method only returns this document without score information. Score information is also required
similarity_search_with_scoreUse the method yourself.
return value
List[Document]: List of documents most similar to query text
Copy
Copy
k You can specify the number of search results in the value.
Copy
Copy
filter on metadata You can use the information to filter your search results.
Copy
Copy
next filter Other in source Use to confirm the results you searched for.
Copy
Copy
Add documents to vector storage
add_documents The method adds or updates documents to the vector repository.
parameter
documents(List[Document]): List of documents to add to the vector repository**kwargs: Additional keyword factorsids: Document ID list (priority over the ID of the document at the time of delivery)
Reference
add_textsThe method should be implemented.Document
page_contentIn text,metadataIs used as a metadata.The document has an ID
kwargsIf no ID is provided, the document's ID is used.kwargsValueError occurs if the ID and number of documents do not match.
return value
List[str]: ID list of added text
exception
NotImplementedError:add_textsOccurs when the method is not implemented
Copy
Copy
Copy
Copy
add_texts The method embeds the text and adds it to the vector repository.
parameter
texts(Iterable[str]): Text list to add to the vector repositorymetadatas(Optional[List[dict]]): Metadata list. The default is Noneids(Optional[List[str]]): Document ID list. The default is None
Reference
idsIf not provided, it is automatically generated using UUID.If the embedding function is set, the text is embedded.
If metadata is provided:
Separate and process text with and without metadata.
For text without metadata, fill it with an empty dictionary.
Perform upsert tasks on the collection to add text, embedding, and metadata.
return value
List[str]: ID list of added text
exception
ValueError: When an error occurs due to a complex metadata, it occurs with a filtering method guide message When adding to an existing IDupsertIs performed, and existing documents are replaced.
Copy
Copy
Copy
Copy
Delete documents from vector storage
delete The method deletes the document of the specified ID from the vector repository.
parameter
ids(Optional[List[str]]): ID list of documents to be deleted. The default is None
Reference
This method is internally collected
deleteCall the method.idsIf it's None, it doesn't do anything.
return value
None
Copy
Copy
Copy
Copy
Copy
reset_collection
reset_collection The method initializes the collection of vector repositories.
Copy
Copy
Copy
Convert vector storage to Retriever
as_retriever The method produces VectorStoreRetriever based on the vector repository.
parameter
**kwargs: Keyword factor to pass to search functionsearch_type(Optional[str]): Search type ("similarity","mmr","similarity_score_threshold")search_kwargs(Optional[Dict]): Additional factors to pass to the search functionk: Number of documents to return (default: 4)score_threshold: Minimum similarity thresholdfetch_k: Number of documents to pass to MMR algorithm (default: 20)lambda_mult: Diversity regulation of MMR results (0~1, default: 0.5)filter: Filter document metadata
return value
VectorStoreRetriever: Vector repository based searcher instanceDBGenerate.
Copy
Four documents set to default values are viewed by performing a similar search.
Copy
Copy
Search for more documents with high diversity
k: Number of documents to return (default: 4)fetch_k: Number of documents to pass to MMR algorithm (default: 20)lambda_mult: Diversity regulation of MMR results (0~1, default: 0.5)
Copy
Copy
Get more documents for the MMR algorithm, but only return the top two
Copy
Copy
Search only documents with similarities above a certain threshold
Copy
Copy
Search only the single most similar document
Copy
Copy
Apply specific metadata filters
Copy
Copy
Multimodal Search
Chroma supports a multi-modal collection, a collection that can contain and query multiple forms of data.
Data set
Hosted in a Hugging Face coco object detection dataset Use a small subset of.
Only some of all the images in the dataset are downloaded locally and used to create a multi-modal collection.
Copy
Multimodal Embeddings
Utilize Multimodal Embeddings to create Embedding for images and text.
In this tutorial, we use OpenClipEmbeddingFunction to embed the image.
Model benchmark
Model
Training data
Resolution
# of samples seen
ImageNet zero-shot acc.
ConvNext-Base
LAION-2B
256px
13B
71.5%
ConvNext-Large
LAION-2B
320px
29B
76.9%
ConvNext-XXLarge
LAION-2B
256px
34B
79.5%
ViT-B/32
DataComp-1B
256px
34B
72.8%
ViT-B/16
DataComp-1B
224px
13B
73.5%
ViT-L/14
LAION-2B
224px
32B
75.3%
ViT-H/14
LAION-2B
224px
32B
78.0%
ViT-L/14
DataComp-1B
224px
13B
79.2%
ViT-G/14
LAION-2B
224px
34B
80.1%
ViT-L/14 ( Original CLIP )
WIT
224px
13B
75.5%
ViT-SO400M/14 ( SigLIP )
WebLI
224px
45B
82.0%
ViT-SO400M-14-SigLIP-384 ( SigLIP )
WebLI
384px
45B
83.1%
ViT-H/14-quickgelu ( DFN )
DFN-5B
224px
39B
83.4%
ViT-H-14-378-quickgelu ( DFN )
DFN-5B
378px
44B
84.4%
In the example below model_name and checkpoint Set and use.
model_name: OpenCLIP model namecheckpoint: Of the OpenCLIP modelTraining dataName
Copy
model_name
checkpoint
0
RN50
openai
One
RN50
yfcc15m
2
RN50
cc12m
3
RN50-quickgelu
openai
4
RN50-quickgelu
yfcc15m
5
RN50-quickgelu
cc12m
6
RN101
openai
7
RN101
yfcc15m
8
RN101-quickgelu
openai
9
RN101-quickgelu
yfcc15m
Copy
Save the path of the image as list.
Copy
Copy
Copy
Create a description for image.
Copy
Copy
Copy
Copy
Below we calculate the similarity between the image description and the text you created.
Copy
Seek and visualize similarities between text versus image description.
Copy
Vectorstore creation and image addition
Generate Vectorstore and add images.
Copy
Copy
Below is the helper class to output the image retrieved results into the image.
Copy
Copy
Copy
Copy
Last updated