01. Chroma

This laptop covers how to start the Chroma vector store.

Chroma is an AI-native open source vector database focused on developer productivity and happiness. Chroma is licensed according to Apache 2.0.

Note link

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

True

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH10-VectorStores")

Copy

Load the sample dataset.

Copy

Copy

VectorStore creation

Vector repository creation (from_documents)

from_documents Class methods create vector repositories from document listings.

parameter

  • documents (List[Document]): List of documents to add to the vector repository

  • embedding (Optional[Embeddings]): Embedding function. The default is None

  • ids (Optional[List[str]]): Document ID list. The default is None

  • collection_name (str): The name of the collection to be created.

  • persist_directory (Optional[str]): Directory to store collections. The default is None

  • client_settings (Optional [chromadb.config.Settings]): Chroma client setup

  • client (Optional [chromadb.Client]): Chroma client instance

  • collection_metadata (Optional[Dict]): Collection composition information. The default is None

Reference

  • persist_directory If specified, the collection is stored in that directory. If not specified, data is temporarily stored in memory.

  • This method is internally from_texts Create a vector repository by calling the method.

  • Document page_content In text, metadata Is used as a metadata.

return value

  • Chroma : Created Chroma vector repository instance When generating documents As a parameter Document Pass the list. Specifies the embedding model to use for embedding, namespace Playing the role collection_name You can specify.

Copy

persist_directory When specified, disk saves it in file form.

Copy

By running the code below DB_PATH Load the data stored in.

Copy

Check the stored data in the called VectorStore.

Copy

Copy

if collection_name If you specify it differently, you will get no results because there is no stored data.

Copy

Copy

Vector repository creation (from_texts)

from_texts Class methods create vector repositories from text listings.

parameter

  • texts (List[str]): Text list to add to the collection

  • embedding (Optional[Embeddings]): Embedding function. The default is None

  • metadatas (Optional[List[dict]]): Metadata list. The default is None

  • ids (Optional[List[str]]): Document ID list. The default is None

  • collection_name (str): The name of the collection to be created. The default is'_LANGCHAIN_DEFAULT_COLLECTION_NAME'

  • persist_directory (Optional[str]): Directory to store collections. The default is None

  • client_settings (Optional [chromadb.config.Settings]): Chroma client setup

  • client (Optional [chromadb.Client]): Chroma client instance

  • collection_metadata (Optional[Dict]): Collection composition information. The default is None

Reference

  • persist_directory If specified, the collection is stored in that directory. If not specified, data is temporarily stored in memory.

  • ids If not provided, it is automatically generated using UUID.

return value

  • Created vector repository instance

Copy

Copy

Copy

Similarity search

similarity_search The method performs a similarity search in the Chroma database. This method returns the documents most similar to the given query.

parameter

  • query (str): Query text to search

  • k (int, optional): Number of results to return. The default is 4.

  • filter (Dict[str, str], optional): Filter by metadata. The default is None.

Reference

  • k You can adjust the value to get the desired number of results.

  • filter You can use parameters to search only documents that meet certain metadata conditions.

  • This method only returns this document without score information. Score information is also required similarity_search_with_score Use the method yourself.

return value

  • List[Document] : List of documents most similar to query text

Copy

Copy

k You can specify the number of search results in the value.

Copy

Copy

filter on metadata You can use the information to filter your search results.

Copy

Copy

next filter Other in source Use to confirm the results you searched for.

Copy

Copy

Add documents to vector storage

add_documents The method adds or updates documents to the vector repository.

parameter

  • documents (List[Document]): List of documents to add to the vector repository

  • **kwargs : Additional keyword factors

  • ids : Document ID list (priority over the ID of the document at the time of delivery)

Reference

  • add_texts The method should be implemented.

  • Document page_content In text, metadata Is used as a metadata.

  • The document has an ID kwargs If no ID is provided, the document's ID is used.

  • kwargs ValueError occurs if the ID and number of documents do not match.

return value

  • List[str] : ID list of added text

exception

  • NotImplementedError : add_texts Occurs when the method is not implemented

Copy

Copy

Copy

Copy

add_texts The method embeds the text and adds it to the vector repository.

parameter

  • texts (Iterable[str]): Text list to add to the vector repository

  • metadatas (Optional[List[dict]]): Metadata list. The default is None

  • ids (Optional[List[str]]): Document ID list. The default is None

Reference

  • ids If not provided, it is automatically generated using UUID.

  • If the embedding function is set, the text is embedded.

  • If metadata is provided:

  • Separate and process text with and without metadata.

  • For text without metadata, fill it with an empty dictionary.

  • Perform upsert tasks on the collection to add text, embedding, and metadata.

return value

  • List[str] : ID list of added text

exception

  • ValueError : When an error occurs due to a complex metadata, it occurs with a filtering method guide message When adding to an existing ID upsert Is performed, and existing documents are replaced.

Copy

Copy

Copy

Copy

Delete documents from vector storage

delete The method deletes the document of the specified ID from the vector repository.

parameter

  • ids (Optional[List[str]]): ID list of documents to be deleted. The default is None

Reference

  • This method is internally collected delete Call the method.

  • ids If it's None, it doesn't do anything.

return value

  • None

Copy

Copy

Copy

Copy

Copy

reset_collection

reset_collection The method initializes the collection of vector repositories.

Copy

Copy

Copy

Convert vector storage to Retriever

as_retriever The method produces VectorStoreRetriever based on the vector repository.

parameter

  • **kwargs : Keyword factor to pass to search function

  • search_type (Optional[str]): Search type ( "similarity" , "mmr" , "similarity_score_threshold" )

  • search_kwargs (Optional[Dict]): Additional factors to pass to the search function

    • k : Number of documents to return (default: 4)

    • score_threshold : Minimum similarity threshold

    • fetch_k : Number of documents to pass to MMR algorithm (default: 20)

    • lambda_mult : Diversity regulation of MMR results (0~1, default: 0.5)

    • filter : Filter document metadata

return value

  • VectorStoreRetriever : Vector repository based searcher instance DB Generate.

Copy

Four documents set to default values are viewed by performing a similar search.

Copy

Copy

Search for more documents with high diversity

  • k : Number of documents to return (default: 4)

  • fetch_k : Number of documents to pass to MMR algorithm (default: 20)

  • lambda_mult : Diversity regulation of MMR results (0~1, default: 0.5)

Copy

Copy

Get more documents for the MMR algorithm, but only return the top two

Copy

Copy

Search only documents with similarities above a certain threshold

Copy

Copy

Search only the single most similar document

Copy

Copy

Apply specific metadata filters

Copy

Copy

Multimodal Search

Chroma supports a multi-modal collection, a collection that can contain and query multiple forms of data.

Data set

Hosted in a Hugging Face coco object detection dataset Use a small subset of.

Only some of all the images in the dataset are downloaded locally and used to create a multi-modal collection.

Copy

Multimodal Embeddings

Utilize Multimodal Embeddings to create Embedding for images and text.

In this tutorial, we use OpenClipEmbeddingFunction to embed the image.

Model benchmark

Model

Training data

Resolution

# of samples seen

ImageNet zero-shot acc.

ConvNext-Base

LAION-2B

256px

13B

71.5%

ConvNext-Large

LAION-2B

320px

29B

76.9%

ConvNext-XXLarge

LAION-2B

256px

34B

79.5%

ViT-B/32

DataComp-1B

256px

34B

72.8%

ViT-B/16

DataComp-1B

224px

13B

73.5%

ViT-L/14

LAION-2B

224px

32B

75.3%

ViT-H/14

LAION-2B

224px

32B

78.0%

ViT-L/14

DataComp-1B

224px

13B

79.2%

ViT-G/14

LAION-2B

224px

34B

80.1%

ViT-L/14 ( Original CLIP )

WIT

224px

13B

75.5%

ViT-SO400M/14 ( SigLIP )

WebLI

224px

45B

82.0%

ViT-SO400M-14-SigLIP-384 ( SigLIP )

WebLI

384px

45B

83.1%

ViT-H/14-quickgelu ( DFN )

DFN-5B

224px

39B

83.4%

ViT-H-14-378-quickgelu ( DFN )

DFN-5B

378px

44B

84.4%

In the example below model_name and checkpoint Set and use.

  • model_name : OpenCLIP model name

  • checkpoint : Of the OpenCLIP model Training data Name

Copy

model_name

checkpoint

0

RN50

openai

One

RN50

yfcc15m

2

RN50

cc12m

3

RN50-quickgelu

openai

4

RN50-quickgelu

yfcc15m

5

RN50-quickgelu

cc12m

6

RN101

openai

7

RN101

yfcc15m

8

RN101-quickgelu

openai

9

RN101-quickgelu

yfcc15m

Copy

Save the path of the image as list.

Copy

Copy

Copy

Create a description for image.

Copy

Copy

Copy

Copy

Below we calculate the similarity between the image description and the text you created.

Copy

Seek and visualize similarities between text versus image description.

Copy

Vectorstore creation and image addition

Generate Vectorstore and add images.

Copy

Copy

Below is the helper class to output the image retrieved results into the image.

Copy

Copy

Copy

Copy

Last updated