03. Pinecone
Pinecone is a high-performance vector database, an efficient vector storage and retrieval solution for AI and machine-learning applications.
Let's compare vector databases like Pinecone, Chroma, Faiss.
Advantages of Pinecone
Scalability: Provides excellent scalability for large datasets.
Ease of management: Fully managed service, with less infrastructure management burden.
Real-time updates: Real-time insertion, update, and deletion of data is possible.
High availability: Cloud based provides high availability and durability.
API friendly: Easily integrated through RESTful/Python API.
Disadvantages of Pinecone
Cost: It can be relatively expensive compared to Chroma or Faiss.
Customization limits: Because it is a fully managed service, there may be restrictions on detailed customization.
Data location: Since you need to store data in the cloud, there may be data sovereignty issues.
Compared to Chroma or Faiss:
Chroma/FAISS open source and locally viable, low initial cost and easy data control. The freedom of customization is high. However, in large-scale scalability, it may be limited compared to Pinecone.
The choice should be made taking into account the size of the project, the port of demand, the budget, etc. Pinecone may be advantageous in large production environments, but Chroma or Faiss may be more suitable in small projects or experimental environments.
Reference
Copy
Copy
Copy
Copy
Update guide
The features below are custom implemented, so you must proceed after updating the library below.
Copy
Insoluble dictionary for Korean processing
Pre-importing of Hangeuller (later used for talkers)
Copy
Copy
Data preprocessing
Below is the pretreatment process for general documents. ROOT_DIR All in the sub .pdf Read the file document_lsit Save on.
Copy
Copy
Copy
Copy
Copy
Copy
Preprocesses documents for storing in DB in Pinecone. You can specify metadata_keys during this process.
If you want to tag additional metadata, add metadata in advance in the preprocessing task and then proceed.
split_docs: This is a List[Document] containing the results of document splitting.metadata_keys: A List containing metadata keys to be added to the document.min_length: Specifies the minimum length of a document. Documents shorter than this length will be excluded.use_basename: Specifies whether to use file names based on the source path. The default is False.
Copy
Copy
Preprocessing of documents
Extracts required metadata information.
Filters only days longer than a minimum length.
Specifies whether to use the document's basename. The default is False.
Here, basename means the last part of the file path.
For example, /Users/teddy/data/document.pdf would be document.pdf.
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Issue API Key
link
Profile - Account - Projects - Starter - API keys - Issue
Add the following to your .env file:
Copy
Creating a new VectorStore index
Create a new index for Pinecone.

Create a pinecone index.
Note - metric specifies how similarity is measured. If you are considering HybridSearch, metric is set to dotproduct.
Copy
Copy
Below is an example using a paid Pod. Paid Pods provide more extended features than free Serverless Pods.
reference: https://docs.pinecone.io/guides/indexes/choose-a-pod-type-and-size
Copy
Copy
Sparse Encoder generation
Sparse Encoder creates.
Kiwi TokenizerWow, Korean stop words(stopwords) Perform processing.We learn contents using Sparse Encoder. The learned encoding here is used to create Sparse Vector when saving documents to VectorStore.
Copy
Learn Corpus in Sparse Encoder.
save_path: This is the path to save the Sparse Encoder. This will be used later when loading the Sparse Encoder saved in pickle format and embedding the query. Therefore, specify the path to save it.
Copy
Copy
[Optional] Below is the code you use when you need to reload the Sparse Encoder you learned and saved later.
Copy
Copy
Pinecone: Upsert to DB Index
context: Here are the contents of the document.page: The page number of the document.source: This is the source of the document.values: Embedder Embedding of documents obtained through .sparse values: Sparse Encoder Embedding of documents obtained through .
Copy
Upsert documents in batches without distributing them. If the number of documents is not large, use the method below.
Copy
Copy
Below is a method to quickly Upsert large documents by performing distributed processing. Use it when uploading large quantities.
Copy
Copy
Index lookup/delete
The describe_index_stats method provides statistical information about the contents of an index. This method provides information such as the number of vectors per namespace and the number of dimensions.
Parameters * filter (Optional[Dict[str, Union[str, float, int, bool, List, dict]]]): A filter that returns only statistics for vectors that meet a certain condition. Defaults to None * **kwargs: Additional keyword arguments.
Return value * DescribeIndexStatsResponse: An object containing statistics information about the index.
Usage examples * Basic usage: index.describe_index_stats() * Applying a filter: index.describe_index_stats(filter={'key': 'value'})
Note - metadata filtering is available to paid users only.
Copy
Copy
Delete namespace
Copy
Copy
Copy
Copy
Below are features exclusive to paid users. Metadata filtering is available to paid users.
Copy
Copy
Create a Retriever
Setting PineconeKiwiHybridRetriever initialization parameters
The init_pinecone_index function and the PineconeKiwiHybridRetriever class implement a hybrid retrieval system using Pinecone, which combines dense and sparse vectors to perform effective document retrieval.
Pinecone Index Initialization
The init_pinecone_index function initializes a Pinecone index and sets up the necessary components.
Parameters * index_name (str): Pinecone index name * namespace (str): Namespace to use * api_key (str): Pinecone API key * sparse_encoder_pkl_path (str): Path to the sparse encoder pickle file * stopwords (List[str]): List of stopwords * tokenizer (str): Tokenizer to use (default: "kiwi") * embeddings (Embeddings): Embedding model * top_k (int): Maximum number of documents to return (default: 10) * alpha (float): Weighting parameter to adjust the density and sparse vectors (default: 0.5)
Main functions 1. Initialize Pinecone index and output statistics information 2. Load sparse encoder (BM25) and set tokenizer 3. Specify namespace
Copy
Copy
PineconeKiwiHybridRetriever
The PineconeKiwiHybridRetriever class implements a hybrid retriever that combines Pinecone and Kiwi.
Key properties * embeddings: Embedding model for dense vector transformation * sparse_encoder: Encoder for sparse vector transformation * index: Pinecone index object * top_k: Maximum number of documents to return * alpha: Weighting parameter for dense and sparse vectors * namespace: Namespace within the Pinecone index
Features * HybridSearch Retriever that combines dense and sparse vectors * Optimization of search strategy through weight adjustment * Various dynamic metadata filtering can be applied (using search_kwargs: filter, k, rerank, rerank_model, top_n, etc.)
Usage Example 1. Initialize required components with the init_pinecone_index function 2. Create a PineconeKiwiHybridRetriever instance with the initialized components 3. Perform a hybrid search using the created searcher Create a PineconeKiwiHybridRetriever.
Copy
General Search
Copy
Copy
Use dynamic search_kwargs - k: specifies the maximum number of documents to return
Copy
Copy
Use dynamic search_kwargs - alpha: A parameter to adjust the weights of dense and sparse vectors. Specify a value between 0 and 1. 0.5 is the default, and the closer it is to 1, the higher the weight of dense vectors.
Copy
Copy
Copy
Copy
Metadata Filtering

Using dynamic search_kwargs - filter: Apply metadata filtering
(Example) Search only documents with pages less than 5.
Copy
Copy
dynamic search_kwargs use - filter: metadata Apply filtering
(example) source go SPRi AI Brief_8 Search within the document Wolho_Industry_Trends.pdf.
Copy
Copy
Reranking apply
You can get retrieval - reranker results by simply applying search_kwargs.
(However, reranker is a paid feature, so please check the fee system in advance.)
Reference Documents
Pinecone Rerank Document
Model and Rate System
Copy
Copy
Copy
Last updated