04. RAPTOR: Long Context Summary
install
Copy
pip install -qU langchain umap-learn scikit-learn langchain_community tiktoken langchain-openai langchainhub chromadb langchain-anthropicRAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval
RAPTOR The paper presents an interesting approach to indexing and retrieving documents.
Teddy note paper summary (note)
leafsIs a set of starting documents.leafs are embedded and clustered.
The cluster then summarizes information between similar documents to a higher level (more abstract).
This process is performed recursively, the original document ( leafs ) To form a "tree" leading to a more abstract summary.
This can be applied on a variety of scales; leafs can be:
Text chunk in a single document (as shown in the paper)
Full document (as shown below)
With longer context LLMs, you can do this for the entire document.
document
Let's apply this to LangChain's LCEL document.
In this case, each doc Silver is a unique web page for LCEL documents.
Contexts range from less than 2,000 tokens to more than 10,000 tokens.
Describes the process of extracting text data from a web document and calculating the number of tokens in the text to visualize it as a histogram.
tiktokenUse the library to calculate the number of tokens in a string based on the given encoding name.RecursiveUrlLoaderUse classes to recursively load web documents from a specified URL. In this processBeautifulSoupUse to extract text from HTML documents.Load documents from multiple URLs to collect all text data into one list.
For each document text
num_tokens_from_stringCall the function to calculate the number of tokens, and save it to the list.matplotlibVisualize the distribution of the number of tokens calculated using histogram. The histogram shows the number of tokens on the x-axis and the frequency of documents with the number of tokens on the y-axis.Histograms help you understand the distribution of data, especially the length distribution of text data.
Copy
Describes the process of sorting and linking document text to calculate the number of tokens.
document(
docs) Based on the metadata's "source" key.Flip the sorted document list in reverse order.
The content of the document in reverse order is a specific separator
"\n\n\n --- \n\n\n"Connect using ).Number of tokens in connected content
num_tokens_from_stringCalculate using functions, output them. At this time,"cl100k_base"Use the model.
Copy
Copy
RecursiveCharacterTextSplitter Describe the process of splitting text using.
chunk_size_tokBy setting a variable, the size of each text chunk is 2000 tokens.RecursiveCharacterTextSplitteroffrom_tiktoken_encoderInitialize the text divider using methods. Chunk size herechunk_size) And chunk overlap (chunk_overlap) To 0.Initialized text divider
split_textBy calling the method,concatenated_contentDivides the linked text stored in the variable. Split resulttexts_splitStored in variables.
Copy
Model
Various models can be tested, new Claude3 The series is also included.
Don't forget to set the relevant API key.
OPENAI_API_KEYWhen using AnthropicANTHROPIC_API_KEY
ChatOpenAI hump ChatAnthropic + OpenAIEmbeddings Use to implement the chatbot model.
OpenAIEmbeddingsInstances the embedding function of OpenAI.ChatOpenAIhumpChatAnthropicUsingtemperatureSet to 0, initialize the chatbot model.
Copy
Copy
Use Cache Embedding.
Copy
Initialize the model.
Copy
Tree building
The clustering approach in tree building contains some interesting ideas.
GMM (Gaussian mixed model)
Model the distribution of data points across various clusters.
Evaluate the model's Bayesian Information Criteria (BIC) to determine the optimal number of clusters.
UMAP (Uniform Manifold Approximation and Projection)
Support clustering.
Reduce the dimension of high-dimensional data.
UMAP helps to emphasize natural grouping based on the similarity of data points.
Regional and global clustering
Used to analyze data at various scales.
Effectively captures both detailed and wider patterns within the data.
Threshold setting
Applied to determine cluster membership in the context of GMM.
Based on probability distribution (data points assigned to ≥ 1 cluster).
The code for GMM and threshold setting is from Sarthi et al mentioned in the two sources below:
Both authors acknowledge full merit.
global_cluster_embeddings Functions use UMAP to perform global dimensional reduction of embedding.
Embedding entered
embeddings) To the specified dimension using UMAP (dim) To dimension.n_neighborsSpecifies the number of neighbors to consider each point, and if not provided, is defaulted to the square root of the embedding number.metricSpecifies the distance measurement criteria to be used for UMAP.As a result, embedding reduced to the specified dimension is returned as a numpy array.
Copy
Function to perform regional dimensional reduction for embedding data local_cluster_embeddings Implement.
Embedding entered
embeddings) To the specified dimension using UMAP (dim) To dimension.The number of neighbors to consider for each point in the dimensional reduction process
num_neighbors) And distance measurement metrics (metric) As a parameter.Finally, the dimensional embedding is reduced.
numpyReturns as an array.
Copy
get_optimal_clusters Functions are used to determine the optimal number of clusters based on the given embedding data. This process is performed by calculating the Bayesian Information Criterion (BIC) using the Gaussian Mixture Model.
Input embedding (
embeddings) Is provided as a numpy array.Maximum number of clusters
max_clusters) Specifies the maximum number of clusters to consider. The default is 50.Random number state for reproducibility
random_state) Use fixed values.The function attempts multiple clusters for input embedding and calculates the BIC value for each.
Determine and return the number of clusters with a minimum BIC value as the optimal number of clusters.
This function can be useful for automatically finding the number of clusters that best describe the data in the clustering problem.
Copy
GMM_cluster The function clusters the embedding using a Gaussian Mixture Model (GMM). This process is based on probability thresholds.
Embedding entered
embeddings) Is provided as a numpy array.thresholdIs the probability threshold for assigning embedding to a specific cluster.random_stateIs the seed value for reproducibility of the result.To determine the optimal number of clusters
get_optimal_clustersCall the function.Based on the number of clusters determined, the Gaussian mixing model is initialized and learning is performed on the embedding entered.
Calculate the probability of cluster allocation for each embedding, and if this probability exceeds a given threshold, assign that embedding to the cluster.
The function finally returns the cluster label of the embedding and the number of clusters determined as a tuple.
Copy
perform_clustering Functions return clustering results by reducing dimensions for embedding, global clustering using Gaussian mixed models, and local clustering within each global cluster.
Embedding entered
embeddings) To perform dimensional reduction. This is a dimension specified using UMAPdim) Includes the process of reducing the dimension of embedding.Global clustering is performed using a Gaussian mixed model (GMM) for reduced dimension embedding. Cluster allocation is given probability threshold
threshold).Perform additional local clustering within each global cluster. Based on the results of the global clustering, this course will reduce the dimension and GMM clustering again for only the embedding belonging to each global cluster.
Finally, assign global and local cluster IDs for all embedding, returning a list containing the cluster ID to which each embedding belongs. This list contains an array of cluster IDs for each embedding in the order of the embedding.
This function provides an approach that combines clustering at the global and local levels for clustering of high-dimensional data. This allows you to get more granular clustering results, and more effectively analyze complex data structures.
Copy
A function that generates embedding for a list of text documents embed Implement.
List of text documents by input (
texts).embdObjectembed_documentsCreate embedding of text documents using methods.Created embedding
numpy.ndarrayConverts to form and returns.
Copy
embed_cluster_texts Functions embed and cluster text lists, including original text, corresponding embedding, and assigned cluster labels pandas.DataFrame Returns.
Generate embedding for a given text list.
Clustering is performed based on the embedding generated. This course is predefined
perform_clusteringUse functions.To save results
pandas.DataFrameInitialize.Save the original text, embedding list, and cluster label in DataFrame respectively.
This function combines embedding generation and clustering of text data into one step, facilitating structural analysis and grouping of text data.
Copy
fmt_txt function pandas of DataFrame Format text documents as a single string.
As input parameters
DataFrameReceived,DataFrameYou need to have a'text' column, including text documents to format.All text documents are connected using a specific delimiter ("--- --- \n --- ---") and returned as a single string.
The function returns a single string containing the linked text document.
Copy
Performs the process of embedding, clustering text data, and generating summaries for each cluster.
Generate embedding for a given text list and proceed with clustering based on similarity. This course
df_clustersData frame results. This data frame includes original text, embedding, and cluster allocation information.Extend data frame items to easily handle cluster assignments. Each row is converted into a new data frame that includes text, embedding, and clusters.
Extract a unique cluster identifier from the extended data frame, and format text for each cluster to generate a summary. This summary
df_summaryStored in the data frame. This data frame includes a summary of each cluster, a specified level of detail, and a cluster identifier.Finally, the function returns a tuple containing two data frames. The first data frame contains the original text, embedding, and cluster allocation information, and the second data frame contains a summary for each cluster and its level of detail, cluster identifier.
Copy
A function that implements the process of recursively embedding, clustering, and summarizing text data.
Embed, clustered, and summarized the given text list to store results for each step.
Functions are executed up to the maximum specified recursive level, or repeated until the only number of clusters is 1.
In each recursive step, the current level of clustering results and summary results are returned in the form of a data frame, which is stored in the result dictionary.
If the current level is less than the maximum recursive level, and the number of only clusters is greater than 1, recursively call the function using the summary result of the current level as input text for the next level.
It finally returns a dictionary containing each level cluster data frame and summary data frame.
Copy
Copy
Copy
Copy
Copy
In the paper collapsed tree retrieval We are reporting this best performance.
This involves flattening the tree structure into a single layer, and then applying a k-recent neighbor (kNN) search simultaneously for all nodes.
We will briefly perform this process below.
Chroma Describes the process of using vector repositories to vectorize and searchable repositories of text data.
Early
leaf_textsText data stored inall_textsCopy to variable.Result data (
results), extract the summarized text at each level,all_textsAdd to.Each level
DataFrameinsummariesThe value of the column is converted to a list and extracted.Summary extracted
all_textsAdd to.All text data (
all_texts)ChromaBuild a vector repository.Chroma.from_textsVector the text data by calling the function, creating a vector repository.To make the generated vector repository searchable
.as_retriever()Initialize the finder (retriever) using methods.
Through this process, text data, including summaries of various levels, is vectorized, and searchable based on this Chroma Build a vector repository.
Copy
Save DB locally.
Copy
Copy
Define the Retrieval Augmented Generation (RAG) chain and implement a method to request specific code examples.
hub.pullUse to bring up the RAG prompt.For formatting documents
format_docsDefine functions. This function connects and returns the page content of the document.Make up the RAG chain. This chain is a searcher
retriever),format_docsAfter formatting as a function, the question is handled.RunnablePassthrough()Use to pass the question as it is.Chains are prompts, models, and
StrOutputParser()Parse the final output as a string through.rag_chain.invokeUsing the method, "How to define a RAG chain? Give me a specific code example."Ra handles the question.
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Last updated