04. RAPTOR: Long Context Summary

install

Copy

pip install -qU langchain umap-learn scikit-learn langchain_community tiktoken langchain-openai langchainhub chromadb langchain-anthropic

RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval

RAPTOR The paper presents an interesting approach to indexing and retrieving documents.

Teddy note paper summary (note)

leafs Is a set of starting documents.
leafs are embedded and clustered.
The cluster then summarizes information between similar documents to a higher level (more abstract).

This process is performed recursively, the original document ( leafs ) To form a "tree" leading to a more abstract summary.

This can be applied on a variety of scales; leafs can be:

Text chunk in a single document (as shown in the paper)
Full document (as shown below)

With longer context LLMs, you can do this for the entire document.

document

Let's apply this to LangChain's LCEL document.

In this case, each doc Silver is a unique web page for LCEL documents.

Contexts range from less than 2,000 tokens to more than 10,000 tokens.

Describes the process of extracting text data from a web document and calculating the number of tokens in the text to visualize it as a histogram.

tiktoken Use the library to calculate the number of tokens in a string based on the given encoding name.
RecursiveUrlLoader Use classes to recursively load web documents from a specified URL. In this process BeautifulSoup Use to extract text from HTML documents.
Load documents from multiple URLs to collect all text data into one list.
For each document text num_tokens_from_string Call the function to calculate the number of tokens, and save it to the list.
matplotlib Visualize the distribution of the number of tokens calculated using histogram. The histogram shows the number of tokens on the x-axis and the frequency of documents with the number of tokens on the y-axis.
Histograms help you understand the distribution of data, especially the length distribution of text data.

Copy

from langchain_community.document_loaders.recursive_url_loader import RecursiveUrlLoader
from bs4 import BeautifulSoup as Soup
import tiktoken
import matplotlib.pyplot as plt


def num_tokens_from_string(string: str, encoding_name: str) -> int:
    # Returns the number of tokens in a given string.
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


# Loading LCEL Documentation
url = "https://python.langchain.com/docs/expression_language/"
loader = RecursiveUrlLoader(
    url=url, max_depth=20, extractor=lambda x: Soup(x, "html.parser").text
)
docs = loader.load()

# Loading LCEL documents using PydanticOutputParser (outside the default LCEL document)
url = "https://python.langchain.com/docs/modules/model_io/output_parsers/quick_start"
loader = RecursiveUrlLoader(
    url=url, max_depth=1, extractor=lambda x: Soup(x, "html.parser").text
)
docs_pydantic = loader.load()

# Loading LCEL documents using Self Query (outside the main LCEL document)
url = "https://python.langchain.com/docs/modules/data_connection/retrievers/self_query/"
loader = RecursiveUrlLoader(
    url=url, max_depth=1, extractor=lambda x: Soup(x, "html.parser").text
)
docs_sq = loader.load()

# document text
docs.extend([*docs_pydantic, *docs_sq])
docs_texts = [d.page_content for d in docs]

# Count the number of tokens for each document
counts = [num_tokens_from_string(d, "cl100k_base") for d in docs_texts]

# Plot a histogram of the number of tokens.
plt.figure(figsize=(10, 6))
plt.hist(counts, bins=30, color="blue", edgecolor="black", alpha=0.7)
plt.title("Token Counts in LCEL Documents")
plt.xlabel("Token Count")
plt.ylabel("Frequency")
plt.grid(axis="y", alpha=0.75)

# Displays a histogram.
plt.show

Describes the process of sorting and linking document text to calculate the number of tokens.

document( docs ) Based on the metadata's "source" key.
Flip the sorted document list in reverse order.
The content of the document in reverse order is a specific separator "\n\n\n --- \n\n\n" Connect using ).
Number of tokens in connected content num_tokens_from_string Calculate using functions, output them. At this time, "cl100k_base" Use the model.

Copy

# Concatenates document text.
# Sort documents by source metadata.
d_sorted = sorted(docs, key=lambda x: x.metadata["source"])
d_reversed = list(reversed(d_sorted))  # Reverse sort the sorted documents.
concatenated_content = "\n\n\n --- \n\n\n".join(
    [
        # Concatenates the contents of documents arranged in reverse order.
        doc.page_content
        for doc in d_reversed
    ]
)
print(
    "Num tokens in all context: %s"  # Prints the number of tokens in each context.
    % num_tokens_from_string(concatenated_content, "cl100k_base")
)

Copy

 Num tokens in all context: 69074

RecursiveCharacterTextSplitter Describe the process of splitting text using.

chunk_size_tok By setting a variable, the size of each text chunk is 2000 tokens.
RecursiveCharacterTextSplitter of from_tiktoken_encoder Initialize the text divider using methods. Chunk size here chunk_size ) And chunk overlap ( chunk_overlap ) To 0.
Initialized text divider split_text By calling the method, concatenated_content Divides the linked text stored in the variable. Split result texts_split Stored in variables.

Copy

# Code for text segmentation
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size_tok = 2000  # Sets the chunk size of the token.
# Initializes a recursive character text splitter. Uses a token encoder to set the chunk size and redundancy.
text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=chunk_size_tok, chunk_overlap=0
)
texts_split = text_splitter.split_text(
    concatenated_content
)  # Splits the given text.

Model

Various models can be tested, new Claude3 The series is also included.

Don't forget to set the relevant API key.

OPENAI_API_KEY When using Anthropic ANTHROPIC_API_KEY

ChatOpenAI hump ChatAnthropic + OpenAIEmbeddings Use to implement the chatbot model.

OpenAIEmbeddings Instances the embedding function of OpenAI.
ChatOpenAI hump ChatAnthropic Using temperature Set to 0, initialize the chatbot model.

Copy

from dotenv import load_dotenv

load_dotenv()

Copy

 True

Use Cache Embedding.

Copy

from langchain_openai import OpenAIEmbeddings
from langchain.embeddings import CacheBackedEmbeddings
from langchain.storage import LocalFileStore

store = LocalFileStore("./cache/")

# Create an embeddings instance.
embd = OpenAIEmbeddings(model="text-embedding-3-small", disallowed_special=())

cached_embeddings = CacheBackedEmbeddings.from_bytes_store(
    embd, store, namespace=embd.model
)

Initialize the model.

Copy

from langchain_anthropic import ChatAnthropic
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain.callbacks.base import BaseCallbackHandler


class StreamCallback(BaseCallbackHandler):
    def on_llm_new_token(self, token: str, **kwargs):
        print(token, end="", flush=True)


# Initialize the ChatOpenAI model. The model uses "gpt-4-turbo-preview".
model = ChatOpenAI(
    model="gpt-4-turbo-preview",
    temperature=0,
    streaming=True,
    callbacks=[StreamCallback()],
)

# ChatAnthropic Initialize the model. Set the temperature to 0 and use the model "claude-3-opus-20240229".
# model = ChatAnthropic(temperature=0, model="claude-3-opus-20240229")

Tree building

The clustering approach in tree building contains some interesting ideas.

GMM (Gaussian mixed model)

Model the distribution of data points across various clusters.
Evaluate the model's Bayesian Information Criteria (BIC) to determine the optimal number of clusters.

UMAP (Uniform Manifold Approximation and Projection)

Support clustering.
Reduce the dimension of high-dimensional data.
UMAP helps to emphasize natural grouping based on the similarity of data points.

Regional and global clustering

Used to analyze data at various scales.
Effectively captures both detailed and wider patterns within the data.

Threshold setting

Applied to determine cluster membership in the context of GMM.
Based on probability distribution (data points assigned to ≥ 1 cluster).

The code for GMM and threshold setting is from Sarthi et al mentioned in the two sources below:

Both authors acknowledge full merit.

global_cluster_embeddings Functions use UMAP to perform global dimensional reduction of embedding.

Embedding entered embeddings ) To the specified dimension using UMAP ( dim ) To dimension.
n_neighbors Specifies the number of neighbors to consider each point, and if not provided, is defaulted to the square root of the embedding number.
metric Specifies the distance measurement criteria to be used for UMAP.
As a result, embedding reduced to the specified dimension is returned as a numpy array.

Copy

from typing import Dict, List, Optional, Tuple

import numpy as np
import pandas as pd
import umap
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from sklearn.mixture import GaussianMixture

RANDOM_SEED = 42  # Fixed seed value for reproducibility

### --- Added comments and documentation to the above quoted code. --- ###


def global_cluster_embeddings(
    embeddings: np.ndarray,
    dim: int,
    n_neighbors: Optional[int] = None,
    metric: str = "cosine",
) -> np.ndarray:
    """
    We perform global dimensionality reduction of embeddings using UMAP.

    Parameters:
    - embeddings: Input embedding as numpy array.
    - dim: Target dimension of reduced space.
    - n_neighbors: Optional; the number of neighbors to consider for each point.
                   If not provided, it defaults to the square root of the number of embeddings.
    - metric: Distance measurement criteria to use in UMAP.

    Return value:
    - A numpy array of embeddings, reduced to the specified dimensions.
    """
    if n_neighbors is None:
        n_neighbors = int((len(embeddings) - 1) ** 0.5)
    return umap.UMAP(
        n_neighbors=n_neighbors, n_components=dim, metric=metric
    ).fit_transform(embeddings)

Function to perform regional dimensional reduction for embedding data local_cluster_embeddings Implement.

Embedding entered embeddings ) To the specified dimension using UMAP ( dim ) To dimension.
The number of neighbors to consider for each point in the dimensional reduction process num_neighbors ) And distance measurement metrics ( metric ) As a parameter.
Finally, the dimensional embedding is reduced. numpy Returns as an array.

Copy

def local_cluster_embeddings(
    embeddings: np.ndarray, dim: int, num_neighbors: int = 10, metric: str = "cosine"
) -> np.ndarray:
    """
    Perform local dimensionality reduction on the embeddings, which is typically used after global clustering.

    Parameters:
    - embeddings: Input embeddings as numpy arrays.
    - dim: Target dimensionality of the reduced space.
    - num_neighbors: The number of neighbors to consider for each point.
    - metric: Distance measurement criteria to use in UMAP.

    Return value:
    - A numpy array of embeddings, reduced to the specified dimensions.
    """
    return umap.UMAP(
        n_neighbors=num_neighbors, n_components=dim, metric=metric
    ).fit_transform(embeddings)

get_optimal_clusters Functions are used to determine the optimal number of clusters based on the given embedding data. This process is performed by calculating the Bayesian Information Criterion (BIC) using the Gaussian Mixture Model.

Input embedding ( embeddings ) Is provided as a numpy array.
Maximum number of clusters max_clusters ) Specifies the maximum number of clusters to consider. The default is 50.
Random number state for reproducibility random_state ) Use fixed values.
The function attempts multiple clusters for input embedding and calculates the BIC value for each.
Determine and return the number of clusters with a minimum BIC value as the optimal number of clusters.

This function can be useful for automatically finding the number of clusters that best describe the data in the clustering problem.

Copy

def get_optimal_clusters(
    embeddings: np.ndarray, max_clusters: int = 50, random_state: int = RANDOM_SEED
) -> int:
    """
    The optimal number of clusters is determined using the Bayesian Information Criterion (BIC) using a Gaussian Mixture Model.

    Parameters:
    - embeddings: Input embeddings as numpy arrays.
    - max_clusters: The maximum number of clusters to consider.
    - random_state: Seeds for reproducibility.

    Return value:
    - An integer representing the optimal number of clusters found.
    """
    max_clusters = min(
        max_clusters, len(embeddings)
    )  # Set the maximum number of clusters to the smaller value between the maximum number of clusters and the length of the embedding.
    n_clusters = np.arange(1, max_clusters)  # Generate a range from 1 to the maximum number of clusters
    bics = []  # List to save BIC scores
    for n in n_clusters:  # Repeat for each cluster count
        gm = GaussianMixture(
            n_components=n, random_state=random_state
        )  # Initializing Gaussian Mixture Models
        gm.fit(embeddings)  # Learning a model for embeddings
        bics.append(gm.bic(embeddings))  # Add BIC scores of the trained model to the list
    return n_clusters[np.argmin(bics)]  # Returns the number of clusters with the lowest BIC score

GMM_cluster The function clusters the embedding using a Gaussian Mixture Model (GMM). This process is based on probability thresholds.

Embedding entered embeddings ) Is provided as a numpy array.
threshold Is the probability threshold for assigning embedding to a specific cluster.
random_state Is the seed value for reproducibility of the result.
To determine the optimal number of clusters get_optimal_clusters Call the function.
Based on the number of clusters determined, the Gaussian mixing model is initialized and learning is performed on the embedding entered.
Calculate the probability of cluster allocation for each embedding, and if this probability exceeds a given threshold, assign that embedding to the cluster.
The function finally returns the cluster label of the embedding and the number of clusters determined as a tuple.

Copy

def GMM_cluster(embeddings: np.ndarray, threshold: float, random_state: int = 0):
    """
    Cluster embeddings using Gaussian mixture models (GMM) based on probability thresholds.

    Parameters:
    - embeddings: Input embeddings as numpy arrays.
    - threshold: Probability threshold for assigning embeddings to clusters.
    - random_state: Seeds for reproducibility.

    Return value:
    - A tuple containing the cluster labels and the determined number of clusters.
    """
    n_clusters = get_optimal_clusters(embeddings)  # optimal cluster number.
    # Initialize the Gaussian mixture model.
    gm = GaussianMixture(n_components=n_clusters, random_state=random_state)
    gm.fit(embeddings)  # Train a model on embeddings.
    probs = gm.predict_proba(
        embeddings
    )  # Predict the probability that an embedding belongs to each cluster.
    # Select the cluster with a probability exceeding the threshold as the label.
    labels = [np.where(prob > threshold)[0] for prob in probs]
    return labels, n_clusters  # Returns the labels and number of clusters.

perform_clustering Functions return clustering results by reducing dimensions for embedding, global clustering using Gaussian mixed models, and local clustering within each global cluster.

Embedding entered embeddings ) To perform dimensional reduction. This is a dimension specified using UMAP dim ) Includes the process of reducing the dimension of embedding.
Global clustering is performed using a Gaussian mixed model (GMM) for reduced dimension embedding. Cluster allocation is given probability threshold threshold ).
Perform additional local clustering within each global cluster. Based on the results of the global clustering, this course will reduce the dimension and GMM clustering again for only the embedding belonging to each global cluster.
Finally, assign global and local cluster IDs for all embedding, returning a list containing the cluster ID to which each embedding belongs. This list contains an array of cluster IDs for each embedding in the order of the embedding.

This function provides an approach that combines clustering at the global and local levels for clustering of high-dimensional data. This allows you to get more granular clustering results, and more effectively analyze complex data structures.

Copy

def perform_clustering(
    embeddings: np.ndarray,
    dim: int,
    threshold: float,
) -> List[np.ndarray]:
    """
    We sequentially perform dimensionality reduction for embeddings, clustering using a Gaussian mixture model, and local clustering within each global cluster.

    Parameters:
    - embeddings: Input embeddings in numpy arrays.
    - dim: Target dimension for UMAP reduction.
    - threshold: A probability threshold for assigning embeddings to clusters in GMM.

    Return value:
    - A list of numpy arrays containing the cluster ID of each embedding.
    """
    if len(embeddings) <= dim + 1:
        # Avoid clustering when there is insufficient data.
        return [np.array([0]) for _ in range(len(embeddings))]

    # Global dimension reduction
    reduced_embeddings_global = global_cluster_embeddings(embeddings, dim)
    # Global clustering
    global_clusters, n_global_clusters = GMM_cluster(
        reduced_embeddings_global, threshold
    )

    all_local_clusters = [np.array([]) for _ in range(len(embeddings))]
    total_clusters = 0

    # Iterate through each global cluster and perform local clustering.
    for i in range(n_global_clusters):
        # Extracting embeddings that currently belong to the global cluster
        global_cluster_embeddings_ = embeddings[
            np.array([i in gc for gc in global_clusters])
        ]

        if len(global_cluster_embeddings_) == 0:
            continue
        if len(global_cluster_embeddings_) <= dim + 1:
            # Small clusters are handled by direct allocation.
            local_clusters = [np.array([0]) for _ in global_cluster_embeddings_]
            n_local_clusters = 1
        else:
            # Local dimension reduction and clustering.
            reduced_embeddings_local = local_cluster_embeddings(
                global_cluster_embeddings_, dim
            )
            local_clusters, n_local_clusters = GMM_cluster(
                reduced_embeddings_local, threshold
            )

        # 총Assign local cluster ID, adjust total number of clusters already processed
        for j in range(n_local_clusters):
            local_cluster_embeddings_ = global_cluster_embeddings_[
                np.array([j in lc for lc in local_clusters])
            ]
            indices = np.where(
                (embeddings == local_cluster_embeddings_[:, None]).all(-1)
            )[1]
            for idx in indices:
                all_local_clusters[idx] = np.append(
                    all_local_clusters[idx], j + total_clusters
                )

        total_clusters += n_local_clusters

    return all_local_clusters

A function that generates embedding for a list of text documents embed Implement.

List of text documents by input ( texts ).
embd Object embed_documents Create embedding of text documents using methods.
Created embedding numpy.ndarray Converts to form and returns.

Copy

def embed(texts):
    # Generate embeddings for a list of text documents.
    #
    # This function assumes that an `embd` object exists, This object has an `embed_documents` method that takes a list of texts and returns their embeddings.
    #
    # Parameters:
    # - texts: List[str], A list of text documents to embed.
    #
    # Return value:
    # - numpy.ndarray: An array of embeddings for the given text documents.
    text_embeddings = embd.embed_documents(
        texts
    )  # Generate embeddings of text documents.
    text_embeddings_np = np.array(text_embeddings)  # Convert embeddings to numpy arrays.
    return text_embeddings_np  # Returns an embedded numpy array.

embed_cluster_texts Functions embed and cluster text lists, including original text, corresponding embedding, and assigned cluster labels pandas.DataFrame Returns.

Generate embedding for a given text list.
Clustering is performed based on the embedding generated. This course is predefined perform_clustering Use functions.
To save results pandas.DataFrame Initialize.
Save the original text, embedding list, and cluster label in DataFrame respectively.

This function combines embedding generation and clustering of text data into one step, facilitating structural analysis and grouping of text data.

Copy

def embed_cluster_texts(texts):
    """
    Embeds and clusters a list of texts, returning a DataFrame containing the texts, their embeddings, and cluster labels.

    This function combines embedding generation and clustering in a single step. It assumes the predefined existence of a `perform_clustering` function that performs clustering on the embeddings.

    Parameters:
    - texts: List[str], A list of text documents to be processed.

    Return value:
    - pandas.DataFrame: A DataFrame containing the original texts, their embeddings, and the assigned cluster labels. """
    """
    text_embeddings_np = embed(texts)  # 임베딩 생성
    cluster_labels = perform_clustering(
        text_embeddings_np, 10, 0.1
    )  # Perform clustering on embeddings
    df = pd.DataFrame()  # Initialize a DataFrame to store the results
    df["text"] = texts  # Save original text
    df["embd"] = list(text_embeddings_np)  # Save as a list in DataFrame
    df["cluster"] = cluster_labels  # Save cluster labels
    return df

fmt_txt function pandas of DataFrame Format text documents as a single string.

As input parameters DataFrame Received, DataFrame You need to have a'text' column, including text documents to format.
All text documents are connected using a specific delimiter ("--- --- \n --- ---") and returned as a single string.
The function returns a single string containing the linked text document.

Copy

def fmt_txt(df: pd.DataFrame) -> str:
    """
    DataFrame Formats a text document into a single string.

    Parameters:
    - df: 'text' Contains a text document to be formatted in columns DataFrame.

    Return value:
    - A single string where all text documents are joined by a specific delimiter.
    """
    unique_txt = df["text"].tolist()  # 'text' Convert all text in a column to a list
    return "--- --- \n --- --- ".join(
        unique_txt
    )  # Returns text documents joined by a specific delimiter

Performs the process of embedding, clustering text data, and generating summaries for each cluster.

Generate embedding for a given text list and proceed with clustering based on similarity. This course df_clusters Data frame results. This data frame includes original text, embedding, and cluster allocation information.
Extend data frame items to easily handle cluster assignments. Each row is converted into a new data frame that includes text, embedding, and clusters.
Extract a unique cluster identifier from the extended data frame, and format text for each cluster to generate a summary. This summary df_summary Stored in the data frame. This data frame includes a summary of each cluster, a specified level of detail, and a cluster identifier.
Finally, the function returns a tuple containing two data frames. The first data frame contains the original text, embedding, and cluster allocation information, and the second data frame contains a summary for each cluster and its level of detail, cluster identifier.

Copy

def embed_cluster_summarize_texts(
    texts: List[str], level: int
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Performs embedding, clustering, and summarization on a list of texts. This function first generates embeddings for the texts, clusters them based on similarity, then expands the cluster assignments to facilitate processing, and summarizes the content within each cluster.

    Parameters:
    - texts: A list of text documents to process.
    - level: An integer parameter that can define the depth or detail of the processing.
    
    Return value:
    - A tuple containing two data frames:
      1. First dataframe(`df_clusters`)contains the original texts, their embeddings, and cluster assignments.
      2. Second data frame(`df_summary`)Contains a summary for each cluster, a specified level of detail, and a cluster identifier.

    """

    # Embed and cluster the text to create a dataframe with columns 'text', 'embd', and 'cluster'.
    df_clusters = embed_cluster_texts(texts)

    # Prepare to extend the dataframe to easily manipulate the cluster.
    expanded_list = []

    # Expand dataframe entries into document-cluster pairs to simplify processing.
    for index, row in df_clusters.iterrows():
        for cluster in row["cluster"]:
            expanded_list.append(
                {"text": row["text"], "embd": row["embd"], "cluster": cluster}
            )

    # Create a new dataframe from the extended list.
    expanded_df = pd.DataFrame(expanded_list)

    # Retrieves a unique cluster identifier for processing.
    all_clusters = expanded_df["cluster"].unique()

    print(f"--Generated {len(all_clusters)} clusters--")

    # summation
    template = """Here is a subset of the LangChain expression language documentation.

    LangChain The expression language provides a way to construct chains in LangChain.

    Please provide a detailed summary of the documentation provided.

    document:
    {context}
    """
    prompt = ChatPromptTemplate.from_template(template)
    chain = prompt | model | StrOutputParser()

    # Format the text within each cluster for summarization.
    summaries = []
    for i in all_clusters:
        df_cluster = expanded_df[expanded_df["cluster"] == i]
        formatted_txt = fmt_txt(df_cluster)
        summaries.append(chain.invoke({"context": formatted_txt}))

    # Summary: Create a dataframe to store the clusters and levels.
    df_summary = pd.DataFrame(
        {
            "summaries": summaries,
            "level": [level] * len(summaries),
            "cluster": list(all_clusters),
        }
    )

    return df_clusters, df_summary

A function that implements the process of recursively embedding, clustering, and summarizing text data.

Embed, clustered, and summarized the given text list to store results for each step.
Functions are executed up to the maximum specified recursive level, or repeated until the only number of clusters is 1.
In each recursive step, the current level of clustering results and summary results are returned in the form of a data frame, which is stored in the result dictionary.
If the current level is less than the maximum recursive level, and the number of only clusters is greater than 1, recursively call the function using the summary result of the current level as input text for the next level.
It finally returns a dictionary containing each level cluster data frame and summary data frame.

Copy

def recursive_embed_cluster_summarize(
    texts: List[str], level: int = 1, n_levels: int = 3
) -> Dict[int, Tuple[pd.DataFrame, pd.DataFrame]]:
    """
    Recursively embed, cluster, and summarize text up to a specified level or until the number of unique clusters reaches 1, and save the results at each level.

    Parameters:
    - texts: List[str], Texts to be processed.
    - level: int, Current recursion level (starting at 1).
    - n_levels: int, Maximum depth of recursion.

   Return value:
    - Dict[int, Tuple[pd.DataFrame, pd.DataFrame]], Using the recursion level as the key, cluster at that level DataFrame과 요약 DataFrame A dictionary whose values are tuples containing.
    """
    results = {}  # A dictionary to store the results at each level

    # Perform embedding, clustering, and summarization for the current level
    df_clusters, df_summary = embed_cluster_summarize_texts(texts, level)

    # Save results for current level
    results[level] = (df_clusters, df_summary)

    # Determine if further recursion is possible and meaningful
    unique_clusters = df_summary["cluster"].nunique()
    if level < n_levels and unique_clusters > 1:
        # Use summary as input text for next level recursive
        new_texts = df_summary["summaries"].tolist()
        next_level_results = recursive_embed_cluster_summarize(
            new_texts, level + 1, n_levels
        )

        # Merge the results of the next level into the current results dictionary
        results.update(next_level_results)

    return results

Copy

# Number of total documents
len(docs_texts)

Copy

# Building a tree
leaf_texts = docs_texts  # Set document text to leaf text
results = recursive_embed_cluster_summarize(
    leaf_texts, level=1, n_levels=3
)  # Recursively perform embedding, clustering, and summarization to obtain results.

Copy

 --Generated 6 clusters-- 
LangChain Expression Language (LCEL) provides a declarative way to organize chains in LangChain. LCEL is designed to support prototyping into production, and can run successfully in this production without code changes, from the simplest "prompt + LLM" chain to hundreds of complex chains. Here are some reasons to use LCEL: 

-**Streaming support**: Building a chain with LCEL will give you the best time to the first token. This means that some chains can stream tokens directly from LLM to streaming output parser, and LLM providers can receive parsed incremental output chunks at the speed of outputting raw tokens. 
-**Asynchronous support**: All chains built with LCEL can be called from both the synchronous API (e.g. from the Jupyter laptop being prototyped) and the asynchronous API (e.g. from the LangServe server). This allows you to use the same code in prototyping and production, provide excellent performance, and handle many simultaneous requests on the same server. 
-**Optimized parallel execution**: If the LCEL chain has steps to run in parallel (e.g. when importing documents from multiple searchers), it automatically performs on both synchronous and asynchronous interfaces, providing the smallest possible latency. 
-** Retry and Replacement**: Retry and Replacement can be configured for all parts of the LCEL chain. This is a great way to make the chain more reliable on a scale. 
-**Interim result approach**: For more complex chains, it may be very useful to access intermediate results before the final output is generated. This allows you to notify the end user that something is going on or debug the chain. 
-** Input and Output Schema**: Input and Output Schema provides Pydantic and JSONSchema schemas deduced from the structure of the chain to all LCEL chains. It can be used for validation of inputs and outputs and is an important part of LangServe. 
-**LangSmith tracking integration and LangServe deployment integration**: As the chain gets more and more complex, it becomes increasingly important to understand exactly what is happening at each stage. With LCEL, all steps are automatically recorded in LangSmith, providing maximum observability and debugging potential. 

LCEL also includes a cookbook that provides example codes to perform a variety of common tasks. These examples show how to do different tasks by combining various Runnable (core LCEL interface) components. For example, using LCEL, you can easily add custom routing logic that dynamically determines the chain logic based on user input based on semantic similarity. This is a particularly useful technique for routing queries to the most relevant prompts.The LangChain Expression Language (LCEL) document provides various guides and examples on how to construct a chain using LangChain. This document is for developers who want to make the most of LangChain's capabilities. The main contents are as follows: 

One. **Runnable type**: 
   -**RunnableParallel**: Describes how to manipulate data. 
   -**RunnablePasstrough**: Describes how to pass data as it is. 
   -**RunnableLambda**: Describes how to run custom functions. 
   -**RunnableBranch**: Describes how to dynamically route logic based on input. 

2. **Runtime factor binding**: Describes how to pass a specific factor as a constant within a Runnable sequence. 

3. ** Chain internal configuration at runtime**: Explain how to experiment with the chain's internal settings at runtime or expose them to end users. 

4. ** `@chain` Create runnable using decorator**: Describes how to convert random functions into chains. This method improves observability and allows you to track runnable called inside a function as a nested child. 

5. **Fallback added**: Describes how to prepare for various failure points that may occur in LLM (Large Language Model) applications. 

6. ** Streaming the custom generator function**: Describes how to use the generator function using the `yield` keyword for the LCEL pipeline. This is useful for maintaining streaming while implementing custom output parsers or modifying outputs from previous steps. 

7. **Runnable test**: Describes how to check runnable generated with LCEL. 

8. **Add message history (memory)**: Describes how to add message history to a specific runnable. 

This document provides useful information for developers who want to use LangChain to build complex data processing and conversion piplines. Each section is described with specific examples to help developers easily understand and apply LangChain's various features.This document is a cookbook that provides example codes for how to perform various tasks using the LangChain Expression Language (LCEL). LCEL provides a way to construct a chain in LangChain, and this cookbook shows how to do multiple tasks by combining various Runnable (core LCEL interface) components. These include: 

One. **Prompt + LLM**: The most common and valuable configuration, which explains how to combine prompt and large language models (LLM). 
2. **RAG**: Introducing how to add search steps to prompts and LLMs. 
3. **Multiple Chains**: Describes how to connect multiple chains using Runnable. 
4. **Querying a SQL DB**: Shows how to clone a SQL database chain using Runnable. 
5. **Agents**: Describes how to deliver Runnable as an agent. 
6. **Code Writing**: Provides an example of writing Python code using LCEL. 
7. **Routing by Semantic Similarity**: Describes how to add routing based on semantic similarity using LCEL. 
8. **Adding Memory**: Shows how to add memory to any chain. 
9. **Adding Moderation**: Describes how to add censorship (or other safety features) around the LLM application. 
10. **Managing Prompt Size**: Shows how the agent dynamically calls the tool and manages how the results of that tool call are added to the prompt. 
11. **Using Tools**: Describes how to easily use the tool with Runnable. 

Each section provides a specific methodology with code examples to perform specific tasks. For example, the "Adding Moderation" section shows how to find text that violates OpenAI's content policy, and the "Multiple Chains" section describes how to connect multiple chains to answer complex questions. The "Agents" section shows how to configure and run an agent, and "Code Writing" shows how to write and run code using LCEL. 

This document is a useful resource for developers who want to use LangChain to perform various tasks. Each example step-by-step explains how to perform specific tasks, which allows developers to understand the various features of LCEL and apply it to their projects.The provisioned documents are part of the LangChain documentation, which details the LangChain Expression Language (LCEL) and its applications. LangChain is a framework designed to facilitate the creation, manipulation, and execution of complex chains of operations, particularly in the context of language models and related tags. The documentation covers various aspects of using LangChain, including installation, quick start guides, security considerations, and details explanations of LCEL's components and capabilities. Here's a summary of the key points from the documents: 

One. **Introduction to LangChain and LCEL**: LangChain offers a way to build complex chains from basic components, supporting functionalities like streaming, parallelism, and logging. LCEL (LangChain Expression Language) simplifies the process of chain together prompts, models, and output parsers to perform tasks like generation jokes based on a given topic or conducting retrieval-augmented generation. 

2. **Basic Examples**: The documentation provides examples of basic use cases, such as painting a prompt template with a model and an output parser to generate content based on on user input. It also demonstrates more complex scenarios like retrieval-augmented generation, where additional context is retrieved and used to inform the generation process. 

3. **Inspecting Runnables**: LangChain allows users to inspect runnables (components of a chain) to understand their structure and operation better. This can include generaling a graph representation of a chain or retrieving the prompts used within a chain. 

4. **Using Tools with Runnables**: LangChain supports the integration of various tools with runnables. An example provided is using DuckDuckGoSearchRun with a chain to turn user input into a search query and retrieve relevant information. 

5. **Self-querying Retrievers**: The documentation discusses self-querying retrievers, which can construct structured queries based on natural language input and apply these queries to their undertlying VectorStore. This allows for sophisticated retrieval operations based on both semantic similarity and metadata filters. 

6. **Advanced Features**: LangChain documentation also touches on advanced features like adding memory to chains, managing prompt size, and routing by semantic similarity. These features enable the creation of more sophisticated and efficient chains capable of handling complex tasks. 

7. **Community and Support**: The documents encourage community engagement and feedback, provision links to community resources like Discord, Twitter, and GitHub. This suggests an active and supportive community around LangChain. 

Overall, the LangChain documentation provides a comprehensive guide to using the LangChain framework and LCEL for building and checking complex chains of operations involving language models and other components. It covers bad basic and advanced use cases, applying practical examples and encouraging community involvement.The provided documents from LangChain cover a range of topics related to the LangChain Expression Language (LCEL) and its applications, including interface Here's a detailed summary of the key points from each document: 

### Interface 
-LangChain introduces a "Runnable" protocol to simplify the creation of custom chains. 
-The standard interface include methods for streaming, invoking, and betting balls, with asynchronous versions available. 
-Input and output types vary by component, with schemas proven for inspection. 

### Streaming 
-Streaming is crucial for making applications feel responsive. 
-LangChain supports synchronous and asynchronous streaming, interviewing intermediate steps and final output. 
-Examples demonstrate streaming with LLMs and chat models, highlighting the importance of handling input streams effectly. 

### RunnableBranch 
-RunnableBranch allows for dynamic routing based on input, enabling non-deterministic chains. 
-Two methods for routing include using a custom function (recommended) or a RunnableBranch. 
-Examples show how to classify questions and route them to corresponding prompt chains based on the classification. 

### RunnableParallel 
-RunnableParallel is used for manipulating data and executing multiple Runnables in parallel. 
-It can be used to match the output format of one Runnable to the input format of another. 
-Examples demonstrate parallel execution and the use of itemgetter for shorthand data extension. 

### RunnablePasstrough 
-RunnablePassthrough passes infuts unchanged or with added keys, often used with RunnableParallel. 
-It allows for the assignment of data to new keys in a map. 
-An example shows use in a retrieval chain, passing user input under a specific key. 

### Add Message History (Memory) 
-RunnableWithMessageHistory adds message history to chains, managing chat message history. 
-It supports various input and output formats, including sequences of BaseMessage and dictionaries. 
-Examples cover in-memory and persistent storage (using Redis) for message histories, demonstrating how to manage and utilize chat histories in chains. 

### Managing Prompt Size 
-Managing prompt size is crucial for preventing context window overflow in models. 
-Custom functionality can be added to LCEL chains for prompt size management. 
-An example demonstrates a multi-step question with prompt handling logic to condense prompts and ensure the model's context window is not exceed. 

These documents collectively provide a comprehensive guide to using LangChain Expression Language for building and managing complex chains, incorporating dynamic routing, parallel processing, and efficient data handling techniques. They emphasize the flexibility and power of LCEL in creating responsive and intelligent applications.The provision documents are part of the LangChain documentation, focusing on the LangChain Expression Language (LCEL), a tool designed to facilitate the construction of complex Here's a detailed summary of the key points from each section: 

### Why Use LCEL 
-**Purpose**: LCEL simplifies building complex chains by offering a unified interface and composition primitives. 
-**Features**: 
  One. **Unified Interface**: Implements the Runnable interface, allowing chains of LCEL objects to support common invocation methods (invoke, batch, stream, etc.). 
  2. **Composition Primitives**: Facilitates composing chains, parallelizing components, adding fallbacks, and more. 
-**Example**: Demonstrates LCEL's utility through a basic example of creation a prompt + model chain and compares the process with and without LCEL, highlighting LCEL's efficiency and simplicity. 

### Prompt + LLM 
-**Common Composition**: Combining a PromptTemplate with an LLM/ChatModel and an OutputParser is a fundamental building block in LCEL. 
-**Simplification and Flexibility**: Shows how to simplify input, attach kwargs, and use different parsers for structured outputs. 
-**Runnable Parallel**: Introduces RunnableParallel for easier invocation, demonstrating how to streamline the process of creating input dictionaries for prompts. 

### Add Fallbacks 
-**Handling Failures**: Discusses using fallbacks to gracefully handle files at various points, especially useful for LLM API errors. 
-**Implementation**: Provides examples of imaging fallbacks, including handling specific errors and creating fallbacks for sequences. 
-**Practical Use**: Offers code snippets to illustrate how fallbacks can be be applied to LLMs, showing how to switch between different models or prompts based on runtime configurations. 

### Configure Chain Internals at Runtime 
-**Dynamic Configuration**: Explains methods to expert with or expos different configurations to end-users by adjusting chain internals at runtime. 
-**Configuration Fields and Alternatives**: 
  -**Fields**: Allows configuring specific fields of a runnable, such as LLM temperature. 
  -**Alternatives**: Enables listing out alternatives for any particular runnable that can be set during runtime, useful for switching between models or prompts. 
-**Examples**: Provides code examples to demonstrate configuring LLMs and prompts, inclusive saving config chains as their down objects for reuse. 

### Quickstart 
-**Output Parsers**: Introduces output parsers as a means to structure language model responses into more useful formats. 
-**PydanticOutputParser**: Highlights the use of PydanticOutputParser for defining desired data structures and parsing model outputs into these structures. 
-**Streaming and Invocation**: Discusses the support for various invocation methods within LCEL and the ability of some parsers to stream through partially parsed objects. 

Each section of the documentation emphasizes the flexibility, efficiency, and ease of use provision by LCEL, showcasing how it can significantly streamline the process of working with language models by offering a structured approx to building 
The provisioned documents offer a comprehensive overview of the LangChain Expression Language (LCEL), a powerful tool designed to facilitate the construction and management of complex operation chains, particularly in the context of language models and related projects. Here's a consolidated summary of the key points and features highlighted across the documents: 

One. **Introduction and Purpose**: LCEL is introduced as a declarative method to construct chains within LangChain, aimed at supporting the transition from prototype to production without code changes. It's designed to handle simple to highly complex chains efficiently. 

2. **Key Features of LCEL**: 
   -**Streaming and Asynchronous Support**: Enables direct streaming of tokens from LLMs to output parsers, supporting both synchronous and asynchronous API balls. This feature is crucial for reducing latency and improving responsiveness. 
   -**Optimized Parallel Execution**: Automatically executes parallelizable steps in a chain to minimize latency, enhancing performance. 
   -**Retry and Fallback Mechanisms**: Offers configurable retry and fallback options for all parts of a chain, inclusive reliability at scale. 
   -**Access to Intermediate Results**: Allows access to results from intermediate steps, useful for debugging and provision progress feedback. 
   -**Input and Output Schemas**: Generates Pydantic and JSONSchema schemas from the chain's structure, facilitating input and output validation. 
   -**Integration with LangSmith and LangServe**: Ensures maximum observability and debugging capabilities by automatically logging approach step in LangSmith and supporting deployment through LangServe. 

3. **Runnable Types and Runtime Features**: 
   -Various runnable types such as `RunnableParallel`, `RunnablePassthrough`, `RunnableLambda`, and `RunnableBranch` are integrated, approach serving different packaging like data management, dynamic routing, and parallel 
   -Features like runtime argument binding, chain internal configuration at runtime, and the use of the `@chain` decorator to enhance observability and manageability of chains are discussed. 

4. **Advanced Usage and Examples**: 
   -The documents provide a plethora of examples demonstrating LCEL's versatility, including prompt + LLM chains, adding fallbacks, custom generator function streaming, and managing prompt size. 
   -Specific use cases like routing by semantic similarity, adding memory, and integrating modeling are covered, showcasing LCEL's capability to handle complex logic and dynamic routing based on user input or other criteria. 

5. **Community and Support**: The documentation emphasizes community engagement and support, encouraging users to contribute feedback and participate in community resources. 

Overall, the LangChain documentation and the detailing extension of LCEL highlight its roule as a critical tool for developers looking to range language models and build sophisticated data processing and transformation pipelines. LCEL's design principles focus on ease of use, flexibility, and efficiency, enabling developers to construct, manage, and scale complex chains with minimal overhead and maximum reliability.

In the paper collapsed tree retrieval We are reporting this best performance.

This involves flattening the tree structure into a single layer, and then applying a k-recent neighbor (kNN) search simultaneously for all nodes.

We will briefly perform this process below.

Chroma Describes the process of using vector repositories to vectorize and searchable repositories of text data.

Early leaf_texts Text data stored in all_texts Copy to variable.
Result data ( results ), extract the summarized text at each level, all_texts Add to.
Each level DataFrame in summaries The value of the column is converted to a list and extracted.
Summary extracted all_texts Add to.
All text data ( all_texts ) Chroma Build a vector repository.
Chroma.from_texts Vector the text data by calling the function, creating a vector repository.
To make the generated vector repository searchable .as_retriever() Initialize the finder (retriever) using methods.

Through this process, text data, including summaries of various levels, is vectorized, and searchable based on this Chroma Build a vector repository.

Copy

from langchain_community.vectorstores import FAISS

# Initialize all_texts by copying leaf_texts.
all_texts = leaf_texts.copy()

# We iterate over the results to extract the summary for each level and add it to all_texts.
for level in sorted(results.keys()):
    # Extracts a summary from the DataFrame at the current level.
    summaries = results[level][1]["summaries"].tolist()
    # Add a summary of the current level to all_texts.
    all_texts.extend(summaries)

# Now we build the FAISS vectorstore using all_texts.
vectorstore = FAISS.from_texts(texts=all_texts, embedding=embd)

Save DB locally.

Copy

import os

DB_INDEX = "RAPTOR"

# Check if a FAISS DB index already exists locally, and if so, load it, merge it with the vectorstore, and then save it.
if os.path.exists(DB_INDEX):
    local_index = FAISS.load_local(DB_INDEX, embd)
    local_index.merge_from(vectorstore)
    local_index.save_local(DB_INDEX)
else:
    vectorstore.save_local(folder_path=DB_INDEX)

Copy

# retriever generation
retriever = vectorstore.as_retriever()

Define the Retrieval Augmented Generation (RAG) chain and implement a method to request specific code examples.

hub.pull Use to bring up the RAG prompt.
For formatting documents format_docs Define functions. This function connects and returns the page content of the document.
Make up the RAG chain. This chain is a searcher retriever ), format_docs After formatting as a function, the question is handled.
RunnablePassthrough() Use to pass the question as it is.
Chains are prompts, models, and StrOutputParser() Parse the final output as a string through.
rag_chain.invoke Using the method, "How to define a RAG chain? Give me a specific code example."Ra handles the question.

Copy

from langchain import hub
from langchain_core.runnables import RunnablePassthrough

# Generate prompt
prompt = hub.pull("rlm/rag-prompt")

# Document Post Processing


def format_docs(docs):
    # Returns the concatenated contents of the pages in the document.
    return "\n\n".join(doc.page_content for doc in docs)


# RAG Chain Definition
rag_chain = (
    # Format search results and process queries.
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt  # Apply the prompt.
    | model  # Apply the model.
    | StrOutputParser()  # Apply a string output parser.
)

LangSmith link

Copy

# Running abstract questions
_ = rag_chain.invoke("Please describe the core topic of the entire document.")

Copy

 The LangChain Expression Language (LCEL) document provides guides and examples on how to use LangChain to build complex data processing and conversion pipain. This document describes the core features and usage of LCEL, including various Runnable types, runtime factor binding, chain internal configuration, `@chain` decorator usage, Fallback addition, custom generator function streaming, and more. In addition, how to combine prompts and LLMs using LCELs, step by step how to do a variety of tasks, including adding search steps, connecting multiple chains, SQL DB queries, writing code, routing according to semantic similarities, adding memory, adding censorship, etc. Gives.

LangSmith link

Copy

# Running Low Level Questions
_ = rag_chain.invoke("PydanticOutputParser Please write an example code using.")

Copy

 from langchain.output_parsers import PydanticOutputParser 
from langchain.prompts import PromptTemplate 
from langchain_core.pydantic_v1 import BaseModel, Field, validator 
from langchain_openai import OpenAI 

model = OpenAI (model_name=" gpt-3.5-turbo-instruct", template=0.0) 

class Joke (BaseModel): 
    setup: str = Field (description="question to set up a joke") 
    punchline: str = Field (description="answer to resolve the joke") 
    @validator("setup") 
    def question_ends_with_question_mark(cls, field): 
        if field[-1] != "?": 
            raise ValueError ("Badly formed question!") 
        return field 

parser = PydanticOutputParser (pydantic_object=Joke) 
prompt = PromptTemplate( 
    template=" Answer the user query.\n{format_instructions}\n{query}\n", 
    input_variables=["query"], 
    partial_variables={"format_instructions": parser.get_format_instructions()}, 
) 

prompt_and_model = prompt | model 
output = prompt_and_model.invoke({"query": "Tell me a joke."}) 
parser.invoke (output) 
```

LangSmith link

Copy

# Running Low Level Questions
_ = rag_chain.invoke("Please write a self-querying method and example code.")

Copy

 The Self-querying method is a method of receiving a natural language query to create a structured query, and based on this, applying it to VectorStore to compare the semantic similarities of documents and applying filters extracted from user queries to the metadata. For example, you can use Chroma vector store to create a small demo set with movie summary documents, which allows you to instantiate your own query finder. Here is an example code that uses its own query finder: 

``python 
%pip install --upgrade --quiet lark chromadb 
from langchain_community.vectorstores import Chroma 
from langchain_core.documents import Document 
from langchain_openai import OpenAIEmbeddings 

docs = [ 
    Document( 
        page_content="A bunch of scientists bring back dinosaurs and mayhem breaks loose", 
        metadata={"year": 1993, "rating": 7.7, "genre": "science fiction"}, 
    ), 
    # Additional documents... 
] 

vectorstore = Chroma.from_documents(docs, OpenAIEmbeddings()) 

from langchain.chains.query_constructor.base import AttributeInfo 
from langchain.retrievers.self_query.base import SelfQueryRetriever 
from langchain_openai import ChatOpenAI 

metadata_field_info = [ 
    AttributeInfo( 
        name="genre", 
        description=" The genre of the movie. One of ['science fiction','comedy','drama','thriller','romance','action','animated']", 
        type="string", 
    ), 
    # Additional metadata fields... 
] 

document_content_description = "Brief summary of a movie" 
llm = ChatOpenAI (temperature=0) 

retriever = SelfQueryRetriever.from_llm( 
    llm, 
    vectorstore, 
    document_content_description, 
    metadata_field_info, 
) 

# example usage 
retriever.invoke ("I want to watch a movie related higher than 8.5")

Previous03. Various module utilizers by function of RAG Next05. RAG chain to remember interactive

Last updated 5 months ago