01. Document summary

This tutorial will take a look at how to do a document summary.

Below is the main overview of the tutorial.

  • Stuff: summary of the entire document at once

  • Map-Reduce: batch merge after split summary

  • Map-Refine: gradual merging after split summary

  • Chain of Density: Runs N times repeatedly, complements missing entity, and improves summary

  • Clustering-Map-Refine: Divide the document's Chunk into N clusters, and Refine summary for a summary of documents close to the center point in each cluster.

Representatively known summary method

The central question when building a summary is how to deliver the document to LLM's context window. Here are some known ways to do this:

  1. Stuff : It's simply a way to "put" all documents in a single prompt. This is the simplest approach.

  2. Map-reduce : This is how each document is summarized individually in the "map" step, and then the summary in the "reduce" step is combined into the final summary.

  3. Refine : Organize responses by traversing input documents and repeatedly updating answers. For each document, you get a new answer by entering all the inscriptions, the current document, and the latest intermediate answer to chain.

Copy

# Configuration file for managing API KEY as environment variable.
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

Copy

Copy

Stuff

stuff documents chain ("stuff" means "fill" or "fill") is the simplest way in the document chain. Take the document list, insert it all into the prompt, and then forward the prompt to LLM.

This chain is suitable for applications where documents are small and only a few are delivered to most calls.

Load data.

Copy

Copy

Below is a prompt with the phrase to write a summary in Korean.

Copy

Copy

Copy

Copy

Map-Reduce

A summary of the Map-reduce method is a technique for efficiently summarizing long documents.

This method first consists of a "map" step that divides the document into small chunk, and a "reduce" step that combines a summary of each chunk.

  1. In the Map phase, each chunk is summarized in parallel

  2. In the reduce phase, these summaries are incorporated into one final summary.

This approach is particularly useful when dealing with large-scale documents, allowing you to bypass the token restrictions in language models.

Load data.

Copy

Copy

Map

The map phase creates a summary for each Chunk.

(In fact, the rectification is a summary generation for Chunk, but I proceed by changing to extracting the core content. It doesn't matter because it's the process of putting summaries together in the reduce phase anyway.)

I thought this method was more valid, but I can proceed by changing at my own discretion whether to summarize in the map phase or extract the core content.

Copy

Copy

Generates map_chain.

Copy

Call batch() to generate a summary for each document.

Copy

Copy

Copy

Copy

Copy

Reduce

In the Reduce phase, the key content done in the map phase is incorporated into one final summary.

Copy

Copy

Generate Reduce Chain.

Copy

Below is an example of streaming output using Reduce Chain.

Copy

Copy

Copy

Copy

Copy

Map-Refine

The Map-refine method is another approach for document summarization, similar to map-reduce, but with some differences.

  1. Map steps: Divide documents into multiple small chunks, and create summaries individually for each chunk.

  2. Refine phase: sequentially handles the generated summaries and gradually improves the final summary. At each stage, the summary is updated by combining the previous summary with the information from the new chunk.

  3. Repeat process: Repeat the refine step until all chunk is processed.

  4. Final summary: The summary obtained after processing up to the last chunk will be the final result.

The advantage of the map-refine method is that you can gradually improve the summary while maintaining the order of the documents. This can be especially useful when the context of the document is important. However, this method is handled sequentially compared to map-reduce, so parallelization is difficult and can take longer to process large documents.

Copy

Copy

So, rather than simply connecting, to make the next chunk seem to cover the previous chunk and repeatedly add streaming Carriage return printing is required.

Partial JSON streaming. Each streamed chunk is a list of the same JSON dicks with new suffixes added.

Copy

Copy

Check the data to summarize.

Copy

The first chain shows intermediate results, and the second chain extracts only the final summary.

  1. Progressive improvement: CoD initially generates a simple summary with fewer objects, then step by step adds important objects and improves the summary. As the length of the summary is maintained during this process, the density of information increases, creating a readable yet informative summary.

  2. Balance of information density and readability: The CoD method regulates the information density in the summary to find the optimal balance between informability and readability. Studies have shown that people prefer CoD summaries that are more dense than typical GPT-4 summaries, but not as dense as man-made summaries.

  3. Abstraction and information fusion improvement: CoD-generated summaries are more abstract, excellent information fusion, and less prone to the front of the original text (lead bias). This contributes to improving the overall quality and readability of the summary. Chain of Density Prompt Input parameter description

  4. content_category : Content rectification (e.g. articles, video recordings, blog posts, research papers). Default: Article

  5. content : Content to summarize

  6. entity_range : The range of entities to select from the content and add to the summary. Default 1-3

  7. max_words : 1 summary time, the maximum word to include in the summary. Default 80 is.

  8. iterations : Number of entity high density rounds. Total summary Repeat count +1 is. For 80 words, 3 iterations is ideal. If the summary is longer, 4~5 rounds, and entity_range Changing 1~4 for example can also help. Default: 3. This code uses the Chain of Density prompt to construct a chain that creates a text summary.

This method initially creates a summary with fewer objects, and then goes through the process of repeatedly integrating missing important objects without increasing the length. Studies have shown that CoD-generated summaries are more abstract than regular prompts, have excellent information fusion, and have a density similar to human-written summaries.

The "Chain of Density" (CoD) prompt is a technique developed to improve the creation of summaries using GPT-4.

  • Thesis: https://arxiv.org/pdf/2309.04269

Chain of Density

Copy

Copy

Copy

Weave a series of processes so far into one chain.

Below is an example that creates map_reduce_chain.

Copy

Copy

Copy

The Refine phase sequentially handles the chunk created in the previous map phase and gradually improves the final summary.

Refine

Copy

Copy

Copy

Copy

Copy

Output a summary of the first document.

Copy

Generates map_chain.

Copy

Copy

Copy

Copy

Clustering-Map-Refine

The original author of this tutorial, gkamradt, made an interesting suggestion for a summary of long documents.

The background is as follows.

  1. The map-reduce or map-refine method is all time consuming and expensive.

  2. Therefore, after dividing the documents into clusters of several (N), the document closest to the most central axis is recognized as the representative document of the cluster, suggesting a way to summarize them in the map-reduce (or map-refine) way.

In fact, the cost is reasonable and the results are satisfactory, so we modify and share the code of the original author's tutorial.

Copy

Copy

When you run the code below, the text is combined into one document. The purpose of the combination is not to be separated by page stars.

The combined number of characters is about 28K.

Copy

Copy

RecursiveCharacterTextSplitter Divide one Text into multiple documents using.

Copy

Check the number of documents divided. It was divided into 79 documents here.

Copy

Copy

Embed documents using the Upstage Embeddings model.

Copy

A total of 79 documents are divided into 10 clusters. At this time KMeans Perform clustering using.

Copy

Check the labeled results.

Copy

Copy

Copy

Then you need to find and save the embedding closest to the center point of each cluster.

Copy

Sort ascending to proceed with the summary of the documents in order.

Copy

Copy

Output 10 selected documents. In this process Document Create documents using objects.

Copy

Copy

Copy

Copy

Copy

Copy

Last updated