01. Document summary
This tutorial will take a look at how to do a document summary.
Below is the main overview of the tutorial.
Stuff: summary of the entire document at once
Map-Reduce: batch merge after split summary
Map-Refine: gradual merging after split summary
Chain of Density: Runs N times repeatedly, complements missing entity, and improves summary
Clustering-Map-Refine: Divide the document's Chunk into N clusters, and Refine summary for a summary of documents close to the center point in each cluster.
Representatively known summary method
The central question when building a summary is how to deliver the document to LLM's context window. Here are some known ways to do this:
Stuff: It's simply a way to "put" all documents in a single prompt. This is the simplest approach.Map-reduce: This is how each document is summarized individually in the "map" step, and then the summary in the "reduce" step is combined into the final summary.Refine: Organize responses by traversing input documents and repeatedly updating answers. For each document, you get a new answer by entering all the inscriptions, the current document, and the latest intermediate answer to chain.
Copy
# Configuration file for managing API KEY as environment variable.
from dotenv import load_dotenv
# Load API KEY information
load_dotenv()Copy
Copy
Copy
Stuff
stuff documents chain ("stuff" means "fill" or "fill") is the simplest way in the document chain. Take the document list, insert it all into the prompt, and then forward the prompt to LLM.
This chain is suitable for applications where documents are small and only a few are delivered to most calls.
Load data.
Copy
Copy
Below is a prompt with the phrase to write a summary in Korean.
Copy
Copy
Copy
Copy
Map-Reduce
A summary of the Map-reduce method is a technique for efficiently summarizing long documents.
This method first consists of a "map" step that divides the document into small chunk, and a "reduce" step that combines a summary of each chunk.
In the Map phase, each chunk is summarized in parallel
In the reduce phase, these summaries are incorporated into one final summary.
This approach is particularly useful when dealing with large-scale documents, allowing you to bypass the token restrictions in language models.
Load data.
Copy
Copy
Map
The map phase creates a summary for each Chunk.
(In fact, the rectification is a summary generation for Chunk, but I proceed by changing to extracting the core content. It doesn't matter because it's the process of putting summaries together in the reduce phase anyway.)
I thought this method was more valid, but I can proceed by changing at my own discretion whether to summarize in the map phase or extract the core content.
Copy
Copy
Generates map_chain.
Copy
Call batch() to generate a summary for each document.
Copy
Copy
Copy
Copy
Copy
Reduce
In the Reduce phase, the key content done in the map phase is incorporated into one final summary.
Copy
Copy
Generate Reduce Chain.
Copy
Below is an example of streaming output using Reduce Chain.
Copy
Copy
Copy
Copy
Copy
Map-Refine
The Map-refine method is another approach for document summarization, similar to map-reduce, but with some differences.
Map steps: Divide documents into multiple small chunks, and create summaries individually for each chunk.
Refine phase: sequentially handles the generated summaries and gradually improves the final summary. At each stage, the summary is updated by combining the previous summary with the information from the new chunk.
Repeat process: Repeat the refine step until all chunk is processed.
Final summary: The summary obtained after processing up to the last chunk will be the final result.
The advantage of the map-refine method is that you can gradually improve the summary while maintaining the order of the documents. This can be especially useful when the context of the document is important. However, this method is handled sequentially compared to map-reduce, so parallelization is difficult and can take longer to process large documents.
Copy
Copy
So, rather than simply connecting, to make the next chunk seem to cover the previous chunk and repeatedly add streaming Carriage return printing is required.
Partial JSON streaming. Each streamed chunk is a list of the same JSON dicks with new suffixes added.
Copy
Copy
Check the data to summarize.
Copy
The first chain shows intermediate results, and the second chain extracts only the final summary.
Progressive improvement: CoD initially generates a simple summary with fewer objects, then step by step adds important objects and improves the summary. As the length of the summary is maintained during this process, the density of information increases, creating a readable yet informative summary.
Balance of information density and readability: The CoD method regulates the information density in the summary to find the optimal balance between informability and readability. Studies have shown that people prefer CoD summaries that are more dense than typical GPT-4 summaries, but not as dense as man-made summaries.
Abstraction and information fusion improvement: CoD-generated summaries are more abstract, excellent information fusion, and less prone to the front of the original text (lead bias). This contributes to improving the overall quality and readability of the summary. Chain of Density Prompt Input parameter description
content_category: Content rectification (e.g. articles, video recordings, blog posts, research papers). Default: Articlecontent: Content to summarizeentity_range: The range of entities to select from the content and add to the summary. Default1-3max_words: 1 summary time, the maximum word to include in the summary. Default 80 is.iterations: Number of entity high density rounds. Total summary Repeat count +1 is. For 80 words, 3 iterations is ideal. If the summary is longer, 4~5 rounds, andentity_rangeChanging 1~4 for example can also help. Default: 3. This code uses the Chain of Density prompt to construct a chain that creates a text summary.
This method initially creates a summary with fewer objects, and then goes through the process of repeatedly integrating missing important objects without increasing the length. Studies have shown that CoD-generated summaries are more abstract than regular prompts, have excellent information fusion, and have a density similar to human-written summaries.
The "Chain of Density" (CoD) prompt is a technique developed to improve the creation of summaries using GPT-4.
Thesis: https://arxiv.org/pdf/2309.04269
Chain of Density
Copy
Copy
Copy
Weave a series of processes so far into one chain.
Below is an example that creates map_reduce_chain.
Copy
Copy
Copy
The Refine phase sequentially handles the chunk created in the previous map phase and gradually improves the final summary.
Refine
Copy
Copy
Copy
Copy
Copy
Output a summary of the first document.
Copy
Generates map_chain.
Copy
Copy
Copy
Copy
Clustering-Map-Refine
The original author of this tutorial, gkamradt, made an interesting suggestion for a summary of long documents.
The background is as follows.
The map-reduce or map-refine method is all time consuming and expensive.
Therefore, after dividing the documents into clusters of several (N), the document closest to the most central axis is recognized as the representative document of the cluster, suggesting a way to summarize them in the map-reduce (or map-refine) way.
In fact, the cost is reasonable and the results are satisfactory, so we modify and share the code of the original author's tutorial.
Copy
Copy
When you run the code below, the text is combined into one document. The purpose of the combination is not to be separated by page stars.
The combined number of characters is about 28K.
Copy
Copy
RecursiveCharacterTextSplitter Divide one Text into multiple documents using.
Copy
Check the number of documents divided. It was divided into 79 documents here.
Copy
Copy
Embed documents using the Upstage Embeddings model.
Copy
A total of 79 documents are divided into 10 clusters. At this time KMeans Perform clustering using.
Copy
Check the labeled results.
Copy
Copy
Copy
Then you need to find and save the embedding closest to the center point of each cluster.
Copy
Sort ascending to proceed with the summary of the documents in order.
Copy
Copy
Output 10 selected documents. In this process Document Create documents using objects.
Copy
Copy
Copy
Copy
Copy
Copy
Last updated