CH07 text split (Text Splitter)
Dividing documents is the second stage of the Retrieval-Augmented Generation (RAG) system, loaded documents Efficiently processed It is an important process to prepare the system to make better use of the information.
The purpose of this stage is to accept large and complex documents by LLM Efficient small-scale pieces is. For later questions entered by the user, only more efficient information is to be compressed/selected.
(Example) How much did Google invest in Ansropic?
The need for division
Pinpoint information retrieval (accuracy) : By subdividing documents Information relevant to the question (Query) It only helps to bring. Each unit focuses on a specific topic or content, Provide relevant information To.
Resource optimization (efficiency) : Entering the entire document in LLM is expensive, and excerpts from many sources of efficient answers will prevent you from answering them. Sometimes these problems Halusination This leads to. Therefore, there is also a purpose to excerpt only the information needed to answer.
Document division process
Identify document structure : Identify structures in various types of documents, including PDF files, web pages, and e-books. This may include the process of identifying the document's header, footer, page number, section title, and more.
Unit selection : Decide which unit to divide the document. This can be page-by-page, section-by-section, or paragraph-by-paragraph, depending on the content and purpose of the document.
Unit size selection (chunk size) : Decide how many token units the document will divide.
Chunk overlap : It is common to split (overlap) by overlapping some so that the context can continue at the divided end.
Chunk size & chunk overlap
code
Copy
from langchain_text_splitters import RecursiveCharacterTextSplitter
# 단계 2: 문서 분할(Split Documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
splits = text_splitter.split_documents(docs)Chunk split visualization
Chunk Visualization site created by Greg Kamradt.
Reference
Last updated