06. Markdownheader Text Split (MarkdownheaderTextSplitter)
Understanding and efficiently handling the structure of a markdown file can be very important for working with documents. In particular, the process of embedding text in a meaningful way, taking into account the overall context and structure of the document, is of great help in creating a comprehensive vector representation that can better capture a wide range of meanings and topics.
In this context, there are times when you want to divide the content by specific part of the markdown file, that is, by header. This is the case, for example, when you want to create a'chunk', a chunk of information that is associated with each other based on what is under each header within a document. This is an attempt to effectively utilize the structural elements of a document while maintaining a common context of text.
To solve this challenge, MarkdownHeaderTextSplitter Ra can take advantage of the tools. This tool splits documents according to the specified set of headers, allowing you to manage the contents under each header group with separate chunks. This method allows you to handle the content in more detail while maintaining the overall structure of the document, which can be useful in various processing processes.
Copy
%pip install -qU langchain-text-splittersMarkdownHeaderTextSplitter Split the text in the form of a markdown into header units using.
Macdown document header
#,##,###It serves to split the text based on (etc).markdown_documentThe variable is assigned a document in the form of a markdown.headers_to_split_onIn the list, the markdown header level and the name of that level are defined in tuple form.MarkdownHeaderTextSplitterUsing classmarkdown_splitterCreate objects,headers_to_split_onPass the header level that is the split criterion as a parameter.split_textBy calling the methodmarkdown_documentSplit according to header level.
Copy
from langchain_text_splitters import MarkdownHeaderTextSplitter
# Defines a document in Markdown format as a string.
markdown_document = "# Title\n\n## 1. SubTitle\n\nHi this is Jim\n\nHi this is Joe\n\n### 1-1. Sub-SubTitle \n\nHi this is Lance \n\n## 2. Baz\n\nHi this is Molly"
print(markdown_document)Copy
Copy
Copy
Basically MarkdownHeaderTextSplitter Removes the split header from the contents of the output chunk.
This is strip_headers = False You can disable it by setting it to.
Copy
Copy
Within each markdown group, you can apply the text splitter you want.
Copy
Copy
first, MarkdownHeaderTextSplitter Use to split the markdown document based on the header.
Copy
Copy
Previous MarkdownHeaderTextSplitter Split result again RecursiveCharacterTextSplitter Split into.
Copy
Copy
Last updated