06. Markdownheader Text Split (MarkdownheaderTextSplitter)

Understanding and efficiently handling the structure of a markdown file can be very important for working with documents. In particular, the process of embedding text in a meaningful way, taking into account the overall context and structure of the document, is of great help in creating a comprehensive vector representation that can better capture a wide range of meanings and topics.

In this context, there are times when you want to divide the content by specific part of the markdown file, that is, by header. This is the case, for example, when you want to create a'chunk', a chunk of information that is associated with each other based on what is under each header within a document. This is an attempt to effectively utilize the structural elements of a document while maintaining a common context of text.

To solve this challenge, MarkdownHeaderTextSplitter Ra can take advantage of the tools. This tool splits documents according to the specified set of headers, allowing you to manage the contents under each header group with separate chunks. This method allows you to handle the content in more detail while maintaining the overall structure of the document, which can be useful in various processing processes.

Copy

%pip install -qU langchain-text-splitters

MarkdownHeaderTextSplitter Split the text in the form of a markdown into header units using.

  • Macdown document header # , ## , ### It serves to split the text based on (etc).

  • markdown_document The variable is assigned a document in the form of a markdown.

  • headers_to_split_on In the list, the markdown header level and the name of that level are defined in tuple form.

  • MarkdownHeaderTextSplitter Using class markdown_splitter Create objects, headers_to_split_on Pass the header level that is the split criterion as a parameter.

  • split_text By calling the method markdown_document Split according to header level.

Copy

from langchain_text_splitters import MarkdownHeaderTextSplitter

# Defines a document in Markdown format as a string.
markdown_document = "# Title\n\n## 1. SubTitle\n\nHi this is Jim\n\nHi this is Joe\n\n### 1-1. Sub-SubTitle \n\nHi this is Lance \n\n## 2. Baz\n\nHi this is Molly"
print(markdown_document)

Copy

Copy

Copy

Basically MarkdownHeaderTextSplitter Removes the split header from the contents of the output chunk.

This is strip_headers = False You can disable it by setting it to.

Copy

Copy

Within each markdown group, you can apply the text splitter you want.

Copy

Copy

first, MarkdownHeaderTextSplitter Use to split the markdown document based on the header.

Copy

Copy

Previous MarkdownHeaderTextSplitter Split result again RecursiveCharacterTextSplitter Split into.

Copy

Copy

Last updated