06. Markdownheader Text Split (MarkdownheaderTextSplitter)

Understanding and efficiently handling the structure of a markdown file can be very important for working with documents. In particular, the process of embedding text in a meaningful way, taking into account the overall context and structure of the document, is of great help in creating a comprehensive vector representation that can better capture a wide range of meanings and topics.

In this context, there are times when you want to divide the content by specific part of the markdown file, that is, by header. This is the case, for example, when you want to create a'chunk', a chunk of information that is associated with each other based on what is under each header within a document. This is an attempt to effectively utilize the structural elements of a document while maintaining a common context of text.

To solve this challenge, MarkdownHeaderTextSplitter Ra can take advantage of the tools. This tool splits documents according to the specified set of headers, allowing you to manage the contents under each header group with separate chunks. This method allows you to handle the content in more detail while maintaining the overall structure of the document, which can be useful in various processing processes.

Copy

%pip install -qU langchain-text-splitters

MarkdownHeaderTextSplitter Split the text in the form of a markdown into header units using.

Macdown document header # , ## , ### It serves to split the text based on (etc).
markdown_document The variable is assigned a document in the form of a markdown.
headers_to_split_on In the list, the markdown header level and the name of that level are defined in tuple form.
MarkdownHeaderTextSplitter Using class markdown_splitter Create objects, headers_to_split_on Pass the header level that is the split criterion as a parameter.
split_text By calling the method markdown_document Split according to header level.

Copy

from langchain_text_splitters import MarkdownHeaderTextSplitter

# Defines a document in Markdown format as a string.
markdown_document = "# Title\n\n## 1. SubTitle\n\nHi this is Jim\n\nHi this is Joe\n\n### 1-1. Sub-SubTitle \n\nHi this is Lance \n\n## 2. Baz\n\nHi this is Molly"
print(markdown_document)

Copy

# Title 

## One. SubTitle 

Hi this is Jim 

Hi this is Joe 

### 1-1. Sub-SubTitle  

Hi this is Lance  

## 2. Baz 

Hi this is Molly

Copy

headers_to_split_on = [  # Defines the header levels by which the document will be divided and the names of those levels..
    (
        "#",
        "Header 1",
    ),  # Header level 1 is denoted by '#' and is named 'Header 1'.
    (
        "##",
        "Header 2",
    ),  # Header level 2 is denoted by '##' and named 'Header 2'.
    (
        "###",
        "Header 3",
    ),  # Header level 3 is represented as '###' and has the name 'Header 3'.
]

# Splitting text based on markdown header MarkdownHeaderTextSplitter Creates an object.
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
# markdown_document Split by header md_header_splits Save to.
md_header_splits = markdown_splitter.split_text(markdown_document)
# Prints the split results.
for header in md_header_splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")

Copy

Hi this is Jim  
Hi this is Joe 
{'Header 1':'Title','Header 2': '1. SubTitle'} 
===================== 
Hi this is Lance 
{'Header 1':'Title','Header 2': '1. SubTitle','Header 3': '1-1. Sub-SubTitle'} 
===================== 
Hi this is Molly 
{'Header 1':'Title','Header 2': '2. Baz'} 
=====================

Basically MarkdownHeaderTextSplitter Removes the split header from the contents of the output chunk.

This is strip_headers = False You can disable it by setting it to.

Copy

markdown_splitter = MarkdownHeaderTextSplitter(
    # Specifies the header to split.
    headers_to_split_on=headers_to_split_on,
    # Sets the header not to be removed.
    strip_headers=False,
)
# Splits a Markdown document based on its header.
md_header_splits = markdown_splitter.split_text(markdown_document)
# Prints the split results.
for header in md_header_splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")

Copy

 # Title  
## One. SubTitle  
Hi this is Jim  
Hi this is Joe 
{'Header 1':'Title','Header 2': '1. SubTitle'} 
===================== 
### 1-1. Sub-SubTitle  
Hi this is Lance 
{'Header 1':'Title','Header 2': '1. SubTitle','Header 3': '1-1. Sub-SubTitle'} 
===================== 
## 2. Baz  
Hi this is Molly 
{'Header 1':'Title','Header 2': '2. Baz'} 
=====================

Within each markdown group, you can apply the text splitter you want.

Copy

from langchain_text_splitters import RecursiveCharacterTextSplitter

markdown_document = "# Intro \n\n## History \n\nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n## Rise and divergence \n\nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n#### Standardization \n\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n## Implementations \n\nImplementations of Markdown are available for over a dozen programming languages."
print(markdown_document)

Copy

# Intro  

## History  

Markdown [9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]  

Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.  

## Rise and divergence  

As Markdown popularity gow rapidly, many Markdown implementationations attached, drive mostly by the needle for  

additional features such as tables, footnotes, definition lists, [note 1] and Markdown inside HTML blocks.  

#### Standardization  

From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation Effort.  

## Implementationations  

Implementationations of Markdown are available for over a dozen programming languages.

first, MarkdownHeaderTextSplitter Use to split the markdown document based on the header.

Copy

headers_to_split_on = [
    ("#", "Header 1"),  # Specifies the header level to split and the name of that level.
    ("##", "Header 2"),
]

# Markdown Splits the document by header level.
markdown_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
# Prints the split results.
for header in md_header_splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")

Copy

# Intro  
## History  
Markdown [9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]  
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. 
{'Header 1':'Intro','Header 2':'History'} 
===================== 
## Rise and divergence  
As Markdown popularity gow rapidly, many Markdown implementationations attached, drive mostly by the needle for  
additional features such as tables, footnotes, definition lists, [note 1] and Markdown inside HTML blocks.  
#### Standardization  
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation Effort. 
{'Header 1':'Intro','Header 2':'Rise and divergence' } 
===================== 
## Implementationations  
Implementationations of Markdown are available for over a dozen programming languages. 
{'Header 1':'Intro','Header 2':'Implementations' } 
=====================

Previous MarkdownHeaderTextSplitter Split result again RecursiveCharacterTextSplitter Split into.

Copy

chunk_size = 200  # Specifies the size of the split chunks.
chunk_overlap = 20  # Specifies the number of overlapping characters between split chunks.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Splits the document into character units.
splits = text_splitter.split_documents(md_header_splits)
# Prints the split results.
for header in splits:
    print(f"{header.page_content}")
    print(f"{header.metadata}", end="\n=====================\n")

Copy

# Intro  
## History 
{'Header 1':'Intro','Header 2':'History'} 
===================== 
Markdown [9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its 
{'Header 1':'Intro','Header 2':'History'} 
===================== 
readers in its source code form.[9] 
{'Header 1':'Intro','Header 2':'History'} 
===================== 
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. 
{'Header 1':'Intro','Header 2':'History'} 
===================== 
## Rise and divergence  
As Markdown popularity gow rapidly, many Markdown implementationations attached, drive mostly by the needle for 
{'Header 1':'Intro','Header 2':'Rise and divergence' } 
===================== 
additional features such as tables, footnotes, definition lists, [note 1] and Markdown inside HTML blocks.  
#### Standardization 
{'Header 1':'Intro','Header 2':'Rise and divergence' } 
===================== 
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation Effort. 
{'Header 1':'Intro','Header 2':'Rise and divergence' } 
===================== 
## Implementationations  
Implementationations of Markdown are available for over a dozen programming languages. 
{'Header 1':'Intro','Header 2':'Implementations' } 
=====================

Previous05. Code splitting (Python, Markdown, JAVA, C++, C#, GO, JS, Latex, etc)Next08. Regressive JSON split (RecursiveJsonSplitter)

Last updated 7 months ago