06. Markdownheader Text Split (MarkdownheaderTextSplitter)
Understanding and efficiently handling the structure of a markdown file can be very important for working with documents. In particular, the process of embedding text in a meaningful way, taking into account the overall context and structure of the document, is of great help in creating a comprehensive vector representation that can better capture a wide range of meanings and topics.
In this context, there are times when you want to divide the content by specific part of the markdown file, that is, by header. This is the case, for example, when you want to create a'chunk', a chunk of information that is associated with each other based on what is under each header within a document. This is an attempt to effectively utilize the structural elements of a document while maintaining a common context of text.
To solve this challenge, MarkdownHeaderTextSplitter Ra can take advantage of the tools. This tool splits documents according to the specified set of headers, allowing you to manage the contents under each header group with separate chunks. This method allows you to handle the content in more detail while maintaining the overall structure of the document, which can be useful in various processing processes.
Copy
%pip install -qU langchain-text-splitters
MarkdownHeaderTextSplitter Split the text in the form of a markdown into header units using.
Macdown document header # , ## , ### It serves to split the text based on (etc).
markdown_document The variable is assigned a document in the form of a markdown.
headers_to_split_on In the list, the markdown header level and the name of that level are defined in tuple form.
MarkdownHeaderTextSplitter Using class markdown_splitter Create objects, headers_to_split_on Pass the header level that is the split criterion as a parameter.
split_text By calling the method markdown_document Split according to header level.
Copy
from langchain_text_splitters import MarkdownHeaderTextSplitter
# Defines a document in Markdown format as a string.
markdown_document = "# Title\n\n## 1. SubTitle\n\nHi this is Jim\n\nHi this is Joe\n\n### 1-1. Sub-SubTitle \n\nHi this is Lance \n\n## 2. Baz\n\nHi this is Molly"
print(markdown_document)
Copy
Copy
Copy
Basically MarkdownHeaderTextSplitter Removes the split header from the contents of the output chunk.
This is strip_headers = False You can disable it by setting it to.
Copy
Copy
Within each markdown group, you can apply the text splitter you want.
Copy
Copy
first, MarkdownHeaderTextSplitter Use to split the markdown document based on the header.
Copy
Copy
Previous MarkdownHeaderTextSplitter Split result again RecursiveCharacterTextSplitter Split into.
# Title
## One. SubTitle
Hi this is Jim
Hi this is Joe
### 1-1. Sub-SubTitle
Hi this is Lance
## 2. Baz
Hi this is Molly
headers_to_split_on = [ # Defines the header levels by which the document will be divided and the names of those levels..
(
"#",
"Header 1",
), # Header level 1 is denoted by '#' and is named 'Header 1'.
(
"##",
"Header 2",
), # Header level 2 is denoted by '##' and named 'Header 2'.
(
"###",
"Header 3",
), # Header level 3 is represented as '###' and has the name 'Header 3'.
]
# Splitting text based on markdown header MarkdownHeaderTextSplitter Creates an object.
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
# markdown_document Split by header md_header_splits Save to.
md_header_splits = markdown_splitter.split_text(markdown_document)
# Prints the split results.
for header in md_header_splits:
print(f"{header.page_content}")
print(f"{header.metadata}", end="\n=====================\n")
Hi this is Jim
Hi this is Joe
{'Header 1':'Title','Header 2': '1. SubTitle'}
=====================
Hi this is Lance
{'Header 1':'Title','Header 2': '1. SubTitle','Header 3': '1-1. Sub-SubTitle'}
=====================
Hi this is Molly
{'Header 1':'Title','Header 2': '2. Baz'}
=====================
markdown_splitter = MarkdownHeaderTextSplitter(
# Specifies the header to split.
headers_to_split_on=headers_to_split_on,
# Sets the header not to be removed.
strip_headers=False,
)
# Splits a Markdown document based on its header.
md_header_splits = markdown_splitter.split_text(markdown_document)
# Prints the split results.
for header in md_header_splits:
print(f"{header.page_content}")
print(f"{header.metadata}", end="\n=====================\n")
# Title
## One. SubTitle
Hi this is Jim
Hi this is Joe
{'Header 1':'Title','Header 2': '1. SubTitle'}
=====================
### 1-1. Sub-SubTitle
Hi this is Lance
{'Header 1':'Title','Header 2': '1. SubTitle','Header 3': '1-1. Sub-SubTitle'}
=====================
## 2. Baz
Hi this is Molly
{'Header 1':'Title','Header 2': '2. Baz'}
=====================
from langchain_text_splitters import RecursiveCharacterTextSplitter
markdown_document = "# Intro \n\n## History \n\nMarkdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\nMarkdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n## Rise and divergence \n\nAs Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n#### Standardization \n\nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n## Implementations \n\nImplementations of Markdown are available for over a dozen programming languages."
print(markdown_document)
# Intro
## History
Markdown [9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.
## Rise and divergence
As Markdown popularity gow rapidly, many Markdown implementationations attached, drive mostly by the needle for
additional features such as tables, footnotes, definition lists, [note 1] and Markdown inside HTML blocks.
#### Standardization
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation Effort.
## Implementationations
Implementationations of Markdown are available for over a dozen programming languages.
headers_to_split_on = [
("#", "Header 1"), # Specifies the header level to split and the name of that level.
("##", "Header 2"),
]
# Markdown Splits the document by header level.
markdown_splitter = MarkdownHeaderTextSplitter(
headers_to_split_on=headers_to_split_on, strip_headers=False
)
md_header_splits = markdown_splitter.split_text(markdown_document)
# Prints the split results.
for header in md_header_splits:
print(f"{header.page_content}")
print(f"{header.metadata}", end="\n=====================\n")
# Intro
## History
Markdown [9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.
{'Header 1':'Intro','Header 2':'History'}
=====================
## Rise and divergence
As Markdown popularity gow rapidly, many Markdown implementationations attached, drive mostly by the needle for
additional features such as tables, footnotes, definition lists, [note 1] and Markdown inside HTML blocks.
#### Standardization
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation Effort.
{'Header 1':'Intro','Header 2':'Rise and divergence' }
=====================
## Implementationations
Implementationations of Markdown are available for over a dozen programming languages.
{'Header 1':'Intro','Header 2':'Implementations' }
=====================
chunk_size = 200 # Specifies the size of the split chunks.
chunk_overlap = 20 # Specifies the number of overlapping characters between split chunks.
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
)
# Splits the document into character units.
splits = text_splitter.split_documents(md_header_splits)
# Prints the split results.
for header in splits:
print(f"{header.page_content}")
print(f"{header.metadata}", end="\n=====================\n")
# Intro
## History
{'Header 1':'Intro','Header 2':'History'}
=====================
Markdown [9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its
{'Header 1':'Intro','Header 2':'History'}
=====================
readers in its source code form.[9]
{'Header 1':'Intro','Header 2':'History'}
=====================
Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.
{'Header 1':'Intro','Header 2':'History'}
=====================
## Rise and divergence
As Markdown popularity gow rapidly, many Markdown implementationations attached, drive mostly by the needle for
{'Header 1':'Intro','Header 2':'Rise and divergence' }
=====================
additional features such as tables, footnotes, definition lists, [note 1] and Markdown inside HTML blocks.
#### Standardization
{'Header 1':'Intro','Header 2':'Rise and divergence' }
=====================
From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation Effort.
{'Header 1':'Intro','Header 2':'Rise and divergence' }
=====================
## Implementationations
Implementationations of Markdown are available for over a dozen programming languages.
{'Header 1':'Intro','Header 2':'Implementations' }
=====================