02. Securitic text text split (RecursiveCharacterTextSplitter)

RecursiveCharacterTextSplitter

This text divider is the recommended method for general text.

This divider acts by taking a list of characters as parameters.

The divider attempts to split the text in the order of the given character list until the chunks are small enough.

Default character list ["\n\n", "\n", " ", ""] is.

paragraph -> sentence -> word Split recursively in order.

This has the effect of keeping the paragraph (then sentence, word) unit as the most strongly associated piece of text as possible.

How text is split: character list ["\n\n", "\n", " ", ""] ) Is divided by.
How the chunk size is measured: measured by the number of character

Copy

%pip install -qU langchain-text-splitters

appendix-keywords.txt Open the file and read the content.
Read file Save to variable.

Copy

# appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

Outputs some of the contents of the file read from the file.

Copy

# Prints some of the content read from the file.
print(file[:500])

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language

Copy

from langchain_text_splitters import RecursiveCharacterTextSplitter

RecursiveCharacterTextSplitter An example of splitting text into small chunks using.

chunk_size Limit the size of each chunk by setting to 250.
chunk_overlap Set to 50 to allow nesting of 50 characters between adjacent chunks.
length_function to len Calculate the length of the text using functions.
is_separator_regex for False Set to separator and do not use regular expressions.

Copy

text_splitter = RecursiveCharacterTextSplitter(
    # Set the chunk size very small. This is for example purposes only.
    chunk_size=250,
    # Sets the number of overlapping characters between chunks.
    chunk_overlap=50,
    # Specifies a function to calculate the length of a string.
    length_function=len,
    # Sets whether to use regular expressions as delimiters.
    is_separator_regex=False,
)

text_splitter Using file Split text into document units.
Split documents texts Stored in list.
print(texts[0]) and print(texts[1]) Outputs the first and second documents of the split document.

Copy

# text_splitter Splits the file text into documents using .
texts = text_splitter.create_documents([file])
print(texts[0])  # Prints the first document of the split document.
print("===" * 20)
print(texts[1])  # Prints the second document of a split document.

Copy

page_content='Semantic Search\n\n Definition: A semantic search is a search method that goes beyond a simple keyword matching and grasps its meaning and returns related results.\n Example: When a user searches for a "solar planet", " Jupiter", Returns information about related planets such as "Mars".\Negator keyword: natural language processing, search algorithm 
============================================================ 
page_content='Embedding\n\n Definition: Embedding is the process of converting text data such as words or sentences into a continuous vector of low dimensions. This allows the computer to understand and process the text.\n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\nAssociation keyword: natural language processing, vectorization, deep learning\n\nToken'

text_splitter.split_text() Using functions file Split text.

Copy

# Splits text and returns the first two elements of the split text.
text_splitter.split_text(file)[:2]

Copy

['Semantic Search\n\n Definition: Semantic search is a search method that returns a related result by identifying the user's query beyond simple keyword matching.\n Example: When a user searches for a "solar planet", it is "purpose", " Mars", etc. Returns information about related planets, such as \nrelated keywords: natural language processing, search algorithms, data mining\n\nEmb This allows the computer to understand and process the text.\n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\nAssociation keyword: natural language processing, vectorization, deep learning\n\nToken']

Previous01. Character text split (CharacterTextSplitter)Next03. Token Text Split (TokenTextSplitter)

Last updated 7 months ago

hashtagRecursiveCharacterTextSplitter

RecursiveCharacterTextSplitter