02. Securitic text text split (RecursiveCharacterTextSplitter)
RecursiveCharacterTextSplitter
This text divider is the recommended method for general text.
This divider acts by taking a list of characters as parameters.
The divider attempts to split the text in the order of the given character list until the chunks are small enough.
Default character list ["\n\n", "\n", " ", ""] is.
paragraph -> sentence -> word Split recursively in order.
This has the effect of keeping the paragraph (then sentence, word) unit as the most strongly associated piece of text as possible.
How text is split: character list ["\n\n", "\n", " ", ""] ) Is divided by.
How the chunk size is measured: measured by the number of character
Copy
%pip install -qU langchain-text-splitters
appendix-keywords.txt Open the file and read the content.
Read file Save to variable.
Copy
# appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
Outputs some of the contents of the file read from the file.
Copy
Copy
Copy
RecursiveCharacterTextSplitter An example of splitting text into small chunks using.
chunk_size Limit the size of each chunk by setting to 250.
chunk_overlap Set to 50 to allow nesting of 50 characters between adjacent chunks.
length_function to len Calculate the length of the text using functions.
is_separator_regex for False Set to separator and do not use regular expressions.
Copy
text_splitter Using file Split text into document units.
Split documents texts Stored in list.
print(texts[0]) and print(texts[1]) Outputs the first and second documents of the split document.
Copy
Copy
text_splitter.split_text() Using functions file Split text.
# Prints some of the content read from the file.
print(file[:500])
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
Example: Split the sentence "I go to school" into "I", "to school", and "go".
Associated Keyword: Tokenization, Natural Language
from langchain_text_splitters import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
# Set the chunk size very small. This is for example purposes only.
chunk_size=250,
# Sets the number of overlapping characters between chunks.
chunk_overlap=50,
# Specifies a function to calculate the length of a string.
length_function=len,
# Sets whether to use regular expressions as delimiters.
is_separator_regex=False,
)
# text_splitter Splits the file text into documents using .
texts = text_splitter.create_documents([file])
print(texts[0]) # Prints the first document of the split document.
print("===" * 20)
print(texts[1]) # Prints the second document of a split document.
page_content='Semantic Search\n\n Definition: A semantic search is a search method that goes beyond a simple keyword matching and grasps its meaning and returns related results.\n Example: When a user searches for a "solar planet", " Jupiter", Returns information about related planets such as "Mars".\Negator keyword: natural language processing, search algorithm
============================================================
page_content='Embedding\n\n Definition: Embedding is the process of converting text data such as words or sentences into a continuous vector of low dimensions. This allows the computer to understand and process the text.\n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\nAssociation keyword: natural language processing, vectorization, deep learning\n\nToken'
# Splits text and returns the first two elements of the split text.
text_splitter.split_text(file)[:2]
['Semantic Search\n\n Definition: Semantic search is a search method that returns a related result by identifying the user's query beyond simple keyword matching.\n Example: When a user searches for a "solar planet", it is "purpose", " Mars", etc. Returns information about related planets, such as \nrelated keywords: natural language processing, search algorithms, data mining\n\nEmb This allows the computer to understand and process the text.\n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\nAssociation keyword: natural language processing, vectorization, deep learning\n\nToken']