03. Token Text Split (TokenTextSplitter)
Language models have token restrictions. Therefore, the token limit should not be exceeded.
TokenTextSplitter Is useful when generating chunks based on the number of tokens in the text.
tiktoken
Copy
%pip install --upgrade --quiet langchain-text-splitters tiktoken./data/appendix-keywords.txtOpen the file and read the content.Read
fileSave to variable.
Copy
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.Outputs some of the contents of the file read from the file.
Copy
Prints some of the content read from a file
print(file[:500])Copy
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
Example: Split the sentence "I go to school" into "I", "to school", and "go".
Associated Keyword: Tokenization, Natural Language CharacterTextSplitter Split text using
from_tiktoken_encoderUse methods to initialize Tiktoken encoder-based text splitters.
Copy
Outputs the number of chunks divided.
Copy
Copy
Copy
Copy
Copy
Reference
CharacterTextSplitter.from_tiktoken_encoderWhen using, the textCharacterTextSplitterOnly dividedtiktokenThe talkizer is used to merge split text. (This is the split texttiktokenThis means that it can be larger than the size of the chunk shown by the talkizer.)RecursiveCharacterTextSplitter.from_tiktoken_encoderUsing allows the split text to be no larger than the chunk size of the token allowed by the language model, and each split is recursively split if the size is larger. You can also load the tiktoken divider directly, which ensures that each split is less than the chunk size.
TokenTextSplitter
TokenTextSplitterSplit text into token units using classes.
Copy
Copy
spaCy
spaCy is an open source software library for advanced natural language processing written in Python and Cython programming languages.
Another alternative to NLTK is to use spaCy tokenizer.
How text is split: spaCy tokenizer Is divided by.
How chunk size is measured: Number of characters Measured as.
Pip command to upgrade spaCy library to latest version.
Copy
Copy
en_core_web_sm Download the model.
Copy
Copy
appendix-keywords.txtOpen the file and read the content.
Copy
Check by outputting some content.
Copy
Copy
Copy
Copy
text_splitterObjectsplit_textUsing methodsfileSplit text.
Copy
Copy
SentenceTransformers
SentenceTransformersTokenTextSplitter has sentence-transformer It is a model-specific text divider.
The default behavior is to split the text into chunks to fit the token window of the sentence transformer model you want to use.
Copy
Check sample text.
Copy
Copy
next file A code that counts the number of tokens in a variable. Output after excluding the number of start and end tokens.
Copy
Copy
splitter.split_text() Using functions text_to_split Split the text stored in the variable into chunks.
Copy
Copy
NLTK
Natural Language Toolkit (NLTK) is a collection of libraries and programs for English Natural Language Processing (NLP) written in Python programming language.
Instead of simply splitting it into "\n\n", you can use NLTK to split text based on NLTK tokenizers.
Text splitting method: split by NLTK tokenizer.
chunk size measurement method: measured by number of characters.
nltkPip instruction to install the library.NLTK (Natural Language Toolkit) is a python library for natural language processing.
Various NLP tasks can be performed, including pre-processing of text data, tokenization, morpheme analysis, and tagging of goods.
Copy
Check sample text
Copy
Copy
NLTKTextSplitterCreate a text divider using classes.chunk_sizeSpecifies that the text is split up to 1000 characters by setting the parameter to 1000 characters.
Copy
text_splitter Object split_text Using methods file Split text.
Copy
Copy
KoNLPy
KoNLPy (Korean NLP in Python) is a python package for Korean natural language processing (NLP).
Token splitting involves splitting text into smaller, more manageable units called tokens.
These tokens are often words, phrases, symbols, or other meaningful elements that are important for further processing and analysis.
In languages such as English, token splitting usually involves separating words with spaces and punctuation marks.
The effect of token splitting is highly dependent on the understanding of the talker about the language structure, which ensures the creation of meaningful tokens.
Designed for English, the talkizer cannot be used effectively in Korean processing because it does not have the ability to understand the unique semantic structure of other languages such as Korean.
Korean token split using KoNLPy's Kkma analyzer
For Korean text, KoNLPY Kkma A morpheme analyzer called (English Knowledge Morpheme Analyzer) is included.
Kkma Provides a detailed morpheme analysis of Korean text.
Disassemble sentences into words, words into each morpheme, and identify stakes for each token.
Text blocks can be divided into individual sentences, which is especially useful for long text processing.
Considers when using
Kkma Is famous for detailed analysis, but it should be noted that this precision can affect the processing speed. therefore Kkma Is best suited for applications where analytical depth is prioritized over rapid text processing.
Pip instruction to install KoNLPy library.
KoNLPy is a Python package for Korean natural language processing, providing features such as morpheme analysis, tagging, and parsing.
Copy
Check sample text.
Copy
This is an example of splitting Korean text using KonlpyTextSplitter.
Copy
text_splitter Using file Split in sentence units.
Copy
Hugging Face tokenizer
The Hugging Face offers a variety of talkers.
In this code, we calculate the token length of the text using GPT2TokenizerFast, one of the talkers of the Hugging Face.
The text splitting method is as follows:
It is divided into the units of characters passed.
Here's how to measure the chunk size:
Based on the number of tokens calculated by the Hugging Face talkizer.
GPT2TokenizerFastUsing classtokenizerCreate an object.from_pretrainedLoad the pre-learned "gpt2" talkizer model by calling the method.
Copy
Check the sample text.
Copy
Copy
from_huggingface_tokenizer Huggingface Talker via method tokenizer ) To initialize the text divider.
Copy
Check the split result of the first element.
Copy
Copy
Last updated