03. Token Text Split (TokenTextSplitter)

Language models have token restrictions. Therefore, the token limit should not be exceeded.

TokenTextSplitter Is useful when generating chunks based on the number of tokens in the text.

tiktoken

Copy

%pip install --upgrade --quiet langchain-text-splitters tiktoken
  • ./data/appendix-keywords.txt Open the file and read the content.

  • Read file Save to variable.

Copy

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

Outputs some of the contents of the file read from the file.

Copy

Prints some of the content read from a file
print(file[:500])

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language 

CharacterTextSplitter Split text using

  • from_tiktoken_encoder Use methods to initialize Tiktoken encoder-based text splitters.

Copy

Outputs the number of chunks divided.

Copy

Copy

Copy

Copy

Copy

Reference

  • CharacterTextSplitter.from_tiktoken_encoder When using, the text CharacterTextSplitter Only divided tiktoken The talkizer is used to merge split text. (This is the split text tiktoken This means that it can be larger than the size of the chunk shown by the talkizer.)

  • RecursiveCharacterTextSplitter.from_tiktoken_encoder Using allows the split text to be no larger than the chunk size of the token allowed by the language model, and each split is recursively split if the size is larger. You can also load the tiktoken divider directly, which ensures that each split is less than the chunk size.

TokenTextSplitter

  • TokenTextSplitter Split text into token units using classes.

Copy

Copy

spaCy

spaCy is an open source software library for advanced natural language processing written in Python and Cython programming languages.

Another alternative to NLTK is to use spaCy tokenizer.

  1. How text is split: spaCy tokenizer Is divided by.

  2. How chunk size is measured: Number of characters Measured as.

Pip command to upgrade spaCy library to latest version.

Copy

Copy

en_core_web_sm Download the model.

Copy

Copy

  • appendix-keywords.txt Open the file and read the content.

Copy

Check by outputting some content.

Copy

Copy

Copy

Copy

  • text_splitter Object split_text Using methods file Split text.

Copy

Copy

SentenceTransformers

SentenceTransformersTokenTextSplitter has sentence-transformer It is a model-specific text divider.

The default behavior is to split the text into chunks to fit the token window of the sentence transformer model you want to use.

Copy

Check sample text.

Copy

Copy

next file A code that counts the number of tokens in a variable. Output after excluding the number of start and end tokens.

Copy

Copy

splitter.split_text() Using functions text_to_split Split the text stored in the variable into chunks.

Copy

Copy

NLTK

Natural Language Toolkit (NLTK) is a collection of libraries and programs for English Natural Language Processing (NLP) written in Python programming language.

Instead of simply splitting it into "\n\n", you can use NLTK to split text based on NLTK tokenizers.

  1. Text splitting method: split by NLTK tokenizer.

  2. chunk size measurement method: measured by number of characters.

  3. nltk Pip instruction to install the library.

  4. NLTK (Natural Language Toolkit) is a python library for natural language processing.

  5. Various NLP tasks can be performed, including pre-processing of text data, tokenization, morpheme analysis, and tagging of goods.

Copy

Check sample text

Copy

Copy

  • NLTKTextSplitter Create a text divider using classes.

  • chunk_size Specifies that the text is split up to 1000 characters by setting the parameter to 1000 characters.

Copy

text_splitter Object split_text Using methods file Split text.

Copy

Copy

KoNLPy

KoNLPy (Korean NLP in Python) is a python package for Korean natural language processing (NLP).

Token splitting involves splitting text into smaller, more manageable units called tokens.

These tokens are often words, phrases, symbols, or other meaningful elements that are important for further processing and analysis.

In languages such as English, token splitting usually involves separating words with spaces and punctuation marks.

The effect of token splitting is highly dependent on the understanding of the talker about the language structure, which ensures the creation of meaningful tokens.

Designed for English, the talkizer cannot be used effectively in Korean processing because it does not have the ability to understand the unique semantic structure of other languages such as Korean.

Korean token split using KoNLPy's Kkma analyzer

For Korean text, KoNLPY Kkma A morpheme analyzer called (English Knowledge Morpheme Analyzer) is included.

Kkma Provides a detailed morpheme analysis of Korean text.

Disassemble sentences into words, words into each morpheme, and identify stakes for each token.

Text blocks can be divided into individual sentences, which is especially useful for long text processing.

Considers when using

Kkma Is famous for detailed analysis, but it should be noted that this precision can affect the processing speed. therefore Kkma Is best suited for applications where analytical depth is prioritized over rapid text processing.

  • Pip instruction to install KoNLPy library.

  • KoNLPy is a Python package for Korean natural language processing, providing features such as morpheme analysis, tagging, and parsing.

Copy

Check sample text.

Copy

This is an example of splitting Korean text using KonlpyTextSplitter.

Copy

text_splitter Using file Split in sentence units.

Copy

Hugging Face tokenizer

The Hugging Face offers a variety of talkers.

In this code, we calculate the token length of the text using GPT2TokenizerFast, one of the talkers of the Hugging Face.

The text splitting method is as follows:

  • It is divided into the units of characters passed.

Here's how to measure the chunk size:

  • Based on the number of tokens calculated by the Hugging Face talkizer.

  • GPT2TokenizerFast Using class tokenizer Create an object.

  • from_pretrained Load the pre-learned "gpt2" talkizer model by calling the method.

Copy

Check the sample text.

Copy

Copy

from_huggingface_tokenizer Huggingface Talker via method tokenizer ) To initialize the text divider.

Copy

Check the split result of the first element.

Copy

Copy

Last updated