03. Token Text Split (TokenTextSplitter)

Language models have token restrictions. Therefore, the token limit should not be exceeded.

TokenTextSplitter Is useful when generating chunks based on the number of tokens in the text.

tiktoken

Copy

%pip install --upgrade --quiet langchain-text-splitters tiktoken

./data/appendix-keywords.txt Open the file and read the content.
Read file Save to variable.

Copy

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

Outputs some of the contents of the file read from the file.

Copy

Prints some of the content read from a file
print(file[:500])

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language

CharacterTextSplitter Split text using

from_tiktoken_encoder Use methods to initialize Tiktoken encoder-based text splitters.

Copy

from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    # Set the chunk size to 300.
    chunk_size=300,
    # Set to ensure that there are no overlapping parts between chunks.
    chunk_overlap=0,
)
# file Splits text into chunks.
texts = text_splitter.split_text(file)

Outputs the number of chunks divided.

Copy

print(len(texts))  Prints the number of split chunks.

Copy

outputs the first element of the text list.

Copy

# texts Prints the first element of the list.
print(texts[0])

Copy

 Semantic Search

Reference

CharacterTextSplitter.from_tiktoken_encoder When using, the text CharacterTextSplitter Only divided tiktoken The talkizer is used to merge split text. (This is the split text tiktoken This means that it can be larger than the size of the chunk shown by the talkizer.)
RecursiveCharacterTextSplitter.from_tiktoken_encoder Using allows the split text to be no larger than the chunk size of the token allowed by the language model, and each split is recursively split if the size is larger. You can also load the tiktoken divider directly, which ensures that each split is less than the chunk size.

TokenTextSplitter

TokenTextSplitter Split text into token units using classes.

Copy

from langchain_text_splitters import TokenTextSplitter

text_splitter = TokenTextSplitter(
    chunk_size=200,  # Set the chunk size to 10.
    chunk_overlap=0,  # Sets the overlap between chunks to 0.
)

# state_of_the_union Split text into chunks.
texts = text_splitter.split_text(file)
print(texts[0])  # Outputs the first chunk of the split text.

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: User "Solar Planet" �

spaCy

spaCy is an open source software library for advanced natural language processing written in Python and Cython programming languages.

Another alternative to NLTK is to use spaCy tokenizer.

How text is split: spaCy tokenizer Is divided by.
How chunk size is measured: Number of characters Measured as.

Pip command to upgrade spaCy library to latest version.

Copy

%pip install --upgrade --quiet spacy

Copy

 Note: you may need to restart the kernel to use updated packages.

en_core_web_sm Download the model.

Copy

!python -m spacy download en_core_web_sm --quiet

Copy

 ✔ Download and installation successful 
You can now load the package via spacy.load ('en_core_web_sm')

appendix-keywords.txt Open the file and read the content.

Copy

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Read the contents of a file and store them in the file variable.

Check by outputting some content.

Copy

# Prints some of the content read from the file.
print(file[:350])

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Press: Natural Language

Copy

SpacyTextSplitter Create a text divider using classes.

Copy

import warnings
from langchain_text_splitters import SpacyTextSplitter

# Ignore the warning message.
warnings.filterwarnings("ignore")

# SpacyTextSplitter creates.
text_splitter = SpacyTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=50,  # Set the overlap between chunks to 50.
)

text_splitter Object split_text Using methods file Split text.

Copy

# text_splitter Splits the file text using.
texts = text_splitter.split_text(file)
print(texts[0])  # Outputs the first element of the split text.

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 


Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.

SentenceTransformers

SentenceTransformersTokenTextSplitter has sentence-transformer It is a model-specific text divider.

The default behavior is to split the text into chunks to fit the token window of the sentence transformer model you want to use.

Copy

from langchain_text_splitters import SentenceTransformersTokenTextSplitter

# Create a sentence splitter and set the overlap between chunks to 0.
splitter = SentenceTransformersTokenTextSplitter(chunk_size=200, chunk_overlap=0)

Check sample text.

Copy

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

# Prints some of the content read from the file.
print(file[:350])

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Press: Natural Language

next file A code that counts the number of tokens in a variable. Output after excluding the number of start and end tokens.

Copy

count_start_and_stop_tokens = 2  # Set the number of start and end tokens to 2

# Subtract the number of start and end tokens from the number of tokens in the text.
text_token_count = splitter.count_tokens(
    text=file) - count_start_and_stop_tokens
print(text_token_count)  # Prints the number of computed text tokens.

Copy

splitter.split_text() Using functions text_to_split Split the text stored in the variable into chunks.

Copy

text_chunks = splitter.split_text(text=file)  # Split text into chunks.

Copy

 . This allows the computer to understand and process the text [UNK]. [UNK]: The word "apology" [0. 65, - 0. 23, 0. 17] and [UNK] vector. Associative keyword: natural language processing, vectorization, dipping token definition: token means [UNK] splitting the text into smaller [UNK]. These are usually words, sentences, and [UNK] verses [UNK]. [UNK]: Sentence "I go to school" "I split into ", "to school", "go". Related keywords: tokenization, natural language processing, parsing tokenizer definition: torque

NLTK

Natural Language Toolkit (NLTK) is a collection of libraries and programs for English Natural Language Processing (NLP) written in Python programming language.

Instead of simply splitting it into "\n\n", you can use NLTK to split text based on NLTK tokenizers.

Text splitting method: split by NLTK tokenizer.
chunk size measurement method: measured by number of characters.
nltk Pip instruction to install the library.
NLTK (Natural Language Toolkit) is a python library for natural language processing.
Various NLP tasks can be performed, including pre-processing of text data, tokenization, morpheme analysis, and tagging of goods.

Copy

%pip install -qU nltk

Check sample text

Copy

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

# Prints some of the content read from the file.
print(file[:350])

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Press: Natural Language

NLTKTextSplitter Create a text divider using classes.
chunk_size Specifies that the text is split up to 1000 characters by setting the parameter to 1000 characters.

Copy

from langchain_text_splitters import NLTKTextSplitter

text_splitter = NLTKTextSplitter(
    chunk_size=200,  # Set the chunk size to 200.
    chunk_overlap=0,  # Sets the overlap between chunks to 0.
)

text_splitter Object split_text Using methods file Split text.

Copy

# text_splitter Splits the file text using.
texts = text_splitter.split_text(file)
print(texts[0])  # Outputs the first element of the split text.

Copy

 Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 

Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.

KoNLPy

KoNLPy (Korean NLP in Python) is a python package for Korean natural language processing (NLP).

Token splitting involves splitting text into smaller, more manageable units called tokens.

These tokens are often words, phrases, symbols, or other meaningful elements that are important for further processing and analysis.

In languages such as English, token splitting usually involves separating words with spaces and punctuation marks.

The effect of token splitting is highly dependent on the understanding of the talker about the language structure, which ensures the creation of meaningful tokens.

Designed for English, the talkizer cannot be used effectively in Korean processing because it does not have the ability to understand the unique semantic structure of other languages such as Korean.

Korean token split using KoNLPy's Kkma analyzer

For Korean text, KoNLPY Kkma A morpheme analyzer called (English Knowledge Morpheme Analyzer) is included.

Kkma Provides a detailed morpheme analysis of Korean text.

Disassemble sentences into words, words into each morpheme, and identify stakes for each token.

Text blocks can be divided into individual sentences, which is especially useful for long text processing.

Considers when using

Kkma Is famous for detailed analysis, but it should be noted that this precision can affect the processing speed. therefore Kkma Is best suited for applications where analytical depth is prioritized over rapid text processing.

Pip instruction to install KoNLPy library.
KoNLPy is a Python package for Korean natural language processing, providing features such as morpheme analysis, tagging, and parsing.

Copy

%pip install -qU konlpy

Check sample text.

Copy

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

# Prints some of the content read from the file.
print(file[:350])

This is an example of splitting Korean text using KonlpyTextSplitter.

Copy

from langchain_text_splitters import KonlpyTextSplitter

# KonlpyTextSplitter Create a text splitter object using .
text_splitter = KonlpyTextSplitter()

text_splitter Using file Split in sentence units.

Copy

texts = text_splitter.split_text(file)  # Splits a Korean document into sentences..
print(texts[0])  # Prints the first sentence of the split sentences.

Hugging Face tokenizer

The Hugging Face offers a variety of talkers.

In this code, we calculate the token length of the text using GPT2TokenizerFast, one of the talkers of the Hugging Face.

The text splitting method is as follows:

It is divided into the units of characters passed.

Here's how to measure the chunk size:

Based on the number of tokens calculated by the Hugging Face talkizer.
GPT2TokenizerFast Using class tokenizer Create an object.
from_pretrained Load the pre-learned "gpt2" talkizer model by calling the method.

Copy

from transformers import GPT2TokenizerFast

# GPT-2 Load the tokenizer for the model.
hf_tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

Check the sample text.

Copy

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

# Prints some of the content read from the file.
print(file[:350])

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Press: Natural Language

from_huggingface_tokenizer Huggingface Talker via method tokenizer ) To initialize the text divider.

Copy

text_splitter = CharacterTextSplitter.from_huggingface_tokenizer(
    # Create a CharacterTextSplitter object using the huggingface tokenizer.
    hf_tokenizer,
    chunk_size=300,
    chunk_overlap=50,
)
# state_of_the_union Split the text and store it in the texts variable.
texts = text_splitter.split_text(file)

Check the split result of the first element.

Copy

print(texts[1])  # texts Prints the first element of the list.

Copy

 Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining

Previous02. Securitic text text split (RecursiveCharacterTextSplitter)Next04. Semantic chunker

Last updated 5 months ago