04. Semantic chunker
Split text based on semantic similarity.
Reference
This method goes through the process of dividing the text into sentence units, grouping three sentences, and merging similar sentences in the embedding space.
Install dependency package
Copy
%pip install -qU langchain_experimental langchain_openaiLoad sample text and output content.
Copy
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
# Prints some of the content read from the file.
print(file[:350])Copy
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Press: Natural Language SemanticChunker generation
SemanticChunker Is one of LangChain's experimental features, which serves to divide text into semantically similar chunks.
This allows you to process and analyze text data more effectively.
Copy
Copy
SemanticChunker Use to split text into semantically relevant chunks.
Copy
Text split
text_splitterUsingfileSplit text into document units.
Copy
Check the split chunk.
Copy
Copy
create_documents() You can use functions to convert chunks into documents.
Copy
Copy
Breakpoints
This chunker works by deciding when to "separate" the sentence. This is done by looking at the differences in embedding between the two sentences.
If the difference exceeds a certain threshold, the sentence is separated.
Reference video: https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580
Percentile
The basic separation method is percentile Percentile Based on ).
In this method, all differences between sentences are calculated, and then separated based on the specified percentile.
Copy
Check the split result.
Copy
Copy
docs Output the length of.
Copy
Copy
Standard Deviation
Specified in this method breakpoint_threshold_amount If there is a difference greater than the standard deviation, it is split.
breakpoint_threshold_typeSet the parameter to "standard_deviation" to specify the chunk splitting criterion as standard deviation basis.
Copy
Check the split results.
Copy
Copy
Copy
docs Output the length of.
Copy
Copy
Interquartile
In this method, chunks are split using the quadrant range (interquartile range).
breakpoint_threshold_typeSet the parameter to "interquartile" to specify the chunk splitting criterion as the quadrant range.
Copy
Copy
Copy
docs Output the length of.
Copy
Copy
Last updated