04. Semantic chunker

Split text based on semantic similarity.

Reference

This method goes through the process of dividing the text into sentence units, grouping three sentences, and merging similar sentences in the embedding space.

Install dependency package

Copy

%pip install -qU langchain_experimental langchain_openai

Load sample text and output content.

Copy

# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
    file = f.read()  # Reads the contents of a file and stores them in the file variable.

# Prints some of the content read from the file.
print(file[:350])

Copy

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Press: Natural Language 

SemanticChunker generation

SemanticChunker Is one of LangChain's experimental features, which serves to divide text into semantically similar chunks.

This allows you to process and analyze text data more effectively.

Copy

Copy

SemanticChunker Use to split text into semantically relevant chunks.

Copy

Text split

  • text_splitter Using file Split text into document units.

Copy

Check the split chunk.

Copy

Copy

create_documents() You can use functions to convert chunks into documents.

Copy

Copy

Breakpoints

This chunker works by deciding when to "separate" the sentence. This is done by looking at the differences in embedding between the two sentences.

If the difference exceeds a certain threshold, the sentence is separated.

  • Reference video: https://youtu.be/8OJC21T2SL4?si=PzUtNGYJ_KULq3-w&t=2580

Percentile

The basic separation method is percentile Percentile Based on ).

In this method, all differences between sentences are calculated, and then separated based on the specified percentile.

Copy

Check the split result.

Copy

Copy

docs Output the length of.

Copy

Copy

Standard Deviation

Specified in this method breakpoint_threshold_amount If there is a difference greater than the standard deviation, it is split.

  • breakpoint_threshold_type Set the parameter to "standard_deviation" to specify the chunk splitting criterion as standard deviation basis.

Copy

Check the split results.

Copy

Copy

Copy

docs Output the length of.

Copy

Copy

Interquartile

In this method, chunks are split using the quadrant range (interquartile range).

  • breakpoint_threshold_type Set the parameter to "interquartile" to specify the chunk splitting criterion as the quadrant range.

Copy

Copy

Copy

docs Output the length of.

Copy

Copy

Last updated