This method goes through the process of dividing the text into sentence units, grouping three sentences, and merging similar sentences in the embedding space.
# data/appendix-keywords.txt Opens a file and creates a file object called f.
with open("./data/appendix-keywords.txt") as f:
file = f.read() # Reads the contents of a file and stores them in the file variable.
# Prints some of the content read from the file.
print(file[:350])
Copy
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Press: Natural Language
SemanticChunker generation
SemanticChunker Is one of LangChain's experimental features, which serves to divide text into semantically similar chunks.
This allows you to process and analyze text data more effectively.
Copy
Copy
SemanticChunker Use to split text into semantically relevant chunks.
Copy
Text split
text_splitter Using file Split text into document units.
Copy
Check the split chunk.
Copy
Copy
create_documents() You can use functions to convert chunks into documents.
Copy
Copy
Breakpoints
This chunker works by deciding when to "separate" the sentence. This is done by looking at the differences in embedding between the two sentences.
If the difference exceeds a certain threshold, the sentence is separated.
# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv
w
# API Load key information
load_dotenv()
True
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings
# OpenAI Initialize a semantic chunk splitter using embeddings.
text_splitter = SemanticChunker(OpenAIEmbeddings())
chunks = text_splitter.split_text(file)
# Outputs the first chunk of the split chunks.
print(chunks[0])
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing
Tokenizer
Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing
VectorStore
Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization
SQL
Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc. Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. Associated Keywords: database, query, data management
CSV
Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data. Example: CSV files with headers named name, age, job may contain data such as Hong Gil-dong, 30, developer. Associates: data format, file processing, data exchange
JSON
Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses readable text for both people and machines to represent data objects. Example: {" Name": "Flood Road", "Age": 30, "Occupation": "Developer" } is data in JSON format. Associates: Data Exchange, Web Development, API
Transformer
Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism. Example: Google translators use transformer models to perform translations between different languages. Associated Keywords: deep learning, natural language processing, Attention
HuggingFace
Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. Associates: Natural language processing, deep learning, library
Digital Transformation
Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. Related Keywords: innovation, technology, business model
Crawling
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. Associates: data collection, web scraping, search engine
Word2Vec
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words.
# text_splitter Split using .
docs = text_splitter.create_documents([file])
print(docs[0].page_content) # Prints the contents of the first document among the split documents.
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing
Tokenizer
Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing
VectorStore
Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization
SQL
Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc. Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. Associated Keywords: database, query, data management
CSV
Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data. Example: CSV files with headers named name, age, job may contain data such as Hong Gil-dong, 30, developer. Associates: data format, file processing, data exchange
JSON
Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses readable text for both people and machines to represent data objects. Example: {" Name": "Flood Road", "Age": 30, "Occupation": "Developer" } is data in JSON format. Associates: Data Exchange, Web Development, API
Transformer
Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism. Example: Google translators use transformer models to perform translations between different languages. Associated Keywords: deep learning, natural language processing, Attention
HuggingFace
Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. Associates: Natural language processing, deep learning, library
Digital Transformation
Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. Related Keywords: innovation, technology, business model
Crawling
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. Associates: data collection, web scraping, search engine
Word2Vec
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words.
text_splitter = SemanticChunker(
# OpenAI Initialize the semantic chunker using the embedding model.
OpenAIEmbeddings(),
# Set the split criteria type to percentile.
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=70,
)
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
print(f"[Chunk {i}]", end="\n\n")
print(doc.page_content) # Prints the contents of the first document among the split documents.
print("===" * 20)
[Chunk 0]
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
============================================================
[Chunk 1]
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
============================================================
[Chunk 2]
Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing
Tokenizer
Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing.
============================================================
[Chunk 3]
Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing
VectorStore
Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks.
============================================================
[Chunk 4]
Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization
SQL
Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc.
============================================================
print(len(docs)) # docs Outputs the length of .
60
text_splitter = SemanticChunker(
# OpenAI Initialize the semantic chunker using the embedding model.
OpenAIEmbeddings(),
# We use standard deviation as the splitting criterion.
breakpoint_threshold_type="standard_deviation",
breakpoint_threshold_amount=1.25,
)
# text_splitter Split using .
docs = text_splitter.create_documents([file])
docs = text_splitter.create_documents([file])
for i, doc in enumerate(docs[:5]):
print(f"[Chunk {i}]", end="\n\n")
print(doc.page_content) # Prints the contents of the first document among the split documents.
print("===" * 20)
[Chunk 0]
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing
Tokenizer
Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing
VectorStore
Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks.
============================================================
[Chunk 1]
Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization
SQL
Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc.
============================================================
[Chunk 2]
Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. Associated Keywords: database, query, data management
CSV
Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data. Example: CSV files with headers named name, age, job may contain data such as Hong Gil-dong, 30, developer. Associates: data format, file processing, data exchange
JSON
Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses readable text for both people and machines to represent data objects. Example: {" Name": "Flood Road", "Age": 30, "Occupation": "Developer" } is data in JSON format. Associates: Data Exchange, Web Development, API
Transformer
Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism.
============================================================
[Chunk 3]
Example: Google translators use transformer models to perform translations between different languages. Associated Keywords: deep learning, natural language processing, Attention
HuggingFace
Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily.
============================================================
[Chunk 4]
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. Associates: Natural language processing, deep learning, library
Digital Transformation
Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology.
============================================================
print(len(docs)) # docs Outputs the length of .
14
text_splitter = SemanticChunker(
# OpenAIWe initialize a semantic chunk splitter using the embedding model of .
OpenAIEmbeddings(),
# Set the split criteria threshold type to interquartile range.
breakpoint_threshold_type="interquartile",
breakpoint_threshold_amount=0.5,
)
# text_splitter Split using .
docs = text_splitter.create_documents([file])
# Prints the results.
for i, doc in enumerate(docs[:5]):
print(f"[Chunk {i}]", end="\n\n")
print(doc.page_content) # Prints the contents of the first document among the split documents.
print("===" * 20)
[Chunk 0]
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
============================================================
[Chunk 1]
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keywords: tokenization, natural language processing, parsing
Tokenizer
Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing.
============================================================
[Chunk 2]
Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. Associated Keywords: tokenization, natural language processing, parsing
VectorStore
Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks.
============================================================
[Chunk 3]
Example: Save the word embedding vectors to the database for quick access. Associated Keywords: embedding, database, vectorization
SQL
Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc.
============================================================
[Chunk 4]
Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. Associated Keywords: database, query, data management
CSV
Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data.
============================================================