FlashRank Existing search and retrieval Ultra-light and ultra-fast Python library to add a seam to the pipeline. SoTA cross-encoders Based on.
This laptop is compressed documents and retrieval for flashrank Shows how to use.
Preferences
Copy
# installation
# !pip install -qU flashrank
Copy
def pretty_print_docs(docs):
print(
f"\n{'-' * 100}\n".join(
[
f"Document {i+1}:\n\n{d.page_content}\nMetadata: {d.metadata}"
for i, d in enumerate(docs)
]
)
)
FlashrankRerank
Load the data above a simple example and generate retriever.
Copy
Copy
Now the basics retriever for ContextualCompressionRetriever Wrapped up, FlashrankRerank Let's use it as a compressor.
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
# load document
documents = TextLoader("./data/appendix-keywords.txt").load()
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
# split document
texts = text_splitter.split_documents(documents)
# Add a unique ID to each text
for idx, text in enumerate(texts):
text.metadata["id"] = idx
# Reset the search engine
retriever = FAISS.from_documents(
texts, OpenAIEmbeddings()
).as_retriever(search_kwargs={"k": 10})
# question
query = "Word2Vec 에 대해서 설명해줘."
# document search
docs = retriever.invoke(query)
# print document
pretty_print_docs(docs)
Document 1:
Crawling
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis.
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content.
Associates: data collection, web scraping, search engine
Word2Vec
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words.
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other.
Associated Keywords: natural language processing, embedding, semantic similarity
LLM (Large Language Model)
Metadata: {'source':'./data/appendix-keywords.txt','id': 5}
----------------------------------------------------------------------------------------------------
Document 2:
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token
Metadata: {'source':'./data/appendix-keywords.txt','id': 0}
----------------------------------------------------------------------------------------------------
Document 3:
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
Example: Split the sentence "I go to school" into "I", "to school", and "go".
Associated Keywords: tokenization, natural language processing, parsing
Tokenizer
Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing.
Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to.
Associated Keywords: tokenization, natural language processing, parsing
VectorStore
Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks.
Example: Save the word embedding vectors to the database for quick access.
Associated Keyword: Embedding, Database, Vectorization
SQL
Metadata: {'source':'./data/appendix-keywords.txt','id': 1}
----------------------------------------------------------------------------------------------------
Document 4:
Parser
Definition: Parser is a tool that analyzes a given data (string, files, etc.) and converts it into a structured form. It is used for parsing programming languages or processing file data.
Example: Parsing an HTML document to create a DOM structure for a web page is an example of parsing.
Associative keywords: parsing, compiler, data processing
TF-IDF (Term Frequency-Inverse Document Frequency)
Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.
Example: Words that do not appear frequently in many documents have high TF-IDF values.
Associates: Natural language processing, information retrieval, data mining
Deep Learning
Metadata: {'source':'./data/appendix-keywords.txt','id': 8}
----------------------------------------------------------------------------------------------------
Document 5:
Definition: LLM refers to a large language model trained with large-scale text data. These models are used for various natural language understanding and creation tasks.
Example: OpenAI's GPT series is a representative large language model.
Associated Keywords: natural language processing, deep learning, text generation
FAISS (Facebook AI Similarity Search)
Definition: FAISS is a high-speed similarity search library developed by Facebook, specifically designed to effectively search for similar vectors in large vector sets.
Example: FAISS can be used to quickly find similar images among millions of image vectors.
Associated Keywords: vector search, machine learning, database optimization
Open Source
Metadata: {'source':'./data/appendix-keywords.txt','id': 6}
----------------------------------------------------------------------------------------------------
Document 6:
Pandas
Definition: Pandas is a library that provides data analysis and manipulation tools for the Python programming language. This allows you to perform data analysis tasks efficiently.
Example: You can use pandas to read CSV files, refine data, and perform various analyzes.
Associates: Data Analysis, Python, Data Processing
GPT (Generative Pretrained Transformer)
Definition: GPT is a proactive language model pre-trained with a large dataset, utilized for a variety of text-based tasks. It can generate natural language based on the text entered.
Example: A chatbot that generates detailed answers to questions provided by the user can use the GPT model.
Associates: Natural language processing, text generation, deep learning
InstructGPT
Metadata: {'source':'./data/appendix-keywords.txt','id': 11}
----------------------------------------------------------------------------------------------------
Document 7:
HuggingFace
Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily.
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more.
Associates: Natural language processing, deep learning, library
Digital Transformation
Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology.
Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation.
Related Keywords: innovation, technology, business model
Crawling
Metadata: {'source':'./data/appendix-keywords.txt','id': 4}
----------------------------------------------------------------------------------------------------
Document 8:
Deep Learning
Definition: Deep learning is a field of machine learning that uses the artificial neural network to solve complex problems. This focuses on learning high-level expressions in data.
Examples: Dip-learning models are utilized in image recognition, speech recognition, and natural language processing.
Associated Press: Artificial Neural Network, Machining, Data Analysis
Schema
Definition: A schema defines the structure of a database or file, providing a blueprint of how data is stored and organized.
Example: A table schema in a relational database defines column names, data types, key constraints, and more.
Associates: database, data modeling, data management
DataFrame
Metadata: {'source':'./data/appendix-keywords.txt','id': 9}
----------------------------------------------------------------------------------------------------
Document 9:
DataFrame
Definition: DataFrame is a table-shaped data structure consisting of rows and columns, mainly used for data analysis and processing.
Example: In the Pandas library, DataFrame can have columns of various data types, facilitating data manipulation and analysis.
Associated Keywords: data analysis, pandas, data processing
Attention mechanism
Definition: The Attention mechanism is a technique that allows you to pay more'attention' to important information in deep learning. It is mainly used in sequence data (eg text, time series data).
Example: In the translation model, the Attention mechanism focuses more on important parts of the input sentence, producing accurate translations.
Associated Keywords: deep learning, natural language processing, sequence modeling
Pandas
Metadata: {'source':'./data/appendix-keywords.txt','id': 10}
----------------------------------------------------------------------------------------------------
Document 10:
Page Rank
Definition: A page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This is evaluated by analyzing the link structure between web pages.
Example: Google search engines use the page rank algorithm to rank search results.
Associates: Search engine optimization, web analytics, link analysis
Data mining
Definition: Data mining is the process of discovering useful information from large amounts of data. It utilizes technologies such as statistics, machine learning, and pattern recognition.
Example: It is an example of data mining that retailers analyze customer purchase data to develop a sales strategy.
Associated Keyword: Big Data, Pattern Recognition, Predictive Analysis
Multimodal
Metadata: {'source':'./data/appendix-keywords.txt','id': 13}
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI
# LLM Initialization
llm = ChatOpenAI(temperature=0)
# Initialize document compressor
compressor = FlashrankRerank(model="ms-marco-MultiBERT-L-12")
# Initialize context compression searcher
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
# Search compressed documents
compressed_docs = compression_retriever.invoke(
"Word2Vec explain to me."
)
# Print document ID
print([doc.metadata["id"] for doc in compressed_docs])
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"
Document 1:
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token
Metadata: {'source':'./data/appendix-keywords.txt','id': 0,'relevance_score': 0.9997491}
----------------------------------------------------------------------------------------------------
Document 2:
HuggingFace
Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily.
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more.
Associates: Natural language processing, deep learning, library
Digital Transformation
Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology.
Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation.
Related Keywords: innovation, technology, business model
Crawling
Metadata: {'source':'./data/appendix-keywords.txt','id': 4,'relevance_score': 0.997148}
----------------------------------------------------------------------------------------------------
Document 3:
DataFrame
Definition: DataFrame is a table-shaped data structure consisting of rows and columns, mainly used for data analysis and processing.
Example: In the Pandas library, DataFrame can have columns of various data types, facilitating data manipulation and analysis.
Associated Keywords: data analysis, pandas, data processing
Attention mechanism
Definition: The Attention mechanism is a technique that allows you to pay more'attention' to important information in deep learning. It is mainly used in sequence data (eg text, time series data).
Example: In the translation model, the Attention mechanism focuses more on important parts of the input sentence, producing accurate translations.
Associated Keywords: deep learning, natural language processing, sequence modeling
Pandas
Metadata: {'source':'./data/appendix-keywords.txt','id': 10,'relevance_score': 0.9997084}