04. FlashRank Reranker

FlashRank Existing search and retrieval Ultra-light and ultra-fast Python library to add a seam to the pipeline. SoTA cross-encoders Based on.

This laptop is compressed documents and retrieval for flashrank Shows how to use.

Preferences

Copy

# installation
# !pip install -qU flashrank

Copy

def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [
                f"Document {i+1}:\n\n{d.page_content}\nMetadata: {d.metadata}"
                for i, d in enumerate(docs)
            ]
        )
    )

FlashrankRerank

Load the data above a simple example and generate retriever.

Copy

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# load document
documents = TextLoader("./data/appendix-keywords.txt").load()

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# split document
texts = text_splitter.split_documents(documents)

# Add a unique ID to each text
for idx, text in enumerate(texts):
    text.metadata["id"] = idx

# Reset the search engine
retriever = FAISS.from_documents(
    texts, OpenAIEmbeddings()
).as_retriever(search_kwargs={"k": 10})

# question
query = "Word2Vec 에 대해서 설명해줘."

# document search
docs = retriever.invoke(query)

# print document
pretty_print_docs(docs)

Copy

 Document 1: 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
Metadata: {'source':'./data/appendix-keywords.txt','id': 5} 
---------------------------------------------------------------------------------------------------- 
Document 2: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
Metadata: {'source':'./data/appendix-keywords.txt','id': 0} 
---------------------------------------------------------------------------------------------------- 
Document 3: 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keywords: tokenization, natural language processing, parsing 

Tokenizer 

Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. 
Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. 
Associated Keywords: tokenization, natural language processing, parsing 

VectorStore 

Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. 
Example: Save the word embedding vectors to the database for quick access. 
Associated Keyword: Embedding, Database, Vectorization 

SQL 
Metadata: {'source':'./data/appendix-keywords.txt','id': 1} 
---------------------------------------------------------------------------------------------------- 
Document 4: 

Parser 

Definition: Parser is a tool that analyzes a given data (string, files, etc.) and converts it into a structured form. It is used for parsing programming languages or processing file data. 
Example: Parsing an HTML document to create a DOM structure for a web page is an example of parsing. 
Associative keywords: parsing, compiler, data processing 

TF-IDF (Term Frequency-Inverse Document Frequency) 

Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents. 
Example: Words that do not appear frequently in many documents have high TF-IDF values. 
Associates: Natural language processing, information retrieval, data mining 

Deep Learning 
Metadata: {'source':'./data/appendix-keywords.txt','id': 8} 
---------------------------------------------------------------------------------------------------- 
Document 5: 

Definition: LLM refers to a large language model trained with large-scale text data. These models are used for various natural language understanding and creation tasks. 
Example: OpenAI's GPT series is a representative large language model. 
Associated Keywords: natural language processing, deep learning, text generation 

FAISS (Facebook AI Similarity Search) 

Definition: FAISS is a high-speed similarity search library developed by Facebook, specifically designed to effectively search for similar vectors in large vector sets. 
Example: FAISS can be used to quickly find similar images among millions of image vectors. 
Associated Keywords: vector search, machine learning, database optimization 

Open Source 
Metadata: {'source':'./data/appendix-keywords.txt','id': 6} 
---------------------------------------------------------------------------------------------------- 
Document 6: 

Pandas 

Definition: Pandas is a library that provides data analysis and manipulation tools for the Python programming language. This allows you to perform data analysis tasks efficiently. 
Example: You can use pandas to read CSV files, refine data, and perform various analyzes. 
Associates: Data Analysis, Python, Data Processing 

GPT (Generative Pretrained Transformer) 

Definition: GPT is a proactive language model pre-trained with a large dataset, utilized for a variety of text-based tasks. It can generate natural language based on the text entered. 
Example: A chatbot that generates detailed answers to questions provided by the user can use the GPT model. 
Associates: Natural language processing, text generation, deep learning 

InstructGPT 
Metadata: {'source':'./data/appendix-keywords.txt','id': 11} 
---------------------------------------------------------------------------------------------------- 
Document 7: 

HuggingFace 

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. 
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. 
Associates: Natural language processing, deep learning, library 

Digital Transformation 

Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. 
Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. 
Related Keywords: innovation, technology, business model 

Crawling 
Metadata: {'source':'./data/appendix-keywords.txt','id': 4} 
---------------------------------------------------------------------------------------------------- 
Document 8: 

Deep Learning 

Definition: Deep learning is a field of machine learning that uses the artificial neural network to solve complex problems. This focuses on learning high-level expressions in data. 
Examples: Dip-learning models are utilized in image recognition, speech recognition, and natural language processing. 
Associated Press: Artificial Neural Network, Machining, Data Analysis 

Schema 

Definition: A schema defines the structure of a database or file, providing a blueprint of how data is stored and organized. 
Example: A table schema in a relational database defines column names, data types, key constraints, and more. 
Associates: database, data modeling, data management 

DataFrame 
Metadata: {'source':'./data/appendix-keywords.txt','id': 9} 
---------------------------------------------------------------------------------------------------- 
Document 9: 

DataFrame 

Definition: DataFrame is a table-shaped data structure consisting of rows and columns, mainly used for data analysis and processing. 
Example: In the Pandas library, DataFrame can have columns of various data types, facilitating data manipulation and analysis. 
Associated Keywords: data analysis, pandas, data processing 

Attention mechanism 

Definition: The Attention mechanism is a technique that allows you to pay more'attention' to important information in deep learning. It is mainly used in sequence data (eg text, time series data). 
Example: In the translation model, the Attention mechanism focuses more on important parts of the input sentence, producing accurate translations. 
Associated Keywords: deep learning, natural language processing, sequence modeling 

Pandas 
Metadata: {'source':'./data/appendix-keywords.txt','id': 10} 
---------------------------------------------------------------------------------------------------- 
Document 10: 

Page Rank 

Definition: A page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This is evaluated by analyzing the link structure between web pages. 
Example: Google search engines use the page rank algorithm to rank search results. 
Associates: Search engine optimization, web analytics, link analysis 

Data mining 

Definition: Data mining is the process of discovering useful information from large amounts of data. It utilizes technologies such as statistics, machine learning, and pattern recognition. 
Example: It is an example of data mining that retailers analyze customer purchase data to develop a sales strategy. 
Associated Keyword: Big Data, Pattern Recognition, Predictive Analysis 

Multimodal 
Metadata: {'source':'./data/appendix-keywords.txt','id': 13}

Now the basics retriever for ContextualCompressionRetriever Wrapped up, FlashrankRerank Let's use it as a compressor.

Copy

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import FlashrankRerank
from langchain_openai import ChatOpenAI

# LLM Initialization
llm = ChatOpenAI(temperature=0)

# Initialize document compressor
compressor = FlashrankRerank(model="ms-marco-MultiBERT-L-12")

# Initialize context compression searcher
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

# Search compressed documents
compressed_docs = compression_retriever.invoke(
    "Word2Vec explain to me."
)

# Print document ID
print([doc.metadata["id"] for doc in compressed_docs])

Copy

 INFO:httpx:HTTP Request: POST https://api.openai.com/v1/embeddings "HTTP/1.1 200 OK"

Copy

 [0, 4, 10]

Compare results after reranker is applied.

Copy

# Output document compression results
pretty_print_docs(compressed_docs)

Copy

 Document 1: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
Metadata: {'source':'./data/appendix-keywords.txt','id': 0,'relevance_score': 0.9997491} 
---------------------------------------------------------------------------------------------------- 
Document 2: 

HuggingFace 

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. 
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. 
Associates: Natural language processing, deep learning, library 

Digital Transformation 

Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. 
Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. 
Related Keywords: innovation, technology, business model 

Crawling 
Metadata: {'source':'./data/appendix-keywords.txt','id': 4,'relevance_score': 0.997148} 
---------------------------------------------------------------------------------------------------- 
Document 3: 

DataFrame 

Definition: DataFrame is a table-shaped data structure consisting of rows and columns, mainly used for data analysis and processing. 
Example: In the Pandas library, DataFrame can have columns of various data types, facilitating data manipulation and analysis. 
Associated Keywords: data analysis, pandas, data processing 

Attention mechanism 

Definition: The Attention mechanism is a technique that allows you to pay more'attention' to important information in deep learning. It is mainly used in sequence data (eg text, time series data). 
Example: In the translation model, the Attention mechanism focuses more on important parts of the input sentence, producing accurate translations. 
Associated Keywords: deep learning, natural language processing, sequence modeling 

Pandas 
Metadata: {'source':'./data/appendix-keywords.txt','id': 10,'relevance_score': 0.9997084}

Previous03. Jina Reranker NextCH12 Retrieval Augmented Generation (RAG)

Last updated 5 months ago