03. Jina Reranker

This laptop is compressed documents and retrieval for Jina Reranker Shows how to use.

API key issuance

Copy

# API KEY A configuration file for managing environment variables
from dotenv import load_dotenv

# API KEY load information
load_dotenv()

Copy

 True

Copy

# LangSmith set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("Reranker")

Copy

 Start tracking LangSmith. 
[Project name] 
Reranker

Jina Reranker

Load the data above a simple example and generate retriever.

Copy

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# load document
documents = TextLoader("./data/appendix-keywords.txt").load()

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)

# split document
texts = text_splitter.split_documents(documents)

# Reset the search engine
retriever = FAISS.from_documents(
    texts, OpenAIEmbeddings()
).as_retriever(search_kwargs={"k": 10})

# question
query = "Word2Vec explain to me."

# document search
docs = retriever.invoke(query)

# print document
pretty_print_docs(docs)

Copy

 Document 1: 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
---------------------------------------------------------------------------------------------------- 
Document 2: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 
---------------------------------------------------------------------------------------------------- 
Document 3: 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keywords: tokenization, natural language processing, parsing 

Tokenizer 

Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. 
Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. 
Associated Keywords: tokenization, natural language processing, parsing 

VectorStore 

Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. 
Example: Save the word embedding vectors to the database for quick access. 
Associated Keywords: embedding, database, vectorization 

SQL 
---------------------------------------------------------------------------------------------------- 
Document 4: 

Parser 

Definition: Parser is a tool that analyzes a given data (string, files, etc.) and converts it into a structured form. It is used for parsing programming languages or processing file data. 
Example: Parsing an HTML document to create a DOM structure for a web page is an example of parsing. 
Associative keywords: parsing, compiler, data processing 

TF-IDF (Term Frequency-Inverse Document Frequency) 

Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents. 
Example: Words that do not appear frequently in many documents have high TF-IDF values. 
Associates: Natural language processing, information retrieval, data mining 

Deep Learning 
---------------------------------------------------------------------------------------------------- 
Document 5: 

Definition: LLM refers to a large language model trained with large-scale text data. These models are used for various natural language understanding and creation tasks. 
Example: OpenAI's GPT series is a representative large language model. 
Associated Keywords: natural language processing, deep learning, text generation 

FAISS (Facebook AI Similarity Search) 

Definition: FAISS is a high-speed similarity search library developed by Facebook, specifically designed to effectively search for similar vectors in large vector sets. 
Example: FAISS can be used to quickly find similar images among millions of image vectors. 
Associated Keywords: vector search, machine learning, database optimization 

Open Source 
---------------------------------------------------------------------------------------------------- 
Document 6: 

Pandas 

Definition: Pandas is a library that provides data analysis and manipulation tools for the Python programming language. This allows you to perform data analysis tasks efficiently. 
Example: You can use pandas to read CSV files, refine data, and perform various analyzes. 
Associates: Data Analysis, Python, Data Processing 

GPT (Generative Pretrained Transformer) 

Definition: GPT is a proactive language model pre-trained with a large dataset, utilized for a variety of text-based tasks. It can generate natural language based on the text entered. 
Example: A chatbot that generates detailed answers to questions provided by the user can use the GPT model. 
Associates: Natural language processing, text generation, deep learning 

InstructGPT 
---------------------------------------------------------------------------------------------------- 
Document 7: 

HuggingFace 

Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily. 
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more. 
Associates: Natural language processing, deep learning, library 

Digital Transformation 

Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology. 
Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation. 
Related Keywords: innovation, technology, business model 

Crawling 
---------------------------------------------------------------------------------------------------- 
Document 8: 

Deep Learning 

Definition: Deep learning is a field of machine learning that uses the artificial neural network to solve complex problems. This focuses on learning high-level expressions in data. 
Examples: Dip-learning models are utilized in image recognition, speech recognition, and natural language processing. 
Associated Press: Artificial Neural Network, Machining, Data Analysis 

Schema 

Definition: A schema defines the structure of a database or file, providing a blueprint of how data is stored and organized. 
Example: A table schema in a relational database defines column names, data types, key constraints, and more. 
Associates: database, data modeling, data management 

DataFrame 
---------------------------------------------------------------------------------------------------- 
Document 9: 

DataFrame 

Definition: DataFrame is a table-shaped data structure consisting of rows and columns, mainly used for data analysis and processing. 
Example: In the Pandas library, DataFrame can have columns of various data types, facilitating data manipulation and analysis. 
Associated Keywords: data analysis, pandas, data processing 

Attention mechanism 

Definition: The Attention mechanism is a technique that allows you to pay more'attention' to important information in deep learning. It is mainly used in sequence data (eg text, time series data). 
Example: In the translation model, the Attention mechanism focuses more on important parts of the input sentence, producing accurate translations. 
Associated Keywords: deep learning, natural language processing, sequence modeling 

Pandas 
---------------------------------------------------------------------------------------------------- 
Document 10: 

Page Rank 

Definition: A page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This is evaluated by analyzing the link structure between web pages. 
Example: Google search engines use the page rank algorithm to rank search results. 
Associates: Search engine optimization, web analytics, link analysis 

Data mining 

Definition: Data mining is the process of discovering useful information from large amounts of data. It utilizes technologies such as statistics, machine learning, and pattern recognition. 
Example: It is an example of data mining that retailers analyze customer purchase data to develop a sales strategy. 
Associated Keyword: Big Data, Pattern Recognition, Predictive Analysis 

Multimodal 
---------------------------------------------------------------------------------------------------- 
Document 11: 

JSON 

Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses readable text for both people and machines to represent data objects. 
Example: {" Name": "Flood Road", "Age": 30, "Occupation": "Developer" } is data in JSON format. 
Associates: Data Exchange, Web Development, API 

Transformer 

Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism. 
Example: Google translators use transformer models to perform translations between different languages. 
Associated Keywords: deep learning, natural language processing, Attention 

HuggingFace 
---------------------------------------------------------------------------------------------------- 
Document 12: 

InstructGPT 

Definition: InstructGPT is a GPT model optimized to perform specific tasks according to user's instructions. This model is designed to produce more accurate and relevant results. 
Example: If a user provides specific instructions such as "draft email", InstructGPT creates an email based on the relevant content. 
Associated Keyword: Artificial Intelligence, Natural Language Understanding, Command-Based Processing 

Keyword Search 

Definition: Keyword search is the process of finding information based on keywords entered by the user. It is used as a basic search method in most search engines and database systems. 
Example: When a user searches for "Coffee Shop Seoul", it returns a list of related coffee shops. 
Associates: search engine, data search, information search 

Page Rank 
---------------------------------------------------------------------------------------------------- 
Document 13: 

SQL 

Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc. 
Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old. 
Associated Keywords: database, query, data management 

CSV 

Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data. 
Example: CSV files with headers named name, age, job may contain data such as Hong Gil-dong, 30, developer. 
Associates: data format, file processing, data exchange 

JSON 
---------------------------------------------------------------------------------------------------- 
Document 14: 

Open Source 

Definition: Open source means software that allows source code to be released and freely used, modified, and distributed by anyone. This plays an important role in promoting collaboration and innovation. 
Example: The Linux operating system is a representative open source project. 
Associates: software development, community, technology collaboration 

Structured Data 

Definition: Structured data is data organized according to a defined format or schema. It can be easily retrieved and analyzed from databases, spreadsheets, etc. 
Example: A customer information table stored in a relational database is an example of structured data. 
Associations: database, data analysis, data modeling 

Parser 
---------------------------------------------------------------------------------------------------- 
Document 15: 

Multimodal 

Definition: Multimodal is a technique that combines and processes different types of data modes (e.g. text, images, sounds, etc.). It is used to extract or predict richer and more accurate information through interactions between different types of data. 
Examples: A system that analyzes images and descriptive text together to perform more accurate image classification is an example of multimodal technology. 
Associated Keyword: Data Convergence, Artificial Intelligence, Dip Running

Reordering with JinaRerank

now Jina Reranker Default using as compressor retriever for ContextualCompressionRetriever Let's wrap it up with.

Copy

from ast import mod
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import JinaRerank

# JinaRerank Compressor initialization
compressor = JinaRerank(model="jina-reranker-v2-base-multilingual", top_n=3)

# Reset Document Compression Finder
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

# Search and compress related documents
compressed_docs = compression_retriever.invoke(
w    "Word2Vec explain to me."
)

Copy

def pretty_print_docs(docs):
    print(
        f"\n{'-' * 100}\n".join(
            [f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
        )
    )

Copy

# Nicely printed compressed documents
pretty_print_docs(compressed_docs)

Copy

 Document 1: 

Crawling 

Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis. 
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content. 
Associates: data collection, web scraping, search engine 

Word2Vec 

Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words. 
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other. 
Associated Keywords: natural language processing, embedding, semantic similarity 
LLM (Large Language Model) 
---------------------------------------------------------------------------------------------------- 
Document 2: 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keywords: tokenization, natural language processing, parsing 

Tokenizer 

Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing. 
Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to. 
Associated Keywords: tokenization, natural language processing, parsing 

VectorStore 

Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks. 
Example: Save the word embedding vectors to the database for quick access. 
Associated Keywords: embedding, database, vectorization 

SQL 
---------------------------------------------------------------------------------------------------- 
Document 3: 

Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token

Previous02. Cohere Reranker Next04. FlashRank Reranker

Last updated 5 months ago