# API KEY A configuration file for managing environment variables
from dotenv import load_dotenv
# API KEY load information
load_dotenv()
Copy
True
Copy
# LangSmith set up tracking. https://smith.langchain.com
# !pip install langchain-teddynote
from langchain_teddynote import logging
# Enter a project name.
logging.langsmith("Reranker")
Copy
Start tracking LangSmith.
[Project name]
Reranker
Jina Reranker
Load the data above a simple example and generate retriever.
Copy
Copy
Reordering with JinaRerank
now Jina Reranker Default using as compressor retriever for ContextualCompressionRetriever Let's wrap it up with.
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
# load document
documents = TextLoader("./data/appendix-keywords.txt").load()
# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
# split document
texts = text_splitter.split_documents(documents)
# Reset the search engine
retriever = FAISS.from_documents(
texts, OpenAIEmbeddings()
).as_retriever(search_kwargs={"k": 10})
# question
query = "Word2Vec explain to me."
# document search
docs = retriever.invoke(query)
# print document
pretty_print_docs(docs)
Document 1:
Crawling
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis.
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content.
Associates: data collection, web scraping, search engine
Word2Vec
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words.
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other.
Associated Keywords: natural language processing, embedding, semantic similarity
LLM (Large Language Model)
----------------------------------------------------------------------------------------------------
Document 2:
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token
----------------------------------------------------------------------------------------------------
Document 3:
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
Example: Split the sentence "I go to school" into "I", "to school", and "go".
Associated Keywords: tokenization, natural language processing, parsing
Tokenizer
Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing.
Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to.
Associated Keywords: tokenization, natural language processing, parsing
VectorStore
Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks.
Example: Save the word embedding vectors to the database for quick access.
Associated Keywords: embedding, database, vectorization
SQL
----------------------------------------------------------------------------------------------------
Document 4:
Parser
Definition: Parser is a tool that analyzes a given data (string, files, etc.) and converts it into a structured form. It is used for parsing programming languages or processing file data.
Example: Parsing an HTML document to create a DOM structure for a web page is an example of parsing.
Associative keywords: parsing, compiler, data processing
TF-IDF (Term Frequency-Inverse Document Frequency)
Definition: TF-IDF is a statistical measure used to evaluate the importance of words within a document. This takes into account the frequency of words in a document and the scarcity of those words in the entire set of documents.
Example: Words that do not appear frequently in many documents have high TF-IDF values.
Associates: Natural language processing, information retrieval, data mining
Deep Learning
----------------------------------------------------------------------------------------------------
Document 5:
Definition: LLM refers to a large language model trained with large-scale text data. These models are used for various natural language understanding and creation tasks.
Example: OpenAI's GPT series is a representative large language model.
Associated Keywords: natural language processing, deep learning, text generation
FAISS (Facebook AI Similarity Search)
Definition: FAISS is a high-speed similarity search library developed by Facebook, specifically designed to effectively search for similar vectors in large vector sets.
Example: FAISS can be used to quickly find similar images among millions of image vectors.
Associated Keywords: vector search, machine learning, database optimization
Open Source
----------------------------------------------------------------------------------------------------
Document 6:
Pandas
Definition: Pandas is a library that provides data analysis and manipulation tools for the Python programming language. This allows you to perform data analysis tasks efficiently.
Example: You can use pandas to read CSV files, refine data, and perform various analyzes.
Associates: Data Analysis, Python, Data Processing
GPT (Generative Pretrained Transformer)
Definition: GPT is a proactive language model pre-trained with a large dataset, utilized for a variety of text-based tasks. It can generate natural language based on the text entered.
Example: A chatbot that generates detailed answers to questions provided by the user can use the GPT model.
Associates: Natural language processing, text generation, deep learning
InstructGPT
----------------------------------------------------------------------------------------------------
Document 7:
HuggingFace
Definition: HuggingFace is a library that provides a variety of pre-trained models and tools for natural language processing. This helps researchers and developers do NLP work easily.
Example: You can use HuggingFace's Transformers library to do emotional analysis, text generation, and more.
Associates: Natural language processing, deep learning, library
Digital Transformation
Definition: Digital transformation is the process of leveraging technology to transform a company's services, culture and operations. This focuses on improving the business model and increasing competitiveness through digital technology.
Example: Innovating data storage and processing by introducing cloud computing is an example of digital transformation.
Related Keywords: innovation, technology, business model
Crawling
----------------------------------------------------------------------------------------------------
Document 8:
Deep Learning
Definition: Deep learning is a field of machine learning that uses the artificial neural network to solve complex problems. This focuses on learning high-level expressions in data.
Examples: Dip-learning models are utilized in image recognition, speech recognition, and natural language processing.
Associated Press: Artificial Neural Network, Machining, Data Analysis
Schema
Definition: A schema defines the structure of a database or file, providing a blueprint of how data is stored and organized.
Example: A table schema in a relational database defines column names, data types, key constraints, and more.
Associates: database, data modeling, data management
DataFrame
----------------------------------------------------------------------------------------------------
Document 9:
DataFrame
Definition: DataFrame is a table-shaped data structure consisting of rows and columns, mainly used for data analysis and processing.
Example: In the Pandas library, DataFrame can have columns of various data types, facilitating data manipulation and analysis.
Associated Keywords: data analysis, pandas, data processing
Attention mechanism
Definition: The Attention mechanism is a technique that allows you to pay more'attention' to important information in deep learning. It is mainly used in sequence data (eg text, time series data).
Example: In the translation model, the Attention mechanism focuses more on important parts of the input sentence, producing accurate translations.
Associated Keywords: deep learning, natural language processing, sequence modeling
Pandas
----------------------------------------------------------------------------------------------------
Document 10:
Page Rank
Definition: A page rank is an algorithm that evaluates the importance of a web page, mainly used to rank search engine results. This is evaluated by analyzing the link structure between web pages.
Example: Google search engines use the page rank algorithm to rank search results.
Associates: Search engine optimization, web analytics, link analysis
Data mining
Definition: Data mining is the process of discovering useful information from large amounts of data. It utilizes technologies such as statistics, machine learning, and pattern recognition.
Example: It is an example of data mining that retailers analyze customer purchase data to develop a sales strategy.
Associated Keyword: Big Data, Pattern Recognition, Predictive Analysis
Multimodal
----------------------------------------------------------------------------------------------------
Document 11:
JSON
Definition: JSON (JavaScript Object Notation) is a lightweight data exchange format that uses readable text for both people and machines to represent data objects.
Example: {" Name": "Flood Road", "Age": 30, "Occupation": "Developer" } is data in JSON format.
Associates: Data Exchange, Web Development, API
Transformer
Definition: Transformers are a type of deep-learning model used in natural language processing, mainly used for translation, summary, text generation, etc. This is based on the Attention mechanism.
Example: Google translators use transformer models to perform translations between different languages.
Associated Keywords: deep learning, natural language processing, Attention
HuggingFace
----------------------------------------------------------------------------------------------------
Document 12:
InstructGPT
Definition: InstructGPT is a GPT model optimized to perform specific tasks according to user's instructions. This model is designed to produce more accurate and relevant results.
Example: If a user provides specific instructions such as "draft email", InstructGPT creates an email based on the relevant content.
Associated Keyword: Artificial Intelligence, Natural Language Understanding, Command-Based Processing
Keyword Search
Definition: Keyword search is the process of finding information based on keywords entered by the user. It is used as a basic search method in most search engines and database systems.
Example: When a user searches for "Coffee Shop Seoul", it returns a list of related coffee shops.
Associates: search engine, data search, information search
Page Rank
----------------------------------------------------------------------------------------------------
Document 13:
SQL
Definition: Structured Query Language (SQL) is a programming language for managing data in a database. You can do a variety of things, including data lookup, modification, insertion, deletion, etc.
Example: SELECT * FROM users WHERE age > 18; views user information over 18 years old.
Associated Keywords: database, query, data management
CSV
Definition: CSV (Comma-Separated Values) is a file format that stores data, and each data value is separated by commas. Used to simply store and exchange tabular data.
Example: CSV files with headers named name, age, job may contain data such as Hong Gil-dong, 30, developer.
Associates: data format, file processing, data exchange
JSON
----------------------------------------------------------------------------------------------------
Document 14:
Open Source
Definition: Open source means software that allows source code to be released and freely used, modified, and distributed by anyone. This plays an important role in promoting collaboration and innovation.
Example: The Linux operating system is a representative open source project.
Associates: software development, community, technology collaboration
Structured Data
Definition: Structured data is data organized according to a defined format or schema. It can be easily retrieved and analyzed from databases, spreadsheets, etc.
Example: A customer information table stored in a relational database is an example of structured data.
Associations: database, data analysis, data modeling
Parser
----------------------------------------------------------------------------------------------------
Document 15:
Multimodal
Definition: Multimodal is a technique that combines and processes different types of data modes (e.g. text, images, sounds, etc.). It is used to extract or predict richer and more accurate information through interactions between different types of data.
Examples: A system that analyzes images and descriptive text together to perform more accurate image classification is an example of multimodal technology.
Associated Keyword: Data Convergence, Artificial Intelligence, Dip Running
from ast import mod
from langchain.retrievers import ContextualCompressionRetriever
from langchain_community.document_compressors import JinaRerank
# JinaRerank Compressor initialization
compressor = JinaRerank(model="jina-reranker-v2-base-multilingual", top_n=3)
# Reset Document Compression Finder
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor, base_retriever=retriever
)
# Search and compress related documents
compressed_docs = compression_retriever.invoke(
w "Word2Vec explain to me."
)
def pretty_print_docs(docs):
print(
f"\n{'-' * 100}\n".join(
[f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]
)
)
Document 1:
Crawling
Definition: Crawl is the process of collecting data by visiting web pages in an automated manner. It is often used for search engine optimization or data analysis.
Example: Crawl is a Google search engine to visit a web site on the Internet to collect and index content.
Associates: data collection, web scraping, search engine
Word2Vec
Definition: Word2Vec is a natural language processing technique that maps words to vector spaces to represent meaningful relationships between words. It produces vectors based on the contextual similarity of words.
Example: In the Word2Vec model, "King" and "Queen" are represented by vectors in positions close to each other.
Associated Keywords: natural language processing, embedding, semantic similarity
LLM (Large Language Model)
----------------------------------------------------------------------------------------------------
Document 2:
Token
Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse.
Example: Split the sentence "I go to school" into "I", "to school", and "go".
Associated Keywords: tokenization, natural language processing, parsing
Tokenizer
Definition: A talkizer is a tool for splitting text data into tokens. It is used to preprocess data in natural language processing.
Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to.
Associated Keywords: tokenization, natural language processing, parsing
VectorStore
Definition: Vector Store is a system that stores data converted to vector format. It is used for search, classification and other data analysis tasks.
Example: Save the word embedding vectors to the database for quick access.
Associated Keywords: embedding, database, vectorization
SQL
----------------------------------------------------------------------------------------------------
Document 3:
Semantic Search
Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results.
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc.
Associates: natural language processing, search algorithms, data mining
Embedding
Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text.
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17].
Associated Keywords: natural language processing, vectorization, deep learning
Token