03. Various module utilizers by function of RAG
One. Question processing
In the process of handling questions, the task is to receive the user's questions to process them, and to find relevant data. The following components are required for this:
Data source connection : You need to connect to various text data sources to find answers to your questions. LangChain helps you easily establish connections with various data sources.
Data indexing and retrieval : To effectively find relevant information from a data source, the data must be indexed. LangChain automates the indexing process and provides the tools you need to retrieve data related to your questions.
2. Create an answer
After finding the relevant data, you need to generate an answer to the user's question based on this. At this stage, the following components are important:
Reply generation model : LangChain provides the ability to generate answers from retrieved data using an advanced natural language processing (NLP) model. These models receive user questions and retrieved data as input, generating appropriate answers.
architecture
We are Q&A Introduction As outlined in, we will create a typical RAG application. This has two main components:
Indexing : A pipeline that collects and indexes data from sources. This usually happens offline.
Search and creation : As a real RAG chain, we receive user queries at run time to retrieve relevant data from the index, and then pass that data to the model.
The full order from RAW data to receiving answers is as follows.
Indexing
road : You must load the data first. for this DocumentLoaders Will use
Division : Text splitters Is big
DocumentsDivide into smaller chunks. This is useful for indexing data and delivering it to the model, and large chunks are difficult to search for and do not fit the model's finite context window.Save : I need a place to store and index the splits for later search. This is often VectorStore Wow Embeddings It is done using a model.
Search and creation
Copy
Copy
LangSmith: https://smith.langchain.com/public/17ef6df2-b012-4f8e-b0a8-62894d82c097/r
Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 14)
Copy
Copy
LangSmith: https://smith.langchain.com/public/2b2913c9-6b9c-4a19-bb16-dc2256e2fdbf/r
Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 12)
Copy
Copy
LangSmith: https://smith.langchain.com/public/4449e744-f0a0-42d2-a3df-855bd7f41652/r
Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 10)
Copy
Copy
RAG template experiment
Copy
Copy
Copy
On the leaderboard below, you can see an open source leaderboard that improves performance day by day.
You can easily download and use the open source model uploaded to HuggingFace.
Copy
Copy
You can check the token usage in the following way:
Copy
A detailed cost system OpenAI API Model List / Fare Table You can check in.
gpt-3.5-turbo: OpenAI's GPT-3.5-turbo modelgpt-4-turbo-preview: OpenAI's GPT-4-turbo-preview model
Select one of the OpenAI models.
Step 7: Create a language model (Create LLM)
Copy
Copy
Copy
LangSmith hub There are many verified prompts uploaded.
You can save money and time by utilizing or slightly modifying the verified prompts.
https://smith.langchain.com/hub/search?q=rag
[TIP2]
if,
retrieverIf important information is missing from the results derived fromretrieverYou need to modify the logic of.if,
retrieverAlthough the results from include a lot of information,llmIf you haven't found any of these important information or don't output it in any form you like, you need to fix the prompts.
[TIP1]
Prompt engineering is given data context Based on ), it plays an important role when we get the results we want.
Stage 6: Prompt generation (Create Prompt)
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Ensemble Retriever
Copy
Copy
Copy
Copy
Copy
Create various queries
Copy
Copy
maximum marginal search result Search using.
Copy
Copy
similarity_score_threshold In a similarity-based search score_threshold Returns only the above results.
Copy
Copy
Default is cosine pseudo-dominant
similarityIs applied.
Similarity based search
On the VectorStore created as_retriver() Bring it to Retriever Generate.
Retriever only returns (or searches) this document without the need to save it.
A retriever is an interface that returns documents given unstructured queries.
Step 5: create Retriever
Copy
Copy
Step 4: create a vector store (Create Vectorstore)
Copy
Copy
Copy
Free Open Source based embedding
Copy
Copy
MODEL
ROUGH PAGES PER DOLLAR
EXAMPLE PERFORMANCE ON MTEB EVAL
text-embedding-3-small
62,500
62.3%
text-embedding-3-large
9,615
64.6%
text-embedding-ada-002
12,500
61.0%
Default value
text-embeding-ada-002is.
next OpenAI List of supported Embedding models.
Copy
Paid billing embedding (OpenAI)
Note: https://python.langchain.com/docs/integrations/text_embedding
Step 3: embedding
Copy
Copy
Copy
Copy
It's a way to split from high level to sentence, then group into three sentences, then merge similar sentences in embedding space.
source: Greg Kamradt's Notebook
Split text based on semantic similarity.
Semantic Similarity
Copy
Try sequentially the specified separators list and split the given document.
Try splitting in order until the chunks are small enough. The default lists are ["\n\n", "\n", "", ""].
This generally has the effect of keeping all paragraphs (and sentences, words) that seem to be the most relevant piece of text as long as possible.
Copy
Copy
Copy
RecursiveCharacterTextSplitter Classes provide the ability to split text recursively. This class chunk_size The size of the chunk to be divided into, chunk_overlap The size of overlap between adjacent chunks, length_function A function that calculates the length of the chunk, and is_separator_regex It receives a parameter that specifies whether the delimiter is a regular expression. In the example, set the chunk size to 100, the overlap size to 20, and the length calculation function len To indicate that the delimiter is not a regular expression is_separator_regex for False Set to.
Copy
How text is split rule: list of
separatorsHow the chunk size is measured:
lenof characters
This text divider is the recommended text divider for plain text.
RecursiveTextSplitter
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
Copy
This function text_splitter Object create_documents Text given using methods ( state_of_the_union ), divided into several documents, and the result texts Save to variable. after texts Output the first document of. This process can be viewed as an early step to process and analyze text data, especially useful for dividing large text data into units of manageable size.
Copy
separatorThe parameter specifies the string used to separate chunks, here two newline characters ("\n\n")chunk_sizedetermines the maximum length of each chunkchunk_overlapSpecifies the number of characters overlapping the adjacent chunk liver.length_functionDetermines the function used to calculate the length of silver chunks, and basically returns the length of the stringlenFunctions are used.is_separator_regexhasseparatorIt is a disadvantageous value that determines whether it will be interpreted as a regular expression.
CharacterTextSplitter Classes provide the ability to split text into chunks of a certain size.
Visualization example: https://chunkviz.up.railway.app/
How text is split: single character unit
How the chunk size is measured:
lenof characters.
This is the simplest way. This method is divided by character (default is "\n\n") and the length of the chunk is measured by the number of characters.
CharacterTextSplitter
Copy
Step 2: Split Documents
Copy
Copy
next .py This is an example of loading a file.
Python
Copy
Next is all in the folder .pdf This is an example of loading a file.
Copy
Copy
Copy
Below is all in the folder .txt This is an example of loading a file.
Load all files within a folder
Copy
Copy
TXT file
Copy
Copy
CSV views data by row number instead of page number.
CSV
Copy
Copy
Copy
Copy
Copy
This is the BBC news article below. If you want to try it with an article written in English, uncomment below and try it.
Copy
(Example)
bs4.SoupStrainerConveniently allows you to get the elements you want from the web.
[Note]
WebBaseLoader to parse only what is needed on the specified web page bs4.SoupStrainer Use.
Web page
Step 1: Load Documents
Copy
Copy
Here you can set various options for each step or apply new techniques.
Below This is an example using the basic RAG model covered in.
Copy
Take a closer look at the module
Copy
LangSmith is not required, but useful. If you want to use LangSmith, after signing up from the link above, you need to set an environment variable to start logging tracking.
Applications built with LangChain will use LLM calls multiple times over multiple stages. As these applications become more and more complex, the ability to investigate exactly what is happening inside a chain or agent becomes very important. The best way to do this LangSmith Is to use.
Copy
Copy
Set API KEY.
Preferences
Files downloaded for practice data Please copy to folder
Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)
Link: https://spri.kr/posts/view/23669
File name:
SPRI_AI_Brief_December 23 issue_F.pdf
Software Policy Institute (SPRi)-December 2023
Documents utilized for practice
Search : Given user input Retriever Use to search for related splits in the repository.
Last updated