03. Various module utilizers by function of RAG

One. Question processing

In the process of handling questions, the task is to receive the user's questions to process them, and to find relevant data. The following components are required for this:

  • Data source connection : You need to connect to various text data sources to find answers to your questions. LangChain helps you easily establish connections with various data sources.

  • Data indexing and retrieval : To effectively find relevant information from a data source, the data must be indexed. LangChain automates the indexing process and provides the tools you need to retrieve data related to your questions.

2. Create an answer

After finding the relevant data, you need to generate an answer to the user's question based on this. At this stage, the following components are important:

  • Reply generation model : LangChain provides the ability to generate answers from retrieved data using an advanced natural language processing (NLP) model. These models receive user questions and retrieved data as input, generating appropriate answers.

architecture

We are Q&A Introduction As outlined in, we will create a typical RAG application. This has two main components:

  • Indexing : A pipeline that collects and indexes data from sources. This usually happens offline.

  • Search and creation : As a real RAG chain, we receive user queries at run time to retrieve relevant data from the index, and then pass that data to the model.

The full order from RAW data to receiving answers is as follows.

Indexing

  1. road : You must load the data first. for this DocumentLoaders Will use

  2. Division : Text splitters Is big Documents Divide into smaller chunks. This is useful for indexing data and delivering it to the model, and large chunks are difficult to search for and do not fit the model's finite context window.

  3. Save : I need a place to store and index the splits for later search. This is often VectorStore Wow Embeddings It is done using a model.

Search and creation

Copy

Copy

  • LangSmith: https://smith.langchain.com/public/17ef6df2-b012-4f8e-b0a8-62894d82c097/r

Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 14)

Copy

Copy

  • LangSmith: https://smith.langchain.com/public/2b2913c9-6b9c-4a19-bb16-dc2256e2fdbf/r

Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 12)

Copy

Copy

  • LangSmith: https://smith.langchain.com/public/4449e744-f0a0-42d2-a3df-855bd7f41652/r

Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 10)

Copy

Copy

RAG template experiment

Copy

Copy

Copy

On the leaderboard below, you can see an open source leaderboard that improves performance day by day.

You can easily download and use the open source model uploaded to HuggingFace.

Copy

Copy

You can check the token usage in the following way:

Copy

A detailed cost system OpenAI API Model List / Fare Table You can check in.

  • gpt-3.5-turbo : OpenAI's GPT-3.5-turbo model

  • gpt-4-turbo-preview : OpenAI's GPT-4-turbo-preview model

Select one of the OpenAI models.

Step 7: Create a language model (Create LLM)

Copy

Copy

Copy

  1. LangSmith hub There are many verified prompts uploaded.

  2. You can save money and time by utilizing or slightly modifying the verified prompts.

  3. https://smith.langchain.com/hub/search?q=rag

[TIP2]

  1. if, retriever If important information is missing from the results derived from retriever You need to modify the logic of.

  2. if, retriever Although the results from include a lot of information, llm If you haven't found any of these important information or don't output it in any form you like, you need to fix the prompts.

[TIP1]

Prompt engineering is given data context Based on ), it plays an important role when we get the results we want.

Stage 6: Prompt generation (Create Prompt)

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Ensemble Retriever

Copy

Copy

Copy

Copy

Copy

Create various queries

Copy

Copy

maximum marginal search result Search using.

Copy

Copy

similarity_score_threshold In a similarity-based search score_threshold Returns only the above results.

Copy

Copy

  • Default is cosine pseudo-dominant similarity Is applied.

Similarity based search

On the VectorStore created as_retriver() Bring it to Retriever Generate.

Retriever only returns (or searches) this document without the need to save it.

A retriever is an interface that returns documents given unstructured queries.

Step 5: create Retriever

Copy

Copy

Step 4: create a vector store (Create Vectorstore)

Copy

Copy

Copy

Free Open Source based embedding

Copy

Copy

MODEL

ROUGH PAGES PER DOLLAR

EXAMPLE PERFORMANCE ON MTEB EVAL

text-embedding-3-small

62,500

62.3%

text-embedding-3-large

9,615

64.6%

text-embedding-ada-002

12,500

61.0%

  • Default value text-embeding-ada-002 is.

next OpenAI List of supported Embedding models.

Copy

Paid billing embedding (OpenAI)

Note: https://python.langchain.com/docs/integrations/text_embedding

Step 3: embedding

Copy

Copy

Copy

Copy

It's a way to split from high level to sentence, then group into three sentences, then merge similar sentences in embedding space.

source: Greg Kamradt's Notebook

Split text based on semantic similarity.

Semantic Similarity

Copy

  • Try sequentially the specified separators list and split the given document.

  • Try splitting in order until the chunks are small enough. The default lists are ["\n\n", "\n", "", ""].

  • This generally has the effect of keeping all paragraphs (and sentences, words) that seem to be the most relevant piece of text as long as possible.

Copy

Copy

Copy

RecursiveCharacterTextSplitter Classes provide the ability to split text recursively. This class chunk_size The size of the chunk to be divided into, chunk_overlap The size of overlap between adjacent chunks, length_function A function that calculates the length of the chunk, and is_separator_regex It receives a parameter that specifies whether the delimiter is a regular expression. In the example, set the chunk size to 100, the overlap size to 20, and the length calculation function len To indicate that the delimiter is not a regular expression is_separator_regex for False Set to.

Copy

  1. How text is split rule: list of separators

  2. How the chunk size is measured: len of characters

This text divider is the recommended text divider for plain text.

RecursiveTextSplitter

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

Copy

This function text_splitter Object create_documents Text given using methods ( state_of_the_union ), divided into several documents, and the result texts Save to variable. after texts Output the first document of. This process can be viewed as an early step to process and analyze text data, especially useful for dividing large text data into units of manageable size.

Copy

  • separator The parameter specifies the string used to separate chunks, here two newline characters ( "\n\n" )

  • chunk_size determines the maximum length of each chunk

  • chunk_overlap Specifies the number of characters overlapping the adjacent chunk liver.

  • length_function Determines the function used to calculate the length of silver chunks, and basically returns the length of the string len Functions are used.

  • is_separator_regex has separator It is a disadvantageous value that determines whether it will be interpreted as a regular expression.

CharacterTextSplitter Classes provide the ability to split text into chunks of a certain size.

Visualization example: https://chunkviz.up.railway.app/

  1. How text is split: single character unit

  2. How the chunk size is measured: len of characters.

This is the simplest way. This method is divided by character (default is "\n\n") and the length of the chunk is measured by the number of characters.

CharacterTextSplitter

Copy

Step 2: Split Documents


Copy

Copy

next .py This is an example of loading a file.

Python

Copy

Next is all in the folder .pdf This is an example of loading a file.

Copy

Copy

Copy

Below is all in the folder .txt This is an example of loading a file.

Load all files within a folder

Copy

Copy

TXT file

Copy

Copy

CSV views data by row number instead of page number.

CSV

Copy

Copy

PDF

Copy

Copy

Copy

This is the BBC news article below. If you want to try it with an article written in English, uncomment below and try it.

Copy

(Example)

  • bs4.SoupStrainer Conveniently allows you to get the elements you want from the web.

[Note]

WebBaseLoader to parse only what is needed on the specified web page bs4.SoupStrainer Use.

Web page

Step 1: Load Documents

Copy

Copy

Here you can set various options for each step or apply new techniques.

Below This is an example using the basic RAG model covered in.

Copy

Take a closer look at the module

Copy

LangSmith is not required, but useful. If you want to use LangSmith, after signing up from the link above, you need to set an environment variable to start logging tracking.

Applications built with LangChain will use LLM calls multiple times over multiple stages. As these applications become more and more complex, the ability to investigate exactly what is happening inside a chain or agent becomes very important. The best way to do this LangSmith Is to use.

Copy

Copy

Set API KEY.

Preferences

Files downloaded for practice data Please copy to folder

  • Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)

  • Link: https://spri.kr/posts/view/23669

  • File name: SPRI_AI_Brief_December 23 issue_F.pdf

Software Policy Institute (SPRi)-December 2023

Documents utilized for practice

  1. Search : Given user input Retriever Use to search for related splits in the repository.

  2. creation : ChatModel / LLM Generate answers using prompts, including silver questions and retrieved data.

Last updated