01. PDF document based QA (Question-Answer)

Understanding RAG basic structure

1. Pre-processing -1~4 steps

In the pre-work phase, we proceed to a four-step that loads-split-embedding-save documents to the Vector DB (storage) of the data source.

1st level document load: brings up the document content.
2nd stage split (Text Split): Split documents to specific criteria (Chunk).
Step 3 Embedding: Embed and save the split (Chunk).
Step 4 Save VectorDB: Save embedded Chunk to DB.

2. RAG performance (RunTime) -5~8 steps

Copy

 The name of the AI developed by the Samsung Electronics itself is'Samsung Gauss'.

Copy

# Chain executive(Run Chain)
# Enter a query for the document and print out the answer.
question = "The name of the AI developed by Samsung Electronics is?"
response = chain.invoke(question)
print(response)

Copy

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# step 1: load document(Load Documents)
loader = PyMuPDFLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf")
docs = loader.load()

# step 2: load document(Split Documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)

# step 3: embedding(Embedding)generation
embeddings = OpenAIEmbeddings()

# step 4: Create DB and save
# Create a vector store.
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

# step 5: Create a Retriever
# Retrieves and generates information contained in documents.
retriever = vectorstore.as_retriever()

# step 6: Create Prompt
# Generate a prompt.
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Answer in Korean.

#Question: 
{question} 
#Context: 
{context} 

#Answer:"""
)

# step 7: 언어모델(LLM) 생성
# Create a model (LLM).
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# step 8: 체인(Chain) 생성
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Full code

Copy

 The name of the AI developed by the Samsung Electronics itself is'Samsung Gauss'.

Copy

# cahin execution(Run Chain)
# w.
question = "The name of the AI developed by Samsung Electronics is?"
response = chain.invoke(question)
print(response)

Enter and run queries (questions) in the chain created.

Copy

# Step 8: Create a Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Copy

# step 7: Generating a language model (LLM)
# Create a model (LLM).
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

Copy

# step 6: Create Prompt
# Generate a prompt.
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 
Answer in Korean.

#Question: 
{question} 
#Context: 
{context} 

#Answer:"""
)

Copy

# Send a query to the search engine and check the searched chunk results..
retriever.invoke("The name of the AI developed by Samsung Electronics is?")

Copy

# step 5: Create a Retriever
# Retrieves and generates information contained in documents.
retriever = vectorstore.as_retriever()

Copy

# Step 4: Create DB and save
# Create a vector store
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

Copy

# Step 3: Creating an embedding
embeddings = OpenAIEmbeddings()

Copy

 Number of chunks divided: 43

Copy

# Step 2: split document(Split Documents)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)
print(f"Number of split chunks: {len(split_documents)}")

Copy

 Number of pages in document: 23

Copy

# Step 1: Load the document(Load Documents)
loader = PyMuPDFLoader("data/SPRI_AI_Brief_December 2023 issue_F.pdf")
docs = loader.load()
print(f"Number of pages in the document: {len(docs)}")

(You can set various options for each step or apply new techniques.)

You can find the right structure for your document while changing the contents of each step-by-step module to suit your future situation.

Below is a skeleton code for understanding basic RAG structure.

Copy

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

RAG basic pipeline (1~8 steps)

Copy

 Start tracking LangSmith. 
[Project name] 
CH12-RAG

Copy

# LangSmith set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# enter a project name.
logging.langsmith("CH12-RAG")

LangSmith is not required, but useful. If you want to use LangSmith, after signing up from the link above, you need to set an environment variable to start logging tracking.

Applications built with LangChain will use LLM calls multiple times over multiple stages. As these applications become more and more complex, the ability to investigate exactly what is happening inside a chain or agent becomes very important. The best way to do this LangSmith Is to use.

Copy

 True

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API load key information
load_dotenv()

Set API KEY.

Preferences

Files downloaded for practice data Please copy to folder

Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)
link: https://spri.kr/posts/view/23669
File name: SPRI_AI_Brief_2023년12월호_F.pdf

Software Policy Institute (SPRi)-December 2023

Documents utilized for practice

5-step searcher (Retriever): Based on Query, search in DB to define the retriever to get results. Retriever is a search algorithm (Dense, Sparse) and is divided into retrievers. Dense: Similarity Based Search, Sparse: Keyword Based Search
6-step prompt: Generates a prompt to perform RAG. In the prompt context, the content retrieved from the document is entered. You can format the answer through prompt engineering.
Step 7 LLM: Define the model (GPT-3.5, GPT-4, Claude, etc..)
Step 8 Chain: Create a chain leading to the prompt - LLM - output.

PreviousCH12 Retrieval Augmented Generation (RAG)Next02. Naver News Agency QA (Question-Answer)

Last updated 5 months ago