02. Naver News Agency QA (Question-Answer)

Understanding RAG basic structure

One. Pre-processing -1~4 steps

In the pre-work phase, we proceed to a four-step that loads-split-embedding-save documents to the Vector DB (storage) of the data source.

1st level document load: brings up the document content.
2nd stage split (Text Split): Split documents to specific criteria (Chunk).
Step 3 Embedding: Embed and save the split (Chunk).
Step 4 Save VectorDB: Save embedded Chunk to DB.

2. RAG performance (RunTime) -5~8 steps

5-step searcher (Retriever): Based on Query, search in DB to define the retriever to get results. Retriever is a search algorithm (Dense, Sparse) and is divided into retrievers.
Dense : Similarity based search (FAISS, DPR)
Sparse : Keyword-based search (BM25, TF-IDF)
6-step prompt: Generates a prompt to perform RAG. In the prompt context, the content retrieved from the document is entered. You can format the answer through prompt engineering.
Step 7 LLM: Define the model (GPT-3.5, GPT-4, Claude, etc..)
Step 8 Chain: Create a chain leading to the prompt - LLM - output.

Preferences

Set API KEY.

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API Load key information
load_dotenv()

Copy

 True

Applications built with LangChain will use LLM calls multiple times over multiple stages. As these applications become more and more complex, the ability to investigate exactly what is happening inside a chain or agent becomes very important. The best way to do this LangSmith Is to use.

LangSmith is not required, but useful. If you want to use LangSmith, after signing up from the link above, you need to set an environment variable to start logging tracking.

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH12-RAG")

Copy

 Start tracking LangSmith. 
[Project name] 
CH12-RAG

Naver News-based QA (Question-Answering) chatbot

In this tutorial, you can ask about the content of Naver News Agency. News Agency QA App Will build.

In this guide, we will use OpenAI chat models, embedding, and Chroma vector stores.

First, you can implement a simple indexing pipeline and RAG chain with about 20 lines of code through the following process. library

bs4 Is a library for parsing web pages.
langchain Is a library that provides a variety of features related to AI, especially text splitting here RecursiveCharacterTextSplitter ), document loading ( WebBaseLoader ), vector storage ( Chroma , FAISS ), output parsing ( StrOutputParser ), a viable pass-through ( RunnablePassthrough ) Etc.
langchain_openai OpenAI's chatbot through modules ( ChatOpenAI ) And embedding ( OpenAIEmbeddings ) You can use the function.

Copy

import bs4
from langchain import hub
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

Copy

 USER_AGENT environment variable not set, consider setting it to identify your requests.

After loading the contents of the web page, dividing the text into chunks, indexing it, we implement the process of creating new content by searching for related text snippets.

WebBaseLoader to parse only what is needed on the specified web page bs4.SoupStrainer Use.

[Note]

bs4.SoupStrainer Conveniently allows you to get the elements you want from the web.

(Example)

Copy

bs4.SoupStrainer(
    "div",
    attrs={"class": ["newsct_article _article_body", "media_end_head_title"]},
)

Copy

# Load news article content, chunk it, and index it.
loader = WebBaseLoader(
    web_paths=("https://n.news.naver.com/article/437/0000378416",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["newsct_article _article_body", "media_end_head_title"]},
        )
    ),
)

docs = loader.load()
print(f"number of documents: {len(docs)}")
docs

Copy

 Number of documents: 1

Copy

 [Document (page_content=”\n shoots '100 million won' to the output staff... The company's catastrophic low-birth policy\n\n\n[anchor] If you are a family planning to have a child this year, this is the news of Sold. The government's monthly salary as a low-birth measure, and a zero-year-old child raised to 1 million won. Adding to this first-time ticket, even the child's allowance, you get 15.2 million won for a year up to the child stone. The municipality was also competing for support. Incheon City is a new born baby, and will give you 100 million won until you are 18 years old. Gwangju City also said he would give 74 million won until he was 17 years old. There was a man who appeared in the election and said he would give cash if he had a child. In the past, only the vote was followed by criticism of Norin's'Emperor Commitment'. But now the fertility rate can't be worse than this, so it's even a situation where we seriously policy this cash aid. Besides, companies are also jumping. This time, it turned out to be a company that would give 100 million won to a given employee.Idealization reporter covered.[Reporter] A group company today has a catastrophic low-birth policy. One billion won for children born after 2021, a total of 70 billion won, and we decided to continue this policy in the future.If you have a lifetime and twin children in that period, you will receive a total of 200 million won.[Oh Hyun-seok/Vyoung Group employees: You're a world that's hard to raise children. I think it will be a great help in education or living.]If it were born by the third, it also said that it would provide national housing.[Chairman of the dual-earned/non-profit group: I think there will be a person with three children within three years, and thus it will be an opportunity to provide housing.][Quiet/Vyoung Group employees:  Wife also wanted to have a third time, but it was negative because of the economic burden. (Now) I think I can think positively.]At today's event, there was also a proposal for the government to be tax-free, taking into account the tax burden of the employees receiving the company.These maternity measures are an increasingly spreading atmosphere.Some places have longer parental leave than during statutory periods, or where male employees are obliged to take parental leave.I run my in-house children's house until 10pm, and if I give birth, I will promote it unconditionally.One company attracted attention by supporting medical expenses to employees who gave birth to four twins last year.As the expectation of a company to go out on behalf of the government to change the social climate, there is also a voice that small and medium-sized support is needed.[Video design ‘\n\t\t\n", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]  [Video design contour]\n\t\t\n", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]  [Video design contour]\n\t\t\n", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]

RecursiveCharacterTextSplitter Divide the document into the specified size of chunks.

Copy

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

splits = text_splitter.split_documents(docs)
len(splits)

Copy

FAISS hump Chroma Vectorstore like this creates a vector representation of the document based on these chunks.

Copy

# Create a vector store.
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Search and create information contained in news.
retriever = vectorstore.as_retriever()

vectorstore.as_retriever() The finder generated through hub.pull With the prompt brought ChatOpenAI Use the model to create new content.

Finally, StrOutputParser Parse the generated result into a string.

Copy

from langchain_core.prompts import PromptTemplate

prompt = PromptTemplate.from_template(
    """You are a friendly AI assistant who performs question-answering. Your task is to answer a given question in a given context.
Answer the question using the context found. If you cannot find the answer in the given context, If you don't know the answer, answer "I can't find the information in the information provided."
Please answer in Korean. However, please use technical terms or names as is without translating them.

#Question: 
{question} 

#Context: 
{context} 

#Answer:"""
)

in hub teddynote/rag-prompt-korean You can download and enter the prompt. In this case, a separate prompting process is omitted.

Copy

# prompt = hub.pull("teddynote/rag-prompt-korean")
# prompt

Copy

llm = ChatOpenAI(model_name="gpt-4o", temperature=0)


# Create a chain.
rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

For streaming output stream_response Use.

Copy

from langchain_teddynote.messages import stream_response

LangSmith Trace View

Copy

answer = rag_chain.stream("Please explain Booyoung Group's policy to encourage childbirth.")
stream_response(answer)

Copy

 The Buyeong Group's policy to encourage childbirth is very catastrophic. Employees born after 2021 support a total of 70 billion won, 100 million won, and a total of 200 million won if you have a lifetime and twin children. There is also a plan to provide national housing when giving birth to a third child. These maternity contributions have also been proposed by the government to be tax-free, taking into account the tax burden of employees.

LangSmith Trace View

Copy

answer = rag_chain.stream("How much support does Booyoung Group provide to employees who give birth?")

Copy

 The Buyeong Group supports 100 million won for maternity employees. In addition, if you have a lifetime and twin children, you can get a total of 200 million won, and there is also a policy to provide national housing if you have a third child.

LangSmith Trace View

Copy

answer = rag_chain.stream("The government's low birth rate policy bullet points 형식으로 작성해 주세요.")
stream_response(answer)

Copy

 In the information given, the government's low-birth measures are organized in the form of bullet points: 

-Parent salary payments per month: 1 million won for 0-year-old children 
-First-time payment 
-Payment per child 
-A total of 15.2 million won for 1 year until Eye Stone 
-Additional support by municipality: 
  -Incheon: 100 million won until 18 years old for a new born baby 
  -Gwangju City: 74 million won until 17 years old

LangSmith Trace View

Copy

answer = rag_chain.stream("How many employees does Booyoung Group have?")
stream_response(answer)

Copy

 I can't find any information about the question in the information given.

Previous01. PDF document based QA (Question-Answer)Next03. Various module utilizers by function of RAG

Last updated 5 months ago