03. Various module utilizers by function of RAG

One. Question processing

In the process of handling questions, the task is to receive the user's questions to process them, and to find relevant data. The following components are required for this:

Data source connection : You need to connect to various text data sources to find answers to your questions. LangChain helps you easily establish connections with various data sources.
Data indexing and retrieval : To effectively find relevant information from a data source, the data must be indexed. LangChain automates the indexing process and provides the tools you need to retrieve data related to your questions.

2. Create an answer

After finding the relevant data, you need to generate an answer to the user's question based on this. At this stage, the following components are important:

Reply generation model : LangChain provides the ability to generate answers from retrieved data using an advanced natural language processing (NLP) model. These models receive user questions and retrieved data as input, generating appropriate answers.

architecture

We are Q&A Introduction As outlined in, we will create a typical RAG application. This has two main components:

Indexing : A pipeline that collects and indexes data from sources. This usually happens offline.
Search and creation : As a real RAG chain, we receive user queries at run time to retrieve relevant data from the index, and then pass that data to the model.

The full order from RAW data to receiving answers is as follows.

Indexing

road : You must load the data first. for this DocumentLoaders Will use
Division : Text splitters Is big Documents Divide into smaller chunks. This is useful for indexing data and delivering it to the model, and large chunks are difficult to search for and do not fit the model's finite context window.
Save : I need a place to store and index the splits for later search. This is often VectorStore Wow Embeddings It is done using a model.

Search and creation

Copy

 YouTube has mandatory AI labeling of AI-generated content since 2024. Content that does not comply with these rules may be deleted and revenue distribution to the creator may also be suspended. YouTube plans to receive a request to delete content in accordance with the personal information breach process if AI-generated content mimics an individual.

Copy

# Step 8: Run Chain
# Enter a query for the document and print out the answer.
question = "What YouTube is Mandating in 2024?"
response = rag_chain.invoke(question)
print(response)

LangSmith: https://smith.langchain.com/public/17ef6df2-b012-4f8e-b0a8-62894d82c097/r

Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 14)

Copy

 By 2027, AI software sales are expected to be $2,510 billion, and AI applications are expected to record an average annual growth rate of 21.1% by 2027. The AI platform is expected to have an average annual growth rate of 35.8% by 2027, and AI SIS is expected to record an average annual growth rate of 32.6% over the five years.

Copy

# Step 8: Run Chain
# Enter a query for the document and print out the answer.
question = "What is the future outlook for AI software sales?"
response = rag_chain.invoke(question)
print(response)

LangSmith: https://smith.langchain.com/public/2b2913c9-6b9c-4a19-bb16-dc2256e2fdbf/r

Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 12)

Copy

 Samsung Gauss is a generated AI model developed by the Samsung, consisting of three models: language, code, and images. This model is operational on the on-device and has been learned through secure data that does not leak user information to the outside. Samsung plans to phase out Samsung Gauss in various products.

Copy

# Step 8: Run Chain
# Enter a query for the document and print out the answer.
question = "Tell me about Samsung Gauss"
response = rag_chain.invoke(question)
print(response)

LangSmith: https://smith.langchain.com/public/4449e744-f0a0-42d2-a3df-855bd7f41652/r

Documentation: data/SPRI_AI_Brief_2023 December _F.pdf (page 10)

Copy

 PDF Path: data/SPRI_AI_Brief_2023 December _F.pdf 
Number of documents: 23 
============================================================ 
[HUMAN] 
Please explain Samsung Gauss 

[AI] 
Samsung Gauss is a generated AI model developed by the Samsung, consisting of three models: language, code, and images. This model is operational on the on-device and has been learned through secure data that does not leak user information to the outside. Samsung plans to phase out Samsung Gauss in various products.

Copy

# Step 1: Load Documents
# Load the document, chunk it, and index it.
from langchain.document_loaders import PyPDFLoader

# Load PDF file. Enter path to file.
file_path = "data/SPRI_AI_Brief_2023년12월호_F.pdf"
loader = PyPDFLoader(file_path=file_path)

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

split_docs = loader.load_and_split(text_splitter=text_splitter)

# Step 3, 4: Create Embedding & Vectorstore
# Create a vector store.
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Step 5: Create Retriever
# Find documents that match your query.

# Retrieves K documents with high similarity.
k = 3

# (Sparse) bm25 retriever and (Dense) faiss retriever You are welcome.
bm25_retriever = BM25Retriever.from_documents(split_docs)
bm25_retriever.k = k

faiss_vectorstore = FAISS.from_documents(split_docs, OpenAIEmbeddings())
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": k})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

# Step 6: Create Prompt
# Generates a prompt.
prompt = hub.pull("rlm/rag-prompt")

# Step 7: Create LLM
# Create a model (LLM).
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    # Combines the searched document results into one paragraph.
    return "\n\n".join(doc.page_content for doc in docs)


# Step 8: Create Chain
rag_chain = (
    {"context": ensemble_retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Output the results
print(f"PDF Path: {file_path}")
print(f"number od documents: {len(docs)}")
print("===" * 20)
print(f"[HUMAN]\n{question}\n")
print(f"[AI]\n{response}")

RAG template experiment

Copy

 'seoul'

Copy

t5_model.invoke("Where is the capital of South Korea?")

Copy

# HuggingFaceHub Object creation
from langchain.llms import HuggingFaceHub

repo_id = "google/flan-t5-xxl"

t5_model = HuggingFaceHub(
    repo_id=repo_id, model_kwargs={"temperature": 0.1, "max_length": 512}
)

HuggingFace LLM Leaderboard

On the leaderboard below, you can see an open source leaderboard that improves performance day by day.

You can easily download and use the open source model uploaded to HuggingFace.

Copy

 Tokens Used: 39 
    Prompt Tokens: 24 
    Completion Tokens: 15 
Successful Requests: 1 
Total Cost (USD): $6.6e-05

Copy

from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    result = model.invoke("What is the capital of South Korea?")
print(cb)

You can check the token usage in the following way:

Copy

from langchain_openai import ChatOpenAI

model = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

A detailed cost system OpenAI API Model List / Fare Table You can check in.

gpt-3.5-turbo : OpenAI's GPT-3.5-turbo model
gpt-4-turbo-preview : OpenAI's GPT-4-turbo-preview model

Select one of the OpenAI models.

Step 7: Create a language model (Create LLM)

Copy

 ChatPromptTemplate (input_variables=['context','question'], messages=[HumanMessagePromptTemplate (prompt=PromptTemplate(input_variables=['context','question' Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:")])])

Copy

prompt = hub.pull("rlm/rag-prompt")
prompt

Copy

from langchain import hub

LangSmith hub There are many verified prompts uploaded.
You can save money and time by utilizing or slightly modifying the verified prompts.
https://smith.langchain.com/hub/search?q=rag

[TIP2]

if, retriever If important information is missing from the results derived from retriever You need to modify the logic of.
if, retriever Although the results from include a lot of information, llm If you haven't found any of these important information or don't output it in any form you like, you need to fix the prompts.

[TIP1]

Prompt engineering is given data context Based on ), it plays an important role when we get the results we want.

Stage 6: Prompt generation (Create Prompt)

Copy

 [Query] 
pear 

[BM25 Retriever] 
[1] My favorite fruits are pear, apple, grower, watermelon. 
[2] Is that boat leaving the last one today? 
============================================================ 
[FAISS Retriever] 
[1] My favorite fruits are pear, apple, grower, watermelon. 
[2] Is that boat leaving the last one today? 
============================================================ 
[Ensemble Retriever] 
[1] My favorite fruits are pear, apple, grower, watermelon. 
[2] Is that boat leaving the last one today?

Copy

sample_query = "pear"
print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.get_relevant_documents(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.get_relevant_documents(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.get_relevant_documents(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

Copy

 [Query] 
ships 

[BM25 Retriever] 
[1] My favorite fruits are pear, apple, grower, watermelon. 
[2] Is that boat leaving the last one today? 
============================================================ 
[FAISS Retriever] 
[1] Is that boat leaving the last one today? 
[2] My favorite fruits are pear, apple, grower, watermelon. 
============================================================ 
[Ensemble Retriever] 
[1] Is that boat leaving the last one today? 
[2] My favorite fruits are pear, apple, grower, watermelon.

Copy

sample_query = "ships"
print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.get_relevant_documents(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.get_relevant_documents(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.get_relevant_documents(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

Copy

 [Query] 
There are many ships floating on the sea. 

[BM25 Retriever] 
[1] My favorite fruits are pear, apple, grower, watermelon. 
[2] Is that boat leaving the last one today? 
============================================================ 
[FAISS Retriever] 
[1] I eat a lot today and my stomach really calls 
[2] Is that boat leaving the last one today? 
============================================================ 
[Ensemble Retriever] 
[1] Is that boat leaving the last one today? 
[2] I eat a lot today and my stomach really calls 
[3] My favorite fruits are pear, apple, grower, watermelon.

Copy

sample_query = "There are many ships floating on the sea."
print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.get_relevant_documents(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.get_relevant_documents(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.get_relevant_documents(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

Copy

 [Query] 
I'm really fat on my stomach these days... 

[BM25 Retriever] 
[1] I eat a lot today and my stomach really calls 
[2] My favorite fruits are pear, apple, grower, watermelon. 
============================================================ 
[FAISS Retriever] 
[1] I eat a lot today and my stomach really calls 
[2] Is that boat leaving the last one today? 
============================================================ 
[Ensemble Retriever] 
[1] I eat a lot today and my stomach really calls 
[2] Is that boat leaving the last one today? 
[3] My favorite fruits are pear, apple, grower, watermelon.

Copy

sample_query = "I've gained a lot of weight on my stomach these days..."
print(f"[Query]\n{sample_query}\n")
relevant_docs = bm25_retriever.get_relevant_documents(sample_query)
print("[BM25 Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = faiss_retriever.get_relevant_documents(sample_query)
print("[FAISS Retriever]")
pretty_print(relevant_docs)
print("===" * 20)
relevant_docs = ensemble_retriever.get_relevant_documents(sample_query)
print("[Ensemble Retriever]")
pretty_print(relevant_docs)

Copy

def pretty_print(docs):
    for i, doc in enumerate(docs):
        print(f"[{i+1}] {doc.page_content}")

Copy

doc_list = [
    "I ate a lot today so I'm really full.",
    "Is that ship leaving today the last ship?",
    "My favorite fruits are pears, apples, oranges, and watermelon.",
]

# initialize the bm25 retriever and faiss retriever
bm25_retriever = BM25Retriever.from_texts(doc_list)
bm25_retriever.k = 2

faiss_vectorstore = FAISS.from_texts(doc_list, OpenAIEmbeddings())
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

# initialize the ensemble retriever
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

Copy

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

Ensemble Retriever

Copy

 INFO:langchain.retrievers.multi_query:Generated queries: ['1. How is the Buyeong Group's policy to encourage childbirth operating?', '2. What policies are the Buyeong Group implementing to encourage childbirth?', '3. What are the policies adopted by the Buyeong Group to increase fertility rates?']

Copy

unique_docs = retriever_from_llm.get_relevant_documents(query=question)
len(unique_docs)

Copy

# Set logging for the queries
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

Copy

from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain_openai import ChatOpenAI

query = "What is the company's low birth rate policy?"

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(), llm=llm
)

Create various queries

Copy

 [Document (page_content=" shoots '100 million' to the maternity employee... The company's catastrophic low-birth policy", metadata={'source':'https://n.news.naver.com/article/437/0000378416'}), Document (page_content=' There are also places where male employees are required to take parental leave.I run my in-house children's house until 10pm, and if I give birth, I will promote it unconditionally.One company attracted attention by supporting medical expenses to employees who gave birth to four twins last year.As the expectation of a company to go out on behalf of the government to change the social climate, there is also a voice that small and medium-sized support is needed.[Video design contour]', metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]

Copy

query = "What is the company's low birth rate policy?"

retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 2})
search_result = retriever.get_relevant_documents(query)
print(search_result)

maximum marginal search result Search using.

Copy

 [Document (page_content=" shoots '100 million' to the sending employee... The company's catastrophic low birth policy", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]

Copy

query = "What is the company's low birth rate policy?"

retriever = vectorstore.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}
)
search_result = retriever.get_relevant_documents(query)
print(search_result)

similarity_score_threshold In a similarity-based search score_threshold Returns only the above results.

Copy

 [Document (page_content=" shoots '100 million' to the maternity employee... The company's catastrophic low-birth policy", metadata={'source':'https://n.news.naver.com/article/437/0000378416'}), Document (page_content=' There are also places where male employees are required to take parental leave.I run my in-house children's house until 10pm, and if I give birth, I will promote it unconditionally.One company attracted attention by supporting medical expenses to employees who gave birth to four twins last year.As the expectation of a company to go out on behalf of the government to change the social climate, there is also a voice that small and medium-sized support is needed.[Video design Kwaksemi]', metadata={'source':'https://n.news.naver.com/article/437/0000378416'}), Document (page_content=" [Anchor] If you are a family planning to give birth to a child this year, it is a solite news. The government's monthly salary as a low-birth measure, and a zero-year-old child raised to 1 million won. Adding to this first-time allowance, even the child's allowance, you get 15.2 million won for a year up to the child stone. The municipality was also competing for support. Incheon City is a new born baby, and will give you 100 million won until you are 18 years old. Gwangju City also said he would give 74 million won until he was 17 years old. There was a man who appeared in the election and said he would give cash if he had a child. In the past, only the vote was followed by criticism of Norin's'Emperor Commitment'. But now the fertility rate can't be worse than this,  It's even a situation where we seriously policy this cash aid. Besides, companies are also jumping. This time, it turned out to be a company that would give 100 million won to a given employee.Idealization reporter covered.[Reporter] A group company today has a catastrophic low-birth policy. 10 million won for children born after 2021, a total of 70 billion won, and decided to continue this policy in the future.If you have a lifetime and twin children in that period, you will receive a total of 200 million won.[Oh Hyun-seok/Vyoung Group employees: You're a world that's hard to raise children. I think it will be a great help in education or living.]If it were born by the third, it also said that it would provide national housing.[Chairman of the dual-earned/non-profit group: I think there will be a person with three children within three years, and thus it will be an opportunity to provide housing.][Quiet/Buyoung Group employees: Wipe wanted to have a third time, but it was negative because of the economic burden. (Now) I think I can think positively.]At today's event, there was also a proposal for the government to be tax-free, taking into account the tax burden of the employees receiving the company.These maternity measures are an increasingly spreading atmosphere.Some places have longer parental leave than during statutory periods, or where male employees are obliged to take parental leave.I run my in-house children's house until 10pm", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]
  Idealization reporter covered.[Reporter] A group company today has a catastrophic low-birth policy. 10 million won for children born after 2021, a total of 70 billion won, and decided to continue this policy in the future.If you have a lifetime and twin children in that period, you will receive a total of 200 million won.[Oh Hyun-seok/Vyoung Group employees: You're a world that's hard to raise children. I think it will be a great help in education or living.]If it were born by the third, it also said that it would provide national housing.[Chairman of the dual-earned/non-profit group: I think there will be a person with three children within three years, and thus it will be an opportunity to provide housing.][Quiet/Buyoung Group employees: Wipe wanted to have a third time, but it was negative because of the economic burden. (Now) I think I can think positively.]At today's event, there was also a proposal for the government to be tax-free, taking into account the tax burden of the employees receiving the company.These maternity measures are an increasingly spreading atmosphere.Some places have longer parental leave than during statutory periods, or where male employees are obliged to take parental leave.I run my in-house children's house until 10pm", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]
  Idealization reporter covered.[Reporter] A group company today has a catastrophic low-birth policy. 10 million won for children born after 2021, a total of 70 billion won, and decided to continue this policy in the future.If you have a lifetime and twin children in that period, you will receive a total of 200 million won.[Oh Hyun-seok/Vyoung Group employees: You're a world that's hard to raise children. I think it will be a great help in education or living.]If it were born by the third, it also said that it would provide national housing.[Chairman of the dual-earned/non-profit group: I think there will be a person with three children within three years, and thus it will be an opportunity to provide housing.][Quiet/Buyoung Group employees: Wipe wanted to have a third time, but it was negative because of the economic burden. (Now) I think I can think positively.]At today's event, there was also a proposal for the government to be tax-free, taking into account the tax burden of the employees receiving the company.These maternity measures are an increasingly spreading atmosphere.Some places have longer parental leave than during statutory periods, or where male employees are obliged to take parental leave.I run my in-house children's house until 10pm", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]
  It's a world that's hard to raise children. I think it will be a great help in education or living.]If it were born by the third, it also said that it would provide national housing.[Chairman of the dual-earned/non-profit group: I think there will be a person with three children within three years, and thus it will be an opportunity to provide housing.][Quiet/Busy Group employees: Wipes wanted to have a third time, but it was negative because of the economic burden. (Now) I think I can think positively.]At today's event, there was also a proposal for the government to be tax-free, taking into account the tax burden of the employees receiving the company.These maternity measures are an increasingly spreading atmosphere.Some places have longer parental leave than during statutory periods, or where male employees are obliged to take parental leave.I run my in-house children's house until 10pm", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]
  It's a world that's hard to raise children. I think it will be a great help in education or living.]If it were born by the third, it also said that it would provide national housing.[Chairman of the dual-earned/non-profit group: I think there will be a person with three children within three years, and thus it will be an opportunity to provide housing.][Quiet/Busy Group employees: Wipes wanted to have a third time, but it was negative because of the economic burden. (Now) I think I can think positively.]At today's event, there was also a proposal for the government to be tax-free, taking into account the tax burden of the employees receiving the company.These maternity measures are an increasingly spreading atmosphere.Some places have longer parental leave than statutory periods, or where male employees are obliged to take parental leave.I run my in-house children's house until 10pm", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]
  There are also places where male employees are required to take parental leave.I run my in-house children's house until 10pm", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]
  There are also places where male employees are required to take parental leave.I run my in-house children's house until 10pm", metadata={'source':'https://n.news.naver.com/article/437/0000378416'})]

Copy

query = "What is the company's low birth rate policy?"

retriever = vectorstore.as_retriever(search_type="similarity")
search_result = retriever.get_relevant_documents(query)
print(search_result)

Default is cosine pseudo-dominant similarity Is applied.

Similarity based search

On the VectorStore created as_retriver() Bring it to Retriever Generate.

Official document

Retriever only returns (or searches) this document without the need to save it.

A retriever is an interface that returns documents given unstructured queries.

Step 5: create Retriever

Copy

from langchain_community.vectorstores import Chroma

# Apply Chroma DB
vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

Copy

from langchain_community.vectorstores import FAISS

# FAISS DB application
vectorstore = FAISS.from_documents(
    documents=splits, embedding=OpenAIEmbeddings())

Step 4: create a vector store (Create Vectorstore)

Copy

from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

vectorstore = FAISS.from_documents(
    documents=splits, embedding=FastEmbedEmbeddings())

Copy

# !pip install fastembed -q

Copy

from langchain_community.embeddings import HuggingFaceBgeEmbeddings

# Step 3: Create Embedding & Vectorstore.
# Create a vector store.
vectorstore = FAISS.from_documents(
    documents=splits, embedding=HuggingFaceBgeEmbeddings()
)

Free Open Source based embedding

Copy

 Warning: model not found. Using cl100k_base encoding.

Copy

vectorstore = FAISS.from_documents(
    documents=splits, embedding=OpenAIEmbeddings(model="text-embedding-3-small")
)

MODEL

ROUGH PAGES PER DOLLAR

EXAMPLE PERFORMANCE ON MTEB EVAL

text-embedding-3-small

62,500

62.3%

text-embedding-3-large

9,615

64.6%

text-embedding-ada-002

12,500

61.0%

Default value text-embeding-ada-002 is.

next OpenAI List of supported Embedding models.

Copy

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings

# Step 3: Create Embedding & Vectorstore
# Create a vector store.
vectorstore = FAISS.from_documents(
    documents=splits, embedding=OpenAIEmbeddings())

Paid billing embedding (OpenAI)

Note: https://python.langchain.com/docs/integrations/text_embedding

Step 3: embedding

Copy

 Selecting the “right” amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly sense and hard to follow. To better understand this tradeoff, we solicit increasingly sense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specificly, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries generated by CoD are more abstract, exhibit more fusion, and have less of a lead bias than GPT-4 summaries generic by a vanilla prompt. We conduct a human preference study on 100 CNN DailyMail articles and find that that humans prefer GPT-4 summaries that are more dense than that generated by a vanilla prompt and almost as dense as human written summaries.  Qualitative analysis supports the notion that there experts a tradeoff between infor-mativeness and readability. 500 annotated CoD summaries, as well as an extra 5,000 unannotated summaries, are freely available on HuggingFace. Introduction 

Automatic summarization has come a long way in the past few years, large due to a paradigm shift away from supervised fine-tuning on label datasets to zero-shot promotion with Large Language Models (LLMs), such as GPT-4 (Open Without additional training, careful prompting can enable fine-grained control over summary characteristics, such as length (Goyal et al., 2022), topics (Bhaskar et al., 2023), and style (Pu and Demberg, 2. 
============================================================ 
An overlooked aspect is the information dentity of an summary. In theory, as a compression of another text, a summary should be denser–containing a higher concentration of information–than the source document. Given the high latency of LLM decoding (Kad-dour et al., 2023), covering more information in fewer words is a northy goal, especially for real-time applications. Yet, how sense is an open question. A summary is uninformative if it contains insertient detail. If it contains too much information, however, it can be-come difficult to follow without having to increase the overall length. Conveying more information subject to a fixed token budget requires a combination of abstract-tion, compression, and fusion. There is a limit to how much space can be be made for additional information before becoming illegible or even factually incorrect. 
============================================================

Copy

# Load some content from the chain of density paper
with open("data/chain-of-density.txt", "r") as f:
    text = f.read()

for sent in semantic_text_splitter.split_text(text):
    print(sent)
    print("===" * 20)

Copy

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

# Create a SemanticChunker.
semantic_text_splitter = SemanticChunker(
    OpenAIEmbeddings(), add_start_index=True)

Copy

# Update to the latest version.
# !pip install -U langchain langchain_experimental -q

It's a way to split from high level to sentence, then group into three sentences, then merge similar sentences in embedding space.

source: Greg Kamradt's Notebook

Split text based on semantic similarity.

Semantic Similarity

Copy

# recursive_text_splitter Check the default separators.
recursive_text_splitter._separators

Try sequentially the specified separators list and split the given document.
Try splitting in order until the chunks are small enough. The default lists are ["\n\n", "\n", "", ""].
This generally has the effect of keeping all paragraphs (and sentences, words) that seem to be the most relevant piece of text as long as possible.

Copy

character_text_splitter = CharacterTextSplitter(
    chunk_size=100, chunk_overlap=10, separator=" "
)
for sent in character_text_splitter.split_text(text):
    print(sent)
print("===" * 20)
recursive_text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100, chunk_overlap=10
)
for sent in recursive_text_splitter.split_text(text):
    print(sent)

Copy

# chain of density Load some content from the paper
with open("data/chain-of-density.txt", "r") as f:
    text = f.read()[:500]

Copy

recursive_text_splitter = RecursiveCharacterTextSplitter(
    # Sets a really small chunk size.
    chunk_size=100,
    chunk_overlap=10,
    length_function=len,
    is_separator_regex=False,
)

RecursiveCharacterTextSplitter Classes provide the ability to split text recursively. This class chunk_size The size of the chunk to be divided into, chunk_overlap The size of overlap between adjacent chunks, length_function A function that calculates the length of the chunk, and is_separator_regex It receives a parameter that specifies whether the delimiter is a regular expression. In the example, set the chunk size to 100, the overlap size to 20, and the length calculation function len To indicate that the delimiter is not a regular expression is_separator_regex for False Set to.

Copy

# langchain in the package RecursiveCharacterTextSplitter Chinese New Year.
from langchain.text_splitter import RecursiveCharacterTextSplitter

How text is split rule: list of separators
How the chunk size is measured: len of characters

This text divider is the recommended text divider for plain text.

RecursiveTextSplitter

Copy

 'Semantic Search\n\n Definition: Semantic Search is a search method that returns a related result by identifying the user's query beyond simple keyword matching.\n Example: When a user searches for a "solar planet," Jupiter, "Mars", etc. Returns information about related planets such as \nangibeo keyword: natural language processing, search algorithm, data mining\n\nEmbed This allows the computer to understand and process the text. \n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\nAssociation keyword: natural language processing, vectorization, deep learning\n\nToken\n\n Definition: Tokens mean splitting text into smaller units. This can usually be a word, sentence, or verse.\n Example: Split the sentence "I go to school" into "I", "to school", "go".\n

Copy

 Number of documents: 1

Copy

# Load, chunk, and index the content of news articles.
loader = WebBaseLoader(
    web_paths=("https://www.bbc.com/news/business-68092814",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "main",
            attrs={"id": ["main-content"]},
        )
    ),
)

# splitter defines.
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separator=" ")

# Splits the document right when it is loaded.
split_docs = loader.load_and_split(text_splitter=text_splitter)
print(f"number of documents: {len(docs)}")
docs[0].page_content[:500]

Copy

 Document (page_content='Semantic Search\n\n Definition: Semantic search is a search method that takes a user's query beyond simple keyword matching to understand its meaning and return related results.\n Example: When a user searches for a "solar planet" Returns information about related planets such as "speech", "Mars", etc. \Negator keywordn This allows the computer to understand and process the text.\n Example: Expresses the word "apple" in a vector such as [0.65, -0.23, 0.17].\nAssociation keyword: natural language processing, vectorization, deep learning\n\nToken\n\n Definition: Tokens mean splitting text into smaller units. This can usually be a word, sentence, or verse.\n Example: Split the sentence "I go to school" into "I", "To school", and "Goes".\nAssociation keyword: tokenization, natural language processing, parsing \n\nTokenizer\n\n Definition: A tool for splitting text data into tokens. It is used to preprocess data in natural language processing.\n Example: "I love programming."The sentences ["I", "love", "programming", "."]Split to.\n Associate Keyword: tokenization,  Natural language processing, parsing\n\nVectorStore\n\n Definition: Vector Store is a system that stores data converted to vector format. It is used for search, sorting, and other data analysis tasks.\n Example: Save the word embedding vectors to the database for quick access.\Non-governor keyword: embedding, database, vectorization\n\nSQL\n\n Definition: SQL (Structured Query Language) is the database It is a programming language to manage data. You can do various things like view, modify, insert, delete data.\n Example: SELECT * FROM users WHERE', metadata={'source':'data/appendix-keywords.txt'})

Copy

split_docs[0]

Copy

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100, separator=" ")
# Split text file into chunks.
text_splitter.split_text(text)

# Split the document into chunks.
split_docs = text_splitter.split_documents(docs)
len(split_docs)

Copy

 ['Selecting the “right” amount of information to include in a summary is a difficult task. \nA good','summary should be detailed and entity-centric without going overly sense and hard to follow. To','better understand this tradeoff, we solicit increasingly sense GPT-4 summaries with what we refer to','as a “Chain of Density” (CoD) prompt. Specificly, GPT-4 generates an initial entity-sparse summary','before iteratively incorporating missing salient entities without increasing the length. Summaries','genera']

Copy

text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=0, separator=" ")
text_splitter.split_text(text)

Copy

 ['Selecting the “right” amount of information to include in a summary is a difficult task. \nA good','A good summary should be treated and entity-centric without going overly sense and hard to follow.','to follow. To better understand this tradeoff, we solicit increasingly sense GPT-4 summaries with','with what we refer to to as a “Chain of Density” (CoD) prompt. Specificly, GPT-4 generates an initial','an initial entity-sparse summary before iteratively incorporating missing salient entities without','without increasing the length. Summaries genera']

Copy

text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10, separator=" ")
text_splitter.split_text(text)

Copy

 ['Selecting the “right” amount of information to include in a summary is a difficult task.','A good summary should be detailed and entity-centric without going overly sense and hard to follow. To better understand this tradeoff, we solicit increasingly sense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specificly, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries genera']

Copy

text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=10, separator="\n")
text_splitter.split_text(text)

Copy

 ['Selecting the “right” amount of information to include in a summary is a difficult task. \nA good summary should be detailed and entity-centric without going overly sense and hard to follow. To better understand this tradeoff, we solicit increasingly sense GPT-4 summaries with what we refer to as a “Chain of Density” (CoD) prompt. Specificly, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries genera']

Copy

text_splitter = CharacterTextSplitter(
    chunk_size=100, chunk_overlap=10, separator="\n\n"
)
text_splitter.split_text(text)

Copy

# chain of density Load some content from the paper
with open("data/chain-of-density.txt", "r") as f:
    text = f.read()[:500]

This function text_splitter Object create_documents Text given using methods ( state_of_the_union ), divided into several documents, and the result texts Save to variable. after texts Output the first document of. This process can be viewed as an early step to process and analyze text data, especially useful for dividing large text data into units of manageable size.

Copy

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=100,
    chunk_overlap=10,
    length_function=len,
    is_separator_regex=False,
)

separator The parameter specifies the string used to separate chunks, here two newline characters ( "\n\n" )
chunk_size determines the maximum length of each chunk
chunk_overlap Specifies the number of characters overlapping the adjacent chunk liver.
length_function Determines the function used to calculate the length of silver chunks, and basically returns the length of the string len Functions are used.
is_separator_regex has separator It is a disadvantageous value that determines whether it will be interpreted as a regular expression.

CharacterTextSplitter Classes provide the ability to split text into chunks of a certain size.

Visualization example: https://chunkviz.up.railway.app/

How text is split: single character unit
How the chunk size is measured: len of characters.

This is the simplest way. This method is divided by character (default is "\n\n") and the length of the chunk is measured by the number of characters.

CharacterTextSplitter

Copy

# Load, chunk, and index the content of news articles.
loader = WebBaseLoader(
    web_paths=("https://www.bbc.com/news/business-68092814",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "main",
            attrs={"id": ["main-content"]},
        )
    ),
)
docs = loader.load()
print(f"number of documents: {len(docs)}")
docs[0].page_content[:500]

Step 2: Split Documents

Copy

 Number of documents: 1 

[Metadata] 

{'source':'data/audio_utils.py'} 

========= [Front] Preview ========= 

import re 
import os 
from pytube import YouTube 
from moviepy.editor import AudioFileClip, VideoFileClip 
from pydub import AudioSegment 
from pydub.silence import detect_nonsilent 


def extract_abr(abr): 
    youtube_audio_pattern = re.compile (r"\d+") 
    kbps = youtube_audio_pattern.search(abr) 
    if kbps: 
        kbps = kbps.group() 
        return int (kbps) 
    else: 
        return 0 


def get_audio_filepath(filename): 
    # Create without audio folder 
    if not os.path.isdir ("audio"): 
        os.mkdir("au

Copy

from langchain_community.document_loaders import PythonLoader

loader = DirectoryLoader(".", glob="**/*.py", loader_cls=PythonLoader)
docs = loader.load()

print(f"Number of documents: {len(docs)}\n")
print("[metadata]\n")
print(docs[0].metadata)
print("\n========= [앞부분] 미리보기 =========\n")
print(docs[0].page_content[:500])

next .py This is an example of loading a file.

Python

Copy

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(".", glob="data/*.pdf")
docs = loader.load()

print(f"number of documents: {len(docs)}\n")
print("[metadata]\n")
print(docs[0].metadata)
print("\n========= [Preview] =========\n")
print(docs[0].page_content[2500:3000])

Next is all in the folder .pdf This is an example of loading a file.

Copy

 Number of documents: 6 

[Page] 
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword matching of a user, grasping its meaning and returning related results. Example: When a user searches for "Solar Planet", it returns information about related planets such as "Vegetic", "Mars", etc. Associated Keywords: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. Associated Keywords: natural language processing, vectorization, dipping 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. Example: Split the sentence "I go to school" into "I", "to school", and "go". Associated Keyword: Tokenization, Natural Language 

[metadata] 
{'source':'data/appendix-keywords-CP949.txt'}

Copy

 100%|██████████| 6/6 [00:00<00:00, 36.18it/s]

Copy

from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(".", glob="data/*.txt", show_progress=True)
docs = loader.load()

print(f"Number of documentsww: {len(docs)}")

# Print the contents of the 10th page
print(f"\n[page content]\n{docs[0].page_content[:500]}")
print(f"\n[metadata]\n{docs[0].metadata}\n")

Below is all in the folder .txt This is an example of loading a file.

Load all files within a folder

Copy

 Number of documents: 1 

[Page] 
Semantic Search 

Definition: A semantic search is a search method that goes beyond a simple keyword match for a user's query and grasps its meaning and returns related results. 
Example: When a user searches for a "solar planet", it returns information about the related planet, such as "Vegetic", "Mars", etc. 
Associates: natural language processing, search algorithms, data mining 

Embedding 

Definition: Embedding is the process of converting text data, such as words or sentences, into a low-dimensional, continuous vector. This allows the computer to understand and process the text. 
Example: The word "apple" is expressed in vectors such as [0.65, -0.23, 0.17]. 
Associated Keywords: natural language processing, vectorization, deep learning 

Token 

Definition: Token means splitting text into smaller units. This can usually be a word, sentence, or verse. 
Example: Split the sentence "I go to school" into "I", "to school", and "go". 
Associated Keyword: Tokenization, Natural Language 

[metadata] 
{'source':'data/appendix-keywords.txt'}

Copy

from langchain_community.document_loaders import TextLoader

loader = TextLoader("data/appendix-keywords.txt")
docs = loader.load()
print(f"Number of documents: {len(docs)}")

# Print the contents of the 10th page
print(f"\n[page contents]\n{docs[0].page_content[:500]}")
print(f"\n[metadata]\n{docs[0].metadata}\n")

TXT file

Copy

 Number of documents: 891 

[Page] 
PassengerId: 11 
Survived: 1 
Pclass: 3 
Name: Sandstrom, Miss. Marguerite Rut 
Sex: female 
Age: 4 
SibSp: 1 
Parch: 1 
Ticket: PP 9549 
Fare: 16.7 
Cabin: G6 
Embarked: S 

[metadata] 
{'source':'data/titanic.csv','row': 10}

Copy

from langchain_community.document_loaders.csv_loader import CSVLoader

# Load CSV file
loader = CSVLoader(file_path="data/titanic.csv")
docs = loader.load()
print(f"Number of documents: {len(docs)}")

# Print the contents of the 10th page
print(f"\n[page content]\n{docs[10].page_content[:500]}")
print(f"\n[metadata]\n{docs[10].metadata}\n")

CSV views data by row number instead of page number.

CSV

Copy

 Number of documents: 23 

[Page] 
SPRi AI Brief |  
December 2023 
8Cohir unveils data source explorer to ensure data transparency 
The original data source, relicense status, through audits of a wide range of datasets by ncohires and 12 institutions,  
Launched the ‘data source explorer ’ platform, providing a variety of information, including authors 
The n-interactive platform allows developers to easily grasp the license status of the dataset, and the dataset  
Configuration and genealogy traceable KEY Contents 
£Data source explorer improves data transparency by providing a wide range of dataset information 
nAI company Cohere with 12 institutions including Massachusetts lesson ⼤ (MIT), Harvard ⼤ Law School, Carnegie Mellon ⼤  
Together October 25, 2023 ‘Data Provenance Explorer ’ platform released 
∙Data transparency is not secured due to the unclear source of datasets used to train AI models, so various  
Legal and ethical issues arise 
∙Research on this 

[metadata] 
{'source':'data/SPRI_AI_Brief_2023 December issue_F.pdf','page': 10}

Copy

from langchain.document_loaders import PyPDFLoader

# PDF Load file. Enter the path to the file.
loader = PyPDFLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf")

# Load documents per page
docs = loader.load()
print(f"Number of documents: {len(docs)}")

# Print the contents of the 10th page
print(f"\n[page content]\n{docs[10].page_content[:500]}")
print(f"\n[metadata]\n{docs[10].metadata}\n")

PDF

Copy

 'Could AI \'trading boots\' transform the world of investing?Published1 FebruaryShareclose panelShare pageCopy linkAbout sharingImage source, Getty ImagesImage caption, It is hard for bottle humans and computers to predict stock market movementsBy Jonty BloomBusiness reporterSearch for "Ains investing"

Copy

 Number of documents: 1

Copy

# Load, chunk, and index the content of news articles.
loader = WebBaseLoader(
    web_paths=("https://www.bbc.com/news/business-68092814",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "main",
            attrs={"id": ["main-content"]},
        )
    ),
)
docs = loader.load()
print(f"number of documents: {len(docs)}")
docs[0].page_content[:500]

This is the BBC news article below. If you want to try it with an article written in English, uncomment below and try it.

Copy

bs4.SoupStrainer(
    "div",
    attrs={"class": ["newsct_article _article_body", "media_end_head_title"]}, # 클래스 명을 입력
)

bs4.SoupStrainer(
    "article",
    attrs={"id": ["dic_area"]}, # Enter the class name

)

(Example)

bs4.SoupStrainer Conveniently allows you to get the elements you want from the web.

[Note]

WebBaseLoader to parse only what is needed on the specified web page bs4.SoupStrainer Use.

Web page

Official document link -Document loaders

Step 1: Load Documents

Copy

 URL: https://n.news.naver.com/article/437/0000378416 
Number of documents: 1 
============================================================ 
[HUMAN] 
Please explain the policy of promoting childbirth by the Buyeong Group 

[AI] 
The Buyeong Group has come up with a policy to encourage childbirth that supports 100 million won for children born after 2021. In addition, in the case of birth until the third, it was decided to provide a national house. These policies are aimed at reducing the economic burden on employees and encouraging childbirth.

Copy

# Step 1: Load Documents
# Load news article content, chunk it, and index it.
url = "https://n.news.naver.com/article/437/0000378416"
loader = WebBaseLoader(
    web_paths=(url,),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            "div",
            attrs={"class": ["newsct_article _article_body", "media_end_head_title"]},
        )
    ),
)
docs = loader.load()


# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)

splits = text_splitter.split_documents(docs)

# Step 3: Create Embedding & Vectorstore
# Create a vector store.
vectorstore = FAISS.from_documents(documents=splits, embedding=OpenAIEmbeddings())

# Step 4: Search
# Search and create information contained in news.
retriever = vectorstore.as_retriever()

# Step 5: Create Prompt
# Generate a prompt.
prompt = hub.pull("rlm/rag-prompt")

# 단계 6: Step 6: Create LLM
# Create a model (LLM).
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    # Combines the searched document results into one paragraph.
    return "\n\n".join(doc.page_content for doc in docs)


# Step 7: Create Chain
rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Step 8: Run the chain
# Enter a query for the document and print out the answer.
question = "Please explain Booyoung Group's policy to encourage childbirth."
response = rag_chain.invoke(question)

# output the results
print(f"URL: {url}")
print(f"number of documents: {len(docs)}")
print("===" * 20)
print(f"[HUMAN]\n{question}\n")
print(f"[AI]\n{response}")

Here you can set various options for each step or apply new techniques.

Below This is an example using the basic RAG model covered in.

Copy

import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma, FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

Take a closer look at the module

Copy

import os

# Enter the project name for debugging.
os.environ["LANGCHAIN_PROJECT"] = "RAG TUTORIAL"

# For tracing, uncomment and run the code below.
# os.environ["LANGCHAIN_TRACING_V2"] = true

LangSmith is not required, but useful. If you want to use LangSmith, after signing up from the link above, you need to set an environment variable to start logging tracking.

Applications built with LangChain will use LLM calls multiple times over multiple stages. As these applications become more and more complex, the ability to investigate exactly what is happening inside a chain or agent becomes very important. The best way to do this LangSmith Is to use.

Copy

 True

Copy

# API A configuration file for managing keys as environment variables.
from dotenv import load_dotenv

# API load key information
load_dotenv()

Set API KEY.

Preferences

Files downloaded for practice data Please copy to folder

Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)
Link: https://spri.kr/posts/view/23669
File name: SPRI_AI_Brief_December 23 issue_F.pdf

Software Policy Institute (SPRi)-December 2023

Documents utilized for practice

Search : Given user input Retriever Use to search for related splits in the repository.
creation : ChatModel / LLM Generate answers using prompts, including silver questions and retrieved data.

Previous02. Naver News Agency QA (Question-Answer)Next04. RAPTOR: Long Context Summary

Last updated 5 months ago