01. Synthetic test dataset generation (RAGAS)

Why Synthetic Test Dataset?

It is very important to evaluate the performance of RAG (Search Enhancement Generation) augmented pipline.

However, manually generating hundreds of QA (question-context-response) samples in a document can be time consuming and labor consuming. In addition, man-made questions can be difficult to reach the level of complexity required for a thorough evaluation, ultimately affecting the quality of the evaluation.

Using synthetic data generation, in the data aggregation process Developer time 90% Can be reduced to

RAGAS: https://docs.ragas.io/en/latest/concepts/testset_generation.html

After uncommenting below, run it and proceed after installing the package.

Copy

# !pip install -qU ragas

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

 True

Copy

# Set up LangSmith tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

Copy

 Start tracking LangSmith. 
[Project name] 
CH16-Evaluations

Documents utilized for practice

Software Policy Institute (SPRi)-December 2023

Author: Jaeheung Lee (AI Policy Institute Office Liability Institute), Lee Ji-soo (AI Policy Lab Yi Phyang Institute)
Link: https://spri.kr/posts/view/23669
File name: SPRI_AI_Brief_2023년12월호_F.pdf

Files downloaded for practice data Please copy to folder

Document pretreatment

Load documents.

Copy

from langchain_community.document_loaders import PDFPlumberLoader

# Create a document loader
loader = PDFPlumberLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf")

# Loading documents
docs = loader.load()

# Table of contents, excluding last page
docs = docs[3:-1]

# Number of pages in the document
len(docs)

Copy

Each document object metadata Contains a metadata dictionary that can be used to store additional information about documents accessible through.

Metadata dictionary filename Make sure the key is included.

This key will be utilized in the Test datasets creation process. Metadata filename Properties are used to identify chunks belonging to the same document.

Copy

# Set metadata (filename must exist)
for doc in docs:
    doc.metadata["filename"] = doc.metadata["source"]

Generate data set

Copy

from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context, conditional
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset.extractor import KeyphraseExtractor
from ragas.testset.docstore import InMemoryDocumentStore

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Dataset Generator
generator_llm = ChatOpenAI(model="gpt-4o-mini")
# Dataset Criticizer
critic_llm = ChatOpenAI(model="gpt-4o-mini")
# Document Embedding
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Initialize DocumentStore. Use custom LLM and embedding.

Copy

# Sets the text splitter.
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)

# Wrap LangChain's ChatOpenAI model with LangchainLLMWrapper to make it compatible with Ragas.
langchain_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

# Initialize the key phrase extractor. It uses the LLM defined above.
keyphrase_extractor = KeyphraseExtractor(llm=langchain_llm)

# Generate agas_embeddings
ragas_embeddings = LangchainEmbeddingsWrapper(embeddings)

# Initializes an InMemoryDocumentStore.
# This is a repository that stores and manages documents in memory.
docstore = InMemoryDocumentStore(
    splitter=splitter,
    embeddings=ragas_embeddings,
    extractor=keyphrase_extractor,
)

Generates TestSet.

Copy

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    ragas_embeddings,
    docstore=docstore,
)

Distribution by type of question

simple: simple question
reasoning: questions that need reasoning
multi_context: questions to consider multiple contexts
conditional: conditional question

Copy

# Determine distribution by question type
# simple: Simple question, reasoning: Questions that require inference, multi_context: undefined, conditional: 조건부 질문
distributions = {simple: 0.4, reasoning: 0.2, multi_context: 0.2, conditional: 0.2}

documents: document data
test_size: the number of questions to create
distributions: distribution by question type
with_debugging_logs: Debugging log output

Copy

# Create a test set
# docs: Document data, 10: Number of questions to generate, distributions: 질문 유형별 분포, with_debugging_logs: 디버깅 로그 출력 여부
testset = generator.generate_with_langchain_docs(
    documents=docs, test_size=10, distributions=distributions, with_debugging_logs=True
)

Copy

# Convert the generated test set to a pandas DataFrame
test_df = testset.to_pandas()
test_df

Saves the dataset stored in DataFrame as a csv file

Copy

# Output the top 5 rows of a DataFrame
test_df.head()

# Save DataFrame as CSV file
test_df.to_csv("data/ragas_synthetic_dataset.csv", index=False)

Copy

# All evaluation requests
_ = evaluation_runnable.invoke(
    "What is the name of the generative AI developed by Samsung Electronics?", config=all_eval_config
)

Copy

# Request Context Recall Evaluation
_ = evaluation_runnable.invoke(
    "What is the name of the generative AI developed by Samsung Electronics?",
    config=context_recall_config,
)

Copy

# Request a Hallucination Evaluation
_ = evaluation_runnable.invoke(
    "What is the name of the generative AI developed by Samsung Electronics?", config=hallucination_config
)

Copy

from langchain_core.runnables import RunnableConfig

# Set tags.
hallucination_config = RunnableConfig(tags=["hallucination_eval"])
context_recall_config = RunnableConfig(tags=["context_recall_eval"])
all_eval_config = RunnableConfig(tags=["hallucination_eval", "context_recall_eval"])