# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv
# Load API KEY information
load_dotenv()
Copy
True
Copy
# LangSmith set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging
# Enter a project name.
logging.langsmith("CH16-Evaluations")
Copy
Define functions for RAG performance testing
We will create a RAG system to use for testing.
Copy
Copy
ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.
Copy
Custom Evaluator Configuration
You can keep the input parameters and return value format of the custom functions below.
Custom function
Input Run and Example To receive and output dict Returns.
Return value {"key": "score_name", "score": score} It is organized in format. Below we have defined a simple example function. Returns a random score between 1~10 regardless of the answer.
Copy
Copy
Custom LLM-as-Judge
This time, we will create an LLM Chain and use it as an evaluator.
first, context , answer , question Defines the function that returns.
Copy
Next, create a custom LLM evaluator.
At this time, the evaluation prompt is freely adjustable.
Copy
Copy
Copy
Previously created context_answer_rag_answer Answers generated using functions, context custom_llm_evaluator Enter in to proceed with the evaluation.
Copy
Copy
custom_evaluator Define functions.
run.outputs : Get the answer, context, question created by the RAG chain.
example.outputs : Get the correct answer from the dataset.
from myrag import PDFRAG
from langchain_openai import ChatOpenAI
# Creating a PDFRAG object
rag = PDFRAG(
"data/SPRI_AI_Brief_December 2023 issue_F.pdf",
ChatOpenAI(model="gpt-4o-mini", temperature=0),
)
# Create a retriever
retriever = rag.create_retriever()
# Create a chain
chain = rag.create_chain(retriever)
# Generate answers to questions
chain.invoke("What is the name of the generative AI developed by Samsung Electronics?")
"The name of the generated AI developed by the Samsung is'Samsung Gauss'."
# Create a function that answers the question
def ask_question(inputs: dict):
return {"answer": chain.invoke(inputs["question"])}
from langsmith.schemas import Run, Example
import random
def random_score_evaluator(run: Run, example: Example) -> dict:
# Return random score
score = random.randint(1, 11)
return {"key": "random_score", "score": score}
# RAG result return function that returns Context
def context_answer_rag_answer(inputs: dict):
context = retriever.invoke(inputs["question"])
return {
"context": "\n".join([doc.page_content for doc in context]),
"answer": chain.invoke(inputs["question"]),
"question": inputs["question"],
}
from langchain import hub
# Get Evaluator Prompt
llm_evaluator_prompt = hub.pull("teddynote/context-answer-evaluator")
llm_evaluator_prompt.pretty_print()
As an LLM evaluator (judge), please assess the LLM's response to the given question. Evaluate the response's accuracy, comprehensiveness, and context precision based on the provided context. After your evaluation, return only the numerical scores in the following format:
Accuracy: [score]
Comprehensiveness: [score]
Context Precision: [score]
Final: [normalized score]
Grading rubric:
Accuracy (0-10 points):
Evaluate how well the answer aligns with the information provided in the given context.
0 points: The answer is completely incurate or contradicts the provided context
4 points: The answer partially aligns with the context but contains signal inaccuracies
7 points: The answer mostly aligns with the context but has minor inaccuracies or omissions
10 points: The answer fully aligns with the provided context and is completely accurate
Comprehensiveness (0-10 points):
0 points: The answer is completely inadequate or irrelevant
3 points: The answer is accurate but too brief to fully address the question
7 points: The answer covers main aspects but lacks detail or misses minor points
10 points: The answer comprehensively covers all aspects of the question
Context Precision (0-10 points):
Evaluate how precisely the answeruses the information from the provided context.
0 points: The answer doesn't use any information from the context oruses it entirely incorrectly
4 points: The answeruses some information from the context but with signalant misinterpretations
7 points: The answer uses most of the relevant context information correctly but with minor misinterpretations
10 points: The answer precisely and correctly use all relevant information from the context
Final Normalized Score:
Calculate by summing the scores for accuracy, comprehensiveness, and context precision, then division by 30 to get a score between 0 and 1.
Formula: (Accuracy + Comprehensiveness + Context Precision) / 30
#Given question:
{question}
#LLM's response:
{answer}
#Provided context:
{context}
Please evaluate the LLM's response according to the criteria Above.
In your output, include only the numerical scores for FINAL NORMALIZED SCORE without any additional exploration or reasoning.
ex) 0.81
#Final Normalized Score (Just the number):
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI
# create an evaluator
custom_llm_evaluator = (
llm_evaluator_prompt
| ChatOpenAI(temperature=0.0, model="gpt-4o-mini")
| StrOutputParser()
)
# Generates an answer.
output = context_answer_rag_answer(
{"question": "What is the name of the generative AI developed by Samsung Electronics?"}
)
# Run the score evaluation
custom_llm_evaluator.invoke(output)