05. LLM-as-Judge

Let's take advantage of the Off-the-shelf Evaluators provided by LangSmith.

Off-the-shelf Evaluators means predefined prompt-based LLM evaluators.

It has the advantage of being easy to use, but you need to define your own evaluator to use more extended features.

By default, the following trivalent information is passed to the LLM Evaluator for evaluation.

input : Question. Usually, the Question of the dataset is used.
prediction : LLM answer generated. Usually the model's answer is used.
reference : Anomalous availability such as correct answer answer, Context, etc.

Reference -https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations

Copy

# installation
# !pip install -U langsmith langchain-teddynote

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

 True

Copy

# LangSmith set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

Copy

 Start tracking LangSmith. 
[Project name] 
CH16-Evaluations

Define functions for RAG performance testing

We will create a RAG system to use for testing.

Copy

from myrag import PDFRAG
from langchain_openai import ChatOpenAI

# Creating a PDFRAG object
rag = PDFRAG(
    "data/SPRI_AI_Brief_2023년12월호_F.pdf",
    ChatOpenAI(model="gpt-4o-mini", temperature=0),
)

# Create a retriever
retriever = rag.create_retriever()

# Create a chain
chain = rag.create_chain(retriever)

# Generate answers to questions
chain.invoke("What is the name of the generative AI developed by Samsung Electronics??")

Copy

 "The name of the generated AI developed by the Samsung is'Samsung Gauss'."

ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.

Copy

# Create a function that answers the question
def ask_question(inputs: dict):
    return {"answer": chain.invoke(inputs["question"])}

Copy

# Example user questions
llm_answer = ask_question(
    {"question": "What is the name of the generative AI developed by Samsung Electronics??"}
)
llm_answer

Copy

 {'answer': "The name of the generated AI developed by the Samsung is'Samsung Gauss'." }

Defines the function above the evaluator prompt output.

Copy

# evaluator prompt function for output
def print_evaluator_prompt(evaluator):
    return evaluator.evaluator.prompt.pretty_print()

Question-Answer Evaluator

Evaluator with the most basic features. Evaluate questions (Query) and answers (Answer).

User input input The answer generated by LLM prediction The correct answer reference It is defined as.

(But the Prompt variable query , result , answer Defined as).

query : Question
result : LLM answer
answer : Answer correct

Copy

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# qa create an evaluator
qa_evalulator = LangChainStringEvaluator("qa")

# prompt output
print_evaluator_prompt(qa_evalulator)

Copy

 You are a teacher grading a quiz. 
You are given a question, the student's answer, and the true answer, and are illust to score the student answer as either CORRECT or INCORRECT. 

Example Format: 
QUESTION: question here 
STUDENT ANSWER: student's answer here 
TRUE ANSWER: true answer here 
GRADE: CORRECT or INCORRECT here 

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrase between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!  

QUESTION: {query} 
STUDENT ANSWER: {result} 
TRUE ANSWER: {answer} 
GRADE:

Proceed with the evaluation, go to the output URL and check the results.

Copy

dataset_name = "RAG_EVAL_DATASET"

# running the evaluation
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=[qa_evalulator],
    experiment_prefix="RAG_EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "QA Evaluator evaluation using",
    },
)

Answers Evaluator based on Context

LangChainStringEvaluator("context_qa") : Instruct the LLM chain to use the reference "context" to determine its accuracy.
LangChainStringEvaluator("cot_qa") : "cot_qa" has "context_qa" Similar to the evaluator, but differs in that it instructs you to use the'inference' of LLM before deciding on the final judgment. Reference

First, you need to define a function that returns Context: context_answer_rag_answer

then, LangChainStringEvaluator Generate. When generating prepare_data Mapping the return values of the functions defined above through appropriately.

Cebu Port

run : LLM results generated ( context , answer , input )
example : Data defined in the dataset. ( question and answer )

LangChainStringEvaluator To perform this evaluation, we need the following three-way information.

prediction : LLM answer generated
reference : Answers defined in the dataset
input : Questions defined in the dataset

But, LangChainStringEvaluator("context_qa") has reference Because it is used as Context, it is defined as: (Note) Below context_qa To utilize evaluators context , answer , question Defined a function to return.

Copy

# RAG result return function that returns Context
def context_answer_rag_answer(inputs: dict):
    context = retriever.invoke(inputs["question"])
    return {
        "context": "\n".join([doc.page_content for doc in context]),
        "answer": chain.invoke(inputs["question"]),
        "query": inputs["question"],
    }

Copy

# 함수 실행
context_answer_rag_answer(
    {"question": "What is the name of the generative AI developed by Samsung Electronics?"}
)

Copy

 {'context':'▹ Samsung, self-developed AI ‘Public,········HG Samsung Gauss: Samsung Research Reveals Generative AI, 2023.11.08.','answer': "The name of the generative AI developed by the Holy Selector is'Samsung Gauss'.",'query':'What is the name of the generated AI developed by the Samsung itself?' }

Copy

# cot_qa create an evaluator
cot_qa_evaluator = LangChainStringEvaluator(
    "cot_qa",
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],  # LLM Answer generated by this
        "reference": run.outputs["context"],  # Context
        "input": example.inputs["question"],  # Questions in the dataset
    },
)

# context_qa create an evaluator
context_qa_evaluator = LangChainStringEvaluator(
    "context_qa",
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],  # LLM this generated an answer
        "reference": run.outputs["context"],  # Context
        "input": example.inputs["question"],  # Questions in the dataset
    },
)

# evaluator prompt output of power
print_evaluator_prompt(context_qa_evaluator)

Copy

 You are a teacher grading a quiz. 
You are given a question, the context the question is about, and the student's answer. You are asked to score the student's answer as either CORRECT or INCORRECT, based on the context. 

Example Format: 
QUESTION: question here 
CONTEXT: context the question is about here 
STUDENT ANSWER: student's answer here 
GRADE: CORRECT or INCORRECT here 

Grade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrase between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin!  

QUESTION: {query} 
CONTEXT: {context} 
STUDENT ANSWER: {result} 
GRADE:

Proceed with the evaluation, go to the output URL and check the results.

Copy

# Set dataset name
dataset_name = "RAG_EVAL_DATASET"

# running the evaluation
evaluate(
    context_answer_rag_answer,
    data=dataset_name,
    evaluators=[cot_qa_evaluator, context_qa_evaluator],
    experiment_prefix="RAG_EVAL",
    metadata={
        "variant": "COT_QA & Context_QA Evaluator evaluation",
    },
)

Evaluation results Ground Truth Given even if you generate an answer that does not fit Context If it is correct CORRECT It is rated as.

Criteria

If there is no baseline reference label (answer answer) or it is difficult to obtain "criteria" or "score" Evaluators can be used to evaluate execution against a set of custom criteria.

This is for the model's answer Monitoring high-level semantic aspects Useful if you want to. LangChainStringEvaluator ("criteria", config={ "criteria": 아래 중 하나의 criterion })

standard

Explanation

conciseness

Evaluate whether the answer is concise and simple

relevance

Evaluate whether the answer is related to the question

correctness

Evaluate if the answer is correct

coherence

Evaluate if the answer is consistent

harmfulness

Evaluate whether the answer is harmful or harmful

maliciousness

Evaluate whether the answer is malicious or worse

helpfulness

Evaluate if the answer helps

controversiality

Evaluate whether the answer is controversial

misogyny

Evaluate whether the answer is to empty women

criminality

Evaluate whether the answer promotes crime

Copy

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# 평가자 설정
criteria_evaluator = [
    LangChainStringEvaluator("criteria", config={"criteria": "conciseness"}),
    LangChainStringEvaluator("criteria", config={"criteria": "misogyny"}),
    LangChainStringEvaluator("criteria", config={"criteria": "criminality"}),
]

# Set dataset name
dataset_name = "RAG_EVAL_DATASET"

# running the evaluation
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=criteria_evaluator,
    experiment_prefix="CRITERIA-EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "criteria evaluation using",
    },
)

Evaluator utilization (labeled_criteria) if correct answer exists

If the correct answer exists, LLM can evaluate by comparing the answer generated by the correct answer.

Like the example below reference The correct answer, prediction LLM delivers the answer you generated.

This is a separate setting prepare_data Defined through.

In addition, LLM used to evaluate answers config of llm Define through.

Copy

from langsmith.evaluation import LangChainStringEvaluator
from langchain_openai import ChatOpenAI

# labeled_criteria create an evaluator
labeled_criteria_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": example.outputs["answer"],  # correct answer
        "input": example.inputs["question"],
    },
)

# evaluator prompt output of power
print_evaluator_prompt(labeled_criteria_evaluator)

Copy

 You are assessing a submitted answer on a given task or input given on a set of criteria. Here is the data: 
[BEGIN DATA] 
*** 
[Input]: {input} 
*** 
[Submission]: {output} 
*** 
[Criteria]: helpfulness: Is this submission helpful to the user, taking into account the correct reference answer? 
*** 
[Reference]: {reference} 
*** 
[END DATA] 
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meetings all criteria. At the end, repeat just the letter again by itself on a new line.

Below relevance This is an example of evaluating.

this time prepare_data Through reference for context Pass by.

Copy

from langchain_openai import ChatOpenAI

relevance_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": "relevance",
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],  # Context 를 전달
        "input": example.inputs["question"],
    },
)

print_evaluator_prompt(relevance_evaluator)

Copy

 You are assessing a submitted answer on a given task or input given on a set of criteria. Here is the data: 
[BEGIN DATA] 
*** 
[Input]: {input} 
*** 
[Submission]: {output} 
*** 
[Criteria]: relevance: Is the submission referring to a real quote from the text? 
*** 
[Reference]: {reference} 
*** 
[END DATA] 
Does the submission meet the Criteria? First, write out in a step by step manner your reasoning about each criterion to be sure that your conclusion is correct. Avoid simply stating the correct answers at the outset. Then print only the single character "Y" or "N" (without quotes or punctuation) on its own line corresponding to the correct answer of whether the submission meetings all criteria. At the end, repeat just the letter again by itself on a new line.

Proceed with the evaluation, go to the output URL and check the results.

Copy

from langsmith.evaluation import evaluate

# Set dataset name
dataset_name = "RAG_EVAL_DATASET"

# Running the evaluation
experiment_results = evaluate(
    context_answer_rag_answer,
    data=dataset_name,
    evaluators=[labeled_criteria_evaluator, relevance_evaluator],
    experiment_prefix="LABELED-EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "labeled_criteria evaluator Utilized Evaluation",
    },
)

Custom score Evaluator (labeled_score_string)

Below is an example of creating an evaluator that returns a score. normalize_by You can normalize your score. The converted score is normalized to a value between (0 ~ 1).

Under accuracy Is a randomly defined criterion by the user. You can use it by defining a suitable Prompt.

Copy

from langsmith.evaluation import LangChainStringEvaluator

# Create an evaluator that returns scores
labeled_score_evaluator = LangChainStringEvaluator(
    "labeled_score_string",
    config={
        "criteria": {
            "accuracy": "How accurate is this prediction compared to the reference on a scale of 1-10?"
        },
        "normalize_by": 10,
        "llm": ChatOpenAI(temperature=0.0, model="gpt-4o-mini"),
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    },
)

print_evaluator_prompt(labeled_score_evaluator)

Copy

 ================================ System Message ================================ 

You are a helpful assistant. 

================================ Human Message ================================= 

[Instruction] 
Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed bellow. {criteria}[Ground truth] 
{reference} 
Begin your evaluation by providing a short exploration. Be as objective as possible. After provisioning your exploration, you must rate the response on a scale of 1 to 10 by strict following this format: "[rating]]", for example: "Rating: [[5]]". 

[Question] 
{input} 

[The Start of Assistant's Answer] 
{prediction} 
[The End of Assistant's Answer]

Proceed with the evaluation, go to the output URL and check the results.

Copy

from langsmith.evaluation import evaluate

# Running the evaluation
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=[labeled_score_evaluator],
    experiment_prefix="LABELED-SCORE-EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "labeled_score Evaluation used",
    },
)

Previous04. LangSmith dataset generation Next06. Embedding-based evaluation (embedding_distance)

Last updated 4 months ago

Previous04. LangSmith dataset generation Next06. Embedding-based evaluation (embedding_distance)

Last updated 5 months ago