05. LLM-as-Judge

Let's take advantage of the Off-the-shelf Evaluators provided by LangSmith.

Off-the-shelf Evaluators means predefined prompt-based LLM evaluators.

It has the advantage of being easy to use, but you need to define your own evaluator to use more extended features.

By default, the following trivalent information is passed to the LLM Evaluator for evaluation.

  • input : Question. Usually, the Question of the dataset is used.

  • prediction : LLM answer generated. Usually the model's answer is used.

  • reference : Anomalous availability such as correct answer answer, Context, etc.

Reference -https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations

Copy

# installation
# !pip install -U langsmith langchain-teddynote

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

Copy

Copy

Define functions for RAG performance testing

We will create a RAG system to use for testing.

Copy

Copy

ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.

Copy

Copy

Copy

Defines the function above the evaluator prompt output.

Copy

Question-Answer Evaluator

Evaluator with the most basic features. Evaluate questions (Query) and answers (Answer).

User input input The answer generated by LLM prediction The correct answer reference It is defined as.

(But the Prompt variable query , result , answer Defined as).

  • query : Question

  • result : LLM answer

  • answer : Answer correct

Copy

Copy

Proceed with the evaluation, go to the output URL and check the results.

Copy

Answers Evaluator based on Context

  • LangChainStringEvaluator("context_qa") : Instruct the LLM chain to use the reference "context" to determine its accuracy.

  • LangChainStringEvaluator("cot_qa") : "cot_qa" has "context_qa" Similar to the evaluator, but differs in that it instructs you to use the'inference' of LLM before deciding on the final judgment. Reference

First, you need to define a function that returns Context: context_answer_rag_answer

then, LangChainStringEvaluator Generate. When generating prepare_data Mapping the return values of the functions defined above through appropriately.

Cebu Port

  • run : LLM results generated ( context , answer , input )

  • example : Data defined in the dataset. ( question and answer )

LangChainStringEvaluator To perform this evaluation, we need the following three-way information.

  • prediction : LLM answer generated

  • reference : Answers defined in the dataset

  • input : Questions defined in the dataset

But, LangChainStringEvaluator("context_qa") has reference Because it is used as Context, it is defined as: (Note) Below context_qa To utilize evaluators context , answer , question Defined a function to return.

Copy

Copy

Copy

Copy

Copy

Proceed with the evaluation, go to the output URL and check the results.

Copy

Evaluation results Ground Truth Given even if you generate an answer that does not fit Context If it is correct CORRECT It is rated as.

Criteria

If there is no baseline reference label (answer answer) or it is difficult to obtain "criteria" or "score" Evaluators can be used to evaluate execution against a set of custom criteria.

This is for the model's answer Monitoring high-level semantic aspects Useful if you want to. LangChainStringEvaluator ("criteria", config={ "criteria": 아래 중 하나의 criterion })

standard

Explanation

conciseness

Evaluate whether the answer is concise and simple

relevance

Evaluate whether the answer is related to the question

correctness

Evaluate if the answer is correct

coherence

Evaluate if the answer is consistent

harmfulness

Evaluate whether the answer is harmful or harmful

maliciousness

Evaluate whether the answer is malicious or worse

helpfulness

Evaluate if the answer helps

controversiality

Evaluate whether the answer is controversial

misogyny

Evaluate whether the answer is to empty women

criminality

Evaluate whether the answer promotes crime

Copy

Evaluator utilization (labeled_criteria) if correct answer exists

If the correct answer exists, LLM can evaluate by comparing the answer generated by the correct answer.

Like the example below reference The correct answer, prediction LLM delivers the answer you generated.

This is a separate setting prepare_data Defined through.

In addition, LLM used to evaluate answers config of llm Define through.

Copy

Copy

Below relevance This is an example of evaluating.

this time prepare_data Through reference for context Pass by.

Copy

Copy

Proceed with the evaluation, go to the output URL and check the results.

Copy

Custom score Evaluator (labeled_score_string)

Below is an example of creating an evaluator that returns a score. normalize_by You can normalize your score. The converted score is normalized to a value between (0 ~ 1).

Under accuracy Is a randomly defined criterion by the user. You can use it by defining a suitable Prompt.

Copy

Copy

Proceed with the evaluation, go to the output URL and check the results.

Copy

Previous04. LangSmith dataset generationNext06. Embedding-based evaluation (embedding_distance)

Last updated 4 months ago

Last updated