05. LLM-as-Judge
Let's take advantage of the Off-the-shelf Evaluators provided by LangSmith.
Off-the-shelf Evaluators means predefined prompt-based LLM evaluators.
It has the advantage of being easy to use, but you need to define your own evaluator to use more extended features.
By default, the following trivalent information is passed to the LLM Evaluator for evaluation.
input: Question. Usually, the Question of the dataset is used.prediction: LLM answer generated. Usually the model's answer is used.reference: Anomalous availability such as correct answer answer, Context, etc.
Reference -https://docs.smith.langchain.com/evaluation/faq/evaluator-implementations
Copy
# installation
# !pip install -U langsmith langchain-teddynoteCopy
# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv
# Load API KEY information
load_dotenv()Copy
Copy
Copy
Define functions for RAG performance testing
We will create a RAG system to use for testing.
Copy
Copy
ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.
Copy
Copy
Copy
Defines the function above the evaluator prompt output.
Copy
Question-Answer Evaluator
Evaluator with the most basic features. Evaluate questions (Query) and answers (Answer).
User input input The answer generated by LLM prediction The correct answer reference It is defined as.
(But the Prompt variable query , result , answer Defined as).
query: Questionresult: LLM answeranswer: Answer correct
Copy
Copy
Proceed with the evaluation, go to the output URL and check the results.
Copy

Answers Evaluator based on Context
LangChainStringEvaluator("context_qa"): Instruct the LLM chain to use the reference "context" to determine its accuracy.LangChainStringEvaluator("cot_qa"):"cot_qa"has"context_qa"Similar to the evaluator, but differs in that it instructs you to use the'inference' of LLM before deciding on the final judgment. Reference
First, you need to define a function that returns Context: context_answer_rag_answer
then, LangChainStringEvaluator Generate. When generating prepare_data Mapping the return values of the functions defined above through appropriately.
Cebu Port
run: LLM results generated (context,answer,input)example: Data defined in the dataset. (questionandanswer)
LangChainStringEvaluator To perform this evaluation, we need the following three-way information.
prediction: LLM answer generatedreference: Answers defined in the datasetinput: Questions defined in the dataset
But, LangChainStringEvaluator("context_qa") has reference Because it is used as Context, it is defined as: (Note) Below context_qa To utilize evaluators context , answer , question Defined a function to return.
Copy
Copy
Copy
Copy
Copy
Proceed with the evaluation, go to the output URL and check the results.
Copy

Evaluation results Ground Truth Given even if you generate an answer that does not fit Context If it is correct CORRECT It is rated as.
Criteria
If there is no baseline reference label (answer answer) or it is difficult to obtain "criteria" or "score" Evaluators can be used to evaluate execution against a set of custom criteria.
This is for the model's answer Monitoring high-level semantic aspects Useful if you want to. LangChainStringEvaluator ("criteria", config={ "criteria": 아래 중 하나의 criterion })
standard
Explanation
conciseness
Evaluate whether the answer is concise and simple
relevance
Evaluate whether the answer is related to the question
correctness
Evaluate if the answer is correct
coherence
Evaluate if the answer is consistent
harmfulness
Evaluate whether the answer is harmful or harmful
maliciousness
Evaluate whether the answer is malicious or worse
helpfulness
Evaluate if the answer helps
controversiality
Evaluate whether the answer is controversial
misogyny
Evaluate whether the answer is to empty women
criminality
Evaluate whether the answer promotes crime
Copy

Evaluator utilization (labeled_criteria) if correct answer exists
If the correct answer exists, LLM can evaluate by comparing the answer generated by the correct answer.
Like the example below reference The correct answer, prediction LLM delivers the answer you generated.
This is a separate setting prepare_data Defined through.
In addition, LLM used to evaluate answers config of llm Define through.
Copy
Copy
Below relevance This is an example of evaluating.
this time prepare_data Through reference for context Pass by.
Copy
Copy
Proceed with the evaluation, go to the output URL and check the results.
Copy

Custom score Evaluator (labeled_score_string)
Below is an example of creating an evaluator that returns a score. normalize_by You can normalize your score. The converted score is normalized to a value between (0 ~ 1).
Under accuracy Is a randomly defined criterion by the user. You can use it by defining a suitable Prompt.
Copy
Copy
Proceed with the evaluation, go to the output URL and check the results.
Copy
Previous04. LangSmith dataset generationNext06. Embedding-based evaluation (embedding_distance)
Last updated 4 months ago
Last updated