10. Assessment of the summary method
Some metrics can only be defined at the overall experimental level, not the individual execution of the experiment.
For example, an experiment started from a data set Calculate the classifier's evaluation score across all runs You may want to.
this summary_evaluators Say it.
These evaluators get each list instead of one Run and Example.
Copy
# installation
# !pip install -qU langsmith langchain-teddynoteCopy
# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv
# Load API KEY information
load_dotenv()Copy
True Copy
# LangSmith Set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging
# Enter a project name.
logging.langsmith("CH16-Evaluations")Copy
Define functions for RAG performance testing
We will create a RAG system to use for testing.
Copy
Utilize the GPT-4o-mini model and the Ollama model to generate functions that generate answers to your questions.
Copy
OpenAIRelevanceGrader Is used to evaluate whether the question, context, and answer (Answer) are relevant.
target="retrieval-question": Evaluate whether the question and context are relevant.target="retrieval-answer": Evaluate whether the answer and context are relevant.
Copy
Copy
Copy
Copy
Copy
Summary Evaluator synthesizes Relevance assessment
Copy
Proceed with the evaluation.
Copy
Check the results.
(Note) Evaluation of individual datasets cannot be confirmed, and can be confirmed in experimental (Experiment) units.

Last updated