10. Assessment of the summary method

Some metrics can only be defined at the overall experimental level, not the individual execution of the experiment.

For example, an experiment started from a data set Calculate the classifier's evaluation score across all runs You may want to.

this summary_evaluators Say it.

These evaluators get each list instead of one Run and Example.

Copy

# installation
# !pip install -qU langsmith langchain-teddynote

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

 True 

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

Copy

Define functions for RAG performance testing

We will create a RAG system to use for testing.

Copy

Utilize the GPT-4o-mini model and the Ollama model to generate functions that generate answers to your questions.

Copy

OpenAIRelevanceGrader Is used to evaluate whether the question, context, and answer (Answer) are relevant.

  • target="retrieval-question" : Evaluate whether the question and context are relevant.

  • target="retrieval-answer" : Evaluate whether the answer and context are relevant.

Copy

Copy

Copy

Copy

Copy

Summary Evaluator synthesizes Relevance assessment

Copy

Proceed with the evaluation.

Copy

Check the results.

(Note) Evaluation of individual datasets cannot be confirmed, and can be confirmed in experimental (Experiment) units.

Last updated