09. Experimental (Experiment) evaluation comparison

Experimental (Experiment) evaluation comparison

The Compare feature provided by LangSmith makes it easy to compare experimental results.

Reference

https://docs.smith.langchain.com/cookbook/tracing-examples/traceable#using-the-decorator

Copy

#installation
# !pip install -qU langsmith langchain-teddynote

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

 True

Copy

# Set up LangSmith tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

Copy

 Start tracking LangSmith. 
[Project name] 
CH16-Evaluations

Define functions for RAG performance testing

We will create a RAG system to use for testing.

Copy

from myrag import PDFRAG
from langchain_openai import ChatOpenAI


# Create a function that answers the question
def ask_question_with_llm(llm):
    # Creating a PDFRAG object
    rag = PDFRAG(
        "data/SPRI_AI_Brief_December 2023 issue_F.pdf",
        llm,
    )

    # Create a retriever
    retriever = rag.create_retriever()

    # Create a chain
    rag_chain = rag.create_chain(retriever)

    def _ask_question(inputs: dict):
        context = retriever.invoke(inputs["question"])
        context = "\n".join([doc.page_content for doc in context])
        return {
            "question": inputs["question"],
            "context": context,
            "answer": rag_chain.invoke(inputs["question"]),
        }

    return _ask_question

Copy

from langchain_community.chat_models import ChatOllama

# Load the Ollama model.
ollama = ChatOllama(model="EEVE-Korean-10.8B:latest")

# Calling the Ollama model
ollama.invoke("hello?")

Copy

 AIMessage (content='Hello! As a helpful and respectful assistant, we will do our best to answer your questions. We will provide accurate and useful information and strive to be safe and free from social prejudice. There may be limits to the information I can provide, but I'll always try to give you honest answers. If the question is nonsense, harmful or unethical, I will explain why instead. \n\n Let's start! If you have any questions or help, please feel free to ask. '1','Evonse_metadata😊'model':'EEVE-Korean-10.8B:latest','created_at': '2024-09-19T10:47:07.198627Z','message':  run-6bb08ee1-77ff-4ab1-be72-5f72dd277964-0')

Utilize the GPT-4o-mini model and the Ollama model to generate functions that generate answers to your questions.

Copy

gpt_chain = ask_question_with_llm(ChatOpenAI(model="gpt-4o-mini", temperature=0))
ollama_chain = ask_question_with_llm(ChatOllama(model="EEVE-Korean-10.8B:latest"))

Evaluate answers using the GPT-4o-mini model and Ollama model.

Proceed for each of the two chains.

Copy

from langsmith.evaluation import evaluate, LangChainStringEvaluator

# qa Create an evaluator
cot_qa_evalulator = LangChainStringEvaluator(
    "cot_qa",
    config={"llm": ChatOpenAI(model="gpt-4o-mini", temperature=0)},
    prepare_data=lambda run, example: {
        "prediction": run.outputs["answer"],
        "reference": run.outputs["context"],
        "input": example.inputs["question"],
    },
)

dataset_name = "RAG_EVAL_DATASET"

# Running the evaluation
experiment_results1 = evaluate(
    gpt_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="MODEL_COMPARE_EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "GPT-4o-mini evaluation (cot_qa)",
    },
)

# Running the evaluation
experiment_results2 = evaluate(
    ollama_chain,
    data=dataset_name,
    evaluators=[cot_qa_evalulator],
    experiment_prefix="MODEL_COMPARE_EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "Ollama(EEVE-Korean-10.8B:latest) evaluation (cot_qa)",
    },
)

Use a comparative view to examine the results.

How to make a comparison view

On Dataset's Experiment tab, select the experiment you want to compare.
Click the "Compare" button at the bottom.
A comparison view

Previous08. Rouge, BLEU, METEOR, SemScore based heuristic evaluation Next10. Assessment of the summary method

Last updated 5 months ago