08. Rouge, BLEU, METEOR, SemScore based heuristic evaluation

Heuristic evaluation

Heuristic evaluation is a method of reasoning that can be used quickly and easily when a perfectly reasonable judgment cannot be made due to insufficient time or information.

(This also has the strength of saving time and money when using LLM as Judge.)

(Note) Uncomment the code below to proceed after updating the library.

Copy

# !pip install -qU langsmith langchain-teddynote rouge-score

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

 True

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

Copy

 Start tracking LangSmith. 
[Project name] 
CH16-Evaluations

Define functions for RAG performance testing

We will create a RAG system to use for testing.

Copy

from myrag import PDFRAG
from langchain_openai import ChatOpenAI

# Creating a PDFRAG object
rag = PDFRAG(
    "data/SPRI_AI_Brief_December 2023 issue_F.pdf",
    ChatOpenAI(model="gpt-4o-mini", temperature=0),
)

# Create a retriever
retriever = rag.create_retriever()

# Create a chain
chain = rag.create_chain(retriever)

# Generate answers to questions
chain.invoke("What is the name of the generative AI developed by Samsung Electronics?")

Copy

 "The name of the generated AI developed by the Samsung is'Samsung Gauss'."

ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.

Copy

# Create a function that answers the question
def ask_question(inputs: dict):
    return {"answer": chain.invoke(inputs["question"])}

Utilization of Hangeulocyte analyzer

The Hange Morphology Analyzer is a tool for separating Korean sentences into the smallest semantic unit, morphology, and judging the stock of each morpheme.

Main features of morpheme analyzer -Separate sentences in morpheme units -Tagging each form -Basic extraction of morpheme You can use the Hangeul morpheme analyzer using the Kiwipiepy library.

Copy

from langchain_teddynote.community.kiwi_tokenizer import KiwiTokenizer

# Tokenizer Declaration
kiwi_tokenizer = KiwiTokenizer()

sent1 = "Hello, nice to meet you. My name is Teddy."
sent2 = "Hello, nice to meet you~^^ My name is Teddy!!"

# tokenization
print(sent1.split())
print(sent2.split())

print("===" * 20)

# tokenization
print(kiwi_tokenizer.tokenize(sent1))
print(kiwi_tokenizer.tokenize(sent2))

Copy

 ['Hello.','Nice to meet you.','My','Name','Teddy.'] 
['Goodbye','Nice to meet you ~^^','My','Name','Teddy!!'] 
============================================================ 
['Hi','Ha','Do','.','Hope','Yes','.','I','Of','Name','Silver','Teddy', 'This',' ⁇ ','.'] 
['Hi','Ha','Do',' ⁇ ','Half','Yes','~','^^','I','Of','Name', ' Silver','Teddy','This',' ⁇ ','!!']

Rouge (Recall-Oriented Understudy for Gisting Evaluation) score

Valuation table used to evaluate the quality of automatic summaries and machine translations.
Measure how much the generated text contains important keywords in the reference text.
Calculated based on n-gram overlap

Rouge-1 -Measure the similarity of word units. -Evaluate individual word matching between two sentences.

Rouge-2 -Measure the similarity of duplicate units in two consecutive words (bigram). -Evaluate two consecutive word matches between two sentences.

Rouge-L -Measure similarity based on Longest Common Subsequence (LCS). -Considering the word order at the sentence level, does not require a continuous match -More flexible evaluation is possible, and naturally reflects the similarity of the sentence structure.

Copy

from rouge_score import rouge_scorer

sent1 = "Hello, nice to meet you. My name is Teddy."
sent2 = "Hello, nice to meet you~^^ My name is Teddy!!"
sent3 = "My name is Teddy. Hello. Nice to meet you."

scorer = rouge_scorer.RougeScorer(
    ["rouge1", "rouge2", "rougeL"], use_stemmer=False, tokenizer=KiwiTokenizer()
)

print(
    f"[1] {sent1}\n[2] {sent2}\n[rouge1] {scorer.score(sent1, sent2)['rouge1'].fmeasure:.5f}\n[rouge2] {scorer.score(sent1, sent2)['rouge2'].fmeasure:.5f}\n[rougeL] {scorer.score(sent1, sent2)['rougeL'].fmeasure:.5f}"
)
print("===" * 20)
print(
    f"[1] {sent1}\n[2] {sent3}\n[rouge1] {scorer.score(sent1, sent3)['rouge1'].fmeasure:.5f}\n[rouge2] {scorer.score(sent1, sent3)['rouge2'].fmeasure:.5f}\n[rougeL] {scorer.score(sent1, sent3)['rougeL'].fmeasure:.5f}"
)

Copy

 [1] Hello. nice to meet you. My name is Teddy. 
[2] Good to meet you ~^^ My name is Teddy!! 
[rouge1] 0.77419 
[rouge2] 0.62069 
[rougeL] 0.77419 
============================================================ 
[1] Hello. nice to meet you. My name is Teddy. 
[2] My name is Teddy. Hello. nice to meet you. 
[rouge1] 1.00000 
[rouge2] 0.92857 
[rougeL] 0.53333

BLEU (Bilingual Evaluation Understudy) score

Mainly used for machine translation evaluation. Measure how similar the generated text is to the reference text.

Calculated based on n-gram precision

Calculation method -N-gram precision calculation: Calculates how much n-grams from 1-gram to 4-gram are included in the reference translation in the machine translation results. -Brevity Penalty applies: Penalties are imposed if the machine translation is shorter than the reference translation. -Calculate final score: Multiply the geometric mean of N-gram precision by the concise penalty to yield the final BLEU score

Limit point -Check only simple string matches without considering meaning. -It does not distinguish the importance of words.

Copy

from nltk.translate.bleu_score import sentence_bleu

sent1 = "Hello, nice to meet you. My name is Teddy."
sent2 = "Hello, nice to meet you~^^ My name is Teddy!!"
sent3 = "My name is Teddy. Hello. Nice to meet you."

# 토큰화
print(kiwi_tokenizer.tokenize(sent1, type="sentence"))
print(kiwi_tokenizer.tokenize(sent2, type="sentence"))
print(kiwi_tokenizer.tokenize(sent3, type="sentence"))

Copy

 Hi, Seyo. nice to meet you . My name is Teddy. 
Hi Seyo  ⁇  Nice to meet you ~ ^^ My name is Teddy  ⁇ !! 
My name is Teddy. Hi, Seyo. nice to meet you .

Copy

bleu_score = sentence_bleu(
    [kiwi_tokenizer.tokenize(sent1, type="sentence")],
    kiwi_tokenizer.tokenize(sent2, type="sentence"),
)
print(f"[1] {sent1}\n[2] {sent2}\n[score] {bleu_score:.5f}")
print("===" * 20)

bleu_score = sentence_bleu(
    [kiwi_tokenizer.tokenize(sent1, type="sentence")],
    kiwi_tokenizer.tokenize(sent3, type="sentence"),
)
print(f"[1] {sent1}\n[2] {sent3}\n[score] {bleu_score:.5f}")

Copy

 [1] Hello. nice to meet you. My name is Teddy. 
[2] Good to meet you ~^^ My name is Teddy!! 
[score] 0.74879 
============================================================ 
[1] Hello. nice to meet you. My name is Teddy. 
[2] My name is Teddy. Hello. nice to meet you. 
[score] 0.95739

METEOR score

This is an evaluation sheet developed to evaluate the quality of machine translation.

It was developed to compensate for the shortcomings of BLEU.
In addition to simple word matching, it takes into account various linguistic factors such as interpolation extraction (stemming), synonym matching, and paraplacing.
Evaluate by considering the word order.
Multiple reference translations are available.
Calculates a score between 0 and 1, and the closer to 1, the better the translation.

Copy

from nltk.corpus import wordnet as wn

wn.ensure_loaded()

Copy

from nltk.translate import meteor_score

sent1 = "Hello, nice to meet you. My name is Teddy."
sent2 = "Hello, nice to meet you~^^ My name is Teddy!!"
sent3 = "My name is Teddy. Hello. Nice to meet you."

meteor = meteor_score.meteor_score(
    [kiwi_tokenizer.tokenize(sent1, type="list")],
    kiwi_tokenizer.tokenize(sent2, type="list"),
)

print(f"[1] {sent1}\n[2] {sent2}\n[score] {meteor:.5f}")
print("===" * 20)

meteor = meteor_score.meteor_score(
    [kiwi_tokenizer.tokenize(sent1, type="list")],
    kiwi_tokenizer.tokenize(sent3, type="list"),
)
print(f"[1] {sent1}\n[2] {sent3}\n[score] {meteor:.5f}")

Copy

 [1] Hello. nice to meet you. My name is Teddy. 
[2] Good to meet you ~^^ My name is Teddy!! 
[score] 0.78849 
============================================================ 
[1] Hello. nice to meet you. My name is Teddy. 
[2] My name is Teddy. Hello. nice to meet you. 
[score] 0.96800

SemScore

SEMSCORE: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity

This paper proposes a simple but effective evaluation indicator called SEMSCORE that compares model output directly with golden standard responses using semantic text similarity (STS). A comparative evaluation of the output of 12 major directed tuned LLMs with 8 widely used text-generated evaluation indicators showed that the proposed SEMSCORE indicators performed better than all other evaluation indicators in terms of correlation with human evaluation.

SentenceTransformer The model is used to generate sentence embedding, and the cosine similarity between the two sentences is calculated. -Model used in thesis all-mpnet-base-v2 Use.

Copy

from sentence_transformers import SentenceTransformer, util
import warnings

warnings.filterwarnings("ignore", category=FutureWarning)

sent1 = "Hello, nice to meet you. My name is Teddy."
sent2 = "Hello, nice to meet you~^^ My name is Teddy!!"
sent3 = "My name is Teddy. Hello. Nice to meet you."

# SentenceTransformer Model Load
model = SentenceTransformer("all-mpnet-base-v2")

# Encode sentences
sent1_encoded = model.encode(sent1, convert_to_tensor=True)
sent2_encoded = model.encode(sent2, convert_to_tensor=True)
sent3_encoded = model.encode(sent3, convert_to_tensor=True)

# Calculating cosine similarity between sent1 and sent2
cosine_similarity = util.pytorch_cos_sim(sent1_encoded, sent2_encoded).item()
print(f"[1] {sent1}\n[2] {sent2}\n[score] {cosine_similarity:.5f}")

print("===" * 20)

# Calculating cosine similarity between sent1 and sent3
cosine_similarity = util.pytorch_cos_sim(sent1_encoded, sent3_encoded).item()
print(f"[1] {sent1}\n[2] {sent3}\n[score] {cosine_similarity:.5f}")

Copy

 [1] Hello. nice to meet you. My name is Teddy. 
[2] Good to meet you ~^^ My name is Teddy!! 
[score] 0.86157 
============================================================ 
[1] Hello. nice to meet you. My name is Teddy. 
[2] My name is Teddy. Hello. nice to meet you. 
[score] 0.99191

Evaluator, which is summarized above, is:

Copy

from langsmith.schemas import Run, Example
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu
from nltk.translate import meteor_score
from sentence_transformers import SentenceTransformer, util
import os

# Setting up tokenizer parallelism (using HuggingFace model)
os.environ["TOKENIZERS_PARALLELISM"] = "true"


def rouge_evaluator(metric: str = "rouge1") -> dict:
    # wrapper function definition
    def _rouge_evaluator(run: Run, example: Example) -> dict:
        # Get output values and answers
        student_answer = run.outputs.get("answer", "")
        reference_answer = example.outputs.get("answer", "")

        # Calculating ROUGE scores
        scorer = rouge_scorer.RougeScorer(
            ["rouge1", "rouge2", "rougeL"], use_stemmer=True, tokenizer=KiwiTokenizer()
        )
        scores = scorer.score(reference_answer, student_answer)

        # Return ROUGE score
        rouge = scores[metric].fmeasure

        return {"key": "ROUGE", "score": rouge}

    return _rouge_evaluator


def bleu_evaluator(run: Run, example: Example) -> dict:
    # Get output values and answers
    student_answer = run.outputs.get("answer", "")
    reference_answer = example.outputs.get("answer", "")

    # Tokenization
    reference_tokens = kiwi_tokenizer.tokenize(reference_answer, type="sentence")
    student_tokens = kiwi_tokenizer.tokenize(student_answer, type="sentence")

    # Calculating BLEU scores
    bleu_score = sentence_bleu([reference_tokens], student_tokens)

    return {"key": "BLEU", "score": bleu_score}


def meteor_evaluator(run: Run, example: Example) -> dict:
    # Get output values and answers
    student_answer = run.outputs.get("answer", "")
    reference_answer = example.outputs.get("answer", "")

    # tokenization
    reference_tokens = kiwi_tokenizer.tokenize(reference_answer, type="list")
    student_tokens = kiwi_tokenizer.tokenize(student_answer, type="list")

    # METEOR Score Calculation
    meteor = meteor_score.meteor_score([reference_tokens], student_tokens)

    return {"key": "METEOR", "score": meteor}


def semscore_evaluator(run: Run, example: Example) -> dict:
    # Get output values and answers
    student_answer = run.outputs.get("answer", "")
    reference_answer = example.outputs.get("answer", "")

    # SentenceTransformer Model Load
    model = SentenceTransformer("all-mpnet-base-v2")

    # Generating sentence embeddings
    student_embedding = model.encode(student_answer, convert_to_tensor=True)
    reference_embedding = model.encode(reference_answer, convert_to_tensor=True)

    # Cosine similarity calculation
    cosine_similarity = util.pytorch_cos_sim(
        student_embedding, reference_embedding
    ).item()

    return {"key": "sem_score", "score": cosine_similarity}

Assessments using Heuristic Evaluator.

Copy

from langsmith.evaluation import evaluate

# Evaluator Definition
heuristic_evalulators = [
    rouge_evaluator(metric="rougeL"),
    bleu_evaluator,
    meteor_evaluator,
    semscore_evaluator,
]

# Set dataset name
dataset_name = "RAG_EVAL_DATASET"

# Running an experiment
experiment_results = evaluate(
    ask_question,
    data=dataset_name,
    evaluators=heuristic_evalulators,
    experiment_prefix="Heuristic-EVAL",
    # Specifying Experiment Metadata
    metadata={
        "variant": "Heuristic-EVAL (Rouge, BLEU, METEOR, SemScore) evaluation using",
    },
)

Check the results.

Previous07. Custom LLM evaluation Next09. Experimental (Experiment) evaluation comparison

Last updated 5 months ago