08. Rouge, BLEU, METEOR, SemScore based heuristic evaluation

Heuristic evaluation

Heuristic evaluation is a method of reasoning that can be used quickly and easily when a perfectly reasonable judgment cannot be made due to insufficient time or information.

(This also has the strength of saving time and money when using LLM as Judge.)

(Note) Uncomment the code below to proceed after updating the library.

Copy

# !pip install -qU langsmith langchain-teddynote rouge-score

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

 True 

Copy

# LangSmith Set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

Copy

Define functions for RAG performance testing

We will create a RAG system to use for testing.

Copy

Copy

ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.

Copy

Utilization of Hangeulocyte analyzer

The Hange Morphology Analyzer is a tool for separating Korean sentences into the smallest semantic unit, morphology, and judging the stock of each morpheme.

Main features of morpheme analyzer -Separate sentences in morpheme units -Tagging each form -Basic extraction of morpheme You can use the Hangeul morpheme analyzer using the Kiwipiepy library.

Copy

Copy

Rouge (Recall-Oriented Understudy for Gisting Evaluation) score

  • Valuation table used to evaluate the quality of automatic summaries and machine translations.

  • Measure how much the generated text contains important keywords in the reference text.

  • Calculated based on n-gram overlap

Rouge-1 -Measure the similarity of word units. -Evaluate individual word matching between two sentences.

Rouge-2 -Measure the similarity of duplicate units in two consecutive words (bigram). -Evaluate two consecutive word matches between two sentences.

Rouge-L -Measure similarity based on Longest Common Subsequence (LCS). -Considering the word order at the sentence level, does not require a continuous match -More flexible evaluation is possible, and naturally reflects the similarity of the sentence structure.

Copy

Copy

BLEU (Bilingual Evaluation Understudy) score

Mainly used for machine translation evaluation. Measure how similar the generated text is to the reference text.

Calculated based on n-gram precision

Calculation method -N-gram precision calculation: Calculates how much n-grams from 1-gram to 4-gram are included in the reference translation in the machine translation results. -Brevity Penalty applies: Penalties are imposed if the machine translation is shorter than the reference translation. -Calculate final score: Multiply the geometric mean of N-gram precision by the concise penalty to yield the final BLEU score

Limit point -Check only simple string matches without considering meaning. -It does not distinguish the importance of words.

Copy

Copy

Copy

Copy

METEOR score

This is an evaluation sheet developed to evaluate the quality of machine translation.

  • It was developed to compensate for the shortcomings of BLEU.

  • In addition to simple word matching, it takes into account various linguistic factors such as interpolation extraction (stemming), synonym matching, and paraplacing.

  • Evaluate by considering the word order.

  • Multiple reference translations are available.

  • Calculates a score between 0 and 1, and the closer to 1, the better the translation.

Copy

Copy

Copy

SemScore

This paper proposes a simple but effective evaluation indicator called SEMSCORE that compares model output directly with golden standard responses using semantic text similarity (STS). A comparative evaluation of the output of 12 major directed tuned LLMs with 8 widely used text-generated evaluation indicators showed that the proposed SEMSCORE indicators performed better than all other evaluation indicators in terms of correlation with human evaluation.

SentenceTransformer The model is used to generate sentence embedding, and the cosine similarity between the two sentences is calculated. -Model used in thesis all-mpnet-base-v2 Use.

Copy

Copy

Evaluator, which is summarized above, is:

Copy

Assessments using Heuristic Evaluator.

Copy

Check the results.

Last updated