08. Rouge, BLEU, METEOR, SemScore based heuristic evaluation
Heuristic evaluation
Heuristic evaluation is a method of reasoning that can be used quickly and easily when a perfectly reasonable judgment cannot be made due to insufficient time or information.
(This also has the strength of saving time and money when using LLM as Judge.)
(Note) Uncomment the code below to proceed after updating the library.
Copy
# !pip install -qU langsmith langchain-teddynote rouge-scoreCopy
# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv
# Load API KEY information
load_dotenv()Copy
True Copy
# LangSmith Set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging
# Enter a project name.
logging.langsmith("CH16-Evaluations")Copy
Define functions for RAG performance testing
We will create a RAG system to use for testing.
Copy
Copy
ask_question Generate a function with the name Lee. Input inputs Ra receives a dickery, answer Ra returns the dictionary.
Copy
Utilization of Hangeulocyte analyzer
The Hange Morphology Analyzer is a tool for separating Korean sentences into the smallest semantic unit, morphology, and judging the stock of each morpheme.
Main features of morpheme analyzer -Separate sentences in morpheme units -Tagging each form -Basic extraction of morpheme You can use the Hangeul morpheme analyzer using the Kiwipiepy library.
Copy
Copy
Rouge (Recall-Oriented Understudy for Gisting Evaluation) score
Valuation table used to evaluate the quality of automatic summaries and machine translations.
Measure how much the generated text contains important keywords in the reference text.
Calculated based on n-gram overlap
Rouge-1 -Measure the similarity of word units. -Evaluate individual word matching between two sentences.
Rouge-2 -Measure the similarity of duplicate units in two consecutive words (bigram). -Evaluate two consecutive word matches between two sentences.
Rouge-L -Measure similarity based on Longest Common Subsequence (LCS). -Considering the word order at the sentence level, does not require a continuous match -More flexible evaluation is possible, and naturally reflects the similarity of the sentence structure.
Copy
Copy
BLEU (Bilingual Evaluation Understudy) score
Mainly used for machine translation evaluation. Measure how similar the generated text is to the reference text.
Calculated based on n-gram precision
Calculation method -N-gram precision calculation: Calculates how much n-grams from 1-gram to 4-gram are included in the reference translation in the machine translation results. -Brevity Penalty applies: Penalties are imposed if the machine translation is shorter than the reference translation. -Calculate final score: Multiply the geometric mean of N-gram precision by the concise penalty to yield the final BLEU score
Limit point -Check only simple string matches without considering meaning. -It does not distinguish the importance of words.
Copy
Copy
Copy
Copy
METEOR score
This is an evaluation sheet developed to evaluate the quality of machine translation.
It was developed to compensate for the shortcomings of BLEU.
In addition to simple word matching, it takes into account various linguistic factors such as interpolation extraction (stemming), synonym matching, and paraplacing.
Evaluate by considering the word order.
Multiple reference translations are available.
Calculates a score between 0 and 1, and the closer to 1, the better the translation.
Copy
Copy
Copy
SemScore
This paper proposes a simple but effective evaluation indicator called SEMSCORE that compares model output directly with golden standard responses using semantic text similarity (STS). A comparative evaluation of the output of 12 major directed tuned LLMs with 8 widely used text-generated evaluation indicators showed that the proposed SEMSCORE indicators performed better than all other evaluation indicators in terms of correlation with human evaluation.
SentenceTransformer The model is used to generate sentence embedding, and the cosine similarity between the two sentences is calculated. -Model used in thesis all-mpnet-base-v2 Use.
Copy
Copy
Evaluator, which is summarized above, is:
Copy
Assessments using Heuristic Evaluator.
Copy
Check the results.

Last updated