02. Evaluation using RAGAS

After uncommenting the comment below, run it to install the package and proceed.

Copy

# !pip install -qU faiss-cpu ragas

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

True

Copy

# LangSmith set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

Copy

Load from a saved CSV file

  • data/ragas_synthetic_dataset.csv Load the file.

Copy

Copy

The answers generated by LLM are stored in the 'answer' column.

Copy

Copy

Call batch() to get answers for a batch dataset.

Copy

Copy

  • 배치: https://wikidocs.net/233345

Create a batch dataset. Batch datasets are useful for processing a large number of questions at once.

Copy

Copy

Copy

Copy

Copy

Copy

Copy

답변 평가

Context Recall

Context recall은 검색된 context가 LLM 이 생성한 답변과 얼마나 일치하는지를 측정합니다.

It is computed using the question, ground truth, and retrieved context, and the value ranges from 0 to 1, with higher values ​​indicating better performance.

To estimate context recall from a ground truth answer, each claim in the ground truth answer is analyzed to see if it can be attributed to the retrieved context. In an ideal scenario, all claims in the ground truth answer should be attributed to the retrieved context.

Context Precision

Context Precision is a metric that evaluates whether the ground-truth related items in the contexts are ranked high. Ideally, all related chunks should appear at the top. This metric is calculated using question, ground_truth, and contexts, and has a value between 0 and 1. A higher score indicates better precision.

The calculation formula for Context Precision@K is as follows:

Answer Relevancy

Answer Relevancy is a metric that evaluates how appropriate the generated answer is to the given prompt. The main characteristics and calculation method of this metric are summarized as follows:

  1. Purpose: To evaluate the relevance of the generated answers.

  2. Score interpretation: A lower score indicates an answer that contains incomplete or redundant information, while a higher score indicates better relevance.

  3. Elements used in calculations: question, context, answer

How to calculate Answer Relevancy: - Defined as the average cosine similarity between the original question and the artificial questions generated based on the answers. - Formula:

Faithfulness

Faithfulness is a metric that measures the factual consistency of the generated answer relative to the given context. Its main features are:

  1. Purpose: To evaluate the factual consistency of answers against their context.

  2. Calculation elements: Uses the answers and the searched context.

  3. Score range: Scaled from 0 to 1, with higher being better.

How to calculate your Faithfulness score:

Copy

Copy

Copy

This metric is useful for assessing how faithful the generated answers are to the given context, and is particularly important for measuring the accuracy and reliability of question-answering systems.

  • Question: "Where and when was Einstein born?"

  • Context: "Albert Einstein (born March 14, 1879) is a German theoretical physicist, considered one of the greatest and most influential scientists in history."

  • High fidelity answer: "Einstein was born on March 14, 1879 in Germany."

  • Low fidelity answer: "Einstein was born on March 20, 1879 in Germany."

example

Computation process:

1. Identify claims in the generated answers.

2. Check each claim against the given context to see if it can be inferred from the context.

3. Calculate the score using the above formula.

Copy

Last updated