02. Evaluation using RAGAS

RAGAS: https://docs.ragas.io/en/latest/getstarted/evaluation.html

After uncommenting the comment below, run it to install the package and proceed.

Copy

# !pip install -qU faiss-cpu ragas

Copy

# Configuration file for managing API KEY as environment variable
from dotenv import load_dotenv

# Load API KEY information
load_dotenv()

Copy

True

Copy

# LangSmith set up tracking. https://smith.langchain.com
# !pip install -qU langchain-teddynote
from langchain_teddynote import logging

# Enter a project name.
logging.langsmith("CH16-Evaluations")

Copy

Start tracking LangSmith.
[Project name]
CH16-Evaluations

Load from a saved CSV file

data/ragas_synthetic_dataset.csv Load the file.

Copy

import pandas as pd

df = pd.read_csv("data/ragas_synthetic_dataset.csv")
df.head()

Copy

# 'answer' Overwrite or add columns
if "answer" in test_dataset.column_names:
    test_dataset = test_dataset.remove_columns(["answer"]).add_column("answer", answer)
else:
    test_dataset = test_dataset.add_column("answer", answer)

The answers generated by LLM are stored in the 'answer' column.

Copy

['The key considerations regarding AI foundation models in the context of EU regulations include:\n\n1. **Regulatory Approach**: There is a significant debate on whether to impose regulations on AI foundation models. Some countries, like France, Germany, and Italy, oppose broad regulations on foundation models and instead propose a "mandatory self-regulation" approach. This involves the development of voluntary codes of conduct that companies must adhere to, rather than imposing direct regulations on the models themselves.\n\n2. **Model Cards**: These countries suggest that companies developing foundation models should create "model cards" that summarize the machine learning technology, the model\'s capabilities, and its limitations. These model cards would help AI supervisory bodies assess compliance with the voluntary codes of conduct.\n\n3. **Impact Assessment and Sanctions**: The proposed approach includes a mechanism where AI supervisory bodies would evaluate compliance based on the model cards. If violations are found, immediate sanctions would not be imposed. Instead, there would be an analysis of the violation and its impact before any sanctions are applied.\n\n4. **Industry Influence**: The opposition to strict regulations is partly driven by concerns from AI companies in these countries. For instance, French AI company Mistral and German AI company Aleph Alpha have lobbied against stringent regulations, fearing that such measures could put them at a competitive disadvantage compared to companies in the US and China.\n\n5. **Risk-Based Regulation**: The proposed self-regulation approach aligns with the principle of risk-based AI regulation, which aims to be technology-neutral and focus on the specific risks associated with different AI applications rather than imposing blanket regulations on all AI technologies.\n\nThese considerations reflect the ongoing negotiations and differing viewpoints within the EU on how best to regulate AI foundation models while balancing innovation and risk management.', "The key features and improvements of the LLM 'Tongyi Qianwen 2.0' compared to its previous version include:\n\n1. **Performance Enhancements**:\n   - **Improved Capabilities**: Tongyi Qianwen 2.0 has enhanced performance in understanding complex instructions, writing advertising copy, reasoning, and memorization compared to the 1.0 version.\n   - **Benchmark Tests**: It outperforms major AI models like Llama-2-70B and GPT-3.5 in various benchmark tests, including language understanding (MMLU), mathematics (GSM8k), and question answering (ARC-C).\n\n2. **Architectural and Functional Upgrades**:\n   - **All-in-One AI Model Building Platform**: Alibaba Cloud has introduced an all-in-one AI model building platform called 'GenAI' to simplify the model development and application building process. This platform provides comprehensive tools for data management, model deployment, evaluation, and rapid engineering.\n\n3. **Application Areas**:\n   - **Industry-Specific Models**: Alibaba Cloud has released industry-specific generative AI models to improve business outcomes across various sectors, including customer support, legal consultation, healthcare, finance, document management, audio and video management, code development, and character creation.\n\n4. **Accessibility**:\n   - **Public Availability**: Tongyi Qianwen 2.0 is available to the public through Alibaba Cloud's website and mobile app, and developers can access it via API.\n\n5. **Future Plans**:\n   - **Open Source Initiative**: Alibaba Cloud plans to open-source a model with 720 billion parameters by the end of the year to further support AI development.\n\nThese improvements and features make Tongyi Qianwen 2.0 a more powerful and versatile LLM compared to its predecessor.", 'The purpose of the AAAI Conference on Artificial Intelligence is to promote AI research and provide opportunities for interaction among researchers, practitioners, scientists, students, and engineers in the AI field. The conference includes presentations on AI-related technologies, special tracks, invited speakers, workshops, tutorials, poster sessions, topic presentations, competitions, and exhibition programs.']

Copy

answer = chain.batch(batch_dataset)
answer[:3]

Call batch() to get answers for a batch dataset.

Copy

['What are the key considerations regarding AI foundation models in the context of EU regulations?', "What are the key features and improvements of the LLM 'Tongyi Qianwen 2.0' compared to its previous version, including any changes in performance metrics, architectural design, and application areas?", 'What is the purpose of the AAAI Conference?']

Copy

batch_dataset = [question for question in test_dataset["question"]]
batch_dataset[:3]

배치: https://wikidocs.net/233345

Create a batch dataset. Batch datasets are useful for processing a large number of questions at once.

Copy

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyMuPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import PromptTemplate
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Step 1: Load Documents
loader = PyMuPDFLoader("data/SPRI_AI_Brief_2023년12월호_F.pdf")
docs = loader.load()

# Step 2: Split Documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
split_documents = text_splitter.split_documents(docs)

# Step 3: Create an embedding
embeddings = OpenAIEmbeddings()

# Step 4: Create DB and save
# Create a vector store.
vectorstore = FAISS.from_documents(documents=split_documents, embedding=embeddings)

# Step 5: Create a Retriever
# Retrieves and generates information contained in documents.
retriever = vectorstore.as_retriever()

# Step 6: Create Prompt
# Generates a prompt.
prompt = PromptTemplate.from_template(
    """You are an assistant for question-answering tasks. 
Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. 

#Context: 
{context}

#Question:
{question}

#Answer:"""
)

# Step 7: Create a language model (LLM)
# Create a model (LLM).
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)

# Step 8: Create a Chain
chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

Copy

['1. 정책/법제 2. 기업/산업 3. 기술/연구 4. 인력/교육\n알리바바 클라우드, 최신 LLM ‘통이치엔원 2.0’ 공개\nKEY Contents\nn 알리바바 클라우드가 복잡한 지침 이해, 광고문구 작성, 추론, 암기 등에서 성능이 향상된 최신\nLLM ‘통이치엔원 2.0’을 공개\nn 알리바바 클라우드는 산업별로 특화된 생성 AI 모델을 공개하는 한편, 모델 개발과 애플리케이션\n구축 절차를 간소화하는 올인원 AI 모델 구축 플랫폼도 출시\n£알리바바의 통이치엔원 2.0, 주요 벤치마크 테스트에서 여타 LLM 능가\nn 중국의 알리바바 클라우드가 2023년 10월 31일 열린 연례 기술 컨퍼런스에서 최신 LLM ‘통이\n치엔원(Tongyi Qianwen) 2.0’을 공개\n∙ 알리바바 클라우드는 통이치엔원 2.0이 2023년 4월 출시된 1.0 버전보다 복잡한 지침 이해,\n광고문구 작성, 추론, 암기 등에서 성능이 향상되었다고 설명\n∙ 통이치엔원 2.0은 언어 이해 테스트(MMLU), 수학(GSM8k), 질문 답변(ARC-C)과 같은 벤치마크\n테스트에서 라마(Llama-2-70B)와 GPT-3.5를 비롯한 주요 AI 모델을 능가\n∙ 통이치엔원 2.0은 알리바바 클라우드의 웹사이트와 모바일 앱을 통해 대중에 제공되며 개발자는\nAPI를 통해 사용 가능\nn 알리바바 클라우드는 여러 산업 영역에서 생성 AI를 활용해 사업 성과를 개선할 수 있도록 지원\n하는 산업별 모델도 출시\n∙ 산업 영역은 고객지원, 법률 상담, 의료, 금융, 문서관리, 오디오와 동영상 관리, 코드 개발, 캐릭터\n제작을 포함\nn 알리바바 클라우드는 급증하는 생성 AI 수요에 대응해 모델 개발과 애플리케이션 구축 절차를\n간소화하는 올인원 AI 모델 구축 플랫폼 ‘젠AI(GenAI)’도 공개\n∙ 이 플랫폼은 데이터 관리, 모델 배포와 평가, 신속한 엔지니어링을 위한 종합 도구 모음을 제공하여\n다양한 기업들이 맞춤형 AI 모델을 한층 쉽게 개발할 수 있도록 지원']

Copy

test_dataset[1]["contexts"]

Copy

Dataset({
    features: ['question', 'contexts', 'ground_truth', 'evolution_type', 'metadata', 'episode_done'],
    num_rows: 10
})

Copy

def convert_to_list(example):
    contexts = ast.literal_eval(example["contexts"])
    return {"contexts": contexts}


test_dataset = test_dataset.map(convert_to_list)
print(test_dataset)

Copy

Dataset({    features: ['question', 'contexts', 'ground_truth', 'evolution_type', 'metadata', 'episode_done'],    num_rows: 10})

Copy

from datasets import Dataset

test_dataset = Dataset.from_pandas(df)
test_dataset

답변 평가

Context Recall

Context recall은 검색된 context가 LLM 이 생성한 답변과 얼마나 일치하는지를 측정합니다.

It is computed using the question, ground truth, and retrieved context, and the value ranges from 0 to 1, with higher values indicating better performance.

To estimate context recall from a ground truth answer, each claim in the ground truth answer is analyzed to see if it can be attributed to the retrieved context. In an ideal scenario, all claims in the ground truth answer should be attributed to the retrieved context.

Context Precision

Context Precision is a metric that evaluates whether the ground-truth related items in the contexts are ranked high. Ideally, all related chunks should appear at the top. This metric is calculated using question, ground_truth, and contexts, and has a value between 0 and 1. A higher score indicates better precision.

The calculation formula for Context Precision@K is as follows:

Answer Relevancy

Answer Relevancy is a metric that evaluates how appropriate the generated answer is to the given prompt. The main characteristics and calculation method of this metric are summarized as follows:

Purpose: To evaluate the relevance of the generated answers.
Score interpretation: A lower score indicates an answer that contains incomplete or redundant information, while a higher score indicates better relevance.
Elements used in calculations: question, context, answer

How to calculate Answer Relevancy: - Defined as the average cosine similarity between the original question and the artificial questions generated based on the answers. - Formula:

Faithfulness

Faithfulness is a metric that measures the factual consistency of the generated answer relative to the given context. Its main features are:

Purpose: To evaluate the factual consistency of answers against their context.
Calculation elements: Uses the answers and the searched context.
Score range: Scaled from 0 to 1, with higher being better.

How to calculate your Faithfulness score:

Copy

Copyresult_df = result.to_pandas()
result_df.head()

Copy

{'context_precision': 0.8000, 'faithfulness': 0.6689, 'answer_relevancy': 0.7836, 'context_recall': 0.7667}

Copy

from ragas import evaluate
from ragas.metrics import (
    answer_relevancy,
    faithfulness,
    context_recall,
    context_precision,
)

result = evaluate(
    dataset=test_dataset,
    metrics=[
        context_precision,
        faithfulness,
        answer_relevancy,
        context_recall,
    ],
)

result

This metric is useful for assessing how faithful the generated answers are to the given context, and is particularly important for measuring the accuracy and reliability of question-answering systems.

Question: "Where and when was Einstein born?"
Context: "Albert Einstein (born March 14, 1879) is a German theoretical physicist, considered one of the greatest and most influential scientists in history."
High fidelity answer: "Einstein was born on March 14, 1879 in Germany."
Low fidelity answer: "Einstein was born on March 20, 1879 in Germany."

example

Computation process:

1. Identify claims in the generated answers.

2. Check each claim against the given context to see if it can be inferred from the context.

3. Calculate the score using the above formula.

Copy

result_df.loc[:, "context_precision":"context_recall"]

Previous01. Synthetic test dataset generation (RAGAS)Next03. Upload data set for evaluation generated (HuggingFace Dataset) Copy

Last updated 7 months ago

hashtagLoad from a saved CSV file

hashtagContext Recall

hashtagContext Precision

hashtagAnswer Relevancy

hashtagFaithfulness

Load from a saved CSV file

Context Recall

Context Precision

Answer Relevancy

Faithfulness