LLMs evaluation

Techniques to evaluate LLMs.
See:

LLM Ops

Resources

https://github.com/tjunlp-lab/awesome-llms-evaluation-papers
LLM Evaluation Metrics: The Ultimate LLM Evaluation Guide - Confident AI
Along with the standard loss metric, which shows you that your fine-tuning is working and the LLM is learning SOMETHING from your data, you can define the following metrics:
1. Heuristics (Levenshtein, perplexity, BLEU and ROUGE and similarity scores (e.g., BERT Score) between the predictions and ground truth (GT), which are similar to classic metrics.
2. LLM-as-judges to test against standard issues such as hallucination and moderation, based solely on the user’s input and predictions.
3. LLM-as-judges to test against standard issues such as hallucination and moderation, based on the user’s input, predictions and GT.
4. LLM-as-judges will test the RAG pipeline on problems such as recall and precision based on the user’s input, predictions, GT, and the RAG context.
5. Implementing custom business metrics that leverage points 1 to 4. In our case, we want to check that the writing style and voice are consistent with the user’s input and context and fit for social media and blog posts.
- Usually, heuristic metrics don’t work well when assessing GenAI systems as they measure exact matches between the generated output and GT. They don’t consider synonyms or that two sentences share the same idea but use entirely different words.
- Therefore, LLM systems are primarily evaluated with similarity scores and LLM judges.
- All the LLM-as-a-judge metrics are based on well-crafted prompts that check for particular criteria.

Benchmarks and leaderboards

Relevant benchmarks for LLMs. First 3 could be the primary and the last 3 as secondary.
- MixEval: A dynamic benchmark evaluating LLMs using real-world user queries and benchmarks, achieving a 0.96 model ranking correlation with Chatbot Arena.
- IFEval: Assess the ability to follow detailed, verifiable instructions. For example, a prompt might instruct, "Write an article with more than 800 words" or "Wrap your response in double quotation marks".
- Arena-Hard: Evaluating LLMs using challenging user queries, reflecting real-world preferences. Successor to MT-Bench and is similar to AlpacaEval 2.0, focusing on multi-turn conversations and instruction-following tasks.
- MMLU (Pro/Redux): Testing on diverse subjects, evaluating zero-shot and few-shot settings.
- GSM8K: Diverse grade school math problems for testing multi-step arithmetic reasoning.
- HumanEval: Evaluating code generation models using hand-crafted programming problems and unit tests.

RAG evaluation

https://github.com/yhpeter/awesome-rag-evaluation
Evaluation Metrics for RAG Systems | by Gaurav Nukala | The Deep Hub | Medium
How to Measure RAG from Accuracy to Relevance? -
LLM & RAG Evaluation Framework: Complete Guide
- When working with RAG, we have an extra dimension that we have to check, which is the retrieved context.
- Thus, we have 4 dimensions where we have to evaluate the interaction between them:
  - the user’s input;
  - the retrieved context;
  - the generated output;
  - the expected output (the GT, which we may not always have).
- We can evaluate RAG in two steps:
  - the retrieval step - metrics such as Normalized Discounted Cumulative Gain (NDCG) that check the quality of recommendation and information retrieval systems
    - NDCG is a ranking quality metric. It compares rankings to an ideal order where all relevant items are at the top of the list
  - the generation step - similar strategies used for LLM evaluation while considering the context dimension. These are metrics with LLM-as-judge, using crafted prompts, e.g. Hallucination, ContextRecall and ContextPrecision
Evaluating RAG Performance: A Comprehensive Guide | by Christian Grech | Medium

Resources

Benchmarks and leaderboards

RAG evaluation

References