LLM Ops
Large Language Model Ops (LLMOps) encompasses the practices, techniques and tools used for the operational management of large language models in production environments.
See:
Resources
- Data-centric MLOps and LLMOps | Databricks
- LLMOps: Operationalizing Large Language Models | Databricks
- LLMops 101: A Detailed Insight into Large Language Model Operations | Medium
- LLMs-based systems evaluation: A tremendous pillar of LLMOps | by Wael SAIDENI | Medium
Metrics and evaluation
- Some relevant metrics for LLMs. First 3 could be the primary and the last 3 as secondary.
- MixEval: A dynamic benchmark evaluating LLMs using real-world user queries and benchmarks, achieving a 0.96 model ranking correlation with Chatbot Arena.
- IFEval: Assess the ability to follow detailed, verifiable instructions. For example, a prompt might instruct, "Write an article with more than 800 words" or "Wrap your response in double quotation marks".
- Arena-Hard: Evaluating LLMs using challenging user queries, reflecting real-world preferences. Successor to MT-Bench and is similar to AlpacaEval 2.0, focusing on multi-turn conversations and instruction-following tasks.
- MMLU (Pro/Redux): Testing on diverse subjects, evaluating zero-shot and few-shot settings.
- GSM8K: Diverse grade school math problems for testing multi-step arithmetic reasoning.
- HumanEval: Evaluating code generation models using hand-crafted programming problems and unit tests.
- When it comes to LLMs, along with the standard loss metric, which shows you that your fine-tuning is working and the LLM is learning SOMETHING from your data, you can define the following metrics:
- HeuristicsĀ (Levenshtein, perplexity,Ā BLEU and ROUGE andĀ similarity scoresĀ (e.g.,Ā BERT Score) between the predictions and ground truth (GT), which are similar to classic metrics.
- LLM-as-judgesĀ to test against standard issues such as hallucination and moderation, based solely on theĀ userās inputĀ andĀ predictions.
- LLM-as-judgesĀ to test against standard issues such as hallucination and moderation, based on theĀ userās input,Ā predictionsĀ andĀ GT.
- LLM-as-judgesĀ will test the RAG pipeline on problems such as recall and precision based on the userās input, predictions, GT, and theĀ RAG context.
- ImplementingĀ custom business metricsĀ thatĀ leverageĀ points 1 to 4. In our case, we want to check that the writing style and voice are consistent with the userās input and context and fit for social media and blog posts.
- Usually, heuristic metrics donāt work well when assessing GenAI systems as they measure exact matches between the generated output and GT. They donāt consider synonyms or that two sentences share the same idea but use entirely different words.
- Therefore, LLM systems are primarily evaluated with similarity scores and LLM judges.
- All the LLM-as-a-judge metrics are based on well-crafted prompts that check for particular criteria.
RAG evaluation
- LLM & RAG Evaluation Framework: Complete Guide
- When working with RAG, we have anĀ extra dimensionĀ that we have to check, which is theĀ retrieved context.
- Thus, we haveĀ 4 dimensionsĀ where we have toĀ evaluateĀ theĀ interaction between them:
- the userās input;
- the retrieved context;
- the generated output;
- the expected output (the GT, which we may not always have).
- We can evaluate an RAG in two steps:
- the retrieval step - metrics such asĀ NDCG that check the quality of recommendation and information retrieval systems
- the generation step - similar strategies used for LLM evaluation while considering the context dimension. These are metrics with LLM-as-judge, using crafted prompts, e.g. Hallucination, ContextRecall and ContextPrecision
- Evaluating RAG Performance: A Comprehensive Guide | by Christian Grech | Medium
LLMs in production
- CreaciĆ³n de aplicaciones de IA generativa con modelos de base ā Amazon Bedrock ā AWS
- https://docs.bentoml.org/en/v1.1.11/quickstarts/deploy-a-transformer-model-with-bentoml.html
- How to deploy Meta Llama models with Azure Machine Learning studio - Azure Machine Learning | Microsoft Learn
Courses
Code
- #CODE GitHub - AgentOps-AI/agentops
- Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks like CrewAI, Langchain, and Autogen
- AgentOps
- Agent Tracking with AgentOps | AutoGen
- AgentOps, the Best Tool for AutoGen Agent Observability | AutoGen
- #CODE Langsmith langchain-ai/langsmith-cookbook (github.com)
- #CODE Opik - Open-source end-to-end LLM Development Platform
- Confidently evaluate, test and monitor LLM applications.
- Opik by Comet | Opik Documentation
Serving LLMs, VLMs
- #CODE vllm-project/vllm - high-throughput and memory-efficient inference and serving engine for LLMs