How to Evaluate Large Language Models and RAG Applications with Pasquale Antonante

Update: 2024-06-19

Description

How to Evaluate Large Language Models and RAG Applications

In this episode, Pasquale Antonante, Co-Founder & CTO of Relari AI, joins us to discuss evaluation methods for LLM and RAG applications. Pasquale has a PhD from MIT, where he focused on the reliability of complex AI systems. At Relari AI, they are building an open-source platform to simulate, test, and validate complex generative AI (GenAI) applications.

During the interview, we’ll discuss Relari AI's innovative approach to improving generative AI and RAG applications, which were inspired by the testing methodologies from the autonomous vehicle industry.

We’ll cover topics like the complexity of GenAI workflows, the challenges in evaluating these systems, and various evaluation methods such as reference-based, and synthetic data-based approaches. We’ll also explore metrics like precision, recall, faithfulness, and relevance, and compare GPT auto-evaluators with simulated user feedback.

Finally, we'll highlight Relari's continuous-eval open-source project and the future of leveraging synthetic data for LLM finetuning.

Topics

- Guest background and about the startup - Relari AI

- What the LLM industry can learn from the autonomous vehicle space

- What do companies view as the biggest challenge to the adoption of generative AI?

- Why are GenAI application workflows and pipelines so complex?

- Explanation of how Retrieval-Augmented Generation (RAG) works and its benefits over traditional generation models

- The challenges of evaluating these workflows

- Different ways to evaluate LLM pipelines

- Reference-free, reverence-based, and using synthetic data based evaluation for LLMs and RAG

- Measuring precision, recall, faithfulness, relevance, and correctness in RAG systems

- The key metrics used to evaluate RAG pipelines

- Semantics metrics and LLM-based metrics

- GPT auto-evaluators versus the advantages of simulated user feedback evaluators

- The role human evaluation plays in assessing the quality of generated text

- The continuous-eval open-source project and various metrics container therein

- Leveraging synthetic data to improve LLM finetuning

- What’s next for Relari?

Show Notes:

Learn more about Pasquale:

⁠https://www.linkedin.com/in/pasquale-antonante/⁠

⁠https://www.mit.edu/~antonap/⁠

⁠https://scholar.google.com/citations?user=7Vvpd-YAAAAJ&hl=it⁠

Learn more about Relari

⁠https://www.relari.ai/⁠

⁠https://github.com/relari-ai/continuous-eval⁠

Task-Aware Risk Estimation of Perception Failures for Autonomous Vehicles

⁠https://arxiv.org/abs/2305.01870⁠

BM25

⁠https://en.wikipedia.org/wiki/Okapi_BM25⁠

Precision, Recall, F1 score

⁠https://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html⁠

A Practical Guide to RAG Pipeline Evaluation (Part 1: Retrieval)⁠https://blog.relari.ai/a-practical-guide-to-rag-pipeline-evaluation-part-1-27a472b09893⁠

⁠https://blog.relari.ai/a-practical-guide-to-rag-evaluation-part-2-generation-c79b1bde0f5d⁠

How important is a Golden Dataset for LLM evaluation?

⁠https://blog.relari.ai/how-important-is-a-golden-dataset-for-llm-pipeline-evaluation-4ef6deb14dc5⁠

Case Study: Reference-free vs Reference-based evaluation of RAG pipeline

⁠https://blog.relari.ai/case-study-reference-free-vs-reference-based-evaluation-of-rag-pipeline-9a49ef49866c⁠

This episode was sponsored by:

Ai+ Training ⁠https://aiplus.training/⁠

Home to hundreds of hours of on-demand, self-paced AI training, ODSC interviews, free webinars, and certifications in in-demand skills like LLMs and Prompt Engineering

And created in partnership with ODSC ⁠https://odsc.com/⁠

The Leading AI Training Conference, featuring expert-led, hands-on workshops, training sessions, and talks on cutting-edge AI topics and

Never miss an episode, subscribe now!