Oreilly – Evaluating Large Language Models (LLMs) 2025-2

Apr 18, 2025 - 18:46

0 0

Evaluating Large Language Models (LLMs) course. This course will introduce you to the various methods of evaluating Large Language Models (LLMs). Whether you are a data scientist, a machine learning engineer, or even an AI enthusiast, this course will help you gain a deep understanding of the methods of evaluating these models. From the basics of evaluation to practical applications, this course covers everything.

What you will learn:

LLM Assessment Fundamentals: The importance of assessment, the difference between content production and comprehension tasks, and key criteria for assessing different tasks
Evaluating content production tasks: evaluating open-ended responses, using language models to evaluate each other, etc.
Evaluating comprehension tasks: Evaluating embedding, classification, and classifier construction tasks using BERT and GPT
Effective use of assessment criteria: the role of criteria, a review of common criteria, and assessing LLMs using criteria
Exploring the LLM Global Model: Exploring the knowledge stored in LLMs and using them to play games
LLM fine-tuning assessment: fine-tuning objectives, success criteria and practical examples
Case studies: Evaluation of artificial intelligence agents, content generation systems with information retrieval, recommendation engines, etc.
The Future of LLM Assessment: Future Trends in LLM Assessment

This course is suitable for people who:

Seeking a deep understanding of large language models
They want to learn how to evaluate these models.
They work in the field of machine learning, natural language processing, or artificial intelligence.
They want to use large language models in various applications.

Evaluating Large Language Models (LLMs) Course Details

Publisher: Oreilly
Instructor: Sinan Ozdemir
Training level: Beginner to advanced
Training duration: 7 hours and 56 minutes

Course headings

Introduction
Evaluating Large Language Models (LLMs): Introduction
Lesson 1: Foundations of LLM Evaluation
Learning objectives
1.1 Introduction to Evaluation: Why It Matters
1.2 Generative versus Understanding Tasks
1.3 Key Metrics for Common Tasks
Lesson 2: Evaluating Generative Tasks
Learning objectives
2.1 Evaluating Multiple-Choice Tasks
2.2 Evaluating Free Text Response Tasks
2.3 AIs Supervising AIs: LLM as a Judge
Lesson 3: Evaluating Understanding Tasks
Learning objectives
3.1 Evaluating Embedding Tasks
3.2 Evaluating Classification Tasks
3.3 Building an LLM Classifier with BERT and GPT
Lesson 4: Using Benchmarks Effectively
Learning objectives
4.1 The Role of Benchmarks
4.2 Interrogating Common Benchmarks
4.3 Evaluating LLMs with Benchmarks
Lesson 5: Probing LLMs for a World Model
Learning objectives
5.1 Probing LLMs for Knowledge
5.2 Probing LLMs to Play Games
Lesson 6: Evaluating LLM Fine-Tuning
Learning objectives
6.1 Fine-Tuning Objectives
6.2 Metrics for Fine-Tuning Success
6.3 Practical Demonstration: Evaluating Fine-Tuning
6.4 Evaluating and Cleaning Data
Lesson 7: Case Studies
Learning objectives
7.1 Evaluating AI Agents: Task Automation and Tool Integration
7.2 Measuring Retrieval-Augmented Generation (RAG) Systems
7.3 Building and Evaluating a Recommendation Engine Using LLMs
7.4 Using Evaluation to Combat AI Drift
7.5 Time-Series Regression
Lesson 8: Summary of Evaluation and Looking Ahead
Learning objectives
8.1 When and How to Evaluate
8.2 Looking Ahead: Trends in LLM Evaluation
Summary
Evaluating Large Language Models (LLMs): Summary