Listen Top Shows Blog

180: Reinforcement Learning

180: Reinforcement Learning

Update: 2025-03-17

Share

Description

Intro topic: Grills

News/Links:

You can’t call yourself a senior until you’ve worked on a legacy project
- https://www.infobip.com/developers/blog/seniors-working-on-a-legacy-project
Recraft might be the most powerful AI image platform I’ve ever used — here’s why
- https://www.tomsguide.com/ai/ai-image-video/recraft-might-be-the-most-powerful-ai-image-platform-ive-ever-used-heres-why
NASA has a list of 10 rules for software development
- https://www.cs.otago.ac.nz/cosc345/resources/nasa-10-rules.htm
AMD Radeon RX 9070 XT performance estimates leaked: 42% to 66% faster than Radeon RX 7900 GRE
- https://www.tomshardware.com/tech-industry/amd-estimates-of-radeon-rx-9070-xt-performance-leaked-42-percent-66-percent-faster-than-radeon-rx-7900-gre

Book of the Show

Patrick:
- The Player of Games (Ian M Banks)
  - https://a.co/d/1ZpUhGl (non-affiliate)
Jason:
- Basic Roleplaying Universal Game Engine
  - https://amzn.to/3ES4p5i

Patreon Plug https://www.patreon.com/programmingthrowdown?ty=h

Tool of the Show

Patrick:
- Pokemon Sword and Shield
Jason:
- Features and Labels ( https://fal.ai )

Topic: Reinforcement Learning

Three types of AI
- Supervised Learning
- Unsupervised Learning
- Reinforcement Learning
Online vs Offline RL
Optimization algorithms
- Value optimization
  - SARSA
  - Q-Learning
- Policy optimization
  - Policy Gradients
  - Actor-Critic
  - Proximal Policy Optimization
Value vs Policy Optimization
- Value optimization is more intuitive (Value loss)
- Policy optimization is less intuitive at first (policy gradients)
- Converting values to policies in deep learning is difficult
Imitation Learning
- Supervised policy learning
- Often used to bootstrap reinforcement learning
Policy Evaluation
- Propensity scoring versus model-based
Challenges to training RL model
- Two optimization loops
  - Collecting feedback vs updating the model
- Difficult optimization target
  - Policy evaluation
RLHF & GRPO

★ Support this podcast on Patreon ★

Comments

In Channel

185: Workflow Orchestrators

185: Workflow Orchestrators

2025-11-0401:32:02

184: Asynchronous Programming

184: Asynchronous Programming

2025-09-2301:30:32

183: Landing a Software Job in 2025

183: Landing a Software Job in 2025

2025-07-3101:46:53

182: AI Assisted Coding

182: AI Assisted Coding

2025-06-3001:37:36

181: Memory Management

181: Memory Management

2025-05-1201:46:21

180: Reinforcement Learning

180: Reinforcement Learning

2025-03-1701:52:22

179: Project Planning

179: Project Planning

2025-02-0301:43:00

178: Working from Home

178: Working from Home

2024-12-0301:45:15

177: Vector Databases

177: Vector Databases

2024-11-0401:28:26

176: MLOps at SwampUp

176: MLOps at SwampUp

2024-09-2401:58:37

175: Resume Writing

175: Resume Writing

2024-08-1601:40:55

174: Devops

174: Devops

2024-06-1001:25:47

173: Mocking and Unit Tests

173: Mocking and Unit Tests

2024-04-2901:35:22

172: Transformers and Large Language Models

172: Transformers and Large Language Models

2024-03-1101:26:08

171: Compilers and Interpreters

171: Compilers and Interpreters

2024-02-1201:25:10

170: 2023 Holiday Special Live

170: 2023 Holiday Special Live

2023-12-2401:38:34

169: HyperLogLog

169: HyperLogLog

2023-11-2701:29:33

168: Godot

168: Godot

2023-11-2001:28:34

167: Desktop User Interfaces

167: Desktop User Interfaces

2023-10-2301:26:06

166: Speedy Database Queries with Lukas Fittl

166: Speedy Database Queries with Lukas Fittl

2023-10-1601:12:12

00:00

00:00

x

180: Reinforcement Learning

180: Reinforcement Learning

Patrick Wheeler and Jason Gauci