DiscoverProgramming Throwdown180: Reinforcement Learning
180: Reinforcement Learning

180: Reinforcement Learning

Update: 2025-03-17
Share

Description

Intro topic: Grills

News/Links:

Book of the Show


Patreon Plug https://www.patreon.com/programmingthrowdown?ty=h


Tool of the Show

  • Patrick: 
    • Pokemon Sword and Shield
  • Jason: 

Topic: Reinforcement Learning

  • Three types of AI
    • Supervised Learning
    • Unsupervised Learning
    • Reinforcement Learning
  • Online vs Offline RL
  • Optimization algorithms
    • Value optimization
      • SARSA
      • Q-Learning
    • Policy optimization
      • Policy Gradients
      • Actor-Critic
      • Proximal Policy Optimization
  • Value vs Policy Optimization
    • Value optimization is more intuitive (Value loss)
    • Policy optimization is less intuitive at first (policy gradients)
    • Converting values to policies in deep learning is difficult
  • Imitation Learning
    • Supervised policy learning
    • Often used to bootstrap reinforcement learning
  • Policy Evaluation
    • Propensity scoring versus model-based
  • Challenges to training RL model
    • Two optimization loops
      • Collecting feedback vs updating the model
    • Difficult optimization target
      • Policy evaluation
  • RLHF &  GRPO



★ Support this podcast on Patreon ★
Comments 
In Channel
182: AI Assisted Coding

182: AI Assisted Coding

2025-06-3001:37:36

181: Memory Management

181: Memory Management

2025-05-1201:46:21

180: Reinforcement Learning

180: Reinforcement Learning

2025-03-1701:52:22

179: Project Planning

179: Project Planning

2025-02-0301:43:00

178: Working from Home

178: Working from Home

2024-12-0301:45:15

177: Vector Databases

177: Vector Databases

2024-11-0401:28:26

176: MLOps at SwampUp

176: MLOps at SwampUp

2024-09-2401:58:37

175: Resume Writing

175: Resume Writing

2024-08-1601:40:55

174: Devops

174: Devops

2024-06-1001:25:47

173: Mocking and Unit Tests

173: Mocking and Unit Tests

2024-04-2901:35:22

169: HyperLogLog

169: HyperLogLog

2023-11-2701:29:33

168: Godot

168: Godot

2023-11-2001:28:34

165: Differential Equations

165: Differential Equations

2023-09-2501:16:43

loading
00:00
00:00
x

0.5x

0.8x

1.0x

1.25x

1.5x

2.0x

3.0x

Sleep Timer

Off

End of Episode

5 Minutes

10 Minutes

15 Minutes

30 Minutes

45 Minutes

60 Minutes

120 Minutes

180: Reinforcement Learning

180: Reinforcement Learning

Patrick Wheeler and Jason Gauci