DPO vs. RLHF

DPO

Andrew Ng's quote 👇

  1. It has been half a year since DPO proposed (Submitted on 29 May 2023 (v1)), and, in LLM leaderboard, most top methods are based on DPO.
  2. In this paper we introduce a new parameterization of the reward model in RLHF that enables extraction of the corresponding optimal policy in closed form, allowing us to solve the standard RLHF problem with only a simple classification loss.

  3. Pasted image 20240122030158.png