Direct Preference Optimization instead of RLHF

March 12, 2024February 20, 2024 admin

Can you still do cutting-edge research on LLM if you do not have massive compute resources?

RLHF became a key algorithm for LLM training thanks to the InstructGPT paper, which adapted the technique to that purpose. A typical implementation of the algorithm works as follows:

Get humans to compare pairs of LLM outputs, generated in response to the same prompt, to specify which one they prefer. For example, humans typically prefer the more helpful, less toxic output.
Use the human preferences to learn a reward function. The reward function, typically represented using a transformer network, is trained to give a higher reward (or score) to the outputs that the humans preferred.
Finally, using the learned reward, run a reinforcement learning algorithm to tune the LLM to (i) maximize the reward of the answers generated, while (ii) not letting the LLM change too much (as a form of regularization).

This is a relatively complex algorithm. It needs to separately represent a reward function and an LLM. Also, the final, reinforcement learning step is well known to be finicky to the choice of hyperparameters.

DPO dramatically simplifies the whole thing. Rather than needing separate transformer networks to represent a reward function and an LLM, the authors show how, given an LLM, you can figure out the reward function (plus regularization term) that that LLM is best at maximizing. This collapses the two transformer networks into one. Thus, you now need to train only the LLM and no longer have to deal with a separately trained reward function. The DPO algorithm trains the LLM directly, so as to make the reward function (which is implicitly defined by the LLM) consistent with the human preferences. Further, the authors show that DPO is better at achieving RLHF’s optimization objective (that is, (i) and (ii) above) than most implementations of RLHF itself.

Paper

Join Upaspro to get email for news in AI and Finance

Direct Preference Optimization instead of RLHF

Like this:

Related

Leave a Reply Cancel reply

Share this:

Like this:

Related

You May Also Like

DWM: Diffusion World Model

Probability and naive Bayes: a quick walk through

Deep dive: Knowledge distillation

Leave a Reply Cancel reply