Self-Rewarding Language Models
Self-Rewarding Language Models (Self-Rewarding LMs) show remarkable performance by using an LLM-as-a-Judge mechanism during Iterative DPO training, outpacing models like Claude 2 and GPT-4 0613 in AlpacaEval 2.0.
The goal is to surpass human-level performance in language models by overcoming the limitations of fixed, human-crafted reward models and enabling continuous self-improvement in instruction following and reward modeling abilities.
Employ LLM-as-a-Judge to enable Self-Rewarding LMs to assign and improve rewards autonomously. Utilize Iterative DPO for training, allowing the model to refine instruction-following and reward-modeling capabilities across iterations. Fine-tune Llama 2 70B over three such iterations for enhanced performance.
After three rounds of Iterative DPO training, the refined Llama 2 70B outperforms established AI systems like Claude 2, Gemini Pro, and GPT-4 0613, showcasing the potential of self-rewarding mechanisms for continual, autonomous improvement in language models.
Join Upaspro to get email for news in AI and Finance