Direct Nash Optimization: Teaching language models to self-improve with general preferences
1 894
7.9
Microsoft Research334 тыс
Опубликовано 3 сентября 2024, 18:59
Corby Rosset, Senior Researcher, Microsoft Research AI Frontiers, discusses teaching language models to self-improve using a preference oracle like GPT-4, framing it as a two-player game to find an optimal policy at a Nash equilibrium, and achieving state-of-the-art win rates against GPT-4 Turbo on benchmarks such as Alpaca-Eval and MT-Bench.
Microsoft Research Forum, September 3, 2024
See more at aka.ms/ResearchForum-Sep2024
Microsoft Research Forum, September 3, 2024
See more at aka.ms/ResearchForum-Sep2024
Свежие видео