Grouped Reward Policy Optimization, a reinforcement learning algorithm for training language models with group-based reward estimation.
No results
Select a result to preview
Grouped Reward Policy Optimization, a reinforcement learning algorithm for training language models with group-based reward estimation.
No results