Grouped Reward Policy Optimization, a reinforcement learning algorithm for training language models with group-based reward estimation.

表格 0 results

No results

Powered by Forestry.md