I think is where the relative rewards come to play - they sample many thinking t...

		zby on Jan 25, 2025 \| parent \| context \| favorite \| on: TinyZero: Reproduction of DeepSeek R1 Zero in coun... I think is where the relative rewards come to play - they sample many thinking traces and reward those that are correct. This works at the current 'cutting edge' for the model - exactly where it could be improved.