Positive TinyStories GPT
Reinforcement Learning Alignment Pipeline
Pretraining → SFT → Policy Gradient RL
Model Size: 0.81M Params | Block Size: 64 | Char-Level GPT
What You Are Seeing
- • Left: GPT trained normally on TinyStories.
- • Right: Same GPT aligned using Reinforcement Learning with Sentiment Reward.
- • RL Objective: Maximize positivity while minimizing KL divergence.
- • Loss = −E[reward × log_prob] + β KL(policy || reference)
Pretrained GPT
Positivity Score: 0
RL-Aligned GPT
Positivity Score: 0