Description
We are seeking a Research Scientist or Engineer to lead the development of next-generation post-training recipes for Gemini. In this role, you will move beyond standard tuning; you will architect the Reward Modeling and Reinforcement Learning strategies that define how our most capable models learn. You will focus specifically on "hard" capabilities—such as improving chain-of-thought reasoning and complex instruction following—where synthetic data and distillation fall short. You will work horizontally to ensure these recipes scale across text, audio, and multimodal domains, establishing the gold standard for how Gemini evolves.
Key responsibilities:
- Frontier Recipe Development: Design and validate novel post-training pipelines (SFT, RLHF, RLAIF) specifically for frontier-class models where no "teacher" model exists.
- Advance Reward Modeling: Lead research into next-gen Reward Models, including investigating new architectures, reducing reward hacking, and improving signal-to-noise ratios in preference data.
- Unlock "Thinking" Capabilities: innovative methods to improve the model's internal reasoning (chain-of-thought), focusing on correctness, logic, and self-correction in multi-step tasks.
- Revamp RL Paradigms: critically re-evaluate and optimize RL prompts and feedback mechanisms to extract maximum performance from the underlying base models.
- Solve the "Flywheel" Challenge: create robust mechanisms to turn user signals and interactions into training data that continuously improves the model without introducing regression or bias.
Horizontal Impact: collaborate across teams to apply these advanced recipes to various model sizes and modalities (e.g., Audio), ensuring consistent high-quality behavior.