Description
Anthropic's ML Performance and Scaling team trains our production pretrained models, work that directly shapes the company's future and our mission to build safe, beneficial AI systems. As a Research Engineer on this team, you'll ensure our frontier models train reliably, efficiently, and at scale. This is demanding, high-impact work that requires both deep technical expertise and a genuine passion for the craft of large-scale ML systems.
This role lives at the boundary between research and engineering. You'll work across our entire production training stack: performance optimization, hardware debugging, experimental design, and launch coordination. During launches, the team works in tight lockstep, responding to production issues that can't wait for tomorrow.
Responsibilities:
- Own critical aspects of our production pretraining pipeline, including model operations, performance optimization, observability, and reliability
- Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure
- Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
- Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams
- Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
- Add new capabilities to the training codebase, such as long context support or novel architectures
- Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams
- Contribute to the team's institutional knowledge by documenting systems, debugging approaches, and lessons learned