Anthropic · San Francisco · Hybrid

Research Engineer, Pretraining Scaling

10/1/2025

Description

Anthropic's ML Performance and Scaling team trains our production pretrained models, work that directly shapes the company's future and our mission to build safe, beneficial AI systems. As a Research Engineer on this team, you'll ensure our frontier models train reliably, efficiently, and at scale. This is demanding, high-impact work that requires both deep technical expertise and a genuine passion for the craft of large-scale ML systems.

This role lives at the boundary between research and engineering. You'll work across our entire production training stack: performance optimization, hardware debugging, experimental design, and launch coordination. During launches, the team works in tight lockstep, responding to production issues that can't wait for tomorrow.

Responsibilities: 

  • Own critical aspects of our production pretraining pipeline, including model operations, performance optimization, observability, and reliability
  • Debug and resolve complex issues across the full stack—from hardware errors and networking to training dynamics and evaluation infrastructure
  • Design and run experiments to improve training efficiency, reduce step time, increase uptime, and enhance model performance
  • Respond to on-call incidents during model launches, diagnosing problems quickly and coordinating solutions across teams
  • Build and maintain production logging, monitoring dashboards, and evaluation infrastructure
  • Add new capabilities to the training codebase, such as long context support or novel architectures
  • Collaborate closely with teammates across SF and London, as well as with Tokens, Architectures, and Systems teams
  • Contribute to the team's institutional knowledge by documenting systems, debugging approaches, and lessons learned

Qualifications

  • Have hands-on experience training large language models, or deep expertise with JAX, TPU, PyTorch, or large-scale distributed systems
  • Genuinely enjoy both research and engineering work—you'd describe your ideal split as roughly 50/50 rather than heavily weighted toward one or the other
  • Are excited about being on-call for production systems, working long days during launches, and solving hard problems under pressure
  • Thrive when working on whatever is most impactful, even if that changes day-to-day based on what the production model needs
  • Excel at debugging complex, ambiguous problems across multiple layers of the stack
  • Communicate clearly and collaborate effectively, especially when coordinating across time zones or during high-stress incidents
  • Are passionate about the work itself and want to refine your craft as a research engineer
  • Care about the societal impacts of AI and responsible scaling
  • Previous experience training LLM’s or working extensively with JAX/TPU, PyTorch, or other ML frameworks at scale
  • Contributed to open-source LLM frameworks (e.g., open_lm, llm-foundry, mesh-transformer-jax)
  • Published research on model training, scaling laws, or ML systems
  • Experience with production ML systems, observability tools, or evaluation infrastructure
  • Background as a systems engineer, quant, or in other roles requiring both technical depth and operational excellence

Benefits

$315,000 - $560,000 USD

Application

View listing at origin and apply!