The Robustness Team is part of the Alignment Science team, and conducts critical safety research and engineering to ensure AI systems can be deployed safely. As part of Anthropic's broader safeguards organization, we work on both immediate safety challenges and longer-term research initiatives, with projects spanning jailbreak robustness, automated red-teaming, monitoring techniques, and applied threat modeling. We prioritize techniques that will enable the safe deployment of more advanced AI systems (ASL-3 and beyond), taking a pragmatic approach to fundamental AI safety challenges while maintaining strong research rigor.
You take a pragmatic approach to running machine learning experiments to help us understand and steer the behavior of powerful AI systems. You care about making AI helpful, honest, and harmless, and are interested in the ways that this could be challenging in the context of human-level capabilities. You could describe yourself as both a scientist and an engineer. You’ll both focus on risks from powerful future systems (like those we would designate as ASL-3 or ASL-4 under our Responsible Scaling Policy), as well as better understanding risks occurring today. You will work in collaboration with other teams including Interpretability, Fine-Tuning, and the Frontier Red Team.
These papers give a simple overview of the topics the team works on: Best-of-N Jailbreaking, Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats, Rapid Response: Mitigating LLM Jailbreaks with a Few Examples, Many-shot Jailbreaking, When Do Universal Image Jailbreaks Transfer Between Vision-Language Models?
Note: Currently, the team has a preference for candidates who are able to be based in the Bay Area. For this role, we conduct all interviews in Python.