Anthropic · San Francisco/New York City · Hybrid

Research Engineer, Safeguards Labs

4/18/2026

Description

We're hiring research engineers to define and execute the Labs research agenda. You'll scope your own projects, run experiments end-to-end, and decide when an idea is ready to hand off to a production team — or when to kill it and move on. The team is small and being built deliberately around a roughly 3:1 mix of researchers to software engineers, so each person has substantial latitude over what they work on and high leverage on the team's direction.

Responsibilities:

  • Lead and contribute to research projects investigating new methods for detecting misuse of Claude, identifying malicious organizations and accounts, strengthening model safeguards, and other safety needs.
  • Design and run offline analyses over model usage data to surface abuse patterns, build classifiers and detection systems, and evaluate their effectiveness.
  • Develop and iterate on prototypes that could eventually feed signals into the real-time safeguards path, partnering with engineers on tech transfer.
  • Contribute to a broader research portfolio investigating methods for detecting abusive behavior in chat-based or agentive workflows, and for training the model to robustly refrain from dangerous responses or behaviors without over-refusing.
  • Build evaluations and methodologies for measuring whether safeguards actually work, including in agentic settings.
  • Write up findings clearly so they inform decisions across Trust & Safety, research, and product teams.

Qualifications

  • Have a track record of independently driving research projects from ambiguous problem statements to concrete results, ideally in AI, ML, security, integrity, or a related technical field.
  • Are comfortable scoping your own work and switching between research, engineering, and analysis as a project demands.
  • Have working familiarity with how large language models operate — sampling, prompting, training — even if LLMs aren't your primary background.
  • Are proficient in Python and comfortable working with large datasets.
  • Care about the societal impacts of AI and want your work to directly reduce real-world harm.
  • Experience building and training machine learning models, including classifiers for abuse, fraud, integrity, or security applications.
  • Knowledge of evaluation methodologies for language models and experience designing evals.
  • Experience with agentic environments and evaluating model behavior in them.
  • Background in trust and safety, integrity, fraud detection, threat intelligence, or adversarial ML.
  • Experience with red teaming, jailbreak research, or interpretability methods like steering vectors.
  • A history of taking research prototypes and transferring them into production systems.

Benefits

$350,000 - $850,000 USD

Application

View listing at origin and apply!

Fast-track your ML job hunt :

Be the first to hear about new sota jobs + exclusive salary research + career cheatsheets.