
Fast-track your ML job hunt :
As a member of this API & power-users team, you will improve the capabilities, reliability, and product fit of OpenAI’s agentic models for power users and API developers. You might design evals from real developer workflows, build training environments around production-like tool use, turn qualitative model failures into training data, evals, or post-training interventions, or drive a behavior improvement from discovery through post-training, integration, and launch.
This role is intentionally broad. The strongest candidates are comfortable turning ambiguous model behavior problems into concrete progress, whether that means improving tool use, planning, instruction following, recovery from mistakes, or how models behave in API-based workflows. You should be excited to work across research, engineering, data, evals, and product to make models better at acting in real workflows.
You will work closely with researchers, engineers, API/product teams, Codex, infrastructure, and safety/alignment partners to decide which behaviors matter, how to measure them, how to train them, and when they are ready for major model runs. This is a high-agency role for people who want their work to show up directly in frontier models used by expert users and developers.
Design and run experiments that improve model behavior in API and power-user workflows: function calling, tool use, coding, planning, long-horizon execution, factuality, instruction following, error recovery, and calibrated reasoning.
Build evals, graders, and environments from real developer and power-user workflows, then turn observed failures into training data, model-behavior hypotheses, and shipped improvements.
Partner with API and power-users to identify high-leverage behavior gaps and convert product signals into post-training interventions.
Improve how models behave when composed into systems: using tools reliably, respecting developer intent, handling partial failures, asking for clarification when appropriate, and maintaining coherence across multi-step tasks.
Own end-to-end model behavior projects, from qualitative failure analysis through data generation, training experiments, eval design, integration into major runs, and launch readiness.
Develop feedback loops that use power-user traces, API usage patterns, and production-like environments to discover the next frontier of agentic model failures and gaps.
Help decide which agentic capabilities, behavioral fixes, and partner-team integrations are ready for inclusion in major model runs.
Debug hard failures in shipped or near-shipped models by moving between traces, evals, training data, model outputs, and product context.
Work on early-training and alignment interventions, including data mixtures, objectives, synthetic data, and eval loops that shape downstream agent behavior.
Improve the machinery for large-scale training and launch: experiment velocity, reliability, observability, reproducibility, cost, latency, and production readiness.
Take on cross-functional projects that touch model training, product infrastructure, and the production agent harness, such as multi-agent systems or training directly against production-like environments.
Have strong technical fundamentals in ML, software engineering, systems, statistics, or applied research, and can quickly learn across unfamiliar parts of the stack.
Have hands-on experience with LLMs, post-training, RL/RLHF/RLAIF, evals, graders, synthetic data, coding agents, tool-using agents, API products, or production ML systems.
Have strong taste for model behavior: you can look at a transcript, trace, eval failure, or API interaction and form concrete hypotheses about what the model needs to learn.
Are excited by ambiguous capability problems where the signal is noisy, the failures are qualitative, and the solution may involve data, training, evals, product changes, or all of the above.
Deeply care about developer and expert-user experience, especially how models behave when embedded in real user workflows, API products, and agent harnesses..
Are comfortable working across research, product, infrastructure, data, evals, and safety boundaries, and can communicate clearly with each group.
Like building load-bearing systems and processes when that is what the team needs, even if the work is not glamorous.
Want to train and ship the models that make agents genuinely useful for developers, enterprises, researchers, and everyday users.
Fast-track your ML job hunt :