Embedded directly in a product team as search, chat, documents, or audio, you'll improve AI-powered features through rigorous evaluation, prompt and orchestration design, and rapid experimentation. You'll own your domain's AI quality end-to-end: define what "good" looks like, measure it, run experiments, and ship what works. Work with Science to deliver measurable improvements to quality, latency, safety, and reliability.
• Design and run evaluations for your product area: reference tests, heuristics, model-graded checks tailored to search relevance, chat quality, document understanding, or audio performance.
• Define and track metrics that matter: task success, helpfulness, hallucination proxies, safety flags, latency, cost.
• Own prompt and orchestration design: write, test, and iterate on prompts and system prompts as a core part of your work.
• Run A/B tests on prompts, models, and configurations; analyze results; make rollout or rollback decisions from data.
• Set up observability for LLM calls: structured logging, tracing, dashboards, alerts.
• Operate model releases: canary and shadow traffic, sign-offs, SLO-based rollback criteria, regression detection.
• Improve core behaviors in your product area, whether that's memory policies, intent classification, routing, tool-call reliability, or retrieval quality.
• Create templates and documentation so other teams can author evals and ship safely.
• Partner with Science to diagnose regressions and lead post-mortems.