Description
Role summaryEmbedded directly in a product team as search, chat, documents, or audio, you'll improve AI-powered features through rigorous evaluation, prompt and orchestration design, and rapid experimentation. You'll own your domain's AI quality end-to-end: define what "good" looks like, measure it, run experiments, and ship what works. Work with Science to deliver measurable improvements to quality, latency, safety, and reliability.
- Design and run evaluations for your product area: reference tests, heuristics, model-graded checks tailored to search relevance, chat quality, document understanding, or audio performance.
- Define and track metrics that matter: task success, helpfulness, hallucination proxies, safety flags, latency, cost.
- Own prompt and orchestration design: write, test, and iterate on prompts and system prompts as a core part of your work.
- Run A/B tests on prompts, models, and configurations; analyze results; make rollout or rollback decisions from data.
- Set up observability for LLM calls: structured logging, tracing, dashboards, alerts.
- Operate model releases: canary and shadow traffic, sign-offs, SLO-based rollback criteria, regression detection.
- Improve core behaviors in your product area, whether that's memory policies, intent classification, routing, tool-call reliability, or retrieval quality.
- Create templates and documentation so other teams can author evals and ship safely.
- Partner with Science to diagnose regressions and lead post-mortems.