Embedded directly in Le Chat product Team, you will build the evaluation and A/B testing framework, add end-to-end observability, and run a reliable model release process. Work with Science to ship measurable improvements to quality, latency, safety, and reliability
• Build and maintain an LLM evaluation framework (reference tests, heuristics, model-graded checks).
• Define and track metrics: task success, helpfulness, hallucination proxies, safety flags, latency/cost.
• Run A/B tests for prompts, models, and system prompts, analyze results, recommend rollout or rollback.
• Set up observability for LLM calls: structured logging, tracing, dashboards, alerts.
• Operate the model release: canary and shadow traffic, sign-offs, SLO-based rollback criteria, regression detection.
• Improve core behaviors: memory write/retrieve policies and evals, intent classification, follow-ups, routing, tool-call reliability.
• Create templates and docs so other teams can author evals and ship safely.
• Partner with Science, diagnose regressions and lead post-mortems.