Description
Develop robust methodologies to assess the performance of foundation models (e.g., LLMs, vision-language models, etc.) across diverse tasks. Leverage LLMs as judges to perform subjective and open-ended model evaluations (e.g., for summarization, reasoning, or multimodal generation tasks). Build, curate, and lead evaluation datasets and benchmarks. Advanced proficiency in at least one scripting language, preferably Python. Collaborate with research, engineering, and product teams to define evaluation goals aligned with user experience and product quality. Conduct failure analysis and uncover edge cases to improve model robustness. Contribute to our tools and infrastructure to automate and scale evaluation processes.