Description
As a staff engineer on ML Compute team, your work will include:
- Drive large-scale training initiatives to support our most complex models.
- Operationalize large-scale ML workloads on Kubernetes.
- Enhance distributed cloud training techniques for foundation models.
- Design and integrate end-to-end lifecycles for distributed ML systems
- Develop tools and services to optimize ML systems beyond model selection.
- Architect a robust MLOps platform to support seamless ML operations.
- Collaborate with cross-functional engineers to solve large-scale ML training challenges.
- Research and implement new patterns and technologies to improve system performance, maintainability, and design.
- Lead complex technical projects, defining requirements and tracking progress with team members.
- Mentor engineers in areas of your expertise, fostering skill growth and knowledge sharing.
- Cultivate a team centered on collaboration, technical excellence, and innovation.