Google DeepMind · Mountain View

Research Engineer (or Scientist), Speech and Language

11/27/2025

Description

We are looking for a top-notch Research Scientist or Research Engineer to join a fast-paced audio and language team that is fundamental to Gemini audio, music / speech generation, representation learning, and diffusion theory projects. We're looking for someone hungry to dive into both theory and code: someone who will drive independent research initiatives, work with teams on large scale AI, and develop solutions to fundamental questions in machine learning and AI.

Key responsibilities:

  • Design, rapidly implement in code, and rigorously evaluate cutting-edge deep learning algorithms and data curation for multimodal generative AI, with a particular emphasis on audio and video synthesis.
  • Report and present research findings and developments clearly and efficiently both internally and externally, verbally and in writing.
  • Thriving under uncertainty, driving both team collaborations to meet ambitious research goals, as well as significant individual contributions.

Qualifications

  • MS or PhD in Computer Science, Artificial Intelligence, Machine Learning, Computer Vision, Speech Processing, or equivalent practical experience.
  • Proven experience in deep learning research and development, particularly in generative AI and related to video and audio synthesis. This includes diffusion models and autoregressive generative models.
  • Exceptional engineering skills in Python and deep learning frameworks (e.g., JAX, TensorFlow, PyTorch), with a track record of building high-quality research prototypes and systems. Self-motivated to pick up technologies to adapt and move quickly.
  • Strong publication record at top-tier machine learning, computer vision, and graphics conferences (e.g., NeurIPS, ICLR, ICML, SIGGRAPH, CVPR, ICCV).
  • Knowledge of probabilistic machine learning and generative modeling (e.g. Diffusion, autoregressive models, GANs, flows, hierarchical VAEs, DDPMs).
  • Demonstrated experience in large-scale training of multimodal generative models.
  • Sequence processing experience with TensorFlow, PyTorch, or JAX.
  • Bonus: knowledge of speech processing and language understanding, in particular text-to-speech synthesis and prosody modeling.

Application

View listing at origin and apply!