Yinghao Aaron Li

Yinghao Aaron Li is an artificial intelligence research scientist known for his contributions to speech synthesis, voice conversion, and multimodal large language models. He is currently a research scientist at Meta Superintelligence Labs and has developed several notable text-to-speech (TTS) models, including StyleTTS, StyleTTS 2, and DMOSpeech 2. ^[1]

Education

Li completed his doctoral studies at Columbia University, earning a Ph.D. from the Department of Electrical Engineering. His research at Columbia focused on generative speech modeling, including text-to-speech synthesis and voice conversion, under the supervision of Professor Nima Mesgarani. Throughout his academic career, his work has been published in various IEEE journals and presented at prominent AI and computational linguistics conferences. ^[2] ^[3] ^[10]

Career

During his Ph.D. studies, Li undertook a research internship at Adobe, where his work contributed to the development of the DMOSpeech project. Upon completing his doctorate, Li announced in 2025 that he would be joining Meta Superintelligence Labs as an AI Research Scientist. His work there focuses on multimodal large language models. His projects are often made publicly available through platforms like GitHub and Hugging Face, where models such as StyleTTS 2 have gained significant traction within the open-source community.

Li's research primarily addresses challenges in generating natural, diverse, and efficient human-like speech. His work spans style-based generative models, zero-shot synthesis, metric-optimized TTS, and integrated spoken dialogue systems. ^[1] ^[2] ^[3] ^[10] ^[11]

StyleTTS and StyleTTS 2

StyleTTS is a style-based generative model for text-to-speech synthesis designed to produce speech with naturalistic prosodic variations and emotional tones. The model synthesizes speech directly from a style vector, which is a latent variable sampled from a reference speech signal, without requiring explicit prosody modeling. This approach allows for the generation of diverse speech outputs from the same text input. ^[5]

Building on this foundation, StyleTTS 2 was developed to advance toward human-level TTS quality. It integrates style diffusion and adversarial training with large speech language models. This model improves upon its predecessor by enhancing the naturalness and speaker similarity of the synthesized speech. The project gained considerable attention in the open-source community, accumulating over 5,900 stars on GitHub and forming the basis for other popular TTS projects. ^[6] ^[1]

StyleTTS-ZS

Presented at the 2025 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), StyleTTS-ZS is an efficient, high-quality zero-shot TTS model. The model addresses common issues in large-scale TTS, such as slow inference speeds and reliance on complex neural codec representations. It introduces a method that uses distilled time-varying style diffusion to capture diverse speaker identities and prosodies from a short reference audio clip.

Key features of StyleTTS-ZS include:

Time-Varying Style Codes: Represents speech using input text and fixed-length, time-varying discrete style codes to capture prosodic variations.
Efficient Latent Diffusion: A diffusion model is used to sample the style code, enabling efficient synthesis.
Classifier-Free Guidance: This technique is employed during the diffusion process to achieve high similarity to the reference speaker's voice.
Distillation for Speed: The style diffusion model is distilled using a perceptual loss with a small dataset, which reduces inference time by 90% while maintaining speech quality.

The model was demonstrated to be 10 to 20 times faster than other state-of-the-art large-scale zero-shot TTS systems at the time of its publication. ^[7]

DMOSpeech 2

DMOSpeech 2 represents an advancement in metric-optimized speech synthesis and was Li's final project during his Ph.D. It extends the work of the original DMOSpeech by incorporating reinforcement learning (RL) to optimize the duration predictor, a component previously not optimized for perceptual metrics. The system aims to create a more complete metric-optimized pipeline for zero-shot TTS.

The core innovations of DMOSpeech 2 are:

RL for Duration Prediction: A duration policy framework is implemented using Group Relative Preference Optimization (GRPO), with speaker similarity and word error rate (WER) serving as reward signals. This allows the model to learn phoneme durations that improve the perceptual quality and intelligibility of the output.
Teacher-Guided Sampling: A hybrid sampling approach that uses a larger "teacher" model for the initial denoising steps before switching to a more efficient "student" model. This method improves output diversity while maintaining fast inference speeds.

Evaluations showed that DMOSpeech 2 achieved superior performance across all metrics compared to previous systems and could perform inference in just 4-8 steps, reducing sampling steps by half without degrading quality. ^[8] ^[1]

Style-Talker

Style-Talker is a framework designed for fast and natural spoken dialogue generation, presented at CoLM 2024. It addresses the latency and prosody limitations of traditional cascaded systems that chain together automatic speech recognition (ASR), a large language model (LLM), and a TTS model. Style-Talker fine-tunes an audio LLM and a style-based TTS model to work in concert.

The system operates by taking user input audio and using the transcribed chat history and speech style to generate both the text and the speaking style for the response. While the response is being synthesized and played, the system processes the next turn's input audio in parallel to extract its transcription and style. This pipeline design significantly reduces latency and allows the model to incorporate paralinguistic information from the input speech into the output, resulting in more natural and coherent dialogue. Experiments showed that Style-Talker was over 50% faster than conventional cascaded and speech-to-speech baseline models. ^[9]

Other Research

Li has also contributed to several other projects in speech processing. His work includes PL-BERT, a phoneme-level BERT model for enhancing TTS prosody; SLMGAN, which uses speech language model representations for unsupervised zero-shot voice conversion; and an examination of the Mamba architecture for speech-related tasks titled Speech Slytherin. These projects further demonstrate his focus on improving the efficiency, naturalness, and controllability of generative speech models. ^[2]

Subscribe to wiki

Share wiki

Bookmark

Wiki Details

Profile Summary