Рет қаралды 65
Explore the development journey of Moshi, the first speech-to-speech model, from its inception to real-time inference. This session will dive into key decisions, such as large-scale training on vast datasets and the shift from traditional frameworks like PyTorch to Rust/Candle. Learn how these choices impacted performance, and discover how 500,000 Moshi sessions were served using optimized L4 GPU clusters, minimizing computational demand while maintaining real-time accuracy.
Laurent Mazaré, CTO at Kyutai