Рет қаралды 2,390
Moshi is the the lowest latency conversational AI ever released.
On July 4, kyutai_labs introduced Moshi, the lowest latency conversational AI ever released. Moshi can perform small talk, explain various concepts and engage in roleplay using many emotions and speaking styles. In this video, watch Moshi talk like a pirate and in a spooky whisper!
Talk to Moshi here: moshi.chat/?qu... .
____________________________________
More Info:
According to Philipp Schmid, @_philschmid on X,
Moshi:
Expresses and understands emotions, e.g. speak with “french accent”
Listens and generates Audio/Speech
Generates realistic, human-like speech In a variety of accents
Supports 2 streams of audio to listen and speak at the same time
Used Joint pre-training on mix of text and audio
Used synthetic data text data from Helium a 7B LLM (Kyutai created)
Is fine-tuned on 100k “oral-style” synthetic (conversations) converted with TTS
Learned its voice from synthetic data generated by a separate TTS model
Achieves a end-to-end latency of 200ms
Has a smaller variant that runs on a MacBook or consumer-size GPU. 🤯
Uses watermarking to detect AI-generated audio (WIP)
Will be released open source!!!
____________________________________
All clips used for fair use commentary, criticism, and educational purposes. See Hosseinzadeh v. Klein, 276 F.Supp.3d 34 (S.D.N.Y. 2017); Equals Three, LLC v. Jukin Media, Inc., 139 F. Supp. 3d 1094 (C.D. Cal. 2015).
____________________________________
artificial intelligence, technology, AI, large language models, LLMs, interactive