1 模型概覽
CosyVoice 是由阿里巴巴達摩院通義實驗室開發的新一代生成式語音合成大模型系列,其核心目標是通過大模型技術深度融合文本理解與語音生成,實現高度擬人化的語音合成體驗。該系列包含初代?CosyVoice?及其升級版?CosyVoice 2.0,兩者在技術架構、性能和應用場景上均有顯著差異。關鍵突破包括:
-
MOS評分達5.53,接近真人發音水平;
-
首包延遲低至150ms,較傳統方案降低60%;
-
支持多種語言及方言(中/英/日/韓/粵語/四川話等),支持中英混合語句自然合成;
-
集成情感控制、環境音效插入(如
[laughter]
)等細粒度生成能力。
2 不同應用場景的模型功能
模型名稱 | 核心功能 | 使用場景 | 技術特點 |
CosyVoice-300M | 零樣本音色克隆、跨語言生成 | 個性化語音克隆、跨語種配音(如中文→英文) | 僅需 3s 參考音頻;支持 5 種語言;無預置音色,需用戶提供樣本 |
CosyVoice-300M-Instruct | 細粒度情感/韻律控制(富文本指令) | 情感配音(如廣告、有聲書)、語氣細節調整 | 支持自然語言指令(如“歡快語氣”)及富文本標簽(如 <laugh>)159 |
CosyVoice-300M-SFT | 預置音色合成(無需樣本) | 快速生成固定音色(如教育課件、導航語音) | 內置 7 種預訓練音色(中/英/日/韓/粵男女聲);無需克隆樣本 |
CosyVoice2-0.5B | 多語言流式語音合成、低延遲實時響應 | 直播、實時對話客服、雙向語音交互 | 0.5B 參數量;支持雙向流式合成(首包延遲 ≤150ms);多種語言支持 |
? ? ? ? 用戶可以根據自己不同的業務需求選擇不同的模型
3 不同場景的demo
3.1?CosyVoice-300M
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio#cosyvoice = CosyVoice('/models/iic/CosyVoice-300M', load_jit=False, load_trt=False, fp16=False)def inference_zero_shot_300M(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_zero_shot(tts_text, '希望你以后能夠做的比我還好呦。', prompt_speech_16k, stream=False)):torchaudio.save('asset/test_data/zero_shot3_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)# cross_lingual usage
def inference_cross_lingual_300M(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/cross_lingual_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)):torchaudio.save('asset/test_data/cross_lingual_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)# vc usage
def inference_vc_300M(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)source_speech_16k = load_wav('asset/cross_lingual_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_vc(source_speech_16k, prompt_speech_16k, stream=False)):torchaudio.save('asset/test_data/vc_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)if __name__ == '__main__':cosyvoice = CosyVoice('hub/models/iic/CosyVoice-300M(模型地址)') # or change to pretrained_models/CosyVoice-300M-25Hz for 25Hz inferenceinference_zero_shot_300M(cosyvoice,'今天是個好日子,我們一起去旅游吧')inference_cross_lingual_300M(cosyvoice,'今天是個好日子,我們一起去旅游吧')
3.2?CosyVoice-300M-Instruct
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudiodef inference_instruct(cosyvoice,tts_text):cosyvoice = CosyVoice('/hub/models/iic/CosyVoice-300M-Instruct')# instruct usage, support <laughter></laughter><strong></strong>[laughter][breath]for i, j in enumerate(cosyvoice.inference_instruct(tts_text, '中文男', 'Theo \'Crimson\', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with impulsiveness.', stream=False)):torchaudio.save('asset/cosyvoice-instruct/instruct_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)if __name__ == '__main__':cosyvoice = CosyVoice('/hub/models/iic/CosyVoice-300M') # or change to pretrained_models/CosyVoice-300M-25Hz for 25Hz inference#nference_zero_shot_300M(cosyvoice,'今天是個好日子,我們一起去旅游吧')inference_instruct(cosyvoice,'在面對挑戰時,他展現了非凡的<strong>勇氣</strong>與<strong>智慧</strong>。')
3.3?CosyVoice-300M-SFT
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio# sft usage
def inference_sft(cosyvoice,tts_text):print(cosyvoice.list_available_spks())# change stream=True for chunk stream inferencefor i, j in enumerate(cosyvoice.inference_sft(tts_text, '中文女', stream=False)):torchaudio.save('asset/cosyvoice-sft/sft_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)if __name__ == '__main__':cosyvoice = CosyVoice('/hub/models/iic/CosyVoice-300M-SFT', load_jit=False, load_trt=False, fp16=False)inference_sft(cosyvoice,'今天是個好日子,我們一起去旅游吧')
3.4?CosyVoice2-0.5B
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio# zero_shot usage
def inference_zero_shot_05B(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_zero_shot(tts_text, '希望你以后能夠做的比我還好呦。', prompt_speech_16k, stream=False)):torchaudio.save('asset/CosyVoice2-05B/zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)# fine grained control, for supported control, check cosyvoice/tokenizer/tokenizer.py#L248
def inference_cross_lingual_05B(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_cross_lingual(tts_text, prompt_speech_16k, stream=False)):torchaudio.save('asset/CosyVoice2-05B/fine_grained_control_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)# instruct usage
def inference_instruct2_05B(cosyvoice,tts_text):prompt_speech_16k = load_wav('asset/zero_shot_prompt.wav', 16000)for i, j in enumerate(cosyvoice.inference_instruct2(tts_text, '用四川話說這句話', prompt_speech_16k, stream=False)):torchaudio.save('asset/CosyVoice2-05B/instruct1_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)if __name__ == '__main__':cosyvoice = CosyVoice2('/hub/models/iic/CosyVoice2-0.5B', load_jit=False, load_trt=False, fp16=False)tts_text = '收到好友從遠方寄來的生日禮物,那份意外的驚喜與深深的祝福讓我心中充滿了甜蜜的快樂,笑容如花兒般綻放。'#inference_zero_shot_05B(cosyvoice,tts_text)#inference_cross_lingual_05B(cosyvoice,tts_text)inference_instruct2_05B(cosyvoice,tts_text)
以上為簡單的demo,實測效果很好了,可以使用CosyVoice框架提供的http接口,也可以自己使用fastapi定制化開發。
CosyVoice代碼倉庫地址:https://github.com/FunAudioLLM/CosyVoice.git
CosyVoice2-0.5B模型魔塔地址:CosyVoice語音生成大模型2.0-0.5B
?
推薦一個好用的JSON工具:JSON在線