SparkTTS 的簡介
????????Spark-TTS是一種基于SpardAudio團隊提出的 BiCodec 構建的新系統,BiCodec 是一種單流語音編解碼器,可將語音策略性地分解為兩種互補的標記類型:用于語言內容的低比特率語義標記和用于說話者特定屬性的固定長度全局標記。這種解開的表示與 Qwen2.5 LLM 和思路鏈 (CoT) 生成方法相結合,既可以實現粗粒度屬性控制(例如性別、音高水平),也可以實現細粒度參數調整(例如精確的音高值、語速)。?
它是香港科技大學,上海交大,南洋技術大學等單位組成的團隊開發的,與香港中文大學的MaskGCT 相比,SparkTTS 使用了大模型。
SparkTTS的結構
MaskGCT 結構
測試網站
你可以在下列網站做一些測試。
Spark TTS - Text-to-Speech AI Model
Windows 安裝?
下載 Spark-TTS
- Go to?Spark-TTS GitHub
- Click?"Code" > "Download ZIP", then extract it.
2. 建立? ?Conda 環境
conda create -n sparktts python=3.12 -y
conda activate sparktts
3. Install Dependencies
pip install -r requirements.txt
Install PyTorch (Auto-Detect CUDA or CPU)
我使用的是RTX4080 顯卡。安裝cuda 12.4,安裝的PyTorch 為2.5.1+cu124。
下載cuda 12.4.
安裝 PyTorch ?+cu124
conda install pytorch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 pytorch-cuda=12.4 -c pytorch -c nvidia
5. Download the Model
mkdir pretrained_models
git clone https://huggingface.co/SparkAudio/Spark-TTS-0.5B pretrained_models/Spark-TTS-0.5B
遇到問題
?運行python webUI.py 時出現:
variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://www.intel.com/software/products/support/.
辦法
1 刪除?libiomp5md.dll
D:\Users\Yao\anaconda3\Library\bin\libiomp5md.dll
2 設置臨時環境變量:KMP_DUPLICATE_LIB_OK=TRUE
set KMP_DUPLICATE_LIB_OK=TRUE
也在windows 下設置了。
結果
效果明顯比MaskGCT 好。轉碼速度快。?
使用Python 調用SparkTTS
改寫了使用python 調用SparkTTS 的方式
from datetime import datetime
import os
import soundfile as sf
import torch
import logging
from cli.SparkTTS import SparkTTS
from sparktts.utils.token_parser import LEVELS_MAP_UI# Initialize modeldef initialize_model(model_dir="pretrained_models/Spark-TTS-0.5B", device=0):"""Load the model once at the beginning."""logging.info(f"Loading model from: {model_dir}")device = torch.device(f"cuda:{device}")model = SparkTTS(model_dir, device)return model
def run_tts(text,model,prompt_text=None,prompt_speech=None,gender=None,pitch=None,speed=None,save_dir="example/results",
):"""Perform TTS inference and save the generated audio."""logging.info(f"Saving audio to: {save_dir}")if prompt_text is not None:prompt_text = None if len(prompt_text) <= 1 else prompt_text# Ensure the save directory existsos.makedirs(save_dir, exist_ok=True)# Generate unique filename using timestamptimestamp = datetime.now().strftime("%Y%m%d%H%M%S")save_path = os.path.join(save_dir, f"{timestamp}.wav")logging.info("Starting inference...")# Perform inference and save the output audiowith torch.no_grad():wav = model.inference(text,prompt_speech,prompt_text,gender,pitch,speed,)sf.write(save_path, wav, samplerate=16000)logging.info(f"Audio saved at: {save_path}")return save_path# Define callback function for voice cloning
def voice_clone(text, prompt_text, prompt_wav_upload, prompt_wav_record):"""Gradio callback to clone voice using text and optional prompt speech.- text: The input text to be synthesised.- prompt_text: Additional textual info for the prompt (optional).- prompt_wav_upload/prompt_wav_record: Audio files used as reference."""prompt_speech = prompt_wav_upload if prompt_wav_upload else prompt_wav_recordprompt_text_clean = None if len(prompt_text) < 2 else prompt_textaudio_output_path = run_tts(text,model,prompt_text=prompt_text_clean,prompt_speech=prompt_speech)return audio_output_path# Define callback function for creating new voices
def voice_creation(text, gender, pitch, speed):"""Gradio callback to create a synthetic voice with adjustable parameters.- text: The input text for synthesis.- gender: 'male' or 'female'.- pitch/speed: Ranges mapped by LEVELS_MAP_UI."""pitch_val = LEVELS_MAP_UI[int(pitch)]speed_val = LEVELS_MAP_UI[int(speed)]audio_output_path = run_tts(text,model,gender=gender,pitch=pitch_val,speed=speed_val)return audio_output_path#model_dir="pretrained_models/Spark-TTS-0.5B"
device=0
model = initialize_model(model_dir, device=device)
text="僅僅懂得應用科學本身是不夠的!對人類本身及其命運的關心必然總是培養出努力學習各種技術的興趣;對尚未解決的物質起源和商品分配的問題的關心——為了我們思想意識的建立,將會給整個人類帶來幸福而不是災難。"
#prompt_wav_upload="E:\yao2025\Spark-TTS-main\src\demos\魯豫\luyu_zh.wav"
prompt_wav_upload="E:\yao2025\yaoaudio.wav"
prompt_text="朋友們,今天我要對你們說,盡管眼下困難重重,但我依然懷有一個夢。這個夢深深植根于美國夢之中。我夢想有一天,這個國家將會奮起,實現其立國信條的真諦,我們認為這些真理不言而喻:人人生而平等。我夢想有一天,在佐治亞洲的紅色山崗上,昔日奴隸的兒子能夠同昔日奴隸主的兒子同席而坐,親如手足。"
prompt_wav_record=None
print("TTS ....")
audio_output_path=voice_clone(text, prompt_text, prompt_wav_upload, prompt_wav_record)
"""
pitch,音調
speed 速度
通過下面的map
LEVELS_MAP_UI = {1: 'very_low',2: 'low',3: 'moderate',4: 'high',5: 'very_high'
}
"""
#audio_output_path=voice_creation(text,"female","5","5")
print(audio_output_path)