開源模型應用落地-語音轉文本-whisper模型-AIGC應用探索（五）

一、前言

? ?在上一節中，學習了如何使用vLLM來部署Whisper-large-v3-turbo模型。不過，在實際使用時，模型一次只能處理30秒的音頻。今天，將結合實際業務，介紹如何處理一段完整的音頻，并生成相應的字幕文件。

? ? 相關文章，請參見：

? ??開源模型應用落地-語音轉文本-whisper模型-AIGC應用探索（一）

? ??開源模型應用落地-語音轉文本-whisper模型-AIGC應用探索（二）

? ??開源模型應用落地-語音轉文本-whisper模型-AIGC應用探索（三）

? ??開源模型應用落地-語音轉文本-whisper模型-AIGC應用探索（四）

二、術語介紹

2.1. 語音轉文本

? ? 也稱為語音識別或自動語音識別 (ASR)是一種將語音音頻轉換為文字的技術。它利用計算機程序和算法來監聽語音輸入,并將其轉換為可讀的文字輸出。

2.2. Whisper-large-v3-turbo

? ? 是 OpenAI 于 2024年10月推出的一款優化型語音轉錄模型，基于 Whisper large-v3 改進而來，旨在平衡速度與準確性。以下是其核心特點：

1.技術改進

解碼器層數縮減：從 32 層減少至 4 層，顯著降低計算復雜度。
速度提升：轉錄速度較 large-v3 快 8 倍，超越 tiny 模型，支持實時應用。
推理優化：結合 torch.compile 和縮放點積注意力（F.scaled_dot_product_attention），進一步加速推理，減少延遲。
參數規模：8.09 億參數，介于 medium（7.69 億）與 large（155 億）之間，模型體積約 1.6GB。

2.性能表現

質量保持：在高質量錄音（如 FLEURS 數據集）上表現接近 large-v2，跨語言能力與 large-v2 相當。
多語言支持：覆蓋 99 種語言，但對泰語、粵語等方言支持較弱。
VRAM 需求：僅需 6GB，顯著低于 large 模型的 10GB，適合邊緣設備部署。

3.應用場景

實時轉錄：適用于會議記錄、直播字幕等低延遲場景。
長音頻處理：支持分塊或順序算法處理超長音頻，兼顧速度與準確性。
本地化部署：輕量化設計，便于在移動端或本地服務器集成。

4.集成與使用

開發友好：通過 Hugging Face Transformers 庫或 OpenAI 官方工具調用，提供 Python 示例代碼。
專注轉錄：訓練數據不含翻譯內容，不支持語音翻譯任務，純轉錄性能更優。

5.對比優勢

速度與質量平衡：較 large-v3 速度提升明顯，質量損失極小。
性價比：參數規模與 medium 接近，但性能更優，適合資源有限的場景。

三、構建環境

3.1.基礎環境構建

conda create -n test python=3.10
conda activate testpip install pydub -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install openai -i https://pypi.tuna.tsinghua.edu.cn/simple

?

3.2.下載模型

huggingface：

https://huggingface.co/openai/whisper-large-v3-turbo/tree/main

ModelScope：

git clone https://www.modelscope.cn/iic/Whisper-large-v3-turbo.git

下載完成（建議使用HuggingFace）：

四、技術實現

4.1.啟動vLLM服務

vllm serve /data/model/whisper-large-v3-turbo  --swap-space 16 --disable-log-requests --max-num-seqs 256 --host 0.0.0.0 --port 9000  --dtype float16 --max-parallel-loading-workers 1  --max-model-len 448 --enforce-eager --gpu-memory-utilization 0.99 --task transcription

調用結果：

GPU占用：

4.2.定義STT工具類

? 請求私有化部署的語音轉文本服務

# -*-  coding:utf-8 -*-from openai import OpenAIopenai_api_key = "EMPTY"
openai_api_base = "http://127.0.0.1:9000/v1"
model = "/data/model/whisper-large-v3-turbo"
language = "en"
response_format = "json"
temperature = 0.0class STT:def __init__(self):self.client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)def request(self,audio_path):with open(str(audio_path), "rb") as f:transcription = self.client.audio.transcriptions.create(file=f,model="/data/model/whisper-large-v3-turbo",language=language,response_format=response_format,temperature=temperature)if transcription:return transcription.textelse:return ''if __name__ == '__main__':audio_path = r'E:\temp\0.mp3'stt = STT()text = stt.request(audio_path)print(f'text: {text}')

?調用結果：

4.3.切分音頻生成字幕文件

? 需求：

? 字幕數據按每一分鐘進行聚合
? 字幕文件包json格式保存，文件格式如下

{"time_begin": 0.0,"time_end": 60000.0,"text": "Hello World,Hello World,Hello World,Hello World,Hello World!"
}

import json
import os.pathfrom pydub import AudioSegmentfrom com.ai.uitl.stt_util import STTstt = STT()def create_directory_if_not_exists(directory_path):# 判斷目錄是否存在if not os.path.exists(directory_path):try:# 創建目錄os.makedirs(directory_path)print(f"目錄 '{directory_path}' 已創建。")except Exception as e:print(f"創建目錄 '{directory_path}' 時發生錯誤: {e}")else:print(f"目錄 '{directory_path}' 已存在。")def split(file_name,input_dir,output_dir,duration,json_file_output):create_directory_if_not_exists(output_dir)input_path = os.path.join(input_dir,file_name)# 加載音頻文件audio = AudioSegment.from_file(input_path, format="mp3")# 音頻文件的時長duration_seconds = audio.duration_secondsduration_milliseconds = duration_seconds * 1000start_time,end_time = 0.00,0.00index = 0text = ''all_objs = []one_minute_obj = {}# 指定切割開始時間和結束時間(單位為毫秒)while end_time < duration_milliseconds:start_time = end_timeend_time = start_time+durationif end_time > duration_milliseconds:end_time = duration_milliseconds# 切割音頻cropped_audio = audio[start_time:end_time]output_file_name = f'{file_name}_{index}.mp3'output_path = os.path.join(output_dir,output_file_name)# 保存切割后的音頻cropped_audio.export(output_path, format="mp3")result = index % 2if result == 0:text = stt.request(output_path)one_minute_obj['time_begin'] = start_timeelse:text = text + stt.request(output_path)one_minute_obj['time_end'] = end_timeone_minute_obj['text'] = textall_objs.append(one_minute_obj)one_minute_obj = {}index += 1result = index % 2if result != 0:one_minute_obj['text'] = textone_minute_obj['time_end'] = end_timeall_objs.append(one_minute_obj)# 打開文件并寫入 JSON 數據with open(json_file_output, 'w', encoding='utf-8') as json_file:json.dump(all_objs, json_file, ensure_ascii=False, indent=4)if __name__ == '__main__':file_arr = ['1277.mp3', '1279.mp3']input_dir = r"E:\temp"for file_name in file_arr:temp_json_file_name = file_name+'_字幕文件.json'output_dir = r"E:\temp\output"output_dir = os.path.join(output_dir,file_name)json_file_output = os.path.join(output_dir,temp_json_file_name)split(file_name,input_dir,output_dir,30000.00,json_file_output)

本文來自互聯網用戶投稿，該文觀點僅代表作者本人，不代表本站立場。本站僅提供信息存儲空間服務，不擁有所有權，不承擔相關法律責任。
如若轉載，請注明出處：http://www.pswp.cn/diannao/75669.shtml
繁體地址，請注明出處：http://hk.pswp.cn/diannao/75669.shtml
英文地址，請注明出處：http://en.pswp.cn/diannao/75669.shtml

如若內容造成侵權/違法違規/事實不符，請聯系多彩編程網進行投訴反饋email:809451989@qq.com，一經查實，立即刪除！