【Faster-Whisper】離線識別本地視頻并生成字幕
- 1 前言
- 2 工具說明
- 2.1 ffmpeg 媒體轉換器
- 2.1.1 理論
- 簡介
- 文檔
- 2.1.2 安裝
- win安裝
- python安裝
- 2.1.3 查看
- 查看音視頻文件格式、編碼
- 2.1.4 視頻處理
- 視頻格式轉換
- 設置 視頻碼率
- 裁剪視頻
- 2.1.5 音頻處理
- 視頻提取音頻
- 音頻格式轉換
- gpu加速
- 2.2 faster-whisper 語音識別模型
- 2.2.1 理論
- 參考文檔
- 各個模型對比
- 2.2.2 安裝
- win安裝
- python 安裝
- 模型下載
- 2.2.3 whisper-faster 參數
- 所有參數
- 所有參數翻譯版
- 性能優化
- 字幕長度
- 2.2.4 舉例
- small
- large-v2
- large-v2操作整個文件夾
- 看進度
- 2.2.5 報錯
- CUDA Out of Memory :CUDA
- 3 實戰演示
- 3.1 純win端演示
- 全流程步驟
- 測試機環境說明
- 使用ffmpeg將視頻提取出音頻
- faster-whisper生成字幕
- 3.2 補充
- 用python批量提取音頻
- 4 總結
1 前言
平常學習時看的本地離線好的視頻,但是視頻一般沒有字幕,偶然看到了PotPlayer 的 生成有聲字幕 功能,正好使用了faster-whisper模型,所以打算單獨拿來用一用
Faster-Whisper
語音識別模型,能夠將音頻轉換為文本此時正好需要還需要一個視頻提取音頻的工具:
ffmpeg
所以,就需要先使用 ffmpeg
把視頻提取出音頻,再把音頻交給 Faster-Whisper
換為字幕
工具安裝,參考 2 工具說明 【我使用的是win端,安裝只要安裝win端的,如果配合python使用,可以下載python版本】
生成視頻字幕的演示,參考 3 實戰演示
全流程步驟:
- 安裝ffmpeg
- 下載faster-whisper
- 下載faster-whisper 的模型
- 使用ffmpeg將視頻提取出音頻
- 使用faster-whisper,指定模型,進行語音識別,生成字幕
2 工具說明
2.1 ffmpeg 媒體轉換器
2.1.1 理論
簡介
ffmpeg
是一個通用的媒體轉換器。
它可以讀取各種各樣的 輸入 - 包括實時抓取/錄制設備 - 過濾和轉碼 轉換為多種輸出格式。
FFmpeg 是一個跨平臺的開源多媒體框架,用于錄制、轉換、流處理音頻和視頻。它支持幾乎所有主流的音視頻格式(包括編解碼、封裝格式),并提供了豐富的濾鏡、特效和處理功能,被廣泛應用于視頻編輯、流媒體服務、格式轉換、音視頻分析等領域。
文檔
官方
ffmpeg 文檔
第三方
完整的 FFmpeg 命令使用教程_ffmpeg使用教程-CSDN博客
FFmpeg教程(超級詳細版) - 個人文章 - SegmentFault 思否
2.1.2 安裝
win安裝
-
官網下載 Download FFmpeg
-
-
直達下載頁面 構建 - CODEX FFMPEG @ gyan.dev
-
-
下載后解壓
-
添加環境變量
- 圖形界面方式:慢慢添加
- 命令行方式:
- CMD方式,要用管理員
- setx PATH “%PATH%;D:\ffmpeg-2025-05-12-git-full_build\bin” /M
setx
:命令行工具,用于設置環境變量。PATH
:要設置的環境變量名稱。%PATH%;D:\ffmpeg-2025-05-12-git-full_build\bin
:新的環境變量值。%PATH%
表示當前系統的 PATH 環境變量的值,D:\ffmpeg-2025-05-12-git-full_build\bin
是您要添加的目錄路徑。
/M
:表示修改系統環境變量(對所有用戶生效)。不加就只想修改當前用戶的環境變量
- PowerShell
- 命令復雜,不記了,cmd簡單。不要用PowerShell執行cmd那條,會把環境變量覆蓋了!
- CMD是CMD,PowerShell是PowerShell,命令不一樣,不要刷錯了
- CMD方式,要用管理員
setx PATH "%PATH%;D:\ffmpeg-2025-05-12-git-full_build\bin" /M
# 過程
PS C:\Users\h1369> echo %Path%PS C:\Users\h1369> setx PATH "%PATH%;D:\ffmpeg-2025-05-12-git-full_build\bin" /M成功: 指定的值已得到保存。
PS C:\Users\h1369>
重新打開命令提示符,驗證
ffmpeg -version
# 過程
Windows PowerShell
版權所有(C) Microsoft Corporation。保留所有權利。安裝最新的 PowerShell,了解新功能和改進!https://aka.ms/PSWindowsPS C:\Users\h1369> ffmpeg -version
ffmpeg version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developers
built with gcc 15.1.0 (Rev2, Built by MSYS2 project)
configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-lcms2 --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-libdvdnav --enable-libdvdread --enable-sdl2 --enable-libaribb24 --enable-libaribcaption --enable-libdav1d --enable-libdavs2 --enable-libopenjpeg --enable-libquirc --enable-libuavs3d --enable-libxevd --enable-libzvbi --enable-libqrencode --enable-librav1e --enable-libsvtav1 --enable-libvvenc --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxeve --enable-libxvid --enable-libaom --enable-libjxl --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libcodec2 --enable-libilbc --enable-libgsm --enable-liblc3 --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprint
libavutil 60. 2.100 / 60. 2.100
libavcodec 62. 3.101 / 62. 3.101
libavformat 62. 0.102 / 62. 0.102
libavdevice 62. 0.100 / 62. 0.100
libavfilter 11. 0.100 / 11. 0.100
libswscale 9. 0.100 / 9. 0.100
libswresample 6. 0.100 / 6. 0.100Exiting with exit code 0
PS C:\Users\h1369>
python安裝
pip install ffmpeg-python
python示例
import ffmpeg# 定義視頻文件和輸出音頻文件的路徑
video_file = r"D:\BaiduNetdiskDownload\數據遷移原理.ts"
audio_file = r'D:\BaiduNetdiskDownload\數據遷移原理.wav'# Step 1: 將.ts視頻轉換為音頻文件
# 使用ffmpeg-python將視頻轉換為音頻
# 創建ffmpeg輸入流
input_stream = ffmpeg.input(video_file)# 設置輸出流的參數
output_stream = input_stream.output(audio_file,acodec='pcm_s16le',ar='44100',ac='2')# 執行轉換操作
output_stream.run()print('將.ts視頻轉換為音頻文件')
2.1.3 查看
查看音視頻文件格式、編碼
兩個方式都可以看:
- 視頻編碼、色彩空間、分辨率、幀率
- 音頻編碼、采樣率、聲道、音頻比特率
方式1: ffprobe
(詳細)
ffprobe -i 文件名
方式2:ffmpeg
(精簡)
ffmpeg -i 文件名
# 輸出詳細說明
PS C:\Users\h1369> ffprobe -i "D:\BaiduNetdiskDownload\HyperCDP技術.ts"# FFmpeg 版本與編譯信息
ffprobe version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2007-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev2, Built by MSYS2 project)configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads 。。。。。libavutil 60. 2.100 / 60. 2.100libavcodec 62. 3.101 / 62. 3.101libavformat 62. 0.102 / 62. 0.102libavdevice 62. 0.100 / 62. 0.100libavfilter 11. 0.100 / 11. 0.100libswscale 9. 0.100 / 9. 0.100libswresample 6. 0.100 / 6. 0.100# 媒體文件基本信息。# mpegts(容器格式。MPEG-2 傳輸流,常用于直播、廣播電視)
Input #0, mpegts, from 'D:\BaiduNetdiskDownload\HyperCDP技術.ts':# 總時長Duration、開始時間start、整體比特率bitrateDuration: 03:01:03.65, start: 1.513556, bitrate: 558 kb/sProgram 1Metadata:service_name : Service01service_provider: FFmpeg# 視頻流信息# 視頻編碼Video、色彩空間yuv420p、分辨率1728x1080、幀率25 fps、時間基準90k tbnStream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709/bt709/iec61966-2-1, progressive), 1728x1080 [SAR 1:1 DAR 8:5], 25 fps, 25 tbr, 90k tbn, Start 1.560000# 音頻流信息# 音頻編碼Audio、采樣率44100 Hz、聲道stereo(立體聲)、音頻比特率153 kb/sStream #0:1[0x101](und): Audio: aac (LC) ([15][0][0][0] / 0x000F), 44100 Hz, stereo, fltp, 153 kb/s, Start 1.513556
PS C:\Users\h1369>
PS C:\Users\h1369> ffmpeg -i "D:\BaiduNetdiskDownload\HyperCDP技術.ts"
ffmpeg version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev2, Built by MSYS2 project)configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --。。。。。。libavutil 60. 2.100 / 60. 2.100libavcodec 62. 3.101 / 62. 3.101libavformat 62. 0.102 / 62. 0.102libavdevice 62. 0.100 / 62. 0.100libavfilter 11. 0.100 / 11. 0.100libswscale 9. 0.100 / 9. 0.100libswresample 6. 0.100 / 6. 0.100
Input #0, mpegts, from 'D:\BaiduNetdiskDownload\HyperCDP技術.ts':Duration: 03:01:03.65, start: 1.513556, bitrate: 558 kb/sProgram 1Metadata:service_name : Service01service_provider: FFmpeg# 視頻流信息Stream #0:0[0x100]: Video: h264 (High) ([27][0][0][0] / 0x001B), yuv420p(tv, bt709/bt709/iec61966-2-1, progressive), 1728x1080 [SAR 1:1 DAR 8:5], 25 fps, 25 tbr, 90k tbn, Start 1.560000# 音頻流信息Stream #0:1[0x101](und): Audio: aac (LC) ([15][0][0][0] / 0x000F), 44100 Hz, stereo, fltp, 153 kb/s, Start 1.513556
At least one output file must be specified
PS C:\Users\h1369>
2.1.4 視頻處理
視頻格式轉換
比如:.ts 轉MP4
方式1
ffmpeg -i input.ts output.mp4
ffmpeg -i "HyperCDP技術.ts" "HyperCDP技術.mp4"
方式 2:兼容轉換(重新編碼為標準 MP4 格式)
ffmpeg -i input.ts -c:v libx264 -crf 23 -c:a aac -b:a 128k output.mp4
# 參數說明
-c:v libx264 # 將視頻重新編碼為 H.264(MP4 最兼容的視頻編碼)
-crf 23 # 控制視頻質量(默認 23,值越小畫質越好,文件越大)
-c:a aac # 將音頻重新編碼為 AAC(MP4 標準音頻編碼)
-b:a 128k # 設置音頻比特率為 128kbps(平衡音質與體積)
-progress pipe:1 # 參數可實時顯示轉換進度
-bsf:v h264_mp4toannexb # 修復 H.264 時間戳問題(常見于直播流)
-copyts # 保留原始時間戳(避免某些播放器播放異常)
設置 視頻碼率
# 設置輸出文件的視頻碼率為 64 kbit/s:
ffmpeg -i input.avi -b:v 64k -bufsize 64k output.mp4
裁剪視頻
FFmpeg 也允許你裁剪視頻
例如,從視頻中提取從 00:00:30 到 00:00:50 之間的視頻片段:
ffmpeg -ss 00:00:30 -to 00:00:50 -i 輸入.mp4 輸出.mp4
ffmpeg -ss 00:00:30 -to 00:00:50 -i [新聞30分]國內簡訊-1.mp4 [新聞30分]國內簡訊-裁剪后.mp4
# 參數說明
-ss 00:00:30 # 從 00:00:30 開始裁剪
-to 00:00:50 # 在 00:00:50 結束裁剪
2.1.5 音頻處理
視頻提取音頻
視頻提取音頻
ffmpeg -i 輸入視頻文件 -vn -acodec mp3 輸出音頻文件
ffmpeg -i "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp4" -vn -acodec mp3 "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3"
-vn
: 禁用視頻流,只提取音頻。-acodec mp3
: 設置音頻編碼格式為 MP3。
PS D:\Users\Desktop\新建文件夾> ffmpeg -i "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp4" -vn -acodec mp3 "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3"
ffmpeg version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev2, Built by MSYS2 project)configuration: --enable-gpl 。。。。libavutil 60. 2.100 / 60. 2.100libavcodec 62. 3.101 / 62. 3.101libavformat 62. 0.102 / 62. 0.102libavdevice 62. 0.100 / 62. 0.100libavfilter 11. 0.100 / 11. 0.100libswscale 9. 0.100 / 9. 0.100libswresample 6. 0.100 / 6. 0.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp4':Metadata:major_brand : mp42minor_version : 0compatible_brands: mp42isomDuration: 00:01:42.17, start: 0.000000, bitrate: 2680 kb/sStream #0:0[0x1](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 192 kb/s (default)Metadata:vendor_id : [0][0][0][0]Stream #0:1[0x2](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 1916x1076 [SAR 1:1 DAR 479:269], 2485 kb/s, 29.93 fps, 30 tbr, 100k tbn (default)Metadata:vendor_id : [0][0][0][0]encoder : JVT/AVC Coding
Stream mapping:Stream #0:0 -> #0:0 (aac (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to 'D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3':Metadata:major_brand : mp42minor_version : 0compatible_brands: mp42isomTSSE : Lavf62.0.102Stream #0:0(und): Audio: mp3, 48000 Hz, stereo, fltp (default)Metadata:encoder : Lavc62.3.101 libmp3lamevendor_id : [0][0][0][0]
[out#0/mp3 @ 00000269a61c48c0] video:0KiB audio:1598KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 0.020112%
size= 1598KiB time=00:01:42.17 bitrate= 128.1kbits/s speed=81.4x elapsed=0:00:01.25
PS D:\Users\Desktop\新建文件夾>
PS D:\Users\Desktop\新建文件夾> ls目錄: D:\Users\Desktop\新建文件夾Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2025/6/18 10:54 1636169 [新聞30分]國內簡訊-1.mp3
-a---- 2025/6/18 10:18 34232134 [新聞30分]國內簡訊-1.mp4
-a---- 2025/6/18 10:22 35120886 [新聞30分]國內簡訊-2.mp4PS D:\Users\Desktop\新建文件夾>
音頻格式轉換
跟視頻轉格式一樣
ffmpeg -i input.ts output.mp3
gpu加速
ffmpeg 硬件加速視頻轉碼指南 - afnewiung - 博客園
C:\Users\HN>ffmpeg -hwaccels
ffmpeg version 2025-05-15-git-12b853530a-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev4, Built by MSYS2 project)configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-。。。libswresample 6. 0.100 / 6. 0.100
Hardware acceleration methods:
cuda
vaapi
dxva2
qsv
d3d11va
opencl
vulkan
d3d12va
amfC:\Users\HN>
2.2 faster-whisper 語音識別模型
2.2.1 理論
參考文檔
基于OpenAI的Whisper構建的高效語音識別模型:faster-whisper-CSDN博客
各個模型對比
ai生成
模型名稱 | 準確性 | 識別速度 | 參數量 | 語言支持 | 適用場景 |
---|---|---|---|---|---|
tiny | 較低準確性 | 極快(比 base 快) | 39M | 多語言 | 實時翻譯、低資源設備、快速轉錄 |
tiny.en | 英語較低準確性 | 極快(比 base.en 快) | 39M | 英語 | 僅英語場景,極致性能需求 |
base | 中等準確性 | 快(比 small 快) | 74M | 多語言 | 日常使用、平衡精度與速度 |
base.en | 英語中等準確性 | 快(比 small.en 快) | 74M | 英語 | 僅英語場景,優化精度 |
small | 較高準確性 | 中等(比 medium 快) | 244M | 多語言 | 專業字幕生成、會議記錄、標準需求 |
small.en | 英語較高準確性 | 中等(比 medium.en 快) | 244M | 英語 | 僅英語場景,更高精度 |
medium | 高準確性 | 較慢(比 large-v2 快) | 769M | 多語言 | 學術研究、專業音頻處理、高要求場景 |
medium.en | 英語高準確性 | 較慢(比 large-v2 快) | 769M | 英語 | 僅英語場景,極高精度 |
large-v1 | 最高準確性 | 慢(基準速度) | 1550M | 多語言 | 高質量轉錄需求、長文本處理 |
large-v2 | 最高準確性 | 慢(基準速度) | 1550M | 多語言 | 改進版大模型,全面優于 v1 |
large-v3 | 最高準確性 | 慢(基準速度) | 1550M | 多語言 | 最新版大模型,優化長文本和復雜場景 |
large | 最高準確性 | 慢(基準速度) | 1550M | 多語言 | 等同于 large-v3 |
distil-large-v2 | 接近 large-v2 | 較快(比 large-v2 快) | 1550M | 多語言 | 蒸餾優化版,資源效率更高 |
distil-medium.en | 接近 medium.en | 中等偏快(比 medium.en 快) | 769M | 英語 | 英語蒸餾優化版,平衡效率與精度 |
distil-small.en | 接近 small.en | 快(比 small.en 快) | 244M | 英語 | 英語小型蒸餾版,輕量高效 |
distil-large-v3 | 接近 large-v3 | 較快(比 large-v3 快) | 1550M | 多語言 | 最新版蒸餾大模型,優化效率 |
2.2.2 安裝
faster-whisper 下好后,還需要下載模型
win安裝
faster-whisper的 win版本:
Releases · Purfview/whisper-standalone-win
python 安裝
python 安裝 faster-whisper faster-whisper · PyPI Python 3.9 或更高版本
pip install faster-whisper
模型下載
模型下載
guillaumekln (Guillaume Klein)
large-v3模型:https://huggingface.co/Systran/faster-whisper-large-v3/tree/main
large-v2模型:https://huggingface.co/guillaumekln/faster-whisper-large-v2/tree/main
large-v1模型:https://huggingface.co/guillaumekln/faster-whisper-large-v1/tree/main
medium模型:https://huggingface.co/guillaumekln/faster-whisper-medium/tree/main
small模型:https://huggingface.co/guillaumekln/faster-whisper-small/tree/main
base模型:https://huggingface.co/guillaumekln/faster-whisper-base/tree/main
tiny模型:https://huggingface.co/guillaumekln/faster-whisper-tiny/tree/main
國內模型地址:
https://aifasthub.com/models/guillaumekln
2.2.3 whisper-faster 參數
所有參數
--model
['tiny', 'tiny.en', 'base', 'base.en', 'small', 'small.en', 'medium', 'medium.en', 'large-v1', 'large-v2', 'large-v3', 'large', 'distil-large-v2', 'distil-medium.en', 'distil-small.en', 'distil-large-v3']
# 用法:
whisper-faster.exe 選項 音頻文件【你可以輸入文件通配符、文件列表(txt、m3u、m3u8、lst)或目錄以進行批量處理。注意:列表或目錄中的非媒體文件將按擴展名過濾掉】
PS D:\Users\Desktop\字幕\Whisper-Faster> .\whisper-faster.exe -h
usage: whisper-faster.exe [-h] [--model MODEL] [--model_dir MODEL_DIR] [--device DEVICE] [--output_dir OUTPUT_DIR][--output_format {lrc,txt,text,vtt,srt,tsv,json,all}] [--verbose VERBOSE] [--task {transcribe,translate}][--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}][--language_detection_threshold LANGUAGE_DETECTION_THRESHOLD] [--language_detection_segments LANGUAGE_DETECTION_SEGMENTS][--temperature TEMPERATURE] [--best_of BEST_OF] [--beam_size BEAM_SIZE] [--patience PATIENCE] [--length_penalty LENGTH_PENALTY][--repetition_penalty REPETITION_PENALTY] [--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE] [--suppress_blank SUPPRESS_BLANK][--suppress_tokens SUPPRESS_TOKENS] [--initial_prompt INITIAL_PROMPT] [--prefix PREFIX][--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT] [--prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATURE][--without_timestamps WITHOUT_TIMESTAMPS] [--max_initial_timestamp MAX_INITIAL_TIMESTAMP][--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK][--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD] [--logprob_threshold LOGPROB_THRESHOLD][--no_speech_threshold NO_SPEECH_THRESHOLD] [--v3_offsets_off][--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD] [--hallucination_silence_th_temp {0.0,0.2,0.5,0.8,1.0}][--clip_timestamps CLIP_TIMESTAMPS] [--no_speech_strict_lvl {0,1,2}] [--word_timestamps WORD_TIMESTAMPS][--highlight_words HIGHLIGHT_WORDS] [--prepend_punctuations PREPEND_PUNCTUATIONS] [--append_punctuations APPEND_PUNCTUATIONS][--threads THREADS] [--version] [--vad_filter VAD_FILTER] [--vad_threshold VAD_THRESHOLD][--vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS] [--vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S][--vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS] [--vad_speech_pad_ms VAD_SPEECH_PAD_MS][--vad_window_size_samples VAD_WINDOW_SIZE_SAMPLES] [--vad_dump] [--max_new_tokens MAX_NEW_TOKENS] [--chunk_length CHUNK_LENGTH][--compute_type {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}] [--batch_recursive][--beep_off] [--skip] [--checkcuda] [--print_progress] [--postfix] [--check_files] [--PR163_off] [--hallucinations_list_off][--one_word {0,1,2}] [--sentence] [--standard] [--standard_asia] [--max_comma MAX_COMMA] [--max_comma_cent {50,60,70,80,90,100}][--max_gap MAX_GAP] [--max_line_width MAX_LINE_WIDTH] [--max_line_count MAX_LINE_COUNT][--min_dist_to_end {0,4,5,6,7,8,9,10,11,12}] [--prompt_max {16,32,64,128,223}] [--reprompt {0,1,2}][--prompt_reset_on_no_end {0,1,2}] [--ff_dump] [--ff_track {1,2,3,4,5,6}] [--ff_fc] [--ff_mp3] [--ff_sync] [--ff_rnndn_sh][--ff_rnndn_xiph] [--ff_fftdn [0 - 97]] [--ff_tempo [0.5 - 2.0]] [--ff_gate] [--ff_speechnorm] [--ff_loudnorm][--ff_silence_suppress noise duration] [--ff_lowhighpass]audio [audio ...]positional arguments:audio audio file(s). You can enter a file wildcard, filelist (txt. m3u, m3u8, lst) or directory to do batch processing. Note: non-mediafiles in list or directory are filtered out by extension.optional arguments:-h, --help show this help message and exit--model MODEL, -m MODELname of the Whisper model to use (default: medium)--model_dir MODEL_DIRthe path to save model files; uses D:\Users\Desktop\字幕\Whisper-Faster\_models by default (default: None)--device DEVICE, -d DEVICEDevice to use. Default is 'cuda' if CUDA device is detected, else is 'cpu'. If CUDA GPU is a second device then set 'cuda:1'.(default: cuda)--output_dir OUTPUT_DIR, -o OUTPUT_DIRdirectory to save the outputs. By default the same folder where the executable file is or where media file is if--batch_recursive=True. '.'- sets to the current folder. 'source' - sets to where media file is. (default: default)--output_format {lrc,txt,text,vtt,srt,tsv,json,all}, -f {lrc,txt,text,vtt,srt,tsv,json,all}format of the output file; if not specified srt will be produced (default: srt)--verbose VERBOSE, -v VERBOSEwhether to print out debug messages (default: False)--task {transcribe,translate}whether to perform X->X speech recognition ('transcribe') or X->English translation ('translate') (default: transcribe)--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}, -l {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}language spoken in the audio, specify None to perform language detection (default: None)--language_detection_threshold LANGUAGE_DETECTION_THRESHOLDIf the maximum probability of the language tokens is higher than this value, the language is detected. (default: None)--language_detection_segments LANGUAGE_DETECTION_SEGMENTSNumber of segments/chunks to consider for the language detection. (default: 1)--temperature TEMPERATUREtemperature to use for sampling (default: 0)--best_of BEST_OF, -bo BEST_OFnumber of candidates when sampling with non-zero temperature (default: 5)--beam_size BEAM_SIZE, -bs BEAM_SIZEnumber of beams in beam search, only applicable when temperature is zero (default: 5)--patience PATIENCE, -p PATIENCEoptional patience value to use in beam decoding, as in https://arxiv.org/abs/2204.05424, the default (1.0) is equivalent toconventional beam search (default: 2.0)--length_penalty LENGTH_PENALTYoptional token length penalty coefficient (alpha) as in https://arxiv.org/abs/1609.08144, uses simple length normalization bydefault (default: 1.0)--repetition_penalty REPETITION_PENALTYPenalty applied to the score of previously generated tokens (set > 1.0 to penalize). (default: 1.0)--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZEPrevent repetitions of ngrams with this size (set 0 to disable). (default: 0)--suppress_blank SUPPRESS_BLANKSuppress blank outputs at the beginning of the sampling. (default: True)--suppress_tokens SUPPRESS_TOKENScomma-separated list of token ids to suppress during sampling; '-1' will suppress most special characters except commonpunctuations (default: -1)--initial_prompt INITIAL_PROMPT, -prompt INITIAL_PROMPToptional text to provide context as a prompt for the first window. Use 'None' to disable it. Note: 'auto' and 'default' areexperimental ~universal prompt presets, they work if --language is set. (default: auto)--prefix PREFIX Optional text to provide as a prefix for the first window (default: None)--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT, -condition CONDITION_ON_PREVIOUS_TEXTif True, provide the previous output of the model as a prompt for the next window; disabling may make the text inconsistent acrosswindows, but the model becomes less prone to getting stuck in a failure loop. If disabled then you may want to disable --reprompttoo. (default: True)--prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATUREResets prompt if temperature is above this value. Arg has effect only if condition_on_previous_text is True. (default: 0.5)--without_timestamps WITHOUT_TIMESTAMPSOnly sample text tokens. (default: False)--max_initial_timestamp MAX_INITIAL_TIMESTAMPThe initial timestamp cannot be later than this. (default: 1.0)--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK, -fallback TEMPERATURE_INCREMENT_ON_FALLBACKtemperature to increase when falling back when the decoding fails to meet either of the thresholds below. To disable fallback setit to 'None'. (default: 0.2)--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLDif the gzip compression ratio is higher than this value, treat the decoding as failed (default: 2.4)--logprob_threshold LOGPROB_THRESHOLDif the average log probability is lower than this value, treat the decoding as failed (default: -1.0)--no_speech_threshold NO_SPEECH_THRESHOLDif the probability of the <|nospeech|> token is higher than this value AND the decoding has failed due to 'logprob_threshold',consider the segment as silence (default: 0.6)--v3_offsets_off Disables custom offsets to the defaults of pseudo-vad thresholds when 'large-v3' models are in use. Note: Offsets made to combat'large-v3' hallucinations. (default: False)--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD, -hst HALLUCINATION_SILENCE_THRESHOLD(Experimental) When word_timestamps is True, skip silent periods longer than this threshold (in seconds) when a possiblehallucination is detected. Optimal value is somewhere between 2 - 8 seconds. Inactive if None. (default: None)--hallucination_silence_th_temp {0.0,0.2,0.5,0.8,1.0}, -hst_temp {0.0,0.2,0.5,0.8,1.0}(Experimental) Additional heuristic for '--hallucination_silence_threshold'. If temperature is higher that this threshold thenconsider segment as possible hallucination ignoring the hst score. Inactive if 1.0. (default: 1.0)--clip_timestamps CLIP_TIMESTAMPSComma-separated list start,end,start,end,... timestamps (in seconds) of clips to process. The last end timestamp defaults to theend of the file. VAD is auto-disabled. (default: 0)--no_speech_strict_lvl {0,1,2}(experimental) Level of stricter actions when no_speech_prob > 0.93. Use beam_size=5 if this is enabled. Options: 0 - Disabled (donothing), 1 - Reset propmt (see condition_on_previous_text), 2 - Invalidate the cached encoder output (if no_speech_threshold isnot None). Arg meant to combat cases where the model is getting stuck in a failure loop or outputs nonsense (default: 0)--word_timestamps WORD_TIMESTAMPS, -wt WORD_TIMESTAMPSExtract word-level timestamps and refine the results based on them (default: True)--highlight_words HIGHLIGHT_WORDS, -hw HIGHLIGHT_WORDSunderline each word as it is spoken AKA karaoke in srt and vtt output formats (default: False)--prepend_punctuations PREPEND_PUNCTUATIONSif word_timestamps is True, merge these punctuation symbols with the next word (default: "'“?([{-)--append_punctuations APPEND_PUNCTUATIONSif word_timestamps is True, merge these punctuation symbols with the previous word (default: "'.。,,!!??::”)]}、)--threads THREADS number of threads used for CPU inference; By default number of the real cores but no more that 4 (default: 0)--version Show Faster-Whisper's version number--vad_filter VAD_FILTER, -vad VAD_FILTEREnable the voice activity detection (VAD) to filter out parts of the audio without speech. (default: True)--vad_threshold VAD_THRESHOLDProbabilities above this value are considered as speech. (default: 0.45)--vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MSFinal speech chunks shorter min_speech_duration_ms are thrown out. (default: 350)--vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_SMaximum duration of speech chunks in seconds. Longer will be split at the timestamp of the last silence. (default: None)--vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MSIn the end of each speech chunk time to wait before separating it. (default: 3000)--vad_speech_pad_ms VAD_SPEECH_PAD_MSFinal speech chunks are padded by speech_pad_ms each side. (default: 900)--vad_window_size_samples VAD_WINDOW_SIZE_SAMPLESSize of audio chunks fed to the silero VAD model. Values other than 512, 1024, 1536 may affect model perfomance!!! (default: 1536)--vad_dump Dumps VAD timings to a subtitle file for inspection. (default: False)--max_new_tokens MAX_NEW_TOKENSMaximum number of new tokens to generate per-chunk. (default: None)--chunk_length CHUNK_LENGTHThe length of audio segments. If it is not None, it will overwrite the default chunk_length of the FeatureExtractor. (default:None)--compute_type {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}, -ct {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}Type of quantization to use (see https://opennmt.net/CTranslate2/quantization.html). (default: auto)--batch_recursive, -brEnables recursive batch processing. Note: If set then it changes defaults of --output_dir. (default: False)--beep_off Disables the beep sound when operation is finished. (default: False)--skip Skips media file if subtitle exists. Works if input is wildcard or directory. (default: False)--checkcuda, -cc Returns CUDA device count. (for Subtitle Edit's internal use)--print_progress, -ppPrints progress bar instead of transcription. (default: False)--postfix Adds language as a postfix to subtitle's filename. (default: False)--check_files Checks input files for errors before passing all them for transcription. Works if input is wildcard or directory. (default: False)--PR163_off (For dev experiments) Disables PR163. . (default: False)--hallucinations_list_off(For dev experiments) Disables hallucinations_list, allows hallucinations added to prompt. (default: False)--one_word {0,1,2} 0) Disabled. 1) Outputs srt and vtt subtitles with one word per line. 2) As '1', plus removes whitespace and ensures >= 50ms forsub lines. Note: VAD may slightly reduce the accuracy of timestamps on some lines. (default: 0)--sentence Enables splitting lines to sentences for srt and vtt subs. Every sentence starts in the new segment. By default meant to outputwhole sentence per line for better translations, but not limited to, read about '--max_...' parameters. Note: has no effect on'highlight_words'. (default: False)--standard Quick hardcoded preset to split lines in standard way. 42 chars per 2 lines with max_comma_cent=70 and --sentence are activatedautomatically. (default: False)--standard_asia Quick hardcoded preset to split lines in standard way for some Asian languages. 16 chars per 2 lines with max_comma_cent=80 and--sentence are activated automatically. (default: False)--max_comma MAX_COMMA(requires --sentence) After this line length a comma is treated as the end of sentence. Note: disabled if it‘s over or equal to--max_line_width. (default: 250)--max_comma_cent {50,60,70,80,90,100}(requires --sentence) Percentage of --max_line_width when it starts breaking the line after comma. Note: 100 = disabled. (default:100)--max_gap MAX_GAP (requires --sentence) Threshold for a gap length in seconds, longer gaps are treated as dots. (default: 3.0)--max_line_width MAX_LINE_WIDTHThe maximum number of characters in a line before breaking the line. (default: 1000)--max_line_count MAX_LINE_COUNTThe maximum number of lines in one sub segment. (default: 1)--min_dist_to_end {0,4,5,6,7,8,9,10,11,12}(requires --sentence) If from words like 'the', 'Mr.' and ect. to the end of line distance is less than set then it starts in anew line. Note: 0 = disabled. (default: 0)--prompt_max {16,32,64,128,223}(experimental) The maximum size of prompt. (default: 223)--reprompt {0,1,2} (experimental) 0) Disabled. 1) Inserts initial_prompt after the prompt resets. 2) Ensures that initial_prompt is present in promptfor all windows/chunks. Note: auto-disabled if initial_prompt=None. It's similar to 'hotwords' feature. (default: 2)--prompt_reset_on_no_end {0,1,2}(experimental) Resets prompt if there is no end of sentence in window/chunk. 0 - disabled, 1 - looks for period, 2 - looks forperiod or comma. Note: it's auto-disabled if reprompt=0. (default: 2)--ff_dump Dumps pre-processed audio by the filters to the 16000Hz file and prevents deletion of some intermediate audio files. (default:False)--ff_track {1,2,3,4,5,6}Audio track selector. 1 - selects the first audio track. (default: 1)--ff_fc Selects only front-center channel (FC) to process. (default: False)--ff_mp3 Audio filter: Conversion to MP3 and back. (default: False)--ff_sync Audio filter: Stretch/squeeze samples to the given timestamps, with a maximum of 3600 samples per second compensation. Input filemust be container that support storing PTS like mp4, mkv... (default: False)--ff_rnndn_sh Audio filter: Suppress non-speech with GregorR‘s SH model using Recurrent Neural Networks. Notes: It’s more aggressive than Xiph,discards singing. (default: False)--ff_rnndn_xiph Audio filter: Suppress non-speech with Xiph’s original model using Recurrent Neural Networks. (default: False)--ff_fftdn [0 - 97] Audio filter: General denoise with Fast Fourier Transform. Notes: 12 - normal strength, 0 - disabled. (default: 0)--ff_tempo [0.5 - 2.0]Audio filter: Adjust audio tempo. Values below 1.0 slows down audio, above - speeds up. 1.0 = disabled. (default: 1.0)--ff_gate Audio filter: Reduce lower parts of a signal. (default: False)--ff_speechnorm Audio filter: Extreme and fast speech amplification. (default: False)--ff_loudnorm Audio filter: EBU R128 loudness normalization. (default: False)--ff_silence_suppress noise durationAudio filter: Suppress quiet parts of audio. Takes two values. First value - noise tolerance in decibels [-70 - 0] (0=disabled),second value - minimum silence duration in seconds [0.1 - 10]. (default: [0, 3.0])--ff_lowhighpass Audio filter: Pass 50Hz - 7800 band. sinc + afir. (default: False)
PS D:\Users\Desktop\字幕\Whisper-Faster>
所有參數翻譯版
-h, --help 顯示此幫助消息并退出--model MODEL, -m MODEL要使用的Whisper模型名稱(默認值:medium)--model_dir MODEL_DIR保存模型文件的路徑;默認使用D:\Users\Desktop\字幕\Whisper-Faster\_models(默認值:None)--device DEVICE, -d DEVICE使用的設備。如果檢測到CUDA設備,默認值為'cuda',否則為'cpu'。如果CUDA GPU是第二個設備,則設置'cuda:1'。(默認值:cuda)--output_dir OUTPUT_DIR, -o OUTPUT_DIR保存輸出的目錄。默認情況下,如果--batch_recursive=True,則為可執行文件所在的文件夾或媒體文件所在的位置。'.'設置為當前文件夾。'source'設置為媒體文件所在的位置。(默認值:default)--output_format {lrc,txt,text,vtt,srt,tsv,json,all}, -f {lrc,txt,text,vtt,srt,tsv,json,all}輸出文件的格式;如果未指定,將生成srt(默認值:srt)--verbose VERBOSE, -v VERBOSE是否輸出調試消息(默認值:False)--task {transcribe,translate}是執行X->X語音識別('transcribe')還是X->英語翻譯('translate')(默認值:transcribe)--language {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}, -l {af,am,ar,as,az,ba,be,bg,bn,bo,br,bs,ca,cs,cy,da,de,el,en,es,et,eu,fa,fi,fo,fr,gl,gu,ha,haw,he,hi,hr,ht,hu,hy,id,is,it,ja,jw,ka,kk,km,kn,ko,la,lb,ln,lo,lt,lv,mg,mi,mk,ml,mn,mr,ms,mt,my,ne,nl,nn,no,oc,pa,pl,ps,pt,ro,ru,sa,sd,si,sk,sl,sn,so,sq,sr,su,sv,sw,ta,te,tg,th,tk,tl,tr,tt,uk,ur,uz,vi,yi,yo,yue,zh,Afrikaans,Albanian,Amharic,Arabic,Armenian,Assamese,Azerbaijani,Bashkir,Basque,Belarusian,Bengali,Bosnian,Breton,Bulgarian,Burmese,Cantonese,Castilian,Catalan,Chinese,Croatian,Czech,Danish,Dutch,English,Estonian,Faroese,Finnish,Flemish,French,Galician,Georgian,German,Greek,Gujarati,Haitian,Haitian Creole,Hausa,Hawaiian,Hebrew,Hindi,Hungarian,Icelandic,Indonesian,Italian,Japanese,Javanese,Kannada,Kazakh,Khmer,Korean,Lao,Latin,Latvian,Letzeburgesch,Lingala,Lithuanian,Luxembourgish,Macedonian,Malagasy,Malay,Malayalam,Maltese,Mandarin,Maori,Marathi,Moldavian,Moldovan,Mongolian,Myanmar,Nepali,Norwegian,Nynorsk,Occitan,Panjabi,Pashto,Persian,Polish,Portuguese,Punjabi,Pushto,Romanian,Russian,Sanskrit,Serbian,Shona,Sindhi,Sinhala,Sinhalese,Slovak,Slovenian,Somali,Spanish,Sundanese,Swahili,Swedish,Tagalog,Tajik,Tamil,Tatar,Telugu,Thai,Tibetan,Turkish,Turkmen,Ukrainian,Urdu,Uzbek,Valencian,Vietnamese,Welsh,Yiddish,Yoruba}音頻中使用的語言,指定None以執行語言檢測(默認值:None)--language_detection_threshold LANGUAGE_DETECTION_THRESHOLD如果語言標記的最大概率高于此值,則檢測到該語言。(默認值:None)--language_detection_segments LANGUAGE_DETECTION_SEGMENTS用于語言檢測的段/塊數。(默認值:1)--temperature TEMPERATURE采樣時使用的溫度(默認值:0)--best_of BEST_OF, -bo BEST_OF非零溫度采樣時的候選數(默認值:5)--beam_size BEAM_SIZE, -bs BEAM_SIZE波束搜索中的波束數,僅在溫度為零時適用(默認值:5)--patience PATIENCE, -p PATIENCE波束解碼中使用的可選耐心值,如https://arxiv.org/abs/2204.05424所述,默認值(1.0)相當于傳統波束搜索(默認值:2.0)--length_penalty LENGTH_PENALTY可選的標記長度懲罰系數(alpha),如https://arxiv.org/abs/1609.08144所述,默認使用簡單長度歸一化(默認值:1.0)--repetition_penalty REPETITION_PENALTY應用于先前生成的標記分數的懲罰(設置>1.0以進行懲罰)。(默認值:1.0)--no_repeat_ngram_size NO_REPEAT_NGRAM_SIZE防止出現此大小的ngram重復(設置0以禁用)。(默認值:0)--suppress_blank SUPPRESS_BLANK在采樣開始時抑制空白輸出。(默認值:True)--suppress_tokens SUPPRESS_TOKENS采樣期間要抑制的標記ID的逗號分隔列表;'-1'將抑制除常見標點符號外的大多數特殊字符(默認值:-1)--initial_prompt INITIAL_PROMPT, -prompt INITIAL_PROMPT可選文本,用于為第一個窗口提供上下文作為提示。使用'None'禁用它。注意:'auto'和'default'是實驗性的~通用提示預設,如果設置了--language,它們會起作用。(默認值:auto)--prefix PREFIX 可選文本,用于為第一個窗口提供前綴(默認值:None)--condition_on_previous_text CONDITION_ON_PREVIOUS_TEXT, -condition CONDITION_ON_PREVIOUS_TEXT如果為True,則將模型的先前輸出作為下一個窗口的提示;禁用可能會使窗口之間的文本不一致,但模型更不容易陷入失敗循環。如果禁用,則可能需要同時禁用--reprompt。(默認值:True)--prompt_reset_on_temperature PROMPT_RESET_ON_TEMPERATURE如果溫度高于此值,則重置提示。僅當condition_on_previous_text為True時,該參數才有效。(默認值:0.5)--without_timestamps WITHOUT_TIMESTAMPS僅采樣文本標記。(默認值:False)--max_initial_timestamp MAX_INITIAL_TIMESTAMP初始時間戳不能晚于這個值。(默認值:1.0)--temperature_increment_on_fallback TEMPERATURE_INCREMENT_ON_FALLBACK, -fallback TEMPERATURE_INCREMENT_ON_FALLBACK當解碼未能滿足以下任一閾值而回退時要增加的溫度。要禁用回退,請將其設置為'None'。(默認值:0.2)--compression_ratio_threshold COMPRESSION_RATIO_THRESHOLD如果gzip壓縮比高于此值,則將解碼視為失敗(默認值:2.4)--logprob_threshold LOGPROB_THRESHOLD如果平均對數概率低于此值,則將解碼視為失敗(默認值:-1.0)--no_speech_threshold NO_SPEECH_THRESHOLD如果<|nospeech|>標記的概率高于此值,并且解碼因'logprob_threshold'失敗,則將該段視為靜音(默認值:0.6)--v3_offsets_off 禁用使用'large-v3'模型時對偽VAD閾值默認值的自定義偏移。注意:偏移量用于抑制'large-v3'的幻覺。(默認值:False)--hallucination_silence_threshold HALLUCINATION_SILENCE_THRESHOLD, -hst HALLUCINATION_SILENCE_THRESHOLD(實驗性)當word_timestamps為True時,在檢測到可能的幻覺時,跳過超過此閾值(以秒為單位)的靜音期。最佳值介于2-8秒之間。如果為None則不激活。(默認值:None)--hallucination_silence_th_temp {0.0,0.2,0.5,0.8,1.0}, -hst_temp {0.0,0.2,0.5,0.8,1.0}(實驗性)'--hallucination_silence_threshold'的附加啟發式方法。如果溫度高于此閾值,則將段視為可能的幻覺,忽略hst分數。如果為1.0則不激活。(默認值:1.0)--clip_timestamps CLIP_TIMESTAMPS要處理的剪輯的開始、結束、開始、結束...時間戳(以秒為單位)的逗號分隔列表。最后一個結束時間戳默認為文件末尾。VAD會自動禁用。(默認值:0)--no_speech_strict_lvl {0,1,2}(實驗性)當no_speech_prob > 0.93時的嚴格操作級別。如果啟用,使用beam_size=5。選項:0 - 禁用(不執行任何操作),1 - 重置提示(請參閱condition_on_previous_text),2 - 使緩存的編碼器輸出無效(如果no_speech_threshold不為None)。該參數旨在解決模型陷入失敗循環或輸出無意義內容的情況(默認值:0)--word_timestamps WORD_TIMESTAMPS, -wt WORD_TIMESTAMPS提取單詞級時間戳并基于它們優化結果(默認值:True)--highlight_words HIGHLIGHT_WORDS, -hw HIGHLIGHT_WORDS在srt和vtt輸出格式中,隨著單詞的發音為其添加下劃線(即卡拉OK效果)(默認值:False)--prepend_punctuations PREPEND_PUNCTUATIONS如果word_timestamps為True,將這些標點符號與下一個單詞合并(默認值:"'“?([{-)--append_punctuations APPEND_PUNCTUATIONS如果word_timestamps為True,將這些標點符號與前一個單詞合并(默認值:"'.。,,!!??::”)]}、)--threads THREADS CPU推理使用的線程數;默認值為實際核心數,但不超過4(默認值:0)--version 顯示Faster-Whisper的版本號--vad_filter VAD_FILTER, -vad VAD_FILTER啟用語音活動檢測(VAD)以過濾掉音頻中無語音的部分。(默認值:True)--vad_threshold VAD_THRESHOLD高于此值的概率被視為語音。(默認值:0.45)--vad_min_speech_duration_ms VAD_MIN_SPEECH_DURATION_MS最終語音塊短于min_speech_duration_ms將被丟棄。(默認值:350)--vad_max_speech_duration_s VAD_MAX_SPEECH_DURATION_S語音塊的最大持續時間(以秒為單位)。較長的塊將在最后一次靜音的時間戳處拆分。(默認值:None)--vad_min_silence_duration_ms VAD_MIN_SILENCE_DURATION_MS在每個語音塊結束時等待分離前的時間。(默認值:3000)--vad_speech_pad_ms VAD_SPEECH_PAD_MS最終語音塊兩側各填充speech_pad_ms的時間。(默認值:900)--vad_window_size_samples VAD_WINDOW_SIZE_SAMPLES輸入到silero VAD模型的音頻塊大小。非512、1024、1536的值可能會影響模型性能!!!(默認值:1536)--vad_dump 將VAD時間戳轉儲到字幕文件中以供檢查。(默認值:False)--max_new_tokens MAX_NEW_TOKENS每個塊生成的最大新標記數。(默認值:None)--chunk_length CHUNK_LENGTH音頻段的長度。如果不為None,它將覆蓋FeatureExtractor的默認chunk_length。(默認值:None)--compute_type {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}, -ct {default,auto,int8,int8_float16,int8_float32,int8_bfloat16,int16,float16,float32,bfloat16}使用的量化類型(請參閱https://opennmt.net/CTranslate2/quantization.html)。(默認值:auto)--batch_recursive, -br啟用遞歸批量處理。注意:如果設置,它將更改--output_dir的默認值。(默認值:False)--beep_off 操作完成時禁用提示音。(默認值:False)--skip 如果字幕已存在,則跳過媒體文件。如果輸入是通配符或目錄,則有效。(默認值:False)--checkcuda, -cc 返回CUDA設備數量。(供Subtitle Edit內部使用)--print_progress, -pp打印進度條而不是轉錄內容。(默認值:False)--postfix 在字幕文件名后添加語言作為后綴。(默認值:False)--check_files 在傳遞所有文件進行轉錄之前檢查輸入文件是否有錯誤。如果輸入是通配符或目錄,則有效。(默認值:False)--PR163_off (用于開發實驗)禁用PR163。(默認值:False)--hallucinations_list_off(用于開發實驗)禁用hallucinations_list,允許將幻覺添加到提示中。(默認值:False)--one_word {0,1,2} 0)禁用。1)輸出srt和vtt字幕,每行一個單詞。2)如'1',另外刪除空格并確保字幕行≥50ms。注意:VAD可能會略微降低某些行時間戳的準確性。(默認值:0)--sentence 啟用將行分割為句子以生成srt和vtt字幕。每個句子在新段中開始。默認情況下,旨在每行輸出整個句子以獲得更好的翻譯,但不僅限于此,請參閱'--max_...'參數。注意:對'highlight_words'無效。(默認值:False)--standard 以標準方式分割行的快速硬編碼預設。自動激活每行42個字符、2行、max_comma_cent=70和--sentence。(默認值:False)--standard_asia 針對某些亞洲語言的標準分行快速硬編碼預設。自動激活每行16個字符、2行、max_comma_cent=80和--sentence。(默認值:False)--max_comma MAX_COMMA(需要啟用--sentence)當行長度超過此值時,逗號將被視為句子結尾。注意:如果該值大于或等于--max_line_width,則禁用此功能。(默認值:250)--max_comma_cent {50,60,70,80,90,100}(需要啟用--sentence)當達到--max_line_width的此百分比時,開始在逗號后換行。注意:100表示禁用。(默認值:100)--max_gap MAX_GAP (需要啟用--sentence)間隙長度(以秒為單位)的閾值,超過該閾值的間隙將被視為省略號。(默認值:3.0)--max_line_width MAX_LINE_WIDTH換行前每行的最大字符數。(默認值:1000)--max_line_count MAX_LINE_COUNT每個字幕段的最大行數。(默認值:1)--min_dist_to_end {0,4,5,6,7,8,9,10,11,12}(需要啟用--sentence)如果像'the'、'Mr.'等單詞到行尾的距離小于設定值,則另起一行。注意:0表示禁用。(默認值:0)--prompt_max {16,32,64,128,223}(實驗性)提示的最大大小。(默認值:223)--reprompt {0,1,2} (實驗性)0)禁用。1)在提示重置后插入initial_prompt。2)確保所有窗口/塊的提示中都存在initial_prompt。注意:如果initial_prompt=None則自動禁用。類似于“熱詞”功能。(默認值:2)--prompt_reset_on_no_end {0,1,2}(實驗性)如果窗口/塊中沒有句子結尾,則重置提示。0 - 禁用,1 - 查找句號,2 - 查找句號或逗號。注意:如果reprompt=0則自動禁用。(默認值:2)--ff_dump 將過濾器預處理后的音頻轉儲為16000Hz文件,并防止刪除某些中間音頻文件。(默認值:False)--ff_track {1,2,3,4,5,6}音頻軌道選擇器。1 - 選擇第一個音頻軌道。(默認值:1)--ff_fc 僅選擇前中聲道(FC)進行處理。(默認值:False)--ff_mp3 音頻過濾器:轉換為MP3并轉回。(默認值:False)--ff_sync 音頻過濾器:根據給定的時間戳拉伸/壓縮樣本,最大補償為每秒3600個樣本。輸入文件必須是支持存儲PTS的容器,如mp4、mkv...(默認值:False)--ff_rnndn_sh 音頻過濾器:使用循環神經網絡(GregorR的SH模型)抑制非語音部分。注意:比Xiph模型更激進,會丟棄歌聲。(默認值:False)--ff_rnndn_xiph 音頻過濾器:使用循環神經網絡(Xiph的原始模型)抑制非語音部分。(默認值:False)--ff_fftdn [0 - 97] 音頻過濾器:使用快速傅里葉變換進行常規降噪。注意:12 - 正常強度,0 - 禁用。(默認值:0)--ff_tempo [0.5 - 2.0]音頻過濾器:調整音頻節奏。值低于1.0會放慢音頻,高于1.0會加快音頻。1.0表示禁用。(默認值:1.0)--ff_gate 音頻過濾器:降低信號的低頻部分。(默認值:False)--ff_speechnorm 音頻過濾器:極端快速的語音放大。(默認值:False)--ff_loudnorm 音頻過濾器:EBU R128響度歸一化。(默認值:False)--ff_silence_suppress noise duration音頻過濾器:抑制音頻中的安靜部分。接受兩個值。第一個值 - 噪聲容限(分貝,[-70 - 0],0表示禁用),第二個值 - 最小靜音持續時間(秒,[0.1 - 10])。(默認值:[0, 3.0])--ff_lowhighpass 音頻過濾器:通過50Hz - 7800Hz頻段。使用sinc和afir濾波器。(默認值:False)
性能優化
# 使用CUDA、指定cpu線程數、模型量化參數
whisper-faster --device cuda --threads 8 --compute_type int8_float16
- 改 Whisper 模型單次處理音頻長度的參數,默認單位是秒 【small模型 實測最大值30】
--chunk_length 20
- 改模型量化參數
--compute_type int8_float16
不同量化類型的區別
量化類型 | 精度 | 內存占用 | 速度 | 適用場景 |
---|---|---|---|---|
float32 | 最高 | 完整模型大小(如 medium=3GB) | 最慢 | 追求極致精度,顯存充足(≥8GB)的場景 |
float16 | 高 | 約 float32 的 50% | 快(GPU 加速) | 英偉達 GPU(支持 Tensor Core),需平衡精度和速度 |
int8 | 中 | 約 float32 的 25% | 快(CPU/GPU) | 顯存有限(≤4GB),可接受輕微精度損失(WER↑約 1-3%) |
int8_float16 | 中高 | 約 float32 的 25-30% | 最快 | 推薦!混合精度,在 int8 基礎上保留關鍵層的 float16 精度,平衡內存和精度 |
以 medium.en
模型處理 1 小時音頻為例:
參數 | 內存峰值 | 處理時間(RTX 3060) | WER(詞錯誤率) |
---|---|---|---|
float32 | 3.2GB | 15 分鐘 | 4.5% |
float16 | 1.6GB | 10 分鐘 | 4.6% |
int8 | 0.8GB | 8 分鐘 | 5.0% |
int8_float16 | 0.9GB | 7 分鐘 | 4.7% |
- 指定 CPU 推理時使用的線程數
--threads 參數(CPU 推理專用)
作用:指定 CPU 推理時使用的線程數,優化多核 CPU 的利用率。
取值范圍:0(自動檢測,默認不超過 4 線程)或手動設置為 CPU 核心數(如 4、8)。
字幕長度
# 如果有幾句話識別成 一段很長的話 的場景【執行時的輸出雖然會有長句,但是輸出后的文件會分割】:
--standard # 啟用標準預設:每行42字符,2行限制,自動激活 --sentence
--standard_asia # 優化亞洲語言:每行16字符,2行限制,更高的逗號容忍度# 可選
--vad_min_silence_duration_ms 500 --vad_threshold 0.5 # 降低靜音檢測閾值至0.5秒--sentence --max_line_width 30 --max_line_count 2 # 按句子分割,每行最多20字符,每段最多2行
--max_comma 25 --max_comma_cent 70 # 25字符后逗號視為句子結束,70%寬度時優先斷句
2.2.4 舉例
small
# 測試機:CPU: I5 8250URAM: 16GGPU: MX150 2G# 目錄結構
D:\Users\Desktop\字幕
├── faster-whisper-large-v2
├── faster-whisper-small
│ ├── config.json
│ ├── model.bin
│ ├── preprocessor_config.json
│ ├── tokenizer.json
│ └── vocabulary.txt
├── faster-whisper-tiny
├── output_audio.srt
└── Whisper-Faster├── cublas64_11.dll├── cublasLt64_11.dll├── cudnn_cnn_infer64_8.dll├── cudnn_ops_infer64_8.dll├── Wav47B5.tmp├── whisper-faster.exe└── zlibwapi.dll
.\whisper-faster.exe --model_dir "D:\Users\Desktop\字幕" --model small -l zh --chunk_length 20 --output_format srt --output_dir "D:\Users\Desktop\字幕" "D:\BaiduNetdiskDownload\存儲數據遷移原理.wav"
cpu線程數8
.\whisper-faster.exe --device cuda --threads 8 --compute_type int8 --standard_asia --chunk_length 20 --model small --model_dir "D:\Users\Desktop\字幕" --output_format srt --output_dir "D:\Users\Desktop\字幕" "D:\BaiduNetdiskDownload\HyperCDP技術.wav"
過程
PS D:\Users\Desktop\字幕\Whisper-Faster> .\whisper-faster.exe --model_dir "D:\Users\Desktop\字幕" --model small -l zh --chunk_length 20 --output_format srt --output_dir "D:\Users\Desktop\字幕" "D:\BaiduNetdiskDownload\存儲數據遷移原理.wav"Standalone Faster-Whisper r192.3 running on: CUDAStarting work on: D:\BaiduNetdiskDownload\存儲數據遷移原理.wav[00:00.580 --> 00:32.830] Hello,Hello,現在可以嗎?現在可以嗎?有聲音嗎?有聲音嗎?。。。。。
[03:11:16.700 --> 03:11:21.680] 那接下來呢,我們今天上午呢就到這個地方,大家扣6,我們就下課了,下午兩點鐘我們繼續。Transcription speed: 0.61 audio seconds/sSubtitles are written to 'D:\Users\Desktop\字幕' directory.Operation finished in: 18967 secondsPS D:\Users\Desktop\字幕\Whisper-Faster>
large-v2
# 測試機:CPU: AMD R7 3700XRAM: 48GGPU: RTX2070 8G執行3小時音頻轉字幕:CPU占用33左右,內容占用1G以內,GPU吃滿# 目錄結構
D:\faster-whisper
├── faster-whisper-large-v2
│ ├── config.json
│ ├── gitattributes
│ ├── model.bin
│ ├── README.md
│ ├── tokenizer.json
│ └── vocabulary.txt
└── Faster-Whisper-XXL├── _xxl_data├── faster-whisper-xxl.exe├── ffmpeg.exe└── One Click Transcribe.bat
.\faster-whisper-xxl.exe --device cuda --threads 12 --compute_type int8_float16 --standard_asia --chunk_length 30 --language zh --model large-v2 --model_dir "D:\faster-whisper" --output_format srt --output_dir "D:\faster-whisper" "G:\學習視頻\異步遠程復制原理.wav"
# 過程
PS D:\faster-whisper\Faster-Whisper-XXL> .\faster-whisper-xxl.exe --device cuda --threads 12 --compute_type int8_float16 --standard_asia --chunk_length 30 --language zh --model large-v2 --model_dir "D:\faster-whisper" --output_format srt --output_dir "D:\faster-whisper" "G:\學習視頻\異步遠程復制原理.wav"Standalone Faster-Whisper-XXL r245.4 running on: CUDAStarting to process: G:\學習視頻\異步遠程復制原理.wavStarting sequential faster-whisper inference.[01:18.410 --> 01:49.220] 接下來是昨日。
。。。。。
[03:05:45.530 --> 03:05:46.670] 下午兩點鐘我們再繼續啊。
[03:06:13.500 --> 03:06:14.540] 各位同學我們下課了啊。Transcription speed: 12.25 audio seconds/sSubtitles are written to 'D:\faster-whisper' directory.Operation finished in: 0:16:06.705PS D:\faster-whisper\Faster-Whisper-XXL>
large-v2操作整個文件夾
把 D:\HN\桌面\新建文件夾
下所有音頻 提取字幕 到 D:\HN\桌面\字幕
目錄下
.\faster-whisper-xxl.exe --device cuda --threads 16 --compute_type int8_float16 --standard_asia --language zh --model large-v2 --model_dir "D:\faster-whisper" --output_format srt --output_dir "D:\HN\桌面\字幕" D:\HN\桌面\新建文件夾
PS D:\HN\桌面\新建文件夾> ls目錄: D:\HN\桌面\新建文件夾Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2025/5/18 23:20 172193166 周末班-上午.mp3
-a---- 2025/5/18 23:20 166684464 周末班-下午.mp3PS D:\HN\桌面\新建文件夾>
看進度
執行時,cmd標題會變成進度
Default:"47%|5128/10864|32:08<<35:57|2.66 audioseconds/s"
47%
: 表示當前處理的進度為 47%。5128/10864
: 表示已經處理的音頻幀數(5128)和總音頻幀數(10864)。32:08<<35:57
: 表示已經處理的時間(32分08秒)和總時間(35分57秒)。2.66 audioseconds/s
: 表示處理速度,即每秒處理的音頻秒數。
2.2.5 報錯
CUDA Out of Memory :CUDA
# 報錯:CUDA內存不足
RuntimeError: CUDA failed with error out of memory[01:51:06.920 --> 01:51:20.480] 那接下來呢,我們來看一下,除了這個migration,在v5的早期啊,它就只有快業物,在v5的后期呢,它就有了文件業物,一直現在呢,
Traceback (most recent call last):File "D:\whisper-fast\__main__.py", line 1600, in <module>File "D:\whisper-fast\__main__.py", line 1527, in cliFile "faster_whisper\transcribe.py", line 1373, in restore_speech_timestampsFile "faster_whisper\transcribe.py", line 722, in generate_segmentsFile "faster_whisper\transcribe.py", line 1072, in generate_with_fallback
RuntimeError: CUDA failed with error out of memory
[18364] Failed to execute script '__main__' due to unhandled exception!
PS D:\Users\Desktop\字幕\Whisper-Faster>
解決方式:
- 改 Whisper 模型單次處理音頻長度的參數,默認單位是秒 【small模型 實測最大值30】
--chunk_length 20
- 內存充足(≥8GB):30-60 秒(平衡速度與上下文連貫性)。
- 內存有限(≤4GB):10-20 秒(避免 CUDA Out of Memory)。
- 處理含長句的音頻:20-40 秒(確保完整句子不被截斷)
3 實戰演示
3.1 純win端演示
全流程步驟
全流程步驟:
- 安裝ffmpeg
- 下載faster-whisper
- 下載faster-whisper 的模型
- 使用ffmpeg將視頻提取出音頻
- 使用faster-whisper,指定模型,進行語音識別,生成字幕
測試機環境說明
# 測試機環境說明:CPU: I5 8250URAM: 16GGPU: MX150 2G# 視頻存放位置: D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp4
# faster-whisper存放目錄: E:\字幕識別工具\Whisper-Faster
# faster-whisper 的模型 存放目錄: E:\字幕識別工具\模型存放目錄# 目錄結構
E:\字幕識別工具
├── Whisper-Faster
│ ├── cublas64_11.dll
│ ├── cublasLt64_11.dll
│ ├── cudnn_cnn_infer64_8.dll
│ ├── cudnn_ops_infer64_8.dll
│ ├── Wav47B5.tmp
│ ├── whisper-faster.exe
│ └── zlibwapi.dll
└── 模型存放目錄├── faster-whisper-tiny├── faster-whisper-large-v2└── faster-whisper-small├── config.json├── model.bin├── preprocessor_config.json├── tokenizer.json└── vocabulary.txt
使用ffmpeg將視頻提取出音頻
ffmpeg -i "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp4" -vn -acodec mp3 "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3"
-vn
: 禁用視頻流,只提取音頻。-acodec mp3
: 設置音頻編碼格式為 MP3。
# 過程
PS C:\Users\h1369> ffmpeg -i "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp4" -vn -acodec mp3 "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3"
ffmpeg version 2025-05-12-git-8ce32a7cbb-full_build-www.gyan.dev Copyright (c) 2000-2025 the FFmpeg developersbuilt with gcc 15.1.0 (Rev2, Built by MSYS2 project)configuration: --enable-gpl --enable-version3 --enable-static --disable-w32threads --disable-autodetect --enable-fontconfig --enable-iconv --enable-gnutls --enable-lcms2 --enable-libxml2 --enable-gmp --enable-bzlib --enable-lzma --enable-libsnappy --enable-zlib --enable-librist --enable-libsrt --enable-libssh --enable-libzmq --enable-avisynth --enable-libbluray --enable-libcaca --enable-libdvdnav --enable-libdvdread --enable-sdl2 --enable-libaribb24 --enable-libaribcaption --enable-libdav1d --enable-libdavs2 --enable-libopenjpeg --enable-libquirc --enable-libuavs3d --enable-libxevd --enable-libzvbi --enable-libqrencode --enable-librav1e --enable-libsvtav1 --enable-libvvenc --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxavs2 --enable-libxeve --enable-libxvid --enable-libaom --enable-libjxl --enable-libvpx --enable-mediafoundation --enable-libass --enable-frei0r --enable-libfreetype --enable-libfribidi --enable-libharfbuzz --enable-liblensfun --enable-libvidstab --enable-libvmaf --enable-libzimg --enable-amf --enable-cuda-llvm --enable-cuvid --enable-dxva2 --enable-d3d11va --enable-d3d12va --enable-ffnvcodec --enable-libvpl --enable-nvdec --enable-nvenc --enable-vaapi --enable-libshaderc --enable-vulkan --enable-libplacebo --enable-opencl --enable-libcdio --enable-libgme --enable-libmodplug --enable-libopenmpt --enable-libopencore-amrwb --enable-libmp3lame --enable-libshine --enable-libtheora --enable-libtwolame --enable-libvo-amrwbenc --enable-libcodec2 --enable-libilbc --enable-libgsm --enable-liblc3 --enable-libopencore-amrnb --enable-libopus --enable-libspeex --enable-libvorbis --enable-ladspa --enable-libbs2b --enable-libflite --enable-libmysofa --enable-librubberband --enable-libsoxr --enable-chromaprintlibavutil 60. 2.100 / 60. 2.100libavcodec 62. 3.101 / 62. 3.101libavformat 62. 0.102 / 62. 0.102libavdevice 62. 0.100 / 62. 0.100libavfilter 11. 0.100 / 11. 0.100libswscale 9. 0.100 / 9. 0.100libswresample 6. 0.100 / 6. 0.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp4':Metadata:major_brand : mp42minor_version : 0compatible_brands: mp42isomDuration: 00:01:42.17, start: 0.000000, bitrate: 2680 kb/sStream #0:0[0x1](und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 192 kb/s (default)Metadata:vendor_id : [0][0][0][0]Stream #0:1[0x2](und): Video: h264 (High) (avc1 / 0x31637661), yuv420p(progressive), 1916x1076 [SAR 1:1 DAR 479:269], 2485 kb/s, 29.93 fps, 30 tbr, 100k tbn (default)Metadata:vendor_id : [0][0][0][0]encoder : JVT/AVC Coding
Stream mapping:Stream #0:0 -> #0:0 (aac (native) -> mp3 (libmp3lame))
Press [q] to stop, [?] for help
Output #0, mp3, to 'D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3':Metadata:major_brand : mp42minor_version : 0compatible_brands: mp42isomTSSE : Lavf62.0.102Stream #0:0(und): Audio: mp3, 48000 Hz, stereo, fltp (default)Metadata:encoder : Lavc62.3.101 libmp3lamevendor_id : [0][0][0][0]
[out#0/mp3 @ 0000024fc68ac040] video:0KiB audio:1598KiB subtitle:0KiB other streams:0KiB global headers:0KiB muxing overhead: 0.020112%
size= 1598KiB time=00:01:42.17 bitrate= 128.1kbits/s speed=86.6x elapsed=0:00:01.17
PS C:\Users\h1369>
faster-whisper生成字幕
使用faster-whisper,指定模型,進行語音識別,生成字幕
E:\字幕識別工具\Whisper-Faster\whisper-faster.exe --standard_asia --model_dir "E:\字幕識別工具\模型存放目錄" --model small -l zh --chunk_length 20 --output_format srt --output_dir "D:\Users\Desktop\新建文件夾" "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3"
# 參數說明
--standard_asia # 優化亞洲語言:每行16字符,2行限制,更高的逗號容忍度
--model_dir # 指定模型存放目錄路徑: E:\字幕識別工具\模型存放目錄
--model # 選擇模型: small
-l zh # 設置識別語言: 中文
--chunk_length # 設置音頻分塊長度: 20 秒
--output_format # 設置輸出字幕格式: srt
--output_dir # 指定輸出字幕文件保存路徑: D:\Users\Desktop\新建文件夾
# 過程
PS C:\Users\h1369> E:\字幕識別工具\Whisper-Faster\whisper-faster.exe --standard_asia --model_dir "E:\字幕識別工具\模型存放目錄" --model small -l zh --chunk_length 20 --output_format srt --output_dir "D:\Users\Desktop\新建文件夾" "D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3"Standalone Faster-Whisper r192.3 running on: CUDAStarting work on: D:\Users\Desktop\新建文件夾\[新聞30分]國內簡訊-1.mp3[00:00.000 --> 00:01.600] 接下來更多消息,我們來看一組簡訊。
[00:03.560 --> 00:08.460] 今年以來,消費品已就換新,有力帶動消費,持續回升向好。
[00:08.920 --> 00:11.820] 商務部數據顯示,截至5月31號,
[00:12.080 --> 00:17.200] 今年消費品已就換新,五大品類合計帶動銷售額1.1萬億元。
[00:17.200 --> 00:20.960] 發放直達消費者的補貼約1.75億份。
[00:21.940 --> 00:25.800] 5月份,山東青島港開閉三條全新航線,
[00:26.120 --> 00:31.500] 覆蓋了巴西、阿根廷、智利等南美主要經濟體以及中東疏扭港口,
[00:31.860 --> 00:36.620] 預計每年為青島港新增集裝箱吞吐量超過20萬飆箱。
[00:37.560 --> 00:40.040] 新加坡與毛球公開賽昨晚結束,
[00:40.340 --> 00:42.160] 中國隊一金兩銀收關。
[00:42.540 --> 00:46.360] 女單決賽中,陳宇飛戰勝隊友王子儀獲得了冠軍。
[00:46.700 --> 00:49.460] 南單決賽中,中國選手陸光祖0比2
[00:49.460 --> 00:52.180] 不敵泰國名將昆拉伍特收獲亞軍。
[00:53.200 --> 00:56.020] 1號,在2025年法國網球公開賽
[00:56.020 --> 01:00.340] 女單第四輪比賽中,賽會8號種子中國選手鄭青文
[01:00.340 --> 01:03.260] 經歷了2小時47分鐘的苦戰,
[01:03.540 --> 01:06.620] 2比1擊敗了俄羅斯選手薩蒙索諾娃,
[01:06.800 --> 01:09.080] 職業生涯首次近期罰網八強。
[01:09.440 --> 01:12.220] 鄭青文在四分之一決賽中的對手將會是
[01:12.220 --> 01:13.740] 頭號種子薩巴倫卡,
[01:13.740 --> 01:17.420] 后者執落兩盤,戰勝美國選手阿尼西莫娃。
[01:18.160 --> 01:23.000] 記者昨天從中國氣象局國家空間天氣監測預警中心獲悉,
[01:23.440 --> 01:25.820] 5月31號,太陽爆發藥班,
[01:26.180 --> 01:29.040] 地球可能連續三天發生地磁爆,
[01:29.560 --> 01:32.820] 衛星通信、航天氣運行等可能會受到干擾。
[01:32.820 --> 01:36.340] 我國北部有機會出現較為明顯的極光,
[01:36.500 --> 01:38.860] 但不會對人體健康有影響。Transcription speed: 2.16 audio seconds/sSubtitles are written to 'D:\Users\Desktop\新建文件夾' directory.Operation finished in: 52 secondsPS C:\Users\h1369>
PS C:\Users\h1369> ls D:\Users\Desktop\新建文件夾目錄: D:\Users\Desktop\新建文件夾Mode LastWriteTime Length Name
---- ------------- ------ ----
-a---- 2025/6/19 9:44 1636169 [新聞30分]國內簡訊-1.mp3
-a---- 2025/6/18 10:18 34232134 [新聞30分]國內簡訊-1.mp4
-a---- 2025/6/19 9:57 2233 [新聞30分]國內簡訊-1.srt
-a---- 2025/6/18 10:22 35120886 [新聞30分]國內簡訊-2.mp4PS C:\Users\h1369>
3.2 補充
用python批量提取音頻
import os
import subprocess
import multiprocessing
import ffmpeg# 將視頻轉換為音頻, cmd方式
def video_to_audio_cmd(_video_file, _audio_file):# 定義ffmpeg命令command = ["ffmpeg","-hwaccel", "cuda","-i", _video_file,"-vn","-acodec", "libmp3lame",_audio_file]# 執行命令try:subprocess.run(command, check=True)print("ffmpeg命令執行成功")print(f"音頻文件已保存到: {_audio_file}")except subprocess.CalledProcessError as e:print(f"ffmpeg命令執行失敗: {e}")# 將視頻轉換為音頻, ffmpeg方式
def video_to_audio(_video_file, _audio_file):# 使用ffmpeg-python將視頻轉換為音頻input_stream = ffmpeg.input(_video_file)# 設置輸出流的參數output_stream = input_stream.output(_audio_file,ar='44100',ac='2')# 執行轉換操作output_stream.run()if __name__ == "__main__":# 指定目錄路徑directory = r"D:\Users\Desktop\新建文件夾"# 指定輸出音頻目錄路徑output_directory = r"D:\Users\Desktop\新建文件夾"# 確保輸出目錄存在os.makedirs(output_directory, exist_ok=True)# 獲取目錄中的所有視頻文件video_files = [f for f in os.listdir(directory) if f.endswith(('.mp4', '.ts', '.avi', '.mov'))]# 定義一個進程池,最大進程數4po = multiprocessing.Pool(processes=4)# 并行處理視頻文件for video_file in video_files:filepath = os.path.join(directory, video_file)audiopath = os.path.join(output_directory, os.path.splitext(video_file)[0] + '.mp3')print(f"視頻路徑: {filepath} \n音頻路徑: {audiopath}")po.apply_async(video_to_audio, args=(filepath, audiopath))# 關閉進程池,關閉后po不再接收新的請求po.close()# 主進程等待子進程執行完po.join()
4 總結
這個Faster-Whisper的識別率:
- small模型,主要在配置低的筆記本上運行的
- large-v2模型,雖然有識別錯的,但是更精準了,比較滿意了;
主要能離線識別,比較方便