在當今數字化時代,語音識別技術已經成為許多應用不可或缺的一部分。無論是會議記錄、語音助手還是內容字幕,將語音轉化為文本的能力對提升用戶體驗和工作效率至關重要。本文將介紹如何構建一個簡潔的音頻轉寫系統,專注于文件上傳、云存儲以及ASR(自動語音識別)的集成,特別是基于火山引擎ASR服務的實現。
系統架構概覽
一個簡潔的音頻轉寫系統需要包含以下幾個關鍵組件:
- 前端界面:提供用戶上傳音頻文件的入口
- API服務層:處理請求和業務邏輯
- 云存儲服務:安全存儲音頻文件
- ASR服務:將音頻轉寫為文本(本文使用火山引擎ASR服務)
系統流程如下:
用戶 → 上傳音頻 → 存儲到云服務 → 觸發ASR轉寫 → 獲取轉寫結果 → 返回給用戶
技術選型
我們的最小實現基于以下技術棧:
- 后端框架:FastAPI(Python)
- 云存儲:兼容S3協議的對象存儲
- ASR服務:火山引擎ASR服務
- 異步處理:基于asyncio的異步請求處理
詳細實現
1. 音頻文件上傳流程
實現音頻上傳有兩種主要方式:
1.1 預簽名URL上傳
這種方式適合大文件上傳,減輕服務器負擔:
async def create_upload_url(file_name, file_size, mime_type):"""創建上傳鏈接"""# 生成唯一文件名timestamp = datetime.now().strftime("%Y%m%d%H%M%S")random_suffix = os.urandom(4).hex()file_ext = os.path.splitext(file_name)[1]filename = f"{timestamp}_{random_suffix}{file_ext}"# 生成存儲路徑storage_path = f"audio/{filename}"# 獲取預簽名URLupload_url = storage_client.generate_presigned_url(storage_path,expiry=300, #5分鐘有效期http_method="PUT",content_length=file_size)return {"upload_url": upload_url,"storage_path": storage_path}
前端調用示例:
// 1. 獲取上傳URL
const response = await fetch('/api/audio/upload-url', {method: 'POST',headers: { 'Content-Type': 'application/json' },body: JSON.stringify({file_name: file.name,file_size: file.size,mime_type: file.type})
});
const { upload_url, storage_path } = await response.json();// 2. 使用預簽名URL上傳文件
await fetch(upload_url, {method: 'PUT',body: file,headers: { 'Content-Type': file.type }
});// 3. 觸發轉寫
const transcriptResponse = await fetch('/api/audio/transcribe', {method: 'POST',headers: { 'Content-Type': 'application/json' },body: JSON.stringify({ storage_path })
});
const transcriptResult = await transcriptResponse.json();
1.2 直接上傳方式
適合較小文件,通過API直接上傳:
async def upload_audio(file):"""直接上傳音頻文件"""# 驗證文件類型if file.content_type not in ALLOWED_AUDIO_TYPES:raise ValueError("不支持的文件類型")# 讀取文件內容contents = await file.read()if len(contents) == 0:raise ValueError("文件內容為空")# 生成唯一文件名timestamp = datetime.now().strftime("%Y%m%d%H%M%S")random_suffix = os.urandom(4).hex()file_ext = os.path.splitext(file.filename)[1]filename = f"{timestamp}_{random_suffix}{file_ext}"# 存儲路徑storage_path = f"audio/{filename}"# 上傳到云存儲storage_client.upload(storage_path, contents)# 生成訪問URLaccess_url = storage_client.generate_presigned_url(storage_path,expiry=3600, # 1小時有效期http_method="GET")return {"file_name": file.filename,"storage_path": storage_path,"file_size": len(contents),"mime_type": file.content_type,"access_url": access_url,"url_expires_at": datetime.now() + timedelta(hours=1)}
2. ASR語音轉寫實現
可以通過兩種方式調用ASR服務:基于存儲路徑或直接通過URL。
2.1 基于存儲路徑的轉寫
async def transcribe_audio_by_storage_path(storage_path):"""通過存儲路徑轉寫音頻文件"""# 生成可訪問的URLaccess_url = storage_client.generate_presigned_url(storage_path,expiry=3600,http_method="GET")# 調用ASR服務transcript_result = await _call_asr_service(access_url)return {"storage_path": storage_path,"transcript": transcript_result.get("text", ""),"segments": transcript_result.get("segments", []),"duration": transcript_result.get("duration")}
2.2 基于URL的轉寫
async def transcribe_audio_by_url(audio_url):"""通過URL轉寫音頻"""# 調用ASR服務transcript_result = await _call_asr_service(audio_url)return {"audio_url": audio_url,"transcript": transcript_result.get("text", ""),"segments": transcript_result.get("segments", []),"duration": transcript_result.get("duration")}
2.3 上傳并立即轉寫
async def upload_and_transcribe(file):"""上傳并立即轉寫音頻文件"""# 上傳文件upload_result = await upload_audio(file)# 轉寫音頻transcript_result = await _call_asr_service(upload_result["access_url"])# 組合結果return {"file_name": upload_result["file_name"],"storage_path": upload_result["storage_path"],"file_size": upload_result["file_size"],"mime_type": upload_result["mime_type"],"access_url": upload_result["access_url"],"transcript": transcript_result.get("text", ""),"segments": transcript_result.get("segments", []),"duration": transcript_result.get("duration")}
3. 火山引擎ASR服務調用實現
以下是基于火山引擎ASR服務的詳細實現:
async def _call_asr_service(audio_url):"""調用火山引擎ASR服務進行轉寫"""# 生成唯一任務IDtask_id = str(uuid.uuid4())# 火山引擎ASR服務API端點submit_url = "https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit"query_url = "https://openspeech.bytedance.com/api/v3/auc/bigmodel/query"# 構建請求頭headers = {"Content-Type": "application/json","X-Api-App-Key": APP_KEY,"X-Api-Access-Key": ACCESS_KEY,"X-Api-Resource-Id": "volc.bigasr.auc","X-Api-Request-Id": task_id,"X-Api-Sequence": "-1"}# 請求體payload = {"audio": {"url": audio_url}}# 提交轉寫任務async with aiohttp.ClientSession() as session:async with session.post(submit_url, headers=headers, data=json.dumps(payload)) as response:if response.status != 200:error_detail = await response.text()raise ValueError(f"提交ASR任務失敗: {error_detail}")response_headers = response.headersstatus_code = response_headers.get("X-Api-Status-Code")log_id = response_headers.get("X-Tt-Logid", "")if status_code not in ["20000000", "20000001", "20000002"]:raise ValueError(f"ASR任務提交錯誤: {response_headers.get('X-Api-Message', '未知錯誤')}")# 輪詢查詢結果max_retries = 10for i in range(max_retries):# 等待一段時間再查詢await asyncio.sleep(0.5)# 查詢轉寫結果async with aiohttp.ClientSession() as session:query_headers = {"Content-Type": "application/json","X-Api-App-Key": APP_KEY,"X-Api-Access-Key": ACCESS_KEY,"X-Api-Resource-Id": "volc.bigasr.auc","X-Api-Request-Id": task_id,"X-Tt-Logid": log_id}async with session.post(query_url, headers=query_headers,data=json.dumps({})) as response:if response.status != 200:continuequery_status_code = response.headers.get("X-Api-Status-Code")# 如果完成,返回結果if query_status_code == "20000000":try:response_data = await response.json()result = response_data.get("result", {})text = result.get("text", "")utterances = result.get("utterances", [])return {"text": text, "utterances": utterances}except Exception as e:raise ValueError(f"解析ASR響應失敗: {str(e)}")# 如果仍在處理,繼續等待elif query_status_code in ["20000001", "20000002"]:await asyncio.sleep(0.5)continueelse:error_message = response.headers.get("X-Api-Message", "未知錯誤")raise ValueError(f"ASR任務查詢失敗: {error_message}")# 超過最大重試次數raise ValueError("ASR轉寫超時,請稍后重試")
4. API接口設計
完整的API接口設計,專注于最小功能實現:
# 1. 獲取上傳URL
@router.post("/audio/upload-url")
async def create_upload_url(request: dict):return await audio_service.create_upload_url(request["file_name"], request["file_size"], request["mime_type"])# 2. 直接上傳音頻
@router.post("/audio/upload")
async def upload_audio(file: UploadFile):return await audio_service.upload_audio(file)# 3. 轉寫音頻 (通過存儲路徑)
@router.post("/audio/transcribe")
async def transcribe_audio(request: dict):return await audio_service.transcribe_audio_by_storage_path(request["storage_path"])# 4. 通過URL轉寫音頻
@router.post("/audio/transcribe-by-url")
async def transcribe_by_url(request: dict):return await audio_service.transcribe_audio_by_url(request["audio_url"])# 5. 上傳并轉寫音頻
@router.post("/audio/upload-and-transcribe")
async def upload_and_transcribe(file: UploadFile):return await audio_service.upload_and_transcribe(file)
性能與可靠性優化
在實際生產環境中,我們還應關注以下幾點:
1. 大文件處理
對于大型音頻文件,應當:
- 使用分塊上傳方式
- 實現斷點續傳
- 限制文件大小
- 采用預簽名URL方式,避免通過API服務器中轉
2. 錯誤處理和重試
增強系統穩定性:
- 實現指數退避重試策略
- 添加詳細日志記錄
- 設置超時處理
3. 安全性考慮
保護用戶數據:
- 實現訪問控制
- 對音頻URL設置短期有效期
- 考慮臨時文件清理機制
完整示例:構建最小可行實現
下面是一個使用FastAPI構建的基于火山引擎ASR的最小可行實現示例:
import os
import uuid
import json
import asyncio
import aiohttp
from datetime import datetime, timedelta
from fastapi import FastAPI, UploadFile, File
from typing import Dict, Any, Optionalapp = FastAPI()# 配置項
ALLOWED_AUDIO_TYPES = ["audio/mpeg", "audio/wav", "audio/mp4", "audio/x-m4a"]
APP_KEY = os.getenv("VOLCANO_ASR_APP_ID")
ACCESS_KEY = os.getenv("VOLCANO_ASR_ACCESS_TOKEN")# 簡單的存儲客戶端模擬
class SimpleStorageClient:def upload(self, path, content):# 實際項目中應連接到S3、OSS等云存儲print(f"Uploading {len(content)} bytes to {path}")return Truedef generate_presigned_url(self, path, expiry=3600, http_method="GET", **kwargs):# 簡化示例,實際應返回帶簽名的URLreturn f"https://storage-example.com/{path}?expires={expiry}&method={http_method}"storage_client = SimpleStorageClient()# API端點
@app.post("/audio/upload-url")
async def create_upload_url(file_name: str, file_size: int, mime_type: str):"""獲取上傳URL"""timestamp = datetime.now().strftime("%Y%m%d%H%M%S")random_suffix = os.urandom(4).hex()file_ext = os.path.splitext(file_name)[1]filename = f"{timestamp}_{random_suffix}{file_ext}"storage_path = f"audio/{filename}"upload_url = storage_client.generate_presigned_url(storage_path,expiry=300,http_method="PUT",content_length=file_size)return {"upload_url": upload_url,"storage_path": storage_path}@app.post("/audio/upload")
async def upload_audio(file: UploadFile = File(...)):"""直接上傳音頻文件"""if file.content_type not in ALLOWED_AUDIO_TYPES:return {"error": "不支持的文件類型"}contents = await file.read()if len(contents) == 0:return {"error": "文件內容為空"}timestamp = datetime.now().strftime("%Y%m%d%H%M%S")random_suffix = os.urandom(4).hex()file_ext = os.path.splitext(file.filename)[1]filename = f"{timestamp}_{random_suffix}{file_ext}"storage_path = f"audio/{filename}"storage_client.upload(storage_path, contents)access_url = storage_client.generate_presigned_url(storage_path,expiry=3600,http_method="GET")return {"file_name": file.filename,"storage_path": storage_path,"file_size": len(contents),"mime_type": file.content_type,"access_url": access_url,"url_expires_at": (datetime.now() + timedelta(hours=1)).isoformat()}@app.post("/audio/transcribe")
async def transcribe_audio(storage_path: str):"""通過存儲路徑轉寫音頻"""access_url = storage_client.generate_presigned_url(storage_path,expiry=3600,http_method="GET")transcript_result = await _call_volcano_asr(access_url)return {"storage_path": storage_path,"transcript": transcript_result}@app.post("/audio/transcribe-by-url")
async def transcribe_by_url(audio_url: str):"""通過URL轉寫音頻"""transcript_result = await _call_volcano_asr(audio_url)return {"audio_url": audio_url,"transcript": transcript_result}@app.post("/audio/upload-and-transcribe")
async def upload_and_transcribe(file: UploadFile = File(...)):"""上傳并轉寫音頻文件"""upload_result = await upload_audio(file)if "error" in upload_result:return upload_resulttranscript_result = await _call_volcano_asr(upload_result["access_url"])return {**upload_result,"transcript": transcript_result}async def _call_volcano_asr(audio_url):"""調用火山引擎ASR服務"""if not APP_KEY or not ACCESS_KEY:return {"text": "火山引擎ASR配置缺失,請設置環境變量"}# 生成任務IDtask_id = str(uuid.uuid4())# 火山引擎ASR服務API端點submit_url = "https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit"query_url = "https://openspeech.bytedance.com/api/v3/auc/bigmodel/query"# 提交請求頭headers = {"Content-Type": "application/json","X-Api-App-Key": APP_KEY,"X-Api-Access-Key": ACCESS_KEY,"X-Api-Resource-Id": "volc.bigasr.auc","X-Api-Request-Id": task_id,"X-Api-Sequence": "-1"}# 請求體payload = {"audio": {"url": audio_url}}try:# 提交任務async with aiohttp.ClientSession() as session:async with session.post(submit_url, headers=headers, data=json.dumps(payload)) as response:status_code = response.headers.get("X-Api-Status-Code")log_id = response.headers.get("X-Tt-Logid", "")if status_code not in ["20000000", "20000001", "20000002"]:return {"error": f"提交轉寫任務失敗: {response.headers.get('X-Api-Message', '未知錯誤')}"}# 查詢結果max_retries = 10for i in range(max_retries):await asyncio.sleep(1) # 等待1秒# 查詢請求頭query_headers = {"Content-Type": "application/json","X-Api-App-Key": APP_KEY,"X-Api-Access-Key": ACCESS_KEY,"X-Api-Resource-Id": "volc.bigasr.auc","X-Api-Request-Id": task_id,"X-Tt-Logid": log_id}async with aiohttp.ClientSession() as session:async with session.post(query_url, headers=query_headers, data="{}") as response:query_status = response.headers.get("X-Api-Status-Code")if query_status == "20000000": # 轉寫完成result = await response.json()text = result.get("result", {}).get("text", "")return {"text": text}elif query_status in ["20000001", "20000002"]: # 處理中continueelse:return {"error": f"查詢轉寫結果失敗: {response.headers.get('X-Api-Message', '未知錯誤')}"}return {"error": "轉寫超時,請稍后查詢結果"}except Exception as e:return {"error": f"轉寫過程發生錯誤: {str(e)}"}if __name__ == "__main__":import uvicornuvicorn.run(app, host="0.0.0.0", port=8000)
結論
構建一個簡潔的音頻轉寫系統可以不依賴數據庫,只需要專注于文件上傳、獲取URL和ASR轉寫三個核心功能。通過集成火山引擎ASR服務,我們可以快速實現高質量的語音轉文本功能,無需自行構建復雜的語音識別模型。
本文的最小可行實現充分利用了火山引擎ASR的API功能,提供了一個完整的工作流程,包括文件上傳、URL生成和轉寫調用。這種方式不僅開發效率高,而且可以在不斷迭代中逐步增強功能。
進一步的拓展方向
在有了最小可行實現后,可以考慮以下拓展:
- 添加數據庫存儲轉寫歷史
- 實現用戶認證和授權
- 支持實時語音轉寫
- 多語言轉寫支持
- 說話人分離功能
- 情感分析集成
- 關鍵詞提取和主題識別