【后端】構建簡潔的音頻轉寫系統：基于火山引擎ASR實現

在當今數字化時代，語音識別技術已經成為許多應用不可或缺的一部分。無論是會議記錄、語音助手還是內容字幕，將語音轉化為文本的能力對提升用戶體驗和工作效率至關重要。本文將介紹如何構建一個簡潔的音頻轉寫系統，專注于文件上傳、云存儲以及ASR（自動語音識別）的集成，特別是基于火山引擎ASR服務的實現。

系統架構概覽

一個簡潔的音頻轉寫系統需要包含以下幾個關鍵組件：

前端界面：提供用戶上傳音頻文件的入口
API服務層：處理請求和業務邏輯
云存儲服務：安全存儲音頻文件
ASR服務：將音頻轉寫為文本（本文使用火山引擎ASR服務）

系統流程如下：

用戶 → 上傳音頻 → 存儲到云服務 → 觸發ASR轉寫 → 獲取轉寫結果 → 返回給用戶

技術選型

我們的最小實現基于以下技術棧：

后端框架：FastAPI（Python）
云存儲：兼容S3協議的對象存儲
ASR服務：火山引擎ASR服務
異步處理：基于asyncio的異步請求處理

詳細實現

1. 音頻文件上傳流程

實現音頻上傳有兩種主要方式：

1.1 預簽名URL上傳

這種方式適合大文件上傳，減輕服務器負擔：

async def create_upload_url(file_name, file_size, mime_type):"""創建上傳鏈接"""# 生成唯一文件名timestamp = datetime.now().strftime("%Y%m%d%H%M%S")random_suffix = os.urandom(4).hex()file_ext = os.path.splitext(file_name)[1]filename = f"{timestamp}_{random_suffix}{file_ext}"# 生成存儲路徑storage_path = f"audio/{filename}"# 獲取預簽名URLupload_url = storage_client.generate_presigned_url(storage_path,expiry=300,  #5分鐘有效期http_method="PUT",content_length=file_size)return {"upload_url": upload_url,"storage_path": storage_path}

前端調用示例：

// 1. 獲取上傳URL
const response = await fetch('/api/audio/upload-url', {method: 'POST',headers: { 'Content-Type': 'application/json' },body: JSON.stringify({file_name: file.name,file_size: file.size,mime_type: file.type})
});
const { upload_url, storage_path } = await response.json();// 2. 使用預簽名URL上傳文件
await fetch(upload_url, {method: 'PUT',body: file,headers: { 'Content-Type': file.type }
});// 3. 觸發轉寫
const transcriptResponse = await fetch('/api/audio/transcribe', {method: 'POST',headers: { 'Content-Type': 'application/json' },body: JSON.stringify({ storage_path })
});
const transcriptResult = await transcriptResponse.json();

1.2 直接上傳方式

適合較小文件，通過API直接上傳：

async def upload_audio(file):"""直接上傳音頻文件"""# 驗證文件類型if file.content_type not in ALLOWED_AUDIO_TYPES:raise ValueError("不支持的文件類型")# 讀取文件內容contents = await file.read()if len(contents) == 0:raise ValueError("文件內容為空")# 生成唯一文件名timestamp = datetime.now().strftime("%Y%m%d%H%M%S")random_suffix = os.urandom(4).hex()file_ext = os.path.splitext(file.filename)[1]filename = f"{timestamp}_{random_suffix}{file_ext}"# 存儲路徑storage_path = f"audio/{filename}"# 上傳到云存儲storage_client.upload(storage_path, contents)# 生成訪問URLaccess_url = storage_client.generate_presigned_url(storage_path,expiry=3600,  # 1小時有效期http_method="GET")return {"file_name": file.filename,"storage_path": storage_path,"file_size": len(contents),"mime_type": file.content_type,"access_url": access_url,"url_expires_at": datetime.now() + timedelta(hours=1)}

2. ASR語音轉寫實現

可以通過兩種方式調用ASR服務：基于存儲路徑或直接通過URL。

2.1 基于存儲路徑的轉寫

async def transcribe_audio_by_storage_path(storage_path):"""通過存儲路徑轉寫音頻文件"""# 生成可訪問的URLaccess_url = storage_client.generate_presigned_url(storage_path,expiry=3600,http_method="GET")# 調用ASR服務transcript_result = await _call_asr_service(access_url)return {"storage_path": storage_path,"transcript": transcript_result.get("text", ""),"segments": transcript_result.get("segments", []),"duration": transcript_result.get("duration")}

2.2 基于URL的轉寫

async def transcribe_audio_by_url(audio_url):"""通過URL轉寫音頻"""# 調用ASR服務transcript_result = await _call_asr_service(audio_url)return {"audio_url": audio_url,"transcript": transcript_result.get("text", ""),"segments": transcript_result.get("segments", []),"duration": transcript_result.get("duration")}

2.3 上傳并立即轉寫

async def upload_and_transcribe(file):"""上傳并立即轉寫音頻文件"""# 上傳文件upload_result = await upload_audio(file)# 轉寫音頻transcript_result = await _call_asr_service(upload_result["access_url"])# 組合結果return {"file_name": upload_result["file_name"],"storage_path": upload_result["storage_path"],"file_size": upload_result["file_size"],"mime_type": upload_result["mime_type"],"access_url": upload_result["access_url"],"transcript": transcript_result.get("text", ""),"segments": transcript_result.get("segments", []),"duration": transcript_result.get("duration")}

3. 火山引擎ASR服務調用實現

以下是基于火山引擎ASR服務的詳細實現：

async def _call_asr_service(audio_url):"""調用火山引擎ASR服務進行轉寫"""# 生成唯一任務IDtask_id = str(uuid.uuid4())# 火山引擎ASR服務API端點submit_url = "https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit"query_url = "https://openspeech.bytedance.com/api/v3/auc/bigmodel/query"# 構建請求頭headers = {"Content-Type": "application/json","X-Api-App-Key": APP_KEY,"X-Api-Access-Key": ACCESS_KEY,"X-Api-Resource-Id": "volc.bigasr.auc","X-Api-Request-Id": task_id,"X-Api-Sequence": "-1"}# 請求體payload = {"audio": {"url": audio_url}}# 提交轉寫任務async with aiohttp.ClientSession() as session:async with session.post(submit_url, headers=headers, data=json.dumps(payload)) as response:if response.status != 200:error_detail = await response.text()raise ValueError(f"提交ASR任務失敗: {error_detail}")response_headers = response.headersstatus_code = response_headers.get("X-Api-Status-Code")log_id = response_headers.get("X-Tt-Logid", "")if status_code not in ["20000000", "20000001", "20000002"]:raise ValueError(f"ASR任務提交錯誤: {response_headers.get('X-Api-Message', '未知錯誤')}")# 輪詢查詢結果max_retries = 10for i in range(max_retries):# 等待一段時間再查詢await asyncio.sleep(0.5)# 查詢轉寫結果async with aiohttp.ClientSession() as session:query_headers = {"Content-Type": "application/json","X-Api-App-Key": APP_KEY,"X-Api-Access-Key": ACCESS_KEY,"X-Api-Resource-Id": "volc.bigasr.auc","X-Api-Request-Id": task_id,"X-Tt-Logid": log_id}async with session.post(query_url, headers=query_headers,data=json.dumps({})) as response:if response.status != 200:continuequery_status_code = response.headers.get("X-Api-Status-Code")# 如果完成，返回結果if query_status_code == "20000000":try:response_data = await response.json()result = response_data.get("result", {})text = result.get("text", "")utterances = result.get("utterances", [])return {"text": text, "utterances": utterances}except Exception as e:raise ValueError(f"解析ASR響應失敗: {str(e)}")# 如果仍在處理，繼續等待elif query_status_code in ["20000001", "20000002"]:await asyncio.sleep(0.5)continueelse:error_message = response.headers.get("X-Api-Message", "未知錯誤")raise ValueError(f"ASR任務查詢失敗: {error_message}")# 超過最大重試次數raise ValueError("ASR轉寫超時，請稍后重試")

4. API接口設計

完整的API接口設計，專注于最小功能實現：

# 1. 獲取上傳URL
@router.post("/audio/upload-url")
async def create_upload_url(request: dict):return await audio_service.create_upload_url(request["file_name"], request["file_size"], request["mime_type"])# 2. 直接上傳音頻
@router.post("/audio/upload")
async def upload_audio(file: UploadFile):return await audio_service.upload_audio(file)# 3. 轉寫音頻 (通過存儲路徑)
@router.post("/audio/transcribe")
async def transcribe_audio(request: dict):return await audio_service.transcribe_audio_by_storage_path(request["storage_path"])# 4. 通過URL轉寫音頻
@router.post("/audio/transcribe-by-url")
async def transcribe_by_url(request: dict):return await audio_service.transcribe_audio_by_url(request["audio_url"])# 5. 上傳并轉寫音頻
@router.post("/audio/upload-and-transcribe")
async def upload_and_transcribe(file: UploadFile):return await audio_service.upload_and_transcribe(file)

性能與可靠性優化

在實際生產環境中，我們還應關注以下幾點：

1. 大文件處理

對于大型音頻文件，應當：

使用分塊上傳方式
實現斷點續傳
限制文件大小
采用預簽名URL方式，避免通過API服務器中轉

2. 錯誤處理和重試

增強系統穩定性：

實現指數退避重試策略
添加詳細日志記錄
設置超時處理

3. 安全性考慮

保護用戶數據：

實現訪問控制
對音頻URL設置短期有效期
考慮臨時文件清理機制

完整示例：構建最小可行實現

下面是一個使用FastAPI構建的基于火山引擎ASR的最小可行實現示例：

import os
import uuid
import json
import asyncio
import aiohttp
from datetime import datetime, timedelta
from fastapi import FastAPI, UploadFile, File
from typing import Dict, Any, Optionalapp = FastAPI()# 配置項
ALLOWED_AUDIO_TYPES = ["audio/mpeg", "audio/wav", "audio/mp4", "audio/x-m4a"]
APP_KEY = os.getenv("VOLCANO_ASR_APP_ID")
ACCESS_KEY = os.getenv("VOLCANO_ASR_ACCESS_TOKEN")# 簡單的存儲客戶端模擬
class SimpleStorageClient:def upload(self, path, content):# 實際項目中應連接到S3、OSS等云存儲print(f"Uploading {len(content)} bytes to {path}")return Truedef generate_presigned_url(self, path, expiry=3600, http_method="GET", **kwargs):# 簡化示例，實際應返回帶簽名的URLreturn f"https://storage-example.com/{path}?expires={expiry}&method={http_method}"storage_client = SimpleStorageClient()# API端點
@app.post("/audio/upload-url")
async def create_upload_url(file_name: str, file_size: int, mime_type: str):"""獲取上傳URL"""timestamp = datetime.now().strftime("%Y%m%d%H%M%S")random_suffix = os.urandom(4).hex()file_ext = os.path.splitext(file_name)[1]filename = f"{timestamp}_{random_suffix}{file_ext}"storage_path = f"audio/{filename}"upload_url = storage_client.generate_presigned_url(storage_path,expiry=300,http_method="PUT",content_length=file_size)return {"upload_url": upload_url,"storage_path": storage_path}@app.post("/audio/upload")
async def upload_audio(file: UploadFile = File(...)):"""直接上傳音頻文件"""if file.content_type not in ALLOWED_AUDIO_TYPES:return {"error": "不支持的文件類型"}contents = await file.read()if len(contents) == 0:return {"error": "文件內容為空"}timestamp = datetime.now().strftime("%Y%m%d%H%M%S")random_suffix = os.urandom(4).hex()file_ext = os.path.splitext(file.filename)[1]filename = f"{timestamp}_{random_suffix}{file_ext}"storage_path = f"audio/{filename}"storage_client.upload(storage_path, contents)access_url = storage_client.generate_presigned_url(storage_path,expiry=3600,http_method="GET")return {"file_name": file.filename,"storage_path": storage_path,"file_size": len(contents),"mime_type": file.content_type,"access_url": access_url,"url_expires_at": (datetime.now() + timedelta(hours=1)).isoformat()}@app.post("/audio/transcribe")
async def transcribe_audio(storage_path: str):"""通過存儲路徑轉寫音頻"""access_url = storage_client.generate_presigned_url(storage_path,expiry=3600,http_method="GET")transcript_result = await _call_volcano_asr(access_url)return {"storage_path": storage_path,"transcript": transcript_result}@app.post("/audio/transcribe-by-url")
async def transcribe_by_url(audio_url: str):"""通過URL轉寫音頻"""transcript_result = await _call_volcano_asr(audio_url)return {"audio_url": audio_url,"transcript": transcript_result}@app.post("/audio/upload-and-transcribe")
async def upload_and_transcribe(file: UploadFile = File(...)):"""上傳并轉寫音頻文件"""upload_result = await upload_audio(file)if "error" in upload_result:return upload_resulttranscript_result = await _call_volcano_asr(upload_result["access_url"])return {**upload_result,"transcript": transcript_result}async def _call_volcano_asr(audio_url):"""調用火山引擎ASR服務"""if not APP_KEY or not ACCESS_KEY:return {"text": "火山引擎ASR配置缺失，請設置環境變量"}# 生成任務IDtask_id = str(uuid.uuid4())# 火山引擎ASR服務API端點submit_url = "https://openspeech.bytedance.com/api/v3/auc/bigmodel/submit"query_url = "https://openspeech.bytedance.com/api/v3/auc/bigmodel/query"# 提交請求頭headers = {"Content-Type": "application/json","X-Api-App-Key": APP_KEY,"X-Api-Access-Key": ACCESS_KEY,"X-Api-Resource-Id": "volc.bigasr.auc","X-Api-Request-Id": task_id,"X-Api-Sequence": "-1"}# 請求體payload = {"audio": {"url": audio_url}}try:# 提交任務async with aiohttp.ClientSession() as session:async with session.post(submit_url, headers=headers, data=json.dumps(payload)) as response:status_code = response.headers.get("X-Api-Status-Code")log_id = response.headers.get("X-Tt-Logid", "")if status_code not in ["20000000", "20000001", "20000002"]:return {"error": f"提交轉寫任務失敗: {response.headers.get('X-Api-Message', '未知錯誤')}"}# 查詢結果max_retries = 10for i in range(max_retries):await asyncio.sleep(1)  # 等待1秒# 查詢請求頭query_headers = {"Content-Type": "application/json","X-Api-App-Key": APP_KEY,"X-Api-Access-Key": ACCESS_KEY,"X-Api-Resource-Id": "volc.bigasr.auc","X-Api-Request-Id": task_id,"X-Tt-Logid": log_id}async with aiohttp.ClientSession() as session:async with session.post(query_url, headers=query_headers, data="{}") as response:query_status = response.headers.get("X-Api-Status-Code")if query_status == "20000000":  # 轉寫完成result = await response.json()text = result.get("result", {}).get("text", "")return {"text": text}elif query_status in ["20000001", "20000002"]:  # 處理中continueelse:return {"error": f"查詢轉寫結果失敗: {response.headers.get('X-Api-Message', '未知錯誤')}"}return {"error": "轉寫超時，請稍后查詢結果"}except Exception as e:return {"error": f"轉寫過程發生錯誤: {str(e)}"}if __name__ == "__main__":import uvicornuvicorn.run(app, host="0.0.0.0", port=8000)