【EVI】Hume AI 初探

寫在前面的話

Hume AI宣布已在B輪融資中籌集5000萬美元，由前Google DeepMind研究員Alan Cowen創立并擔任CEO。該AI模型專注于理解人類情感，并發布了「共情語音界面」演示，通過語音對話實現互動。從 Hume AI 官網展示的信息，EVI 能夠識別和響應 53 種不同情緒。這一從聲音中辨別情緒的能力來源于包括全球數十萬人的受控實驗數據在內的全面研究，EVI 正是基于對不同文化來源聲音和面部表情的復雜分析，才構成了 AI 情緒識別能力的基礎。

聽說后，我簡單地了解了Hume AI文檔。從接入方式來看，與之前接入GPT的方式差不多，通過網絡請求的方式去弄
…總之，一言難盡。

Hume AI介紹

Hume AI 可以集成到任何涉及人類數據的應用程序或研究中:音頻、視頻、圖像或文本。使用api來訪問模型，這些模型可以在細微的面部和聲音行為中測量超過50個維度的情緒表達。捕捉細微的表情，如臉上的無聊和欲望，聲音表情，如嘆息和笑，講話中持續的情感語調，文本中傳達的情感，以及對情感體驗的時刻到時刻的多模態估計。

EVI

EVI（EMPATHIC VOICE INTERFACE ），Hume的EVI接口可以理解和模擬語音語調、單詞重音等，以優化人類與人工智能的互動。
語音AI助手

Demo

具有共情的能力的語音AI。
官方的在線Demo: https://demo.hume.ai

在這里插入圖片描述

快速開始

本小節內容由官網(quickstart)翻譯而來。

獲取API KEY

Hume AI采用即用即付的付費模式。

為了建立經過身份驗證的連接，首先需要使用我們的 API 密鑰和客戶端密鑰實例化 Hume 客戶端。這些密鑰可以通過登錄門戶并訪問API 密鑰頁面來獲取。

在下面的示例代碼中，API 密鑰和客戶端密鑰已保存到環境變量中。避免在項目中對這些值進行硬編碼，以防止它們被泄露。

import { Hume, HumeClient } from 'hume';// instantiate the Hume client and authenticate
const client = new HumeClient({apiKey: import.meta.env.HUME_API_KEY,clientSecret: import.meta.env.HUME_CLIENT_SECRET,
});

使用我們的 Typescript SDK 時，在使用您的憑據實例化 Hume 客戶端后，將獲取與 EVI 建立經過身份驗證的連接所需的訪問令牌并在后臺應用。

連接

使用我們的憑據實例化 Hume 客戶端后，我們現在可以與 EVI 建立經過身份驗證的 WebSocket 連接并定義我們的 WebSocket 事件處理程序。目前，我們將包含占位符事件處理程序，以便在后續步驟中更新。

import { Hume, HumeClient } from 'hume';// instantiate the Hume client and authenticate
const client = new HumeClient({apiKey: import.meta.env.HUME_API_KEY,clientSecret: import.meta.env.HUME_CLIENT_SECRET,
});// instantiates WebSocket and establishes an authenticated connection
const socket = await client.empathicVoice.chat.connect({onOpen: () => {console.log('WebSocket connection opened');},onMessage: (message) => {console.log(message);},onError: (error) => {console.error(error);},onClose: () => {console.log('WebSocket connection closed');}
});

上傳音頻

要捕獲音頻并將其作為音頻輸入通過套接字發送，需要執行幾個步驟。

需要處理用戶訪問麥克風的權限。
使用 Media Stream API 捕獲音頻，并使用 MediaRecorder API 錄制捕獲的音頻。
對錄制的音頻 Blob 進行 base64 編碼，
使用該sendAudioInput方法通過 WebSocket 發送編碼的音頻。

接受的音頻格式包括：mp3、wav、aac、ogg、flac、webm、avr、cdda、cvs/vms、mp2、mp4、ac3、avi、wmv、mpeg、ircam

import {convertBlobToBase64,ensureSingleValidAudioTrack,getAudioStream,
} from 'hume';// the recorder responsible for recording the audio stream to be prepared as the audio input
let recorder: MediaRecorder | null = null;
// the stream of audio captured from the user's microphone
let audioStream: MediaStream | null = null;// define function for capturing audio
async function captureAudio(): Promise<void> {// prompts user for permission to capture audio, obtains media stream upon approvalaudioStream = await getAudioStream();// ensure there is only one audio track in the streamensureSingleValidAudioTrack(audioStream);// instantiate the media recorderrecorder = new MediaRecorder(audioStream, { mimeType });// callback for when recorded chunk is available to be processedrecorder.ondataavailable = async ({ data }) => {// IF size of data is smaller than 1 byte then do nothingif (data.size < 1) return;// base64 encode audio dataconst encodedAudioData = await convertBlobToBase64(data);// define the audio_input message JSONconst audioInput: Omit<Hume.empathicVoice.AudioInput, 'type'> = {data: encodedAudioData,};// send audio_input messagesocket?.sendAudioInput(audioInput);};// capture audio input at a rate of 100ms (recommended)const timeSlice = 100;recorder.start(timeSlice);
}// define a WebSocket open event handler to capture audio
async function handleWebSocketOpenEvent(): Promise<void> {// place logic here which you would like invoked when the socket opensconsole.log('Web socket connection opened');await captureAudio();
}

響應

響應將包含多條消息，詳細信息如下：

user_message：此消息封裝了音頻輸入的轉錄。此外，它還包括與說話者的聲音韻律相關的表情測量預測。
assistant_message：對于響應中的每個句子，AssistantMessage都會發送一個。此消息不僅傳遞響應的內容，而且還包含有關生成的音頻響應的表達質量的預測。
audio_output：每個都會附帶AssistantMessage一條消息。這包含與相對應的實際音頻（二進制）響應。AudioOutputAssistantMessage
assistant_end：表示對音頻輸入的響應的結束，AssistantEnd 消息作為通信的最后一部分傳遞。

這里我們將重點播放接收到的音頻輸出。要播放響應中的音頻輸出，我們需要定義將接收到的二進制文件轉換為 Blob 的邏輯，并創建 HTMLAudioInput 來播放音頻。然后，我們需要更新客戶端的 on message WebSocket 事件處理程序，以在接收音頻輸出時調用播放音頻的邏輯。為了管理此處傳入音頻的播放，我們將實現一個隊列并按順序播放音頻。

import { convertBase64ToBlob,getBrowserSupportedMimeType
} from 'hume';// audio playback queue
const audioQueue: Blob[] = [];
// flag which denotes whether audio is currently playing or not
let isPlaying = false;
// the current audio element to be played
let currentAudio: : HTMLAudioElement | null = null;
// mime type supported by the browser the application is running in
const mimeType: MimeType = (() => {const result = getBrowserSupportedMimeType();return result.success ? result.mimeType : MimeType.WEBM;
})();// play the audio within the playback queue, converting each Blob into playable HTMLAudioElements
function playAudio(): void {// IF there is nothing in the audioQueue OR audio is currently playing then do nothingif (!audioQueue.length || isPlaying) return;// update isPlaying stateisPlaying = true;// pull next audio output from the queueconst audioBlob = audioQueue.shift();// IF audioBlob is unexpectedly undefined then do nothingif (!audioBlob) return;// converts Blob to AudioElement for playbackconst audioUrl = URL.createObjectURL(audioBlob);currentAudio = new Audio(audioUrl);// play audiocurrentAudio.play();// callback for when audio finishes playingcurrentAudio.onended = () => {// update isPlaying stateisPlaying = false;// attempt to pull next audio output from queueif (audioQueue.length) playAudio();};
}// define a WebSocket message event handler to play audio output
function handleWebSocketMessageEvent(message: Hume.empathicVoice.SubscribeEvent
): void {// place logic here which you would like to invoke when receiving a message through the socketswitch (message.type) {// add received audio to the playback queue, and play next audio outputcase 'audio_output':// convert base64 encoded audio to a Blobconst audioOutput = message.data;const blob = convertBase64ToBlob(audioOutput, mimeType);// add audio Blob to audioQueueaudioQueue.push(blob);// play the next audio outputif (audioQueue.length === 1) playAudio();break;}
}

中斷

可中斷性是 Empathic Voice Interface 的一大特色。如果在接收上一個音頻輸入的響應消息時通過 websocket 發送音頻輸入，則將停止發送對上一個音頻輸入的響應。此外，界面將發回一條 user_interruption消息，并開始響應新的音頻輸入。

// function for stopping the audio and clearing the queue
function stopAudio(): void {// stop the audio playbackcurrentAudio?.pause();currentAudio = null;// update audio playback stateisPlaying = false;// clear the audioQueueaudioQueue.length = 0;
}// update WebSocket message event handler to handle interruption
function handleWebSocketMessageEvent(message: Hume.empathicVoice.SubscribeEvent
): void {// place logic here which you would like to invoke when receiving a message through the socketswitch (message.type) {// add received audio to the playback queue, and play next audio outputcase 'audio_output':// convert base64 encoded audio to a Blobconst audioOutput = message.data;const blob = convertBase64ToBlob(audioOutput, mimeType);// add audio Blob to audioQueueaudioQueue.push(blob);// play the next audio outputif (audioQueue.length === 1) playAudio();break;// stop audio playback, clear audio playback queue, and update audio playback state on interruptcase 'user_interruption':stopAudio();break;}
}

API參考

官方鏈接：API Reference

網絡請求URL:
https://api.hume.ai/v0/evi/tools?page_number=0&page_size=2

示例代碼：

curl -G https://api.hume.ai/v0/evi/tools \-H "X-Hume-Api-Key: " \-d page_number=0 \-d page_size=2

TypeScript示例：

// List tools (GET /tools)
const response = await fetch("https://api.hume.ai/v0/evi/tools?page_number=0&page_size=2", {method: "GET",headers: {"X-Hume-Api-Key": ""},
});
const body = await response.json();
console.log(body);

Python示例

import requests
# List tools (GET /tools)
response = requests.get("https://api.hume.ai/v0/evi/tools?page_number=0&page_size=2",headers={"X-Hume-Api-Key": ""},
)
print(response.json())