Azure語音服務(2)語音轉文字(Speech-to-Text)指南

我們上一章簡單的介紹了Azure語音服務，介紹了語音服務提供了幾樣工具： Azure Speech CLI， Azure Speech SDK（多種開發語言支援），語音裝置SDK，以及Speech Stuido， Rest API，同時Azure語音服務也提供了幾種場景，我們本機以一個例項來描述Azure語音服務中的語音轉文字的開發的基本要點。本節的原始碼可以從下述的位置找到：azure-demo/dotnet/cognitive-service/SpeechService/SpeechToText at main · hylinux/azure-demo （github。com）

使用。Net SDK 快速入門語音轉文字

我們前面討論過了，Azure語音服務提供Azure Speech Cli以及各種語言工具的SDK，我們需要注意到各種工具的具體應用場景，如果是需要更多的定製，客戶自身也有足夠的coding 能力，那麼選擇SDK是合適的選擇，下面我們使用。Net 5 SDK來完成整個指南。

建立新的專案

使用如下的命令建立新專案：

上述命令在目錄SpeechToText中建立了專案，同時進入到該專案目錄中新增speech的包支援。完成這個部分之後，使用編輯器或者IDE開啟該專案，新增如下的包引用到檔案中：

usingSystem；

usingSystem。IO；

usingSystem。Threading。Tasks；

usingMicrosoft。CognitiveServices。Speech；

usingMicrosoft。CognitiveServices。Speech。Audio；

SpeechConfig物件

我們先來認識一下SpeechConfig物件，這個物件是所有語音服務必須使用的配置物件，無論是開發語音識別服務，還是語音合成服務，還是語音翻譯服務，都是必須首先建立一個SpeechConfig物件，建立該物件的方法也很簡單，只需要傳入和即服務所在的區域。

在方法裡建立SpeechConfig物件。

asyncstaticTaskMain（string［］args）

{

varspeechConfig=SpeechConfig。FromSubscription（“”，“”）；

}

從麥克風中識別文字

進行語音識別時，除了要傳入SpeechConfig物件，同時也要傳入AudioConfig，我們在本例中是為了從麥克風中進行識別，那麼可以使用如下的程式碼：

asyncstaticTaskFromMic（SpeechConfigspeechConfig）

{

usingvaraudioConfig=AudioConfig。FromDefaultMicrophoneInput（）；

usingvarrecognizer=newSpeechRecognizer（speechConfig， audioConfig）；

Console。WriteLine（“請對著麥克風講話：”）；

varresult=awaitrecognizer。RecognizeOnceAsync（）；

Console。WriteLine（$“文字識別為： Text=”）；

}

定義了方法之後，我們對方法進行更改：

asyncstaticTaskMain（string［］args）

{

varspeechConfig=SpeechConfig。FromSubscription（“”，“”）；

awaitFromMic（speechConfig）；

}

從上述方法就可以看出，為了識別麥克風，我們使用了類的方法代表從預設的麥克風讀入

更改完成之後，執行執行該專案，該專案效果如下：

識別其他語言

我們會發現只能識別英文，假如我們需要識別其他的語言，該如何做呢？例如我們需要識別中文？

speechConfig。SpeechRecognitionLanguage=“zh-CN”

所以還是去設定物件的屬性，關於語言的可選擇項，可以參考文件：語音轉文字支援的語言

那麼我們更改方法如下：

asyncstaticTaskMain（string［］args）

{

varspeechConfig=SpeechConfig。FromSubscription（“”，“”）；

speechConfig。SpeechRecognitionLanguage=“zh-CN”；

awaitFromMic（speechConfig）；

}

這次執行就可以識別中文了，如下圖：

從檔案中識別

為了測試我們這一步，需要建立一個格式的聲音檔案，需要注意的是預設情況下，Speech語音服務僅僅支援16 KHz 或 8 kHz，16 位，單聲道 PCM的檔案，為了建立這樣一個檔案，可以使用我們上一章提供的工具來另存一個檔案，例如：

這樣我們就可以得到一個符合需求的格式檔案了，具體使用方法，請參考該文件Speech Cli Intro

現在我們已經有了一個用於測試的檔案了，那麼我們來定義如下的方法

asyncstaticTaskFromFile（SpeechConfigspeechConfig）

{

usingvaraudioConfig=AudioConfig。FromWavFileInput（“my-sample。wav”）；

usingvarrecognizer=newSpeechRecognizer（speechConfig， audioConfig）；

varresult=awaitrecognizer。RecognizeOnceAsync（）；

Console。WriteLine（$“文字識別為： Text=”）；

}

從wav檔案中讀取內容，僅僅只需要使用方法就可以了。

更改方法如下：

asyncstaticTaskMain（string［］args）

{

varspeechConfig=SpeechConfig。FromSubscription（“”，“”）；

speechConfig。SpeechRecognitionLanguage=“zh-CN”；

//await FromMic（speechConfig）；

awaitFromFile（speechConfig）；

}

然後執行即可以完成了

從記憶體流中輸入解析

假如你已經有一個音訊檔案已經透過byte［］陣列讀入到了記憶體裡，那麼我們可以透過類進行讀取，並且使用Speech語音服務進行識別，我們在本例中為了方便演示，我們使用將聲音檔案讀入到記憶體中，然後進行識別：

asyncstaticTaskFromStream（SpeechConfigspeechConfig）

{

varreader=newBinaryReader（File。OpenRead（“my-sample。wav”））；

usingvaraudioInputStream=AudioInputStream。CreatePushStream（）；

usingvaraudioConfig=AudioConfig。FromStreamInput（audioInputStream）；

usingvarrecognizer=newSpeechRecognizer（speechConfig， audioConfig）；

byte［］readBytes；

{

readBytes=reader。ReadBytes（1024）；

audioInputStream。Write（readBytes， readBytes。Length）；

}while（readBytes。Length>）；

varresult=awaitrecognizer。RecognizeOnceAsync（）；

Console。WriteLine（$“文字識別： Text=”）；

}

注意類方法預設讀取檔案的格式為16 KHz 或 8 kHz，16 位，單聲道 PCM的檔案，如果檔案格式和這個不同，那麼可以使用方法然後將物件傳遞給`CreatePushStream（），即可以讀取了。

然後接下來只需要更改函式就可以了：

asyncstaticTaskMain（string［］args）

{

varspeechConfig=SpeechConfig。FromSubscription（“7038e65654ff4042be18b04522629a99”，“chinaeast2”）；

speechConfig。SpeechRecognitionLanguage=“zh-CN”；

//await FromMic（speechConfig）；

//await FromFile（speechConfig）；

awaitFromStream（speechConfig）；

}

連續語音識別

我們前面的例子中，每次的語音識別都是以一句表達的語句進行識別，這句表達語句結束了之後，就會立即停止識別。這裡有一個概念：什麼是一句表達語句，我的理解就是例如在我們日常的說話中，說完一句話，或者是文字表達中的以標點符號結尾，或者有停頓的地方都可以稱之為一句表達語句，需要注意的是在語音識別中，除了一個表達語句，還有一個硬性的要求，那就是不得超過15秒。但是很多時候，我們一個檔案中有很多表達語句，又或者你透過麥克風進行識別時，你說一直保持說話，也是需要持續識別的，持續識別的要點在於需要訂閱不同的事件：

事件： Recognizing，表示正在識別。

事件：Recognized，識別結束

事件： Canceled，取消識別

事件：SessionStopped，語音服務的會話結束。

在監聽相應的事件之後，我們需要透過方法``StartContinuousRecognitionAsyncStopContinuousRecognitionAsync（）`停止連續識別。

主要的程式碼如下：

asyncstaticTaskFromContinue（SpeechConfigspeechConfig）

{

usingvaraudioConfig=AudioConfig。FromDefaultMicrophoneInput（）；

usingvarrecognizer=newSpeechRecognizer（speechConfig， audioConfig）；

//建立認證狀態

varstopRecognition=newTaskCompletionSource（）；

//訂閱事件

recognizer。Recognizing+=（s， e）=>

{

Console。WriteLine（$“正在識別： Text=”）；

}；

recognizer。Recognized+=（s， e）=>

{

if（e。Result。Reason==ResultReason。RecognizedSpeech）

{

Console。WriteLine（$“識別結束： Text=”）；

}

elseif（e。Result。Reason==ResultReason。NoMatch）

{

Console。WriteLine（$“NOMATCH： Speech could not be recognized。”）；

}

}；

recognizer。Canceled+=（s， e）=>

{

Console。WriteLine（$“CANCELED： Reason=”）；

if（e。Reason==CancellationReason。Error）

{

Console。WriteLine（$“CANCELED： ErrorCode=”）；

Console。WriteLine（$“CANCELED： ErrorDetails=”）；

Console。WriteLine（$“CANCELED： Did you update the subscription info？”）；

}

stopRecognition。TrySetResult（）；

}；

recognizer。SessionStopped+=（s， e）=>

{

Console。WriteLine（“\n Session stopped event。”）；

stopRecognition。TrySetResult（）；

}；

awaitrecognizer。StartContinuousRecognitionAsync（）；

//等待結束

Task。WaitAny（new［］ { stopRecognition。Task }）；

// make the following call at some point to stop recognition。

//停止識別

awaitrecognizer。StopContinuousRecognitionAsync（）；

}

然後更改函式

awaitFromContinue（speechConfig）；

開啟聽寫模式

關於聽寫模式我的理解時會根據語氣在轉換文字的時候，會將語氣形成標點符號，例如：當你問一個人，你在家嗎+語氣，語音識別的時候會識別成：你在家嗎？要啟動聽寫模式也非常簡單，只需要在物件上啟動就可以了，例如：

啟用了聽寫模式之後，使用上述的連續識別應用，如下圖：注意圖中的標點符號，例如句號，問好等等。

錯誤處理

我們在聯絡識別中實際已經應用到了，透過識別回來的resource來判斷從而達到錯誤處理的要求，如下述程式碼：

switch（result。Reason）

{

caseResultReason。RecognizedSpeech：

Console。WriteLine（$“RECOGNIZED： Text=”）；

break；

caseResultReason。NoMatch：

Console。WriteLine（$“NOMATCH： Speech could not be recognized。”）；

break；

caseResultReason。Canceled：

varcancellation=CancellationDetails。FromResult（result）；

Console。WriteLine（$“CANCELED： Reason=”）；

if（cancellation。Reason==CancellationReason。Error）

{

Console。WriteLine（$“CANCELED： ErrorCode=”）；

Console。WriteLine（$“CANCELED： ErrorDetails=”）；

Console。WriteLine（$“CANCELED： Did you update the subscription info？”）；

}

break；

}

透過片語列表提高識別精度

從聲音識別然後轉換到文字，極有可能會因為多音字，同音字從而造成識別精度不夠，例如在英語裡： Move to Ward，容易識別成Move toward，中文裡同音字，多音字就更多了，為了提高識別精度，我們可以透過片語列表提升精度。可以使用如下的程式碼：

varphraseList=PhraseListGrammar。FromRecognizer（recognizer）；

phraseList。AddPhrase（“Supercalifragilisticexpialidocious”）；

片語列表中可以新增單個的單詞，也可以新增完整的片語或者短語。還可以使用方法來清除整個列表。

識別其他格式的音訊檔案

我們前面說過，預設情況僅僅支6 bit， 16khz 單身到 PCM的wav檔案，假如我們有其他格式的音訊檔案該如何識別？這裡我們需要使用到一個第三方庫：GStreamer，你可以從這個連結來檢視如何在windows上安裝GStream，下載回來gstreamer-1。0-msvc-x86_64-1。18。4。msi之後，一路安裝就好了，安裝完成之後，需要將他們放在變數中。

為了測試，我已經將一個檔案放在目錄中。

同時我們需要從來讀取該檔案，需要注意的時該類需要一個幫助類，該幫助類要繼承並實現他的方法。

publicsealedclassBinaryAudioStreamReader：PullAudioInputStreamCallback

{

privateSystem。IO。BinaryReader_reader；

///

/// Creates and initializes an instance of BinaryAudioStreamReader。

///

///The underlying stream to read the audio data from。 Note： The stream contains the bare sample data， not the container （like wave header data， etc）。

publicBinaryAudioStreamReader（System。IO。BinaryReaderreader）

{

_reader=reader；

}

///

/// Creates and initializes an instance of BinaryAudioStreamReader。

///

///The underlying stream to read the audio data from。 Note： The stream contains the bare sample data， not the container （like wave header data， etc）。

publicBinaryAudioStreamReader（System。IO。Streamstream）

：this（newSystem。IO。BinaryReader（stream））

{

}

///

/// Reads binary data from the stream。

///

///The buffer to fill

///The size of data in the buffer。

///The number of bytes filled， or 0 in case the stream hits its end and there is no more data available。

/// If there is no data immediate available， Read（） blocks until the next data becomes available。

publicoverrideintRead（byte［］dataBuffer，uintsize）

{

return_reader。Read（dataBuffer，，（int）size）；

}

///

/// This method performs cleanup of resources。

/// The Boolean parameterindicates whether the method is called from（ifis true） or from the finalizer （ifis false）。

/// Derived classes should override this method to dispose resource if needed。

///

///Flag to request disposal。

protectedoverridevoidDispose（booldisposing）

{

if（disposed）

{

return；

}

if（disposing）

{

_reader。Dispose（）；

}

disposed=true；

base。Dispose（disposing）；

}

privatebooldisposed=false；

}

然後使用上述的連續識別的方法定義方法：

asyncstaticTaskFromGStream（SpeechConfigspeechConfig）

{

varpullAudio=AudioInputStream。CreatePullStream（

newBinaryAudioStreamReader（newBinaryReader（File。OpenRead（@“。\1。flac”））），

AudioStreamFormat。GetCompressedFormat（AudioStreamContainerFormat。FLAC）

）；

usingvaraudioConfig=AudioConfig。FromStreamInput（pullAudio）；

//自動語言檢測

varautoDetectSourceLanguageConfig=

AutoDetectSourceLanguageConfig。FromLanguages（

newstring［］{“en-us”，“zh-CN”}

）；

usingvarrecognizer=newSpeechRecognizer（speechConfig，

autoDetectSourceLanguageConfig，

audioConfig）；

varstopRecognition=newTaskCompletionSource（）；

recognizer。Recognizing+=（s， e）=>

{

Console。WriteLine（$“RECOGNIZING： Text=”）；

}；

recognizer。Recognized+=（s， e）=>

{

if（e。Result。Reason==ResultReason。RecognizedSpeech）

{

Console。WriteLine（$“RECOGNIZED： Text=”）；

}

elseif（e。Result。Reason==ResultReason。NoMatch）

{

Console。WriteLine（$“NOMATCH： Speech could not be recognized。”）；

}

}；

recognizer。Canceled+=（s， e）=>

{

Console。WriteLine（$“CANCELED： Reason=”）；

if（e。Reason==CancellationReason。Error）

{

Console。WriteLine（$“CANCELED： ErrorCode=”）；

Console。WriteLine（$“CANCELED： ErrorDetails=”）；

Console。WriteLine（$“CANCELED： Did you update the subscription info？”）；

}

stopRecognition。TrySetResult（）；

}；

recognizer。SessionStopped+=（s， e）=>

{

Console。WriteLine（“\n Session stopped event。”）；

stopRecognition。TrySetResult（）；

}；

awaitrecognizer。StartContinuousRecognitionAsync（）；

// Waits for completion。 Use Task。WaitAny to keep the task rooted。

Task。WaitAny（new［］ { stopRecognition。Task }）；

// make the following call at some point to stop recognition。

awaitrecognizer。StopContinuousRecognitionAsync（）；

}

隨後在方法裡呼叫該方法，結果如下圖：

自動語言檢測

我們有時候需要對聲音檔案進行自動語言檢測，要啟用這個特性，只需要配置：

usingvaraudioConfig=AudioConfig。FromStreamInput（pullAudio）；

//自動語言檢測

varautoDetectSourceLanguageConfig=

AutoDetectSourceLanguageConfig。FromLanguages（

newstring［］{“en-us”，“zh-CN”}

）；

usingvarrecognizer=newSpeechRecognizer（speechConfig，

autoDetectSourceLanguageConfig，

audioConfig）；

AzureDeveloper，一個分享和學習Azure技術的好去處，歡迎關注

Azure語音服務(2)語音轉文字(Speech-to-Text)指南

相關文章