Microsoft Azure text to speech

In Machine Learning Engineering


Bring your apps to life with natural-sounding voices. Create apps and services that speak naturally.
alt
Editorial Commitee Qualified.One,
Management
alt

Highlight your brand with a customised realistic voice generator and access voices with different speech styles and emotional tones to suit your use case - from text reading and communication tools to customer service chatbots.

Microsoft, Google, iFLYTEK have released tools related to TTS. Microsoft has opened an API and SDK on Azure (https://azure.microsoft.com/zh-cn/services/cognitive-services/text-to-speech/). The following will explain how to use the API. Google has also released TTS products (https://cloud.google.com/text-to-speech/, you need "Science Online" to access), which is part of Google Cloud, so you want to try or buy for free, you need to register Google Cloud. When registering, you need to link a Visa or MasterCard credit card for successful completion, which is extremely unfriendly to internal users. 

In Azure, the documentation for the voice part is written in more detail, including various features such as the TTS API, TTS SDK and the custom voice model for text-to-speech conversion. But without a general, outline-like introduction, it's very likely that after reading the document I still don't know where to start. This article will explain step by step how to use the Azure TTS API from 0 (there will be time to add the use of the SDK later). The effect we want to achieve is to enter a snippet of text, call the API and return us a wav audio fragment. Once played, this is the previously entered text.

Step 1: Register a Microsoft Azure account.

Enter https://azure.microsoft.com/en-us/services/cognitive-services/text-to-speech/, click "Try Text-to-Voice Conversion" to try or register with Azure.

Step 2: Get an endpoint and key

Once you have completed registration, go to https://azure.microsoft.com/en-us/try/cognitive-services/?api=speech-services and add the 'Voice Service' feature, you will automatically get an endpoint and key.

Step 3: Upload the sample code

The API is called using the following method: the local program sends an HTTP request (including the text to be converted) to the Microsoft server, and after authentication the server returns the converted audio to the local one. The called program can be downloaded from GitHub:https://github.com/Azure-Samples/Cognitive-Speech-TTS, Samples-Http folder contains source code of different languages such as Android, C #, Java, Node.js, PHP, Python, Ruby etc. D. Take a version of Python (Python3) as an example and then present it.

After opening the Python folder there are two files: TTSSample.py and README. The py file cannot be run directly and minor modifications are required (I will explain how to do this below. The modified source code will be attached at the end of this article). After making the changes, run the file, the program will automatically complete all the operations described in the previous paragraph, and save the returned results in "data".

The TTSSample.py file needs to be modified as follows (see README):

  • (1) apiKey = "Your api key goes here", replace the contents in quotes with key 1 or key 2 in figure STEP2.
  • (2) Check if the endpoint in the code matches the endpoint in the STEP2 figure. For example, AccessTokenHost = "westus.api.cognitive.microsoft.com" in the program, where "westus" is the endpoint. Then check the name of the endpoint assigned to your account during the application, e.g. "westus" in STEP2. If they match, go to the next step, if they are incompatible they need to be changed (only "westus" in the code needs to be changed, ".api.com ......" does not need to be changed)

Similarly "conn = http.client.HTTPSConnection (" westus.tts.speech.microsoft.com ")" made the same modification in the program. If your endpoint is not westus, replace westus in the program with the name of the endpoint.

There are 3 endpoints:

If you do not understand this, refer to https://docs.microsoft.com/zh-cn/azure/cognitive-services/speech-service/how-to-text-to-speech

  • (3) According tohttp://docs.microsoft.com/zh-cn/azure/cognitive-services/speech-service/rest-apis#authentication, Change X-Microsoft-OutputFormat and output audio properties (male or female) to the program. Temporarily output audio only supports a 24kHz sample rate.
  • (4) The content after "voice.text" in the program is the text content you want to convert to audio, which can be changed to suit your actual needs.
  • (5) At this time TTSSample.py can be run. The normal return value of the program should be "200 OK". If an error occurs, there will be a corresponding status code.

Assuming that the previous operations are OK, and also getting a return value of 200, where is the converted sound located? The answer is in the "data" variable of the program. The data in "data" is the audio after the TTS conversion, we need to record it in wav format to get the final sound. See the code at the end of the article for specific operations.

Attachment: the modified TTSSample.py file (of course you need to change apiKey to your key; check if Python contains the "wave" package), output.wav is the converted text.



#! /usr/bin/env python3
 
# -*- coding: utf-8 -*-
 
###
#Copyright (c) Microsoft Corporation
#All rights reserved. 
#MIT License
#Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ""Software""), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
#The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
###
import http.client, urllib.parse, json
from xml.etree import ElementTree
import wave
# Note: new unified SpeechService API key and issue token uri is per region
# New unified SpeechService key
# Free: https://azure.microsoft.com/en-us/try/cognitive-services/?api=speech-services
# Paid: https://go.microsoft.com/fwlink/?LinkId=872236
apiKey = "Your api key goes here"
 
params = ""
headers = {"Ocp-Apim-Subscription-Key": apiKey}
 
#AccessTokenUri = "https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken";
AccessTokenHost = "westus.api.cognitive.microsoft.com"
path = "/sts/v1.0/issueToken"
 
# Connect to server to get the Access Token
print ("Connect to server to get the Access Token")
conn = http.client.HTTPSConnection(AccessTokenHost)
conn.request("POST", path, params, headers)
response = conn.getresponse()
print(response.status, response.reason)
 
data = response.read()
conn.close()
 
accesstoken = data.decode("UTF-8")
print ("Access Token: " + accesstoken)
 
body = ElementTree.Element('speak', version='1.0')
body.set('{http://www.w3.org/XML/1998/namespace}lang', 'en-us')
voice = ElementTree.SubElement(body, 'voice')
voice.set('{http://www.w3.org/XML/1998/namespace}lang', 'en-US')
voice.set('{http://www.w3.org/XML/1998/namespace}gender', 'Male')
voice.set('name', 'Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)')
voice.text = 'This is a demo to call microsoft text to speech service in Python.'
 
headers = {"Content-type": "application/ssml+xml", 
			"X-Microsoft-OutputFormat": "riff-24khz-16bit-mono-pcm",
			"Authorization": "Bearer " + accesstoken, 
			"X-Search-AppId": "07D3234E49CE426DAA29772419F436CA", 
			"X-Search-ClientID": "1ECFAE91408841A480F00935DC390960", 
			"User-Agent": "TTSForPython"}
			
#Connect to server to synthesize the wave
print ("
Connect to server to synthesize the wave")
conn = http.client.HTTPSConnection("westus.tts.speech.microsoft.com")
conn.request("POST", "/cognitiveservices/v1", ElementTree.tostring(body), headers)
response = conn.getresponse()
print(response.status, response.reason)
 
data = response.read()
conn.close()
print("The synthesized wave length: %d" %(len(data)))
 
f = wave.open(r"output.wav", "wb")
f.setnchannels (1) #mono
 f.setframerate (24000) # частота дискретизации
f.setsampwidth(2)#sample width 2 bytes(16 bits)
f.writeframes(data)