
Microsoft is effectively going toe to toe with Google, which in 2019 debuted new AI-synthesized WaveNet voices and standard voices in its Cloud Text-to-Speech service. Microsoft says it’s also working on a way to embed a digital watermark within a synthetic voice to indicate that the content was created with a Custom Neural Voice. “When it’s not immediately obvious in context, explicitly disclose it’s synthetic in a way that’s perceivable by users and not buried in terms.” “We require customers to make very clear it’s a synthetic voice,” Sarah Bird, responsible AI lead for Cognitive Services within Azure AI, said in a statement. Watch on-demand sessions today.īeyond this, Microsoft says it reviews each potential use case and has customers agree to a code of conduct before they can begin using Custom Neural Voice. Learn the critical role of AI & ML in cybersecurity and industry specific case studies.

Microsoft also contractually requires customers to get consent from voice talent. The recording is compared with the model training data using speaker verification to make sure the voices match before a customer can begin creating the voice. When a customer submits a recording, the voice actor makes a statement acknowledging that they (1) understand the technology and (2) are aware the customer is having a voice made. Ĭustom Neural Voice includes controls to help prevent misuse of the service, according to Microsoft.

Microsoft claims that because the models can simultaneously predict the right prosody and synthesize a voice, Custom Neural Voice results in more natural-sounding voices. One set of models converts a script into an acoustic sequence, predicting prosody, while another set of models converts that acoustic sequence into speech.

With Custom Neural Voice, prosody - the tone and duration of each phoneme, the unit of sound that distinguishes one word from another - is combined so machine learning models running in Azure can closely reproduce an actor’s voice or a wholly original voice.
