An Automated Method to Correct Artifacts in Neural Text-to-speech Models
Abstract
Recent advancements in deep learning-based speech synthesis models have yielded substantial progress in generating natural speech. While these high-performing speech models find applications in various domains, a need remains to enhance mis-synthesized speech. Previous speech correction methodologies suffer from inefficiencies due to the need for manual error specification, model retraining, or additional data. This paper presents a novel approach for detecting and correcting errors within the model, obviating the need for additional resources or model retraining. Specifically, we propose a method for automatically identifying abnormal encoder vectors by scrutinizing the inherent limitations of neural network encoders responsible for contextualizing input sentences. Additionally, we introduce a correction algorithm designed to enhance speech artifacts by eliminating the incorrect relationships among phonemes that make abnormal encoder context vectors. Objective evaluation metrics, namely attention alignment error and Fréchet Wav2Vec Distance, along with subjective evaluation using the Comparative Mean Opinion Score, demonstrate significant enhancements in the corrected speech. These findings demonstrate the need for technologies that can autonomously identify and correct flaws in speech synthesis models.
Artificial speech correction with the proposed method
The script is provided with four types of sample audio: synthesized speech, local reference (Ours), global reference (Truncation Trick), and random reference . We conducted experiments on three types of test data: LJSpeech, low PMI LibriSpeech, and high PMI LibriSpeech. We marked abnormal part of sentences as red highlights.
LJSpeech
"LJ019-0368":"The latter too was to be laid before the House of Commons.",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"LJ050-0235":"It has also used other Federal law enforcement agents during Presidential visits to cities in which such agents are stationed",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"LJ050-0084":"or, quote, other high government officials in the nature of a complaint coupled with an expressed or implied determination to use a means",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"LJ048-0053":"It is the conclusion of the Commission that, even in the absence of Secret Service criteria",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"LJ043-0107":"Upon moving to New Orleans on April 24, 1963,",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
Low PMI LibriSpeech
"4122-157669-0057":"CHURL UPON THY EYES I THROW ALL THE POWER THAT THIS CHARM DOTH OWE WHEN THOU WAKEST LET LOVE FORBID SLEEP HIS SEAT ON THY EYELID",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"4179-25937-0046":"I AM NOT SO SURE ABOUT SCHWARTZ I SAID THOUGHTFULLY",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"8272-279789-0044":"SNOWDROP SHALL DIE SHE CRIED IF IT COSTS MY OWN LIFE",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"4042-12369-0039":"SOFT GOAT CHEESE TOME DE SAVOIE FRANCE SOFT PASTE GOAT OR COW OTHERS IN THE SAME CATEGORY ARE",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"5983-39668-0033":"YES COLBERT LITTLE COLBERT MAZARIN'S FACTOTUM THE SAME WELL",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
High PMI LibriSpeech
"1968-145732-0016":"AND CARRIED HIM HOME TO HIS HOUSE AND WAS EXCEEDINGLY KIND TO HIM HE GAVE HIM TO HIS WIFE",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"4546-16781-0038":"REGARD FOR A PERSON IS THE MENTAL VIEW OR FEELING THAT SPRINGS FROM A SENSE OF HIS VALUE",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"844-133697-0063":"SHE WAS REASSURED QUICKLY ENOUGH BY HER SENSE OF HIS GREAT GOOD MANNERS",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"5183-29124-0006":"AS HE WAS POPULARLY CALLED FOR HE HAD BEEN A CLERGYMAN IN HIS DAY",
Synthesized Speech
Local reference (Ours)
Global reference (Truncation Trick)
Random reference
"7307-91998-0017":"HAD HE COME BECAUSE HE HAD HEARD OF THE BETROTHALS HE ADMITTED THAT IT WAS SO",