Improved VITS-Based Multilingual AI Speech Synthesis Model with Domain Adaptors and Acoustic Feature Optimization

Xing Yang

Abstract


Speech synthesis technology plays an important role in global economic and cultural exchanges, and multilingual speech synthesis and output are still unable to meet the current development needs of the global market. The study proposes the use of acoustic feature conversion methods and steps for decoupling multilingual information, combined with modules of domain adaptors to improve end-to-end text to speech variational inference and adversarial learning models, to adapt to the application of multilingual speech synthesis. Through the evaluation of speech synthesis technology indicators, it was found that the average selection score of the model after removing the regularization term for similarity in different languages was 4.93. The synthesis model without domain adaptors significantly reduced the naturalness of speech synthesis by 0.8 compared to multilingual speech synthesis models, indicating that domain adaptors have a good effect on the naturalness of speech synthesis. In cross-lingual indicator analysis, the model proposed by the research achieved the highest naturalness result, with an average selection score of 4.26 and 3.96 for naturalness and similarity in transit English. In the intermediate day voice synthesis with a data volume of 200, the highest accuracy was 94.58%, which was 16.53% higher than traditional speech synthesis frameworks. Comparing the cross-lingual synthesis performance of the synthesis model, it was found that the model had an accuracy rate of 94.58% and a time of 3.12 seconds for the synthesis of Chinese to Japanese conversion with a data volume of 200. The above results demonstrate the feasibility and superiority of the multilingual speech synthesis model based on domain adaptors, which adds multilingual imagery to speech synthesis applications in the field of artificial intelligence and promotes the industrial development and intelligent services of speech synthesis technology.


Full Text:

PDF


DOI: https://doi.org/10.31449/inf.v49i19.7622

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.