GPT-SoVITS Speech Synthesis Modeling Practice

1. Overview

GPT-SoVITS is an open source speech synthesis model that combines deep learning and acoustic techniques to enable high quality speech generation. Its unique feature is its support for zero-sample speech synthesis using reference audio, which allows the model to generate similar styles of speech even without direct training data. Users can further enhance the performance of the model by fine-tuning it to suit specific application requirements.

2. Content

2.1 Introduction to GPT-SoVITS

This open source Text-to-Speech (TTS) project runs on Linux, MacOS and Windows systems with great flexibility and compatibility. Users can easily clone a specific voice by simply providing an audio file up to 1 minute long. The project supports the conversion of Chinese, English and Japanese text into cloned speech, facilitating applications in multilingual environments.

Project Address:/RVC-Boss/GPT-SoVITS
Official Tutorial:GPT-SoVITS Manual

2.2 Speech synthesis

VITS is a model for end-to-end text-to-speech (TTS) that combines adversarial learning and a conditional variational autoencoder designed to generate high-quality speech effects. In recent years, while a variety of TTS models with single-stage training and parallel sampling have been proposed, their sample quality is often not comparable to that of traditional two-stage systems. To address this problem, VITS employs a parallel end-to-end approach capable of generating more natural and realistic sounds.

The model significantly improves the expressive power of generative modeling through normalized streams and variational reasoning enhanced by the adversarial training process. In addition, VITS introduces a stochastic duration predictor that is capable of synthesizing speech with varying rhythms from the input text. This design allows the model to capture the uncertainty of the underlying variables, forming a natural one-to-many relationship that allows for diverse representations of the same text with different pitches and rhythms. This flexibility and high quality output gives VITS the potential for a wide range of applications in speech synthesis.

Paper Address:/pdf/2106.06103
Github address:/jaywalnut310/vits

2.3 Whisper Speech Recognition

Whisper is an advanced automatic speech recognition (ASR) system developed by OpenAI, trained on a corpus containing 680,000 hours of multilingual (covering 98 languages) and multitasking-supervised data.OpenAI believes that this large and diverse dataset significantly improves the system's ability to recognize a wide range of accents, background noises, and terminology, allowing it to Outperform.

In addition to speech recognition, Whisper supports transcription and translation in multiple languages, and is capable of translating non-English languages directly into English. This versatility makes Whisper suitable not only for speech-to-text tasks, but also for international communication, content creation and education. With its outstanding accuracy and flexibility, Whisper provides users with a powerful tool that helps break down language barriers and facilitate communication and understanding.

Paper Address:/openai/whisper
GitHub Address:/pdf/2212.04356

The fundamentals of Whisper are based on a Transformer sequence-to-sequence model designed to handle a wide range of speech tasks, including multilingual speech recognition, speech translation, spoken language recognition, and speech activity detection. By representing these tasks uniformly as a sequence of symbols requiring decoder prediction, Whisper can effectively replace multiple stages in a traditional speech processing pipeline, simplifying the processing flow.

The model is formatted for multi-task training, using a series of special symbols as task indicators or classification targets. This design not only enhances the model's flexibility, but also enables it to perform well when dealing with different types of speech input. For example, when confronted with multiple languages or different accents, Whisper is able to utilize the rich information in its training data to quickly adapt and improve recognition accuracy. With this innovative approach, Whisper demonstrates a strong capability in speech processing to meet diverse user needs.

The Whisper system offers five different model sizes to balance speed and accuracy. Each model is designed to meet the needs of different application scenarios, allowing users to choose the appropriate model for their specific requirements. Below are the names of the available models, their corresponding approximate memory requirements and relative speeds:

Small models: low memory requirements, fast, suitable for real-time speech recognition tasks, but may be slightly less accurate in complex audio environments.
Medium model: provides better accuracy while remaining relatively fast for most everyday applications.
Large models: significant improvement in accuracy, suitable for scenarios requiring high accuracy, such as medical record transcription and legal document review, but slightly slower in relative speed.
Ultra-large models: excellent speech recognition performance, capable of handling complex accents and technical terminology, suitable for professional domains, with high memory requirements and relatively slow speed.
Extra-large models: provide top-notch accuracy, especially for high-noise environments and multi-party dialog scenarios, with great memory requirements and slower speeds, suitable for situations where real-time processing is not required.

With these different sizes of models, users can flexibly choose the most suitable option according to their hardware resources and application requirements to achieve the best speech recognition results.

-SoVITS installation and deployment

3.1 Configuration requirements

1. Training

Windows
- CUDA-enabled nVIDIA graphics card with at least 6GB of video memory is required.
- Unsupported graphics cards include all models prior to the 10-series, GTX 1060 and below, GTX 1660 and below, GTX 2060 and below, and 3050 4GB cards.
- The operating system needs to be Windows 10 or 11.
- If there is no graphics card, the system automatically switches to CPU training, but very slowly.
macOS
- Requires running macOS 14 or later.
- The Xcode command line tool must be installed, which can be done by running xcode-select --install.
Linux
- Proficient in the use of Linux environments.
- A graphics card with at least 6GB of video memory is required.
- Again, without a graphics card, the system will automatically switch to CPU training, which is slower.

2. Reasoning

Windows
- CUDA-enabled nVIDIA graphics card with at least 4GB of video memory is required (untested, 3GB may not be sufficient for speech synthesis, so presumably 4GB should be sufficient).
- The operating system needs to be Windows 10 or 11.
- If there is no graphics card, the system automatically recognizes and uses the CPU for inference.
macOS
- Requires running macOS 14 or later.
- The Xcode command line tool must be installed, as above.
Linux
- Proficient in the use of Linux environments.
- A graphics card with at least 4GB of video memory is required.
- If there is no graphics card, the system will automatically recognize and use the CPU for inference.

With these configuration requirements, users can ensure that the system is able to train and reason efficiently for optimal performance.

3.2 Mac environment requirements

1. Software requirements

Make sure you have installed the Xcode command line tools by running xcode-select --install.
Install Homebrew in order to install the necessary software (e.g. git, ffmpeg).

2. Install conda (can be skipped if already installed)

Tested versions of Python and PyTorch:

Python 3.9、Pytorch 2.2.1

You can check if it is installed by using the following command.

conda info

3. Install FFmpeg (can be skipped if already installed)

You can check if it is installed and if the version is greater than or equal to 6.1 by using the following command

# mounting
brew install ffmpeg
# Checking the environment
ffmpeg -version

3.3 Project preparation

1.Download project code

If Git is not installed, open a terminal and run

brew install git
brew install git-lfs
brew install rust

If you have installed Git, you can directly locate the directory where you want to store your project in the terminal (this is an example of the desktop, please operate according to the actual situation, all the paths in this document are based on this premise), and then clone the repository to the local area, ~/ stands for the directory under the current user.

#typical example
cd ~/desktop # ~ for the current user
git clone --depth=1 /RVC-Boss/GPT-SoVITS

2. Download the pre-training model (just refer to the project directly)

Download the pre-trained models from GPT-SoVITS Models and unzip them replacing ~/desktop/GPT-SoVITS/GPT_SoVITS/pretrained_models . For UVR5 (Vocal/Accompaniment Separation and Reverb Removal) (UVR5 client is recommended, you can skip this step) the Yes UVR5 Tutorial downloads models from UVR5 Weights and places them in ~/GPT-SoVITS/tools/uvr5/uvr5_weights. (If you use UVR5 client, you can skip this step) For Chinese automatic speech recognition, download the models from Damo ASR Model and unzip them and replace ~/desktop/GPT-SoVITS/tools/asr/models.

#One-Step Order
cd ~/desktop/GPT-SoVITS/tools/asr/models
git lfs install
git clone https:///iic/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-
git clone https:///iic/punc_ct-transformer_zh-cn-common-vocab272727-
git clone https:///iic/speech_fsmn_vad_zh

3.4 Environmental preparation

1. Creating the environment

Close the terminal, then open it and type

conda create -n GPTSoVits python=3.9
conda activate GPTSoVits

Just keep hitting y/n and enter.
If you encounter: connot find conda, it is because conda is not installed, type

conda -v

Check to see if it's loaded. Try reopening the terminal.

2. Installation of dependencies

Terminal Input

cd ~/desktop/GPT-SoVITS
conda activate GPTSoVits
pip install -i /simple -r

3. Running

conda activate GPTSoVits
cd ~/desktop/GPT-SoVITS
python  zh_CN

4. Training models

GPT-SoVITS WebUI provides comprehensive features including dataset production, model fine-tuning training and speech clone inference. If you just want to experience the results, you can directly use the officially shared speech models. This design allows users to get started quickly without complex setup or in-depth technical knowledge.

4.1 Data set processing

1. Processing the original audio

If the original audio is clean enough, such as a dry sound extracted from a game, you can skip this step. Next, click on Open UVR5-WebUI and after a few moments, open your browser and visit http://localhost:9873.

2. Cutting Audio

Before cutting audio, it is recommended that all audio files be imported into audio software (e.g., cutscene) for volume adjustment, with the maximum volume set to -9dB to -6dB, and excessively high volumes should be deleted.

After opening WebUI, first enter the folder path of the original audio. Next, the following suggested parameters can be adjusted:

min_length: adjust according to the size of video memory, the smaller the video memory, the smaller the value.
min_interval: adjusted for the average interval of the audio. If the audio is too dense, the value can be lowered appropriately.
max_sil_kept: this parameter affects the coherence of the sentence and needs to be adjusted differently for different audio. If you are not sure, it is recommended to keep the default value.

Click "Turn on voice cutting", the cutting process will be finished immediately, and the default output path is output/slicer_opt. In this way, you can get the processed audio clips quickly.

Open the Slicing folder, and manually slice the audio that exceeds the "Memory" seconds to below that length. For example, if your graphics card has 10GB of RAM, it is recommended to slice the audio that is longer than 10 seconds to less than 10 seconds, or just delete it (anything a little longer can be left alone). Excessively long audio may cause the video memory to fill up.

If it is still a file after voice cutting, the audio is too dense, try turning down the min_interval parameter to get a better cut. This will ensure that the audio file is processed without exceeding the memory limit.

3. Audio Noise Reduction

If the original audio is clean enough, such as a dry voice extracted from a game, you can skip this step. Enter the path to the folder where you cut the audio, default is output/slicer_opt, and click "Enable Noise Reduction". When the process is finished, the noise canceled audio will be output to the output/denoise_opt directory by default. In this way, you can get clear audio files easily.

4. Marking and proofreading

Simply enter the path to the sliced folder you just entered. The default output path is output/denoise_opt if the audio has been processed with noise reduction, or output/slicer_opt if there is no noise reduction.

Next, select Dharma ASR or Fast Whisper and click "Enable offline batch ASR". The default output path is output/asr_opt. Please note that you may need to wait for a while for this step, as the system needs to download the corresponding models.

Dharma ASR: Dedicated to Chinese recognition with the best results.
Fast Whisper: Supports 99 languages and is especially good at recognizing English and Japanese. It is recommended to choose the large V3 model, and select auto for language.

It should be noted that, because the recognized text may not be accurate enough, it is recommended to manually proofread the annotation (this step is more time-consuming, if you do not pursue the ultimate effect can choose to skip). Here I just to demonstrate the process, so this step can be skipped.

4.2 Fine-tuning training

1. Formatting of data sets

In the 1-GPT-SOVITS-TTS tab, fill in the following information:

Experiment/Model Name: Enter the name of the experiment, making sure not to use Chinese.
Text annotation file: Select your annotation file.
Training set audio file directory: specifies the folder path of the audio dataset.

Ensure that all paths and file names are correct for smooth follow-up.

After filling out the form, you can choose to click the following three buttons one by one and wait for each operation to finish before clicking the next one. If you encounter errors, please check the background log, some errors can be solved by just retrying.

Alternatively, you can directly use the "Enable One-Click Trifecta" button to complete the three steps in one click, saving time and effort.

2. Training the fine-tuned model

option1B - Fine-tuning training sub-tab, configure parameters such as batch_size. Then click theOpen SoVITS Training cap (a poem)Open GPT training. Please note that these two training tasks cannot be performed at the same time (unless you have two graphics cards). If the training process is interrupted, you can simply click Start Training again and the system will continue from the most recent save point.

For SoVITS training, it is recommended to set the batch_size to less than half of the video memory. Too high a setting may cause the memory to fill up, and higher is not faster. You need to adjust the batch_size according to the size of the dataset, not strictly half of the video memory. If there is not enough video memory, please reduce the batch_size appropriately, the following is the maximum batch_size of SoVITS training with different video memory for a slice length of 10 seconds; if the slice length is longer or the dataset is bigger, you need to reduce it appropriately.

Next, set the number of training rounds.The number of rounds for SoVITS model can be set higher because the training speed is faster. For GPT models, on the other hand, it is usually recommended to set the number of rounds to 10 and not more than 20 to ensure a balance between training efficiency and model performance.

4.3 Reasoning

1. Turn on the reasoning service

option1C-Reasoning sub-tab, configure the model path (if the model is not shown, you can refresh it by clicking the button on the right). Then, click theOpen TTS Reasoning WEBUI button to open the Reasoning page. In the reasoning page, you can enter text and generate speech to experience the effects of the model.

After a few moments, open your browser and visit http://localhost:9872.

2. Phonetic clone reasoning

In the inference page, the first step is to select the desired model. Step 2 Upload the reference voice and text (recommended duration is between 5 and 10 seconds, the reference audio is very important, it affects the model's ability to learn the speed and tone of voice, so please select it carefully). Step 3 Enter the text you wish to use for speech cloning and get ready to start generating speech.

5. Summary

GPT-SoVITS is an open source speech synthesis framework that combines Generative Adversarial Networks and Variational Reasoning to achieve high quality text-to-speech conversion. GPT-SoVITS supports a variety of features, including model fine-tuning, speech cloning, and multi-language processing, and can be easily operated by users through a user-friendly web interface.GPT-SoVITS is especially suited for generating natural and fluent speech, which is widely used in the fields of gaming, movie and TV dubbing, and voice assistants.

6. Concluding remarks

This blog will share with you here, if you have any questions in the process of research and study, you can add the group for discussion or send an email to me, I will do my best to answer for you, with you!

Also, the blogger has a new book out calledDeeper Understanding of Hiveand the simultaneous publication of theKafka is not hard to learnand theHadoop Big Data Mining from Introduction to Advanced PracticeIt can also be used in conjunction with the new book, so if you like it, you can use it with the new book.Click on the buy link on the bulletin board there to purchase the blogger's bookTo carry out the study, I would like to thank you for your support. Follow the public number below and follow the prompts to get free instructional videos for the books.