 Wouldn't it be cool to have an AI that listens to you speak for a few seconds and then is able to say different things in your voice? Well, it exists, and before getting into details, I'll show you exactly what it can do. Consider the original audio clip. This is how the speaker sounds. The regional newspapers have outperformed the national titles. Listening to a few seconds of audio like this from a speaker, the AI will be able to take some text input and congenerate new audio saying that text. Like this. The large items are put into containers for disposal. So you can make an AI say anything you want and in your voice. That's pretty slick, right? I'm AJ Halthor, and in this video, we're going to take a look at how exactly such a neural voice cloning system works. Stay notified about my videos by clicking that subscribe button and hitting that bell icon. Now let's get to it. Last month, researchers at Baidu Silicon Valley AI Lab developed a neural voice cloning system. This system requires only a few samples of single speaker audio to generate speech in the speaker's voice. I stress on the term using a few samples because until now, neural networks require copious amounts of data in order to train themselves to actually perform any type of task. They need such large amounts of training data because of the thousands and even millions of training parameters that they required to estimate. The paper, Neural Voice Cloning with a few samples, proposes two different methods to perform voice cloning. The first is speaker adaptation. This involves just tweaking or fine-tuning a pre-trained model and hence making it adapt to or cater to the current speaker. The second method is speaker encoding. There are no pre-trained models here. We train two models, the generator model and the speaker encoder model simultaneously. Before getting into the details of these processes, I'm going to have to explain a few concepts so that we're all on the same page. Let's start with something easy, voice cloning. Voice cloning involves reproducing the voice of an unseen speaker. It's a beautiful day, isn't it? I'm going to conquer the world whether you like it or not. Okay. We perform voice cloning with only a few samples and this is considered few-shot generative modeling of speech. In other words, we only require a few samples of such speech in order to clone that speech. Few-shot generative modeling is challenging because it requires to learn speaker characteristics with just limited amounts of data. Let's now take a look at another term, generative models. These are distributions that can be sampled from and such samples correspond to real data. For example, consider a generative model that models animal images. Then sampling from this, I should be able to get the image of say a dog. The next time I sample from it, I may get the image of a cat. For the current problem of voice cloning, the generative model models speech. So sampling from this model gives some speech audio by some speaker. And every time I sample from this model, I can get different speaker audios. Now that you know that generator models generate data, I think you could guess what needs to be done to create a voice cloning AI. We need to train the generator model so that it can sample speech audio from it. This model is trained on multiple speakers with multiple accents. If you want to get into details, this paper uses the LibreSpeech dataset, which consists of about 2,500 speakers and 820 hours of data. To train our model, you assume to use text audio pairs. However, speakers with different accents say the same sentence in different ways. Hence, multiple spectrograms will be mapped to the same text. This leads to a less accurate generative model. However, when the model is additionally given information about the speaker, such as dialect, accent or gender, then it is able to model the differences between the speech and hence improve performance. This information about the speaker is called speaker embedding. And so, to train a generative model, instead of just text audio pairs, we require triplets of text audio and speaker embeddings for every sample. Formally defined, speaker embeddings are low-dimension continuous representations of speaker characteristics. Now that we got the basic terms out of the way, let's talk about designing an objective function. Like I mentioned before, we train our generative model to generate audio. We call our generative model some f. In such parametric models, training refers to learning parameters of the model. Let's call these parameters theta. Remember, we also want to learn an embedding to distinguish between user characteristics like pitch, accent and dialect. Learning is equivalent to estimating a set of parameters. Let's call the speaker embedding parameters for speaker si as e-subscript si. What exactly do we give the trained model to produce an output? We provide two things. The first is the text to say. Here, tij is the jth word spoken by speaker i. And the second is the identity of the speaker, that's si. This helps us to model the parameters for the speaker embedding. Sampling from this f, we get some cloned audio of speaker si saying the word tij. For training, we have a data set for every speaker si consisting of some text, t, and the corresponding actual audio of the speaker, aij, saying the word. The idea is thus to minimize the divergence between the cloned audio sampled from f and the actual audio in the data set for the same speaker. This is the loss for just one sample from a single speaker. And so we take the expected value of this loss over all speakers s and over all samples in the data set tsi. This is done to learn the model parameters theta and the speaker embedding e. And this is the general objective function with a written explanation of each. So here's a question. Why do we take the expected value of loss instead of computing the loss directly? This is because we don't know how tractable or easy to compute the loss function is. It can be and usually is a complex function and hence becomes more feasible to determine the approximate value of the loss. And in math, this approximation is given by the expected value. Let us now take a look at the two methods for actually computing this loss and hence performing voice cloning. We start with speaker adaptation. Here's the idea. We have a pre-trained audio generator. We just need to fine tune it to produce the voice of some unseen speaker given some text. Even in speaker adaptation, there are two approaches of fine tuning. The first is embedding only adaptation and the second is whole model adaptation. In the embedding only adaptation approach, the only thing we need to do is further train the embedding to cater to the new speaker. We don't need to touch the speech generative model. So the new loss function can be obtained from the general one we derived. Since the generative model is pre-trained, there is no theta estimation. We only need some text and the corresponding audio sample spoken by the current speaker, SK. Note that SK is an unseen speaker that the speech generative model F hasn't seen before. I put a cap on theta to indicate it's fixed here. Since the embedding doesn't have nearly as many parameters as the speech model, we don't require the new speaker to talk too much as we don't need that much data to model his or her voice. Let's take a look at some results of this approach. First, here's the original sample voice. We also need a small plastic snake and a big toy frog for the kids. Now, using embedding only adaptation, here's the synthesized voice. Learn about setting up wireless network configuration. You can tell the voice is similar to the original speaker. Let's try something similar, but with a male voice this time. So here's the original speech. Some have accepted it as a miracle without physical explanation. And here is the synthesized voice using the embedding only adaptation. Feedback must be timely and accurate throughout the project. Not bad, right? The voices are nearly the same. So now let's take a look at another speaker adaptive approach that I mentioned that's used for voice cloning too. And that's whole model adaptation. We have a pre-trained model, but not only do we fine-tune the speaker embedding as in the case of embedding only approach, but we also fine-tune the generative model F itself. I'm certain you can imagine the cost function to minimize. If you can't tell, well, that's why I'm here. The cost for the embedding only approach is given by this equation. But now F is also being tweaked. So get rid of that hat over the theta as it is no longer fixed. We are predicting both the embedding and theta in the process. And that's it. Let's take a look at this in action. Here's the original voice. Ask her to bring these things with her from the store. And here's a synthesized voice when using whole model adaptation. Osecutors have opened a massive investigation into allegations of fixing games and illegal bedding. They sound pretty similar, right? Now we do the same for the male voice. Here's the original. The Greeks used to imagine that it was a sign from the gods to foretell war or heavy rain. And here's the synthesized voice. Instead of fixing it, they gave it a nickname. Comparing the two methods, we see that the whole model adaptation has more degrees of freedom and hence more flexibility. However, it can easily overfit when applied to very less speaker data. So there's always a trade off. Until now, we have just looked at the voice cloning phase using speaker adaptation. This is actually just one phase of the entire process. Now let us look at how we actually cloned the voice from the first step to the last. The training part maps the speaker identity to some embedding. The text audio embedding triplet is then fed to the model for training. Initially, both the model parameters theta and the speaker embeddings e are initialized randomly. With supervised training samples, these parameters are gradually learned. After this phase, we have trained the multispeaker model and we have also trained the multispeaker speech embedding. In phase two, we have cloning, which we discussed with the speaker adaptation. This involves fine-tuning either only the embedding or both the embedding and the model using the cloning samples. These cloning samples are collected by sampling the speaker's voice. After phase two, we have trained the multispeaker model. Well, if the whole model adaptation was used, then we've also catered it to a specific speaker or it's just the same as the output of the first phase if we just use embedding only. And the second is, well, speaker embedding is now catered to the current speaker. The third phase is audio generation. Given an input piece of text, the generative model is able to synthesize speech in the voice of the specific speaker, with the help of the embedding, of course. On to the next approach, speaker encoding. Now, this method doesn't really involve fine-tuning any model or embedding per se. The speaker encoding function G takes in a specific speaker's speech as input, that's A subscript SK, and it outputs the corresponding speaker embedding, E subscript SK. Here, A subscript SK is the set of audio samples taken from the current speaker that is the voice to be cloned. This is represented as cloned audios in the figure. Let us now try to determine the loss function for the speaker encoding approach. We have the original loss function, but now we have a speaker encoder G to generate the speech embeddings. So just substitute that in place of E. We are thus able to train the generative model and the speech encoder simultaneously. However, in practice, there are problems in training these generative models from scratch. The first is the missing modes problem, or mode collapse. Without enough training data, when sampling generative models, we may not be able to sample all classes very well. To give a concrete example, say you trained a generative model to output animals by showing it images of dogs, cats, and giraffes. But we don't have enough giraffe images, so every time we sample the generator, you only end up with a dog or cat images, and we cannot sample giraffe images. One way to solve this problem is to get more training data, but if we do that, then what would be the point of this paper? We are trained to perform neural voice cloning with only a few samples, right? That was the objective. So the idea is to use a pre-trained generative model. Hence, we have the model parameters including the speech embeddings learned for the multi-speaker model. The speech encoder is trained from scratch to make sure that we have a custom voice. To train this, we first sample some speech from our pre-trained multi-speaker model F. This will generate an audio sample of speaker SI. This audio clone sample is then used as an input to the speaker encoder labeled as cloned audios. Since we have the speaker embeddings for the speaker from the pre-trained multi-speaker model, we keep it fixed and hence indicated in blue. We compare this embedding to that generated by the encoder and modify the encoder's parameters, data subscript encoder. In other words, the speech encoder is trained. So what is a good objective to minimize this encoder training cost? A simple L01 law seems to work best. Once again, the hats indicate the fixed values. E subscript SI hat indicates the speaker embedding created for the speaker by the pre-trained multi-speaker model. And G is the speaker embedding predicted by the current speaker encoder. Eventually, the speaker encoder can generate appropriate speaker embeddings more catered to the individual. Now, how do we synthesize the speech? First, text in some cloning audio samples is input to the model. Then, the speaker encoder creates a speech embedding. This embedding, along with the text, is passed to the generator model. The audio corresponding to these inputs is sampled. And we get the required audio. Now that we talked about the speaker encoder model for voice cloning, it's on to the next topic. What exactly is the speaker encoder? Like, what does it consist of? Audio waves are converted into male spectrograms. These are passed into a pre-net, which consists of FC layers with an ALO activation. This is just for feature transformation. Next, the transform features are passed through convolution blocks to extract temporal features. These conv blocks have residual connections, allowing deeper networks. Global average pooling summarizes the utterance. If you want to know exactly how residual connections work and various other convolutional network architectures, check out the eye in the sky or the description down below. Different audio samples have different amounts of information. Some of them are valuable, others are less so. A self-attention mechanism is used to determine the weights of audio samples and get aggregate embeddings. This is kind of like soft attention, where we focus on the important parts. For more information on attention mechanisms and its types, I have a video for that too. The output is the predicted speaker embedding for the audio. Let's take a look at some results here. Here's the original speaker's voice. They had four children together. Now, after training a generator model and voice cloning using speaker encoding, here is a generated sample in the same voice. Churches should not encourage it or make it look harmless. Let's listen to similar results for a male voice. First is the original speaker's voice. It was even worse than at home. And here's the clone voice using speaker encoding, saying something else. Different telescope designs perform differently and have different strengths and weaknesses. That's pretty cool if you ask me. In this video, we took a look at a paper released by Baidu on neural voice cloning with a few samples. The idea is to clone an unseen speaker's voice with only a few sound clips. The entire speech synthesis process involves three steps. First is training the multi-speaker generative model and speaker embedding. The second is vocal cloning. And the third is synthesizing voice given text. In phase two, that is vocal cloning, it is carried out using two approaches. The first that we discussed is speaker adaptation, where we just either fine-tune the embedding only or fine-tune the generative model and the embedding to cater to the speaker. The second approach was speaker encoding, where we trained a speaker encoder to accurately model speaker embeddings. I encourage you to read the paper yourself to understand extra details. The link to it is down in the description below the video. I understand that many people are deterred because of the complex math in these papers. However, I hope that my video helps make the paper more accessible. And it bridges the gap between complex math and concept. There are fascinating works published every week on this topic and I'm here to make it more accessible. If you like this content, hit that like button. If you want to watch similar content, hit that subscribe button and hit the bell icon too. I'm trying out this new setup with the camera and the microphone. So just let me know how you like it in the comments down below. And the links to it will also be in the description down below. So if you want to get yourself your own camera, your own microphone, it's all there. Still not satisfied? Click or tap one of the videos right there and it'll take you to another awesome video. And I will see you in the next one. Bye.