Microsoft has a cracking new paper with an AI they call VALL-E that is able to do the same, but not in 30 minutes
Dear Fellow Scholars, this is Two Minute Papers with Dr. Károly Zsolnai-Fehér.
Today I will show you a research paper that I can hardly believe exists. And it is about an amazing voice cloning paper from Microsoft Research.
What does that mean? Well, voice cloning means that an AI listens to us speaking, and then, we write a piece of text, and it says it in our voice.
To see what that looks like, this is a previous work from NVIDIA that was able to do that. Let’s listen to Jamil recording these voice snippets. Okay, so how much of this do we need to train this earlier AI? Well, not the entire life recordings of the test subject, but much less, only 30 minutes of these voice samples. The technique asks us to say these sentences and analyzes the timbre, prosody and the rhythm of our voice, which is quite a task. And, what can it do afterwards?
Well, it creates an AI Jamil that can now say a scholarly message for you that I wrote.
Now, hold on to your papers, because Microsoft has a cracking new paper with an AI they call VALL-E that is able to do the same, but not in 30 minutes. Not even in 3 minutes. It can do it in 3 seconds. Yes, that’s right, all it needs is a three-second snippet of our voice, and it promises that it can clone it from that. And it gets crazier. Much crazier, I will show you how in a moment.
First, in goes a 3-second voice sample of us saying something. This does not need to match the text prompt here. This is going to be used for the learning. Now that it has learned this voice, here is what a previous technique could do in terms of cloning. And here is the new method.
Wow. The phrasing and timing are so much better. But actually, how good is it? How do we even know if this is really better? Well, easy, we can ask this person to read this prompt in their voice, hide it from the AI, and compare side by side. Listen. AI goes first. Real person second.
This is absolutely incredible. This truly feels like we are seeing history in the making.
Here is another example. A speaker prompt for learning. And the new technique. I love this, it even has a little personality in there. And the true human voice. I have to say this is even better, but the AI was also out of this world compared to what we were able to do before.
Here is another one and then, we proceed to my favorite part of the paper, three advanced features. So good.
First, variety. This new technique can generate several variants of speech for the same prompt, so we can listen to them and choose the one that we like best. Listen how it puts the emphasis on different words. Loving it.
Two, get this, it can also listen to the 3-second example of our voice, and preserve its emotions too. Here is an angry one. And here is a sleepy one.
I really don’t know what to say. This is absolutely incredible.
And three, it can also maintain not only the emotions, but the ambiance and acoustic environment our sample has been recorded in. My favorite is this sample which sounds like an old crackly phone conversation. Like this. So, can it clone this kind of sample too? Let’s listen together. My goodness, I did not think that this was possible at all, let alone from a tiny tiny 3-second example.
And just imagine the possibilities. This could bring back people who are not with us anymore and have them read us books and bedtime stories. How amazing is that? Maybe we will have the incredible Isaac Asimov himself reading his own robot books to us soon. What a time to be alive!
And remember, NVIDIA’s previous technique needed 30 minutes for this, and now just one more paper down the line, and 600 times less information is enough to create voice samples of this quality. 600 times less! And just imagine what we will be able to do two more papers down the line. My goodness.
Now, of course, this is a great research paper, so what does that mean? Of course, that means that it has a thorough evaluation section. So let’s pop the hood and have a look. Whoa. That is really nice! I’ll try to explain this. We first compare against previous techniques in terms of the word error rate metric. The two variants of the new technique come out the absolute best. However, correctness by itself is not enough, we also wish to know if the new samples are not only correct, but also similar to the speaker in the input sample. So, is it? Wow. The new technique comes out the absolute best on the two things at the same time. However, wait. These two are some ancient techniques from long-long ago, right? Nope. YourTTS is from the same year as this paper. And so is AudioLM.
This is so much progress in research in so little time. Now that truly feels like history in the making. So, welcome to Two Minute Papers, land of the Fellow Scholars where we look at tables and make happy noises.
So, what do you think? Who else would you be interested in to read to you? Morgan Freeman all the things? Or, maybe Károly all the things? Let me know in the comments below! And if you wish to see more papers like this, please consider subscribing and even better, hitting the bell icon.
Thanks for watching and for your generous support, and I’ll see you next time!