 So I'm Jakub, I came from Poland and I wanted to talk to you about WISP speech and how we scaled it to train on, scale the training to 80,000 plus hours of speech. And basically the things we learned along the way how to actually make this work. So this is a talk more about, I guess, the technical details, not about the application so much. But let's start and give you a bit of a feeling of what WISPR is. What WISPR speech is, sorry. Just, I lost my notes, but I should be fine. So, hi WIS. Hello, Jakub, I am WISPR speech, an open source text-to-speech model trained by Calabra and Lyon on the Jules supercomputer. So, hi, would you like to tell us more about what makes you special? Sure, I am a fully open source model trained on properly licensed speech data. This means anybody can safely use WISPR speech for commercial applications. So that also means that it changes the game for deployment of this model, right? Great point, Jakub. My weights and code are available under the MIT license. This means they can be adapted for even the most difficult deployment scenarios. Calabra can help people with that, but there is no vendor locking. You could do it internally or hire other AI specialists to do it for you. Okay, thank you. So, as you may have noticed, this is the output of our model, so it sounds pretty well, I think, for open source text-to-speech models. I think it's on par with what OpenAI offers or other leading vendors in the market like Microsoft and Google. This was... Great point, Jakub. My weights and code are available under the MIT license. This means they can be adapted for even the most difficult deployment scenarios. I cannot skip this. Calabra can help people with that, but there is no vendor locking. You could do it internally or hire other AI specialists to do it for you. Okay. So, this was a scripted demo, as you may have noticed, but in fact, we have... We've built a solution around it, which we use... We have another project called WhisperLive that does live speech-to-text transcription. Then we put in a menstrual LLM in the middle and then we synthesize the result as speech again through WhisperSpeech. And we build a model. We will have a demo released on GitHub, but this was actually a model, a proprietary model for a customer. But we will make an open demo that you can play with as well. So, to put this work into a little bit of a context and why are we working on this and why I'm not... Okay. So, the genesis of WhisperSpeech, how it came to be. So, in... I think in early March, Google released a paper called Spear DTS. Spear... Spear read prompt high fidelity text-to-speech model with minimal supervision. And we noticed that it was a really simple model, basically two transformer models attached end-to-end. And the results sounded really well, although we only got the cherry-picked examples that they shared. The model was closed source. And we decided that let's try to recreate this and build an open source model that you could actually use. We had no previous experience with text-to-speech or with speech processing or even with training big transformer models. So, this was a new thing for us. We mostly work on computer vision before. So, this is like our journey. I wanna share it with you. So, we come a pretty long way. This is a short sample that I wanna share with you. This was from late April. So, like a month and a half into the project. We choose to go to the moon in this decade and do the other things, not because they core easy, but because they are hard because they'll will serve to organize. So, you get the idea. Like the quality improved quite a lot. It made a lot of mistakes. It mistake word back then. So, we've come a long way. One word about my sponsor, which is Collabora. So, I work on this at Collabora and we are an open source consultancy. We do a lot of work with embedded Linux, but we also do work with AI. So, if you need some help developing custom AI models or you need to integrate an AI model into like an embedded Linux device that we can help you with that. We, I said we were inspired by Spirit DTS, but in fact, we didn't copy it exactly. Instead, we build on top of existing open source models. So, first of all, Whisper. That's why it's called Whisper Speech because we are building on top of the Whisper Speech to Text model. We also use the codec from Meta, which is the, like the neural audio codec. We use the Vocos vocoder from Jamelo AI and we also read a lot of papers and implemented ideas from these papers, some of which I will mention later. So, the rough overview of how Whisper works is, we start with text because there's a text to speech model. Then we have first transformer model which does the reading part, which converts the text into a learned phonetic representation. And the first part does basically prosody and emotions because it converts a variable speed text because text doesn't tell you how fast you should read it into a fixed in time representation where it's like the emotions and the tempo is encoded. Next, we have another model that takes this phonetic representation that's not specific to any speaker, it's like very generic. And we convert it, we add the speaker embedding and we convert it to an actual sound, which is still a compressed speech, the encodec comes here. So, the encodec model gives us this representation and then we apply the vocoder to generate the actual audio waveform. And these phonetic representations that I mentioned, like where are they coming from? So, this is actually coming from Whisper. So, Whisper is a speech to text model. So, in the input we get a waveform, at the output we get some text tokens and they're varying speed. Like, you cannot calculate how many characters you will get out of a 30 seconds of speech that's like a variable ratio. So, Whisper is actually built around two sub-models. One is the encoder, the other one is the decoder. And the decoder outputs the text, which is composed out of the tokenized words. And the encoder processes the speech waveform and converts it in a form that's more useful to actually understand what was said in this speech sample. And we actually, we train a quantizer in the middle of it. So, we limit the amount of information that can be passed from the encoder to the decoder to force Whisper to come up with a representation that only focuses on the required information and the phonetic information. So, this is how we arrive at the semantic tokens. A lot of people use Hubert for that or Wave2Vec. These are the most popular models for this application for the semantic tokens because we are not the only model in the market with this rough architecture. But we are kind of different because we are using Whisper for doing that. And we get 25 tokens per second. That's like what's required to get good performance on the speech-to-text and text-to-speech application. So, our plan, how we're going to build it is we're going to take a lot of speech samples. We have Whisper, so we don't need transcripts, which is very helpful because there are not a lot of data sets that have a lot of speech and a lot of transcriptions at the same time. So, we will run the Whisper speech recognition model. We'll get the text samples for that. And then we will train it back to convert the text back into speech. So, that was the initial plan. What could go wrong? And it turns out that everything could go wrong because Murphy's low and stuff. So, let's start with the speech waveform. Like, what's wrong with the speech waveform? So, what's the difficulty here? So, there is this dataset called LibreLight that contains 60,000 hours of speech. And this was released by Meta. It was Facebook AR research back then, I think. And this speech is in the public domain, which makes it very nice, attractive for commercial applications, commercially usable models. The problem is that 33,000 hours of this speech is in a single 3.6 terabyte tar file, which, you know, 3.6 terabytes seems like reasonable, but if you have a 200 megabyte per second connection, it takes five hours just to download this from the internet. Assuming you can saturate the 200 megabyte connection, which is not that easy. If you want to unpack it, you need to have at least twice the size of the hard drive. So, you know, it took me a few days to actually wrangle this somehow, and of course, I made some mistakes on the first try. So, I had to try again and it takes, it's really not very easy to work with a dataset of this size. There is another issue because like you're training your model, you probably want to like randomly sample shorter segments of your data to be able to train the model. And then, so, the 3.6 terabytes has 220 files in it. So, this means it's, the files are basically in this case, are chapters of your books. So, you get 16 megabytes per chapter on average. So, if you want to sample random sample from a file like this, you will need to read eight megabytes of the disk to arrive on average to get to the point where to do the samples you want. Then you take 30 seconds out of it and then you have to throw away everything else. This, if you do the math, gives you three iterations per second of training, assuming you can saturate your SSD, a fast SSD, six gigabytes per second, read speed. In practice, what I noticed was it was like more one gigabytes per second read speed because 220 files is a lot and the file system overhead comes in and all of these kinds of problems. So, in practice, I got like half an iteration per second and this simple way of loading this data. So, we had to fix this. Fortunately, there's this project that I highly recommend for handling big data for AI training, which is called Web Datasat from TMB Dev, a GitHub user. So, instead of working with 220,000 files or with a single 3.6 terabyte file, which is unwieldy, we'll split it up into 623 files, five gigabytes each. That's a lot easier to work with. You can copy it from place to place. You can download a single shard and look inside. It's very nice to work with this. And then, during training, instead of loading individual samples at random, we will actually open eight of these files, load them sequentially, and then randomize the ordering of the samples in a buffer in memory. So, this makes it a lot faster to train model and if you do the math, you actually get 300 iterations per second, which is probably faster than what the GPUs can actually handle for the training. So, this removes the data loading bottleneck, which is kind of a problem. Interesting thing is that you may not notice this until your dataset size is bigger than your memory size because if it's smaller, the caching in your operating system will fix this for you. It will pretend you're loading files from disk, but in fact, everything is coming from RAM. And then, once you close the 500 gigabyte threshold of your server RAM, the performance will go to zero, basically, because suddenly you will have to reload stuff from the disk every time and then, you may not know it because it's not obvious why your model is not training as fast as it could. You have to actually dig into it and try to figure it out. So, that's the first problem. So, then we arrive at the whisper speech recognition, which should be an easy thing, right? Like, whisper is a GPU model trained and it's a great model, it should be fast. So, out of the box, whisper medium is 50 times faster than real time in our tests. So, this sounds like a fast thing, right? But for 60,000 hours, that's 1.2000 GPU hours to transcribe the whole dataset, which is 50 days. So, that's not gonna fly. Like, I'm not going to wait more than a month until I get my transcriptions. So, to make it faster, what we can do, we can actually, we start with 50 times faster than real time. We can process this in batches, it's the same thing as with LLMs. So, if you process 16 streams of audio simultaneously, you arrive at 256 times real time. So, that's a big improvement. You can switch to faster implementation, in this case, faster whisper. It's an implementation that lowers the, compared to the original PyTorch implementation, it's just like more optimized, lowers the overhead of scheduling CUDA calls. It's still the same model, but it's making CUDA calls more efficiently. So, now we are at 341 times real time. And the last thing we can do is for English, we can actually switch to the small dot English model because for English, the performance of the various whisper model sizes don't differ that much. So, smaller models are pretty good for English. If you do other languages, this is more problematic, but you won't get as much data there. So, less of a problem. So, we arrive at 7.768 times faster than real time. This is 78 GPU hours, so that's still kind of long. But again, web dataset, remember we divided the dataset into shards. So, we have 13 minutes in this five gigabytes, 100 hours basically, and we can distribute this to multiple GPUs, and then we can process the whole 60,000 hours dataset in less than an hour. It seems trivial, but on the other hand, if you do this, you will make mistakes. You will have to do it multiple times. There's no way around it. So, the faster it goes, the quicker you can fix your mistakes or improve your data generation pipeline and the better results you get. And this works for transcription. You can do voice activity detection of course, beacon encodings. Basically, every kind of pre-processing is improved tremendously if you split it up into shards and you are able to do this processing in parallel. So, this out of the picture, we get the text, right? So, the text is good. OpenAI whisper is the state of the RCP recognition model. So, it should work perfectly, right? So, there are two issues with whisper. First is the timestamps. So, supposedly whisper gives you timestamps. So, it localizes the transcript in the recording. In practice, unfortunately, it's off by at least a couple of seconds. So, if you believe the timestamps and cut your audio files using these timestamps, you will cut in middle of a sentence in the middle of a word. It's not going to be great for further training if your transcript says something and then your audio is actually shifted and it missing the first word and maybe you have more words at the end that it should have. It's not gonna work. And what was more puzzling even is that there were large parts of the transcripts that they were missing for some reason. So, to give an example, what I heard when I was listening to this Libri Lite audiobook was the chapter of Five of Things in Our Garden by Arthur Ransom. This is a LibriVox recording. This LibriVox recording is in the public domain. What Whisper heard is this. And it was consistent. Every time I was trying to transcribe this piece of speech, it gave me this wrong answer. It took me a while to figure it out and this is actually quite interesting, I think, how this happened because I'm pretty sure I figured this out. So, it's very likely that OpenAI used Libri Lite, the same data set for training because that's 60,000 hours of English speech. It's 10% of the original training data set. They wouldn't skip it. But how can you use it for training speech recognition if this is just speech? It doesn't have text, remember? So, they took Project Gutenberg, eBooks, and they forced-aligned them using a tool that tries to find the best alignment between the eBook and the speech. The problem is, the eBooks don't have these lines about LibriVox recordings. They don't have, this is chapter Five at the beginning of every chapter because it's different, right? You don't put these kinds of sentences in the book. Also, at the end of each chapter, the LibriVox recording says end of chapter Five. This is nowhere to be found in the actual books. And nobody noticed in the training model and the bigger models, the tiny model is pretty much okay with that. It makes other mistakes. The bigger models just trained, were trained to remember to pretend never to see this text. And you can see this, it's very systematic in the, if you try to transcribe these audiobooks that you will get these mistakes. And the models just, they hear it and then they pretend they don't hear it to satisfy the training regime they were trained under. So, yeah. Fortunately, the solution is simple. We just throw away the first and the last 40 seconds of every chapter because the errors are mostly there. There are other errors, like if you think about footnotes, it's kind of tricky, right? Like, you read a book, you read the footnotes in some way so people can actually understand the whole context. But there's no way you can easily align it with a simple tool to align the text and the speech. But footnotes, fortunately, are not as common and so not as such a big issue. So then we will go to the training. So we have the data set, we have the speech, we have the text, we will train the model, right? So that should be simple. So first let me start with a tip. So what's the most important thing you can do and you can focus on when you wanna do something for the first time in your life and it's difficult. So you need to focus on iteration speed. So a quote, whether you're doing a Kaggle competition or a project at work or whatever, the most important two steps are to create a rapid feedback loop where you can iterate and test fast and to have a test which is highly correlated with the final thing you're going to be doing. This is Jeremy Howard from the Graduate Descent podcast and I think it's a very important thing. So in his case, he was talking about diffusion models for images and he was trying to uncover something for how to improve diffusion models but he was testing on the fashion minutes data set because he could run the whole model, train the whole model from beginning to the end that worked and generated images in five minutes, I believe, instead of five months as for stable diffusion, for example, if we run straight stable diffusion model. In our case, our goal, data set is 60,000 hours, we have thousands of speakers there, we can train this on 96 A100 GPUs but even with so many GPUs, it takes six hours to train the model. So the amount of experiments I can run every day when I work on this, it's like two, one in the morning, then lunch break and then after lunch break I can run another and then in the evening check the results 12 hours later. So that's not a lot, it's going to affect my speed. So instead what I can do is I can take a thousand hours of speech of a single speaker only to make it easier to learn. I can train on a single GPU in my home in 15 minutes. This means I can do 48 experiments per day. So as you may imagine, this will allow me to test a lot more hypothesis and if you're anything like me, most of your ideas are rubbish, so you need to test them out and when you do, you will discover the good ones. So that's what you should do. So we did this, we run hundreds of experiments, we have a good little model trained on a small data set, so now we just add more data, add more layers, adapt and train and profit, right? Like that's easy, deep learning is scalable. So we're going to go like this. So we started with a small model, the validation loss went down, the accuracies went up, looks really nice, right? Like, okay, maybe it's not sounding that great, you heard the example, but it works. So let's maybe bump it up a little bit, let's train, this was a tiny model, let's train a small one. It takes like eight times longer. Oh, it not only takes eight times longer, it doesn't work at all. What's wrong? What's wrong here? Maybe I will try to step back a bit, go to a base model, so don't jump from tiny to small immediately, just go through a base model. Okay, this works, but it's barely any better than the old one, so that's not good, and this also trained for like nine hours, right? So that's a long training time and we've got a result that's barely any better. So we tweak the hyperparameters, we try to improve stuff and then maybe try again with a small model, so hopefully it will be better. It kind of is better, but it's not the improvement we were hoping for. So this was my first experience a few months ago with trying to scale up the models, like I was really sure that once I get the data sorted out, I get the model sorted out, I do all the experiments, figure out the optimal architecture, I will just scale it up and it will work. So let's try to figure out what happened. So in his original tweet, Sam Altman, mentioned a single person, single out a single person who had a big influence on the GPT-4 pre-training process. It was Jakub Pachocki and this guy actually corrode a paper with some Microsoft and OpenAI researchers. A strange title for a paper, but somebody on Twitter recommended this. Like I think it's a really underappreciated paper. What this paper explains is basically what we observed in the weights and biases plots. So in this example, like I highlighted the 512, so this is the model depth, the amount of floats per token that the model is computing, the internal representation size. And we are in the middle of the green circle. We got to an optimal point and we have a pretty low loss. What happens when we try to change the size of the model to 1,024, keeping everything else the same? So we change the model and our loss now goes here. It went for the roof, it went to heaven because the model died, it didn't converge at all. So this is what happens with normal training. If you train your models with hug-and-face or PyTorch standard parameters from scratch, I'm not talking about fine-tuning, fine-tuning is a little bit different. If you're just trying to train the model from scratch, this is what happens. And what this paper promises you, and it actually works, is that you can change your model size and all your other parameters can stay the same and you will consistently get better performance. So this is what you want, right? Like you want to be able to experiment on small models and then scale them up and get better results. And it actually works across a lot of parameters. So if you increase the training length, so you train for longer, you increase the sequence length, so you train on larger sequences, longer texts or longer fragments of speech. If you increase the batch size or increase the number of GPUs you use for training, this is also something that used to work or not. So this will also improve your performance. All of these things, once you understand how this actually works and how you have to initialize your network and how you have to configure the training process. It's all explained in the paper, it's like 35 pages, but it's a really readable paper, actually. Skip the math, just go for the intuitive explanations which they have, and you will understand how to train big models. The interesting thing is, when I started this, I was reading the research papers from Meta, from Google. They all specify all the hyperparameters they used. They pretend to tell you exactly how to reproduce the research that doesn't work. I'm not sure why, if they have some internal tools that already adjust the defaults for them, so it works, or they have some secret source that they don't wanna share. One interesting thing is, if you look at all the papers from the big players, and you search for initialization, like network initialization, there's nothing. They don't mention the word at all in the whole paper. While the Google, sorry, the OpenAI Microsoft paper tells you that initialization is one of the most important parameters you have to adjust when you scale your models. Yet, everybody pretends it doesn't matter. I don't believe them, really. So go read this if you wanna train a model from scratch. It really saves you a lot of time and effort. So, it's all good, but you will still need the 100, 100 GPUs to train with the speech to an acceptable level of quality. Where do you get them? So I went through a couple of sources of GPUs. So I started with the cloud providers, the smaller ones, Datacrunch, Lambda Labs, the ones that offer GPUs in sensible prices. But they are still pricey and they are mostly overbooked. You could just cannot get a GPU there even if you pay for it. There's just no way you will get more than a single GPU out of these companies unless you pay for a year in advance, but that's the point. You can go to Vastai, which is an interesting platform where you can use consumer GPUs and rent them out from other people. They are cheaper. The problem is the bandwidth sometimes is not that good because it's residential internet, so the internet quality is not data center quality. And they have no network drive, so every time you learn a new machine, you have to download all your data and push it around and upload your models and everything. There is also one thing that I would recommend, which is I learned about somewhere in the middle of the process. If you're doing open source research on open source models, you can talk to Lion community and they have connections to the dual supercomputer and they can get you access to the dual supercomputer cluster, which is a cluster of 3.6,000 GPUs, a 100 GPUs that you can use for open source research. So that's a great way to get access to the GPUs if you're doing open source work. Okay, so closing up. All our research and all our results are available on the GitHub repository. We share both the dataset preprocessing tools, the training scripts, everything basically. It's not just a model upload with inference code, it's all there. You can join the Lion server, Lion Discord server. We talk about our own, if you wanna help and talk about audio processing, I invite you to go there. The links is also on the GitHub page. I wanna thank the Goal Center for Supercomputing. They are the guys who set up the duals cluster cluster and they were very important to get this work finished. And these are the links for both the Whisper Live project, which is the real-time transcription project and the Whisper Speech project, which is our text's high-quality speech model. One thing that I didn't mention is that currently Whisper Speech works with English and Polish, so it supports two languages. We would love to support more languages, but this is also something you guys could help with, which is you actually need to be able to understand and figure out if the data is correct and if you can train on it, so you need to understand the language you wanna train. And we are looking for people to help with that as well and to find open, properly licensed datasets for other languages as well. As you've seen in the initial demo, the quality is pretty good. We can get really good results, but we need some help. And that's it, thank you very much for your attention and do you have any questions? Do you have any plans to do next versions of Whisper Speech, more training, better quality? Yeah, definitely, so we are actually training two lines of models, so the first one, what I mentioned is trained only on properly licensed data public domain or Creative Commons licensed data. We also work with Lyon to create a larger dataset of user data because everybody's training on everything they can get a hold of, so we wanna showcase that the model is actually can do better but with more data. So this is something we're gonna do. And we are also working on adding more languages, so we already got requests from people to add Dutch, to add Lithuanian, add other languages. We're talking with customers who can potentially sponsor this work as well because it's a lot of work. And so yeah, we are definitely planning to release more models. One other thing we are working on is emotions, better emotion support. So we are working together with the Lyon community. They are gathering the dataset, they are labeling a dataset of emotional speech. And we also like working on architecture improvements based on the most recent, like there was a paper released by Meta I think a week ago about how to handle emotion and speech that we wanna reproduce because it's a really interesting approach. So we will continue working on this. And we are also working on improving the inference. Right now it's working in PyTorch, it's not very fast. We are looking into making it work with CPP for example, the GGML framework. Through the C translate to implementation as well, which is still CUDA based but just faster than PyTorch. So we are looking into this as well. The basic model architecture is really simple. It's a transformer model. So basically all the tools that you can use to optimize transformer models and large language models are applicable to whisper speech as well. So when you were training this large model, did you have to deal with infrastructure failures, either GPUs, hosts or network interconnect issues? So we looked into this a little bit back then where we didn't have the cluster access and there is an interesting project called HiveMind that allows you to train on unreliable hardware where your GPUs can come and go during the training and the whole training still proceeds in an orderly fashion. But fortunately we didn't have to implement this, we got access to the cluster and the cluster is really reliable. It sometimes fails but not often, most of the time it fails because I made a mistake in the code, not because the hardware is unreliable. So no, we didn't have to deal with that. The training runs were like six hours. Then I could like resume it to train it for longer but then I would have a checkpoint, I could resume it, I wouldn't have to have problems with hardware issues. What kind of performance are you seeing from whisper speech, how many? So right now I don't like this question because I spent like I think two hours optimizing the inference pipeline, maybe three. So it's far from optimal, it's far from being the fastest thing and I was mostly focusing on getting the quality up which I think I hope I succeeded. Right now on a 1490 GPU, I can get something like four times real time performance. One good thing I can tell about performance is that we can do it with low latency. So the model is trained on 30 second chunks of audio but it can actually start outputting speech a lot sooner so you can get like 100 to 100 milliseconds of speech just putting in the first words you wanna say and then you can keep adding words and then you generate more speech. So we can support streaming applications. So the thing I shared in the beginning was when you have a language model and it starts streaming tokens out, you can start conveying them to speech before the language model finishes the whole processing and then you can also like keep adding the new tokens and most of the time the language model plus whisper speech will be faster than the real time speech speed so you will catch up and get the end result. So this helps a lot in low latency applications and this is part of the reason like a lot of these models these days that have good quality they are based on diffusion and they have to process like a second or two of speech at once to be able to get the quality they want and here we can actually process things from left to right in English and get lower latency thanks to that. So if there are no more questions then thank you very much and if you have some other questions you can always catch me in the hallways and I would be happy to talk about AI or TTS or anything else really. Thank you very much.