 Cool, yeah, so hi everyone, thanks for coming to my talk. My name's Eleanor Rastugueva, I work on the Nemo Toolkits at NVIDIA and that's what I'll be presenting today. Before I start, you may have heard about NVIDIA and Nemo in a couple of different contexts and I just wanna be clear that I'm talking specifically at this talk about Nemo Toolkits, which is open source and has a GitHub repository. So, what is Nemo Toolkits? It's an open source toolkit for conversational AI and LLMs. Like I mentioned, it lives in this GitHub repository and the code is released under an Apache 2.0 license and it's a toolkit for training ML models for various conversational AI domains. The main current domains are automatic speech recognition, LLMs and text-to-speech and also coming soon, we have some multimodal code in the pipeline for things like text-to-image and text-to-3D. Today, I'm gonna cover some Nemo fundamentals and then go through each of the free main domains in order. You might notice that the ASR section will be a little bit extra heavy because ASR is kind of the thing I have most experience with. Right, so let's get into it. Your first question might be, how do I install Nemo Toolkit? And the answer is you can either use PIP install or you can use a Docker container and we release a new Docker container with each Nemo release. You might also be wondering, do you have pre-trained models that I can play around with? And yes, we do. So you can browse for the models on NVIDIA NGC or on Huggingface Hub. You can use them out of the box or you can also take them and fine-tune them. And yeah, you can even, there's no need to download them explicitly. You can also automatically load them in code. So you can call like dot model PT dot from pre-trained and it will automatically download the models either from NGC or Huggingface Hub. And a bit of a note about the models. So the models live in model files which have a dot Nemo extension. And all that is is really just an archive file like any ordinary tar file. And you can see here, I extracted an ASR model that we have and inside there's a CKPT file containing the model weights, a YAML file with the model config and some files that we need for the tokenizer for the model. A couple more things. So Nemo actually might be wondering about the name. So the name originated from neural modules. So when Nemo was created, the team was really focused on making sure that it's made up of composable building blocks. And that's kind of something we've kept throughout each of the domains and another advantage is that ASR, LLM and TTS domains have the same look and feel and also the multimodal domain coming up also we'll have that as well. Yeah, training is with PyTorch Lightning. So that means it's really easy to scale to 1000 plus GPUs. And additionally, if you're interested in LLM, the LLM domain leverages Megatron core which means that you can scale your models up to one trillion parameters. Yeah, and I thought this is kind of maybe the most dense slide. I thought I would cover a little bit about how you would get started with Nemo. So you can see that it's quite straightforward. So this is a typical training scripts. And this is for training a GPT model from scratch. And it's quite like simple. And I'll go through some of the things that are going on here. So if you're training from scratch, you need to choose a Nemo model class. And here we have a GPT model. You need to specify config YAML file. By default, it will use config slash megatron GPT config. Or you can specify a different file or it's really easy to override various elements in the YAML file. So it's easy to do experiments tweaking just once more hyper parameter or a subset of them. Then you instantiate a PyTorch Lightning trainer. And the config file specifies what you want your trainer parameters to be or what you want your model parameters to be and your experiment manager, what you want your experiment manager parameters to be. So you can see when we instantiate the PTL trainer, it just looks in the config to find those parameters. Then you call the experiment manager function. An experiment manager just has some useful things like making sure that you'll save your model checkpoints as you go along and also logging the experiments for your results. And we have integrations with loggers like tensorboard and weights and biases. Yeah, then you just instantiate, sorry, you instantiate your model and then you call trainer dot fits and yeah, and that's it. Yeah, let me go a little bit into ASR now. Some ASR 101, if for anyone not familiar. The way most ASR pipelines work is you take your speech signal, you convert it to ML spectrogram, which is basically a representation of the frequencies present in your speech over time. And then the heart of the matter is you have an ASR model, which takes in typically ML spectrogram and outputs the transcription of what was said in the model. And there are different types of ASR models. If you wanna read up more about them, so there's CTC transducer, sometimes called RNNT or RNN transducer, and also like models that we sometimes call AED or attention-based encoded decoder models. We have a bunch of pre-trained checkpoints that you can use, and it's really easy to get started with ASR, you just load a model and then you call model.transcribe and you pass in the audio files that you want. Yeah, so we have, like I mentioned, we have a bunch of checkpoints in various different languages. I thought I'd bring your attention to the hugging-faced open ASR leaderboard, which lists the best performing models currently, the best performing ASR models currently just for English, I think. And you can see it's like Nvidia and opening iOS for models, like just something for the top. Yeah, and in case you're wondering what are the metrics, the relevant metrics to consider. So there's an accuracy metric called words error rates, which is basically how big is the distance between the transcript that your model predicted and the transcript that is true and you want that distance to be as low as possible, so you want the score to be as low as possible. There's also a speed metric called real-time factor, which is how long it takes you to process an audio file divided by the duration of that audio file and hopefully the numerator is very small, so you want the real-time factor to be as low as possible and you can see they're all very low. It's like times 10 to the minus three. Yeah, and something I find quite interesting about ASR, so the typical research models I would say are do offline speech recognition, which is where you process the inputs, the audio that you want to transcribe all at once. But normally users, if they're talking to some ASR, they probably want to see the transcript coming up as they're speaking, and that's what online ASR is for. So that's when you are transcribing, you're taking in chunks of audio at a time and you're transcribing those chunks and then outputting the transcription. And we have some models that also do online streaming in Nemo, and I thought I would do a demo showing one of them. It's the... This is a demo of streaming speech recognition with Nemo, using a model specially trained to transcribe audio in a streaming scenario. It takes an audio in sequential chunks of 160 milliseconds, transcribes the chunks, and outputs the transcription for each chunk as soon as it is ready. Yeah, I just wanted to say that's the name of the model is at the top there. Yeah, we have a couple more ASR adjacent tasks supported in Nemo, so things like speaker, the diarization, which is like who spoke when and allows you to add speaker labels to your transcript. We also support speech translation and speech enhancements, and we also have various models for doing speech classification. And I thought I would share like a simple but fun model demo that I made using the voice activity detection and a keyword spotting model, which is voice controlled snake. So I use a VAD model to see if there's speech or not, and then I use the keyword spotting to see if one of the keywords was spoken and the keywords being up, down, left or right. Oh, and so before I go, I want to say also, I made the UI using the Gradio package, which is really helpful for making UIs using just Python code. Up, left, down, right, up. Oh, you can see I'm quite tense. Down. I actually got pretty good at this. This was a good run. So I'll skip forwards. Snake got pretty long. Yeah, so basically what's happening is every Gradio allows you to do streaming, but currently it's like 50 second chunks. Sorry, 50 millisecond chunks. No, half a second chunks. And yeah, every half a second, I'm seeing what do I say? And then I upload the UI, appropriately, and yeah, in the end I wasn't quite fast enough, so I lost the game. Yeah, so moving on to LLMs. So, Nemo supports LLM training. We support various different types of architectures and various different types of training modes. So pre-training, supervised fine-tuning, or parameter efficient fine-tuning. We also support various types of alignments, like RLHF, which probably everyone here has heard of, DPO, which is like a more recent one. Also, StereoLem, which is work developed by our team recently. So yeah, the code for that is, uses Nemo Toolkit, but it actually currently lives in a specialized repository for model alignments, which is called Nemo Aligner. You can find it here. I also made another demo for LLMs, so I thought one of the strengths of Nemo is that you can use ASR, LLMs, and TTS with the same code, in the same code base, I guess. It's written in the same toolkits. And so I made a demo with an ASR model, like a baby LLM that was able to do Extractive QA, which I got by running one of our Nemo Lora tutorials, and also using a TTS model to two TTS models, which you might, I'll explain later why it's two. Yeah, and here is the model. So I ended up making a demo that uses Nemo Toolkit to talk about Nemo Toolkit. What are Nemo Toolkit models trained with? Oh yeah, I'm gonna just specify, I forgot to mention, so it's Extractive QA, so it takes some context, and then it answers questions based on the context, but it's extractive, so it's like, the answers to the questions are located in the context. I'm also, I noticed this morning that when I did the recording, this bit coming up might be a bit loud, just because it came from my speakers, so just watch out. Paytouch lightning. Okay, it's fine, it's a bit loud. How many GPUs can we scale our training up to? Thousands. What types of parallelism tricks do Megatron models use? Tenser and Pipeline Model Parallelism. So there you go, and speaking of Tenser and Pipeline Model Parallelism, so that's available in Nemo because we make use of Megatron Core, which is another piece of software developed by NVIDIA, also open source, and it has a bunch of tricks making it easy to scale to larger models, including Tenser and Pipeline Parallelism and also activation checkpointing. You can also do FBH training on H100s in case that's something you're interested in, which makes use of Transformer Engine, again another open source software by NVIDIA. And yeah, we also support Flash Attention, which I think is just a simple flag to use. Yeah, and I thought I would give a shout out to some work by my colleagues that I think is quite cool. It's called Stereo-Lem and it's like a simpler alternative to RLHF. They released a paper, which you can find here recently, and what's also kind of interesting is that as a byproduct of the alignment process, you also end up with a model that you can toggle, you can toggle its outputs, you can toggle how much of an attribute the output contains. So the attributes are things like quality, chemo, toxicity, creativity, helpfulness. And yeah, it's like interesting to toggle in case you want to. And I know they did some evals and they found, I think, that for the best evals, I think where you set toxicity, chemo, creativity to min, the minimum setting, and then everything else to high and it gave them good results. And yeah, they released a model. This is one of the models that they released using Stereo-Lem. So they took the Llama 270B base model, they fine-tuned it using Stereo-Lem, and they found that it actually outperformed the Llama 2-charts model on MT-Venge. Yeah, and there's more publicly available LLMs. Like for example, there's the Neumatron 3-8B family of models, which is like a base model, three chat models, and a question and answering model. Yeah, let me also talk a little bit about TTS. So like I kind of hinted, well, basically there are two types of TTS methods and like I hinted earlier, like I had to use two models for doing TTS and that's because I had one model which was doing, turning like the transcripts into the male spectrogram. And then I had another model that was turning the male spectrogram into the speech signal. So that's kind of like the more classical way I would say of doing TTS. But there's also like end-to-ends, you can also have end-to-end models that go like straight from transcription to speech signal, but I think it's like much trickier, it's a much trickier task. Yeah, and we have like a bunch of pre-trained checkpoints doing TTS and these are like the names of some of them. Yeah, yeah, and so for the demo, I mean you heard some of the TTS voices but I thought an interesting facet of TTS is that I would demonstrate today is speaker interpolation, which is basically synthesizing new synthetic voices by combining two or more existing speaker voices. So like what you can do, for example, if you have a model that accepts speaker embeddings as input, is you can generate new speaker embeddings for like non-existent speakers and then feed that to the TTS model. And so what I did in the demo is like, well, it's actually one of the tutorials that we have on Nemo, sorry, on our GitHub. And yeah, you take two speaker embeddings, you make a like interpolated speaker embeddings somewhere like between those two embeddings and that kind of gives you a new speaker that is probably somewhere like, their voices somewhere in the middle between those existing speakers. Oh yeah, I just, I'll play you some samples from the tutorial notebook that we have. Hello world, speaker interpolation is cool. So that's the first speaker. Hello world, speaker interpolation is cool. And that's the second speaker. Hello world, speaker interpolation is cool. And then that's the like interpolated speaker and you can see like in terms of pitch like the voices somewhere in between those two existing speakers. Hello world, speaker in... Yeah, and that's all I have for today. Thank you very much for your attention. If you wanna learn more, so one of my colleagues is actually gonna be doing a keynote tomorrow morning. Also in 24 seven, we have the GitHub repository and the docs. And also if you have questions you can ask me now or I also, you know, my email if you think of something later. So yeah, thanks very much. Sorry, one sec. So my question is if you're interested in ASR and TTS models, what kind of compute or GPUs do you need? And then also is an email toolkit, do you need an NVIDIA GPU to run or can it be something you run like on CPU or another GPU? Yeah, yeah, great question. So for the second one, yeah, to train you need to NVIDIA GPUs as far as I'm aware. If you wanna do inference, you actually don't need to NVIDIA GPU. Actually my, the streaming demo, I actually run it on my Mac because I was working from home and I couldn't get the audio done remotely. Yeah, in terms of the amount of GPUs, I can't say off the top of my head. I know that ASR is like, if you wanna train a really good ASR model, you need quite a lot. TTS generally doesn't need as many GPUs, but also one of the benefits is if you're interested in ASR, we release these pre-trained models that are trained on a lot of GPUs, but if you wanna do fine-tuning, it's quite quite simple to do. And you can also do just basic fine-tuning or you can freeze some of the layers of the model. You can also use adapters and we have some functionality for adapters, which are very effective at fine-tuning. Oh yeah, yeah, yeah, yeah, about inference. Yeah, so inference, if you wanna do something basic on device, yeah, you don't need a GPU. I think if you wanna do something really hardcore, GPUs would help. Yeah. Yeah, oh yeah, yeah, sorry. Just a quick one. Where is Wisper, OpenAI's Wisper on this scale? I don't think I saw it in your ranking. Oh, did I miss it? Yeah, yeah, yeah, so, I mean, there's the OpenASR. Do, do, do. Yeah. So here you go. We were actually at the top until Wisper released the V3. Oh, okay, I missed it. Which is trained on a lot more, a lot more detail. Yeah. Great talk, thank you. Just real quick, do you have, do you have access to the 70 examples that you put up there? Do you have code? Oh, yeah, yeah, yeah. Yeah, so, yeah, I can show you. So this streaming, this was one that I made very recently. So this is actually a combination of the, we have an online ASR notebook and we have some code that does that in a simulated streaming and I kind of took them and made a mishmash. But yeah, I'm planning to have either of that or some kind of like radio demo soon. I'll email you for that for sure. And then, yeah, I mean, it seems like you're very close to real-time assistant type of app idea there. And I'm just wondering what types of techniques or tools would you use to reduce some of the latency that you're seeing with speech to text and then back to text to speech as well. Cause you, injecting that to like some sort of actor database to get results back and then speech to text or text to speech again. Just wondering like what sort of approaches you might use to speed everything up. Yeah, yeah. Sorry, I'm not a good person to ask this question to. I do more research-oriented things. Yeah, I'm sorry. But if you email me, I can definitely connect you. There's definitely people working on this. Okay. Yeah, I guess that's all. Yeah, thank you very much, everyone.