 Hi, everyone. I'm very happy to be here today in Shanghai, giving this talk to all of you. So my talk is yet another talk about AI. It's such a buzzword, no? But I don't think AI is just a buzzword, because we finally have models that actually work. Models that can improve your productivity. And since most of us here are probably programmers, one of these tools is code generation models. These are models that can also complete your code. It can be a function, or a class, or a whole code block, just like GitHub co-pilots, for example. So these models have been there for a while. But they've only been more trained and interested in research when Codex was released. It's a co-generational model by OpenAI, which is being used for the VS Code extension for GitHub co-pilots by Microsoft. Before this model, there were some small co-generational models, like CoGPT, CodeBert. But these were small models that were trained on a little data. So they weren't very good at the task. But when Codex was released, they kind of showed that you can just train a code model like you do for a language model. You take a transformer that is large enough, and you feed it a lot of code data, and just learn how to program. Then DeepMind worked on Alpha code, and AWS worked on Code Whisperer, which was great. But these models are all closed models, meaning you have to send your data to an external API. You also can't find you in these models. So that's when the open-source community tried to step in and train some good co-generation models, for example, co-paras from Huggingface, and coder from Meta, and co-gen from Salesforce. However, there are still some open questions about these models. First, regarding the performance, because these models weren't at the level of Codex yet. And there was also a focus on Python and not the other programming languages. There's also an open question about transparency. Which data were these models trained on, for example? There are also questions about evaluation and user experience. So these are questions that we tried to address. So now let's see what does Huggingface have to do with this? We have the hub for sharing data sets and models, but we also train models. Last year, we released Codeparate, which is an educational tool for training these models, showing you how to scrape data, how to train these models, and evaluate them. So it was more for education, and the performance was not very great. A few months ago, we released StarCoder, which is a strong co-generation model trained on more than 80 programming languages, and it surpasses the performance of Codex on Python, but also on other languages. The model is also open access. So today, I'm gonna talk to you about the projects where we developed this and about the architecture and deployment of these models. So we decided to create a big collaboration called BigCode, where we have more than 500 participants from over 30 countries. This collaboration is led by Huggingface and ServiceNow, but it is open for anyone to join. So if anyone in the audience today would like to help, we're very happy to have you. Our goal is to train strong co-generation models, but in an open and responsible approach. What we started with Code is to address some of the bad challenges of closed large language models development. This starts with not disclosing your training data and the sources that were used. For example, today there are a lot of great open access models, but we don't know on which data they were trained. The other bad practice is not releasing the model weights at all. So if you want to find two in these models, you can't. And if you want to deploy them on premise, you also can't, and you will always have to send your data which might be sensitive to third parties. Another thing that is needless to say is that all of this makes results not reproducible and doesn't foster research and open source. So what we think open development of these models should look like is exactly the opposite of all of that. First, you should make your training data public and you should give inspection tools so that people can look into the data and see what's in there and what these models were trained on. You should also create opt-out tools in case people want to be removed. Because for example, you as authors of some data or some code, you should have the right to say, no, I don't want to be in these model trainings. The other thing is that model weight should be public so that people can fine tune them and also deploy them on premise. And you should also give full documentation of the whole process to make it easy to use and reproducible. So let's see what it takes to train these models from scratch. It takes hundreds, maybe thousands of GPU hours, but also terabytes of data, but that's not just all. And I will try to show you what goes behind the scenes for training these models. Everything I'm gonna show you is about StarCoder. This code generation model that we released this summer and for Star Wars fans, this was named after race and wasn't released on May the 4th. Everything starts with good data. So if you want to train a language model, let's say on Chinese or English, you train it on Chinese and English text so it learns the language. If you want to train a language model how to code, you need to train it on code. And what's the base source for that? It's GitHub. So we cloned all of GitHub repositories and then we did a lot of data cleaning to remove extensions we're not interested in. And also we did some license filtering. Because if you take a code repository on GitHub, it can either have a permissive license like MIT or Apache too. It can have a copy left license like GPL or no license at all. For us, we only took the first category which allows us to use this code. We also built a data inspection tool. So for example, you can go to our tool, you type your GitHub username and if it tells you, if any of your repositories are in the stack, and if you don't want to be included in our future trainings, you just have to fill a simple Google form and we'll make sure not to use your data. We also did a lot of data curation to make sure that our data was clean and that the model could actually benefit from it. The first thing was to select a smaller set of languages and not keep all 300 because some languages are no longer used or maintained. We also added other sources. For example, GitHub issues conversations, Git commits and Jupyter notebooks. The other step for cleaning we did is deduplication because studies have shown that if you train the model on many copies of the same files, this is gonna hurt performance. So it's important to only keep one copy of each file. We also did decontamination. This means that if you have an evaluation benchmarks that you know you're gonna evaluate on after training, you need to make sure they're not in your training set. The last step we did for cleaning was removing personal identifiable information. Since believe it or not, there's still API keys and SSH keys and public repositories on GitHub. And we didn't want to train our models on that so that during inference, people wouldn't exploit this and use people's secretive data. If you're interested in how you can run all of these processings on very large data sets, then HackingFace data sets library comes to the rescue. This is the library we used for all our cleanings. It has some very nice features like applying filters and transformations using a method called map, which allows multiprocessing. And my favorite feature is batched mapping when you can leverage the power of map and using batches and make the filtering go so much faster. Okay, so we've seen how to prepare the data. Now how do we train these models? How do they leverage a lot of GPUs and make sure our model is optimized? Our model is a decoder model. It has the same architecture as GPT-3. It has 15 billion parameters with some code optimizations, but we did some changes to kind of address what people actually want from a code generation model. For example, we used the technique called multi-query attention to make sure that the memory footprint is reduced in inference, especially when you have large batches. We also used an 8,000 context length, which is long and can allow you to feed more context into your model. This is a decoder model, so usually it can only process contexts that's on the left, but we have used the technique called filling the middle, which allows your model to attend also to text that is on the right. For example, if you have some code and you want to edit it in the middle, you can do that with StarCoder. And for the training setup, it only took 512 GPUs from the Hagen-Face cluster for 24 days. It was kind of a smooth sailing where the loss kept going down and we've only had a few automatic restarts. We also used Megatron LM from NVIDIA for the training. So if these numbers scared you, don't worry. You don't need all of that to do fine-tuning because with libraries like Peft from Hagen-Face that use techniques like Laura, you only need little computational resources to do the fine-tuning. The idea is that if you take your very large transformer which has billions of parameters, you freeze all of them and you only add a small number of trainable parameters and you only train those. And you will get the performance that is matching as if you were to do full fine-tuning. This also lowers the storage cost because you only need to store those extra adapter weights. So let's see what the big-code ecosystem looks like. It started from a small family tree where we had the stack data set and some star-coder model and became a whole ecosystem. Nowadays, almost everyone that wants to train a code model they would start from the stack. That was the case, for example, for stability with staple code, for Salesforce and the replay, for example. There's also community fine-tunings where people start from star-coder as a base model and then fine-tune it on instruction data sets. For example, we have wizard-coder from the wizard LM team and the pangu-coder from the Huawei team. Okay, so let's say we now have a good code generation model that we trained. How can we turn that into an actual product? For example, a VS code extension that can serve hundreds of thousands of users. That's when deploying large language models for code becomes important. At the end phase, we have the inference endpoint. So if you don't want to take care of the infrastructure and the MLOps behind these deployments, you can just use these endpoints and you can query them to generate text or code. And what these endpoints use behind the hood is text generation inference. This is our inference library that we use in production. It is available on GitHub and it has a lot of optimizations. It also serves most of the popular language models. And what is cool about TGI is that it is production-ready. It has a lot of metrics and a good tracing mechanism. So you can follow your latency, your throughput, the request that you receive and be notified when there's an anomaly. It also has a warm-up system. We run the full pipeline before you put it in production so that we make sure it never crashes. It also has some optimizations. So TGI is focused on optimizing latency. What some other inference libraries might be reporting throughputs that isn't necessarily what you want because what you care about is serving as many users as you want as fast as possible. So caring about the speed is important. It also has continuous batching for handling concurrent requests, meaning that your batch size varies based on the number of requests that you receive. It also has token streaming. So for example, let's say you want your model to generate 200 tokens. Instead of waiting for the whole generation to be complete and then display in the result, you can display them token by token. And this reduces the perceived latency because the users see the results earlier. And it also makes your UI more interactive because if the users don't like the generation, they can just stop it in the middle and ask for a new one. So these are some of the optimizations that are in TGI. Some of the users are hugging chat, our UI for interacting with chat models like Falcon Chat, Lama 270B, and the VS Code extension, which by the way serves StarCoder and other code models like CodeLama, which is HFL Code Autocomplete. Open Assistant also uses TGI as well as Node.Dev. So for my last slide, how can I be at KubeCon and Cloud Native Con Summit and not talk about Kubernetes? When Kubernetes is what powers all of the hugging phase infrastructure. We use it for all our services, the hub, the API endpoint, dataset servers, and spaces. We have eight production clusters in the 800 nodes. The clusters are very dense because for example, if you want to build a space which is a demo on hugging phase, we usually give each user two CPUs. But these CPUs are usually most of the time not used. So what we want to do is to be able to fit a lot of pods in one cluster, and we can have up to 250 pods in one node. How we do that is using memory swap feature, where we swap the disk space for the RAM and we're able to fit a lot of pods in one node. One other thing that our team implemented is the recompilation of container D to pull images faster. So for example, if a user wants to restart their demo or space, they would need to pull the image of the container again. So if we can make this faster, this improves the user experience. So what our engineers did is to improve the checksum operation so it is 30% faster. So this is an overview of how we use Kubernetes at hugging phase. I hope this gave you an understanding of how we're able to go from raw source code that is laying there on GitHub to actual product like hacking chat or HFO to complete extension. And that's behind the tool. There's what I like to describe as an iceberg, where we only see the tip. It's a lot of GPUs, but that's not everything. There's a lot that is underneath the iceberg. This starts with a lot of data curation. You need to spend time actually inspecting what's in your data and trying to remove as much noise as you can. There's also a lot of data governance. And this was at the heart of the big code project. For example, in the training, we only selected permissive licenses. And then when we released the VS Code extension, for example, we implemented a code attribution tool. So if the model generates code that was copied verbatim from the training data, it is highlighted in red. It points you to the original repository in case you want to give attribution to the author because permissive licenses doesn't exempt you from attributing the author in case the code is the same. We also worked on inference a lot, as you've seen, because a lot goes behind the hood. So for example, the VS Code extension now serves up to, I think, 12,000 users, which is really important, as well as for the endpoint. And then there's a lot that goes in evaluation because you need to make sure that you evaluate your model as much as possible to have a comprehensive understanding of your tool. So that was my talk, and I hope you enjoyed it. Thank you, everyone.