 Hello and welcome to this session at the open air and data forum at the open source Summit Latin America. Today we're going to talk about how we can train AI to understand code using the largest code data set called code net. Myself, my name is Christian Cadman. I'm a software developer at IBM. And I've been focusing on open source projects around machine learning and the machine and ecosystem for the last couple of years. My co-presenter is Tommy Lee today and he'll introduce himself. Hi, my name is Tommy. I'm a senior software developer at IBM focusing on open source. And my main focus is on the AI life cycles and contributing mainly on the Q4 projects. Awesome. And we both work at the IBM Center for open source data and AI. And we're a group of data scientists and open source developers working from the San Francisco Bay Area, but we have team members around the world. The team here in California, we work at the IBM Silicon Valley Laboratories and that's an aerial shot of our facilities. It's nestled in the hills of Southern San Jose and we have nothing but nature around. It's a great place to work. Okay, let's jump right in. Today we're going to talk about AI for code and AI for code really concerns itself with teaching machines, the language of machines. So it's a fairly new field in computer science and it uses technologies like natural language processing and document understanding coupled with code analysis and compilation techniques. And the goal of the field is to enable, you know, the automation of the software engineering process, you know, help developers write code faster to increase their productivity and then perform a myriad of tasks in an automated fashion that normally takes a lot of time, like code search, you know, do code completion, code to code translation. And in the bigger picture, the goal here is to really analyze and modernize legacy software systems and migrate, you know, monolithic applications into a modern micro services ready for the enterprise. Now, everything in machine learning is based on data. Of course, you need compute power and algorithms but without data, you can't really train machine learning models. And so in the recent past we've seen that with ImageNet around, you know, around a decade ago now, there's been a huge advance in image recognition. Right, ImageNet, the data set, as you might be familiar with, has more than 14 million images, you know, ranging 22,000 classes. And everything from spiders to cars to pictures of nature is in there and it really, you know, fueled machine learning revolution. And so this all resulted in, you know, machines being near human, even exceeding human level expertise and speed at some certain very narrow level tasks. So the classification error for images really was reduced drastically. So it's in such a way that machines can do a better job than humans. And you might remember in 2011 IBM Watson competed at Jeopardy and beat the all time reigning champ. Right, that was a huge advance in natural language understanding, all because there was millions of megabytes gigabytes of data available that Watson could work through right and that really powered the language understanding algorithm. And then more recently, in 2019, the project debater really showcased that, you know, machine learning models can produce, can can really be engaged in conversations can make arguments nowadays. And, you know, really almost passed the Turing task, which is what if you don't know if you're talking to a human or machine, you might be full to believe that you're debating an actual human. Now, we've seen that power of AI applied to human language. And there are tools available that you can have on your smartphone today you can travel to Italy, without being able to speak a single word of Italian, and you can use your phone and get real time translations. Right. And so really, AI has been been instrumental machine learning has really opened up the language of humans to machines. Now, what we need for for understanding code. And really I for code is is we need, you know, to apply, you know, similar concepts that were applied for natural language processing to the language of machines. And really, if you think about the language of machines, it's not very different than the language of humans. Right. So, in human language, you have, you know, German, French, Italian, for computers you might have languages like Java, Python, cobalt, C, and human language as structure, as grammar, as a vocabulary syntax, all of which applied to machine machine language as well. So, really, what we need to enable AI for code is is a similar breakthrough that we've seen for ImageNet, or for natural language processing. And ideally, what what AI for code can do is really facilitate code translation. Right. We'll make search for code much easier. We'll use natural language to search for code finds code similarities. And, you know, if things work out well, your code should be much, much better in terms of performance and memory footprint. And of course, code needs to be classified. What does that code do. So all of these are use cases or applications for machine learning algorithms and AI that understands code. What we also need is data. So in order to enable AI for code, we need a lot of code to look at. And that's where Project CodeNet comes in. Project CodeNet was contributed by IBM Research to the open source about a year ago. And it's a high quality code data set that's aimed to facilitate innovation and benchmarking in the field of AI for code. The data is just staggering, right. So there's 14 million code samples in the data set, comprising about 4,000 coding problems and 55 programming languages totaling more than half a billion lines of code. And the problems are really a diverse set of coding problems. And all of the codes code here is has been tested. All of our tests provided for each of those code samples. Here you can see a polyglot of the languages in CodeNet. As said, about 55 different languages are represented in CodeNet. The majority of our hours is C, C++ and C sharp, Python and Java. And that's explained by the origin of that data set, right. So really CodeNet, the CodeNet data set was curated from two sources, the AICU online judge and the AdCoder coding judge websites. And there's a total of 4,000 problems that were posed to software developers who could then submit their solutions to those problems. There were about 14 million submissions and more than half of them were accepted, they could be compiled, they were executable and they produced the results that were specified in the problem. About 30% were wrong answers, but that's not bad for this CodeNet data set because you want to be able to distinguish right from wrong. And another about 17% were rejected for other causes. The data itself is split into metadata and the actual source files, right. So each submission is a complete program in its own file. And each of those files solves one of those problems. But many of those problems are solved in slightly different ways and in many different languages. Now here's a quick look at what the metadata looks like at the data set level. You can see, you know, every submission has an ID and name or every problem that's been posed, right, has a limit on memory and time that it can consume and complexity. And then at the actual problem level, every submission has an ID as well. It can be mapped to the original problem, the user who submitted it, what language it's written in, when it was submitted, and what the status of the submission was. Was it, you know, a correct solution in a sense that was it compilable and executable and did it produce the results that were expected. And interestingly here is also that every solution also has a CPU time and the memory that's been used to run that piece of code. Okay, here's a quick look at the overall spread in terms of the accepted versus rejected solutions. And the error code or the abbreviations that are used in the data set itself. If you want to play with the data set. All right, Project CodeNet, of course, is is the data and the metadata, but it also comes with a set of tools. And those tools help to help to derive certain statistics from the data, help to access the data in the first place, you know, aggregate certain data in certain ways and convert between different formats and allow to pre-process those files. There's AST generation, the source code can be tokenized and, you know, data flow graphs can be generated from it. In the Project CodeNet GitHub repo, you can also find some experiments on the models. You find graph neural network experiments, you can find the MAST language model and token based similarity classification examples. There are also notebooks that can be run as is. So you don't need more than Conda or your Python virtual environment, and you should be able to run the two notebooks that are in that repository. One is that MAST language model. And the other one is for language classification, both of which we'll see in more detail a little later. Now, what are some real world applications? So the idea of Project CodeNet is not to provide all solutions, but to provide a data set that can be used as a benchmark. And, you know, for machine learning models in the AR for code field. And so, of course, you know, one use case that we're going to see in more detail is code classification. Code similarity search can be useful. Source to source code translation, which is especially useful when you think of the gargantuan task of having to modernize existing legacy software systems, often written in languages that are no longer taught in schools. Like there are about 220 billion lines of code used in finance applications. And most of the developers have long left, you know, left their companies or even retired entirely. And so it's a big task to even understand what these monolithic software systems do, yet alone translate those systems into modern microservice architectures. And, you know, more to my heart is, you know, what really, it could really help developers write better code, right? So much of the time that developers spend is learning about code, reading through code, finding, you know, finding code on the internet, looking at Stack Overflow and other websites. So all of that, you know, ideally could be automated and then really make developers more productive. And there are some existing applications already. So IBM's AI for code stack makes use of Project CodeNet. And you might have heard of DeepMind AlphaCode and they used several datasets for their training, but CodeNet was one of their major sources. And impressively, you know, AlphaCode really can compete with human programmers, you know, with most average code programmers and, you know, achieves a 50 to 60% accuracy, which is really mind boggling. With that, I want to hand over to Tommy to talk about the machine change that we're going to use to showcase some of Project CodeNet's examples today. Yeah. Thanks, Christian. Let me share my screen. So as you know, Christian have introduced like the powerful and use cases of how CodeNet dataset could be. Now I'm going to introduce the project, the machine learning exchange on how we can actually share, you know, like some of the CodeNet dataset asset and how we got to use them and leverage them in an open source project that could, you know, train them, use them and, you know, apply them in real world use cases. So just some background on what machine learning exchange is. It's a catalog platform that, you know, store data and assets. And the code for this platform is to actually exchange those data and assets so all developers could be able to develop data and AI that they have and share with, you know, organization and also in open source community as well. And behind the scenes, you know, the machine learning exchange is mainly focused on uploading, registering, executing and deploying pipeline models, datasets, and notebooks. And the back end of the machine learning exchange is mostly powered by, you know, give a pipeline on TechCon to execute, you know, workflows based on, you know, the pipelines or how you want to handle the assets or how you could execute notebooks. And furthermore, you'll have, let's say, a model being trained as part of the workflow. You could also serve them using a server engine called KSERF that is used to be a capable project and now get graduated to LFAI organization as well. And all the data management for handling, you know, datasets and how you could use those datasets is by data sharing. It's a datasets management platforms. And last and not least, this, you know, like, dataset and model mostly coming from our code teams, DAX and MAX platform, which is standard for data asset exchange and model asset exchange. And with all these assets, right, we actually standardize those asset metadata using an ML specs, which is actually created by adding one of the Microsoft projects where it standardize how, you know, ML metadata should be stored. And with this, we just want to go over a little background on each of the integration and how, you know, datasets, pipelines and model is being used within machine learning exchange. So first, we want to go over how dataset is being used in machine learning exchange. So when you have a very big datasets, right, let's say you serve on the cloud, serve any blob storage, when you want to, you know, leverage them inside, you know, your Kubernetes platform, you probably want to have a way to download them. So dataset is a very good platform that able to, you know, leverage like any blob datasets you store either in S3 or NFS or H3. Any of those, you know, file system could able to just pour them into a, you know, what do we call CSI property, but that's under the dataset operators. And this will operate the database data, take those files, create a connections and able to mount them as a volume for your Kubernetes part. So your Kubernetes part able to just take that blob storage without even knowing what underneath the storage is used because all those combination is handled by the dataset platforms. And next once you have, let's say you have your models is being trained on top of those datasets, then you want to serve those models on top of like a platform that's running on Kubernetes. So what does platform we use is called KServe, which, you know, it's built on top of Kubernetes and leverage, you know, it's still okay to do traffic routing. So you could actually not only serve models, but you could also do AB testing, explain what your model is going to do and also log any information from your models into a backup storage for further analysis. And last and not least, to leverage all these, you know, like components into one workflows, we have introduced, you know, give a product where it actually, you know, leverage how you, you know, take the data from your persistent volume, train them, serve them on the case of all those for actually to operate everyone on top of people. And we mainly are leveraging, you know, give a plan on top, because we also, you know, one give a pilot on top, open ship and open ship is, you know, certified to one, you know, touch up pipeline, various securities. And, of course, I. The key for pipelines on top is an open source versions on how you could run things on machine learning exchange, but we also have, you know, the same capabilities that run on top of our product called what's a studio pipelines. It basically just powered by, you know, the same technology to the. But with a more, you know, easy drag and drop UI, so you could easily build your ideal workflow using what's still pipelines as one on top of. And with all these tools, right? All these tools that leverage together on top of machine learning exchange, we're able to speed up the life cycles. So all these tools, you know, help us, you know, process. All the workflow faster among all the, all those teams and all the any duplication asset, let's say model pipeline. Data set could be shared across teams. So if a team has been developed a data set, they don't have to. They can do it and they could share with other teams. And when other teams need the same data set or when they develop, let's say a pipeline of certain workflow for the data sets, the original grace to have that data set could also benefit and understand what are the use cases. Those data sets has been used for. And also, you know, handle some of the challenges, let's say. Let's say really nice tracking, you know, many of the connections and traceability aspect of that all these assets machine learning exchange also have the capabilities for you to. Whatever it is, it has a set of data saying what this data set is certified for under whatever license, let's say a partial license so you could actually use them without any risks. And with all these capabilities and all these governance that we have built on top of these platforms, you will feel safe to use any asset, any data set that has been deployed and hosted by machine learning exchange. And over a thousand artists, we all these kind of tools we have, how this actually been used as a whole flow. So, as we know, when the data scientists need to develop a models, usually they have to gather data, right? And there was a data, they still have to analyze them with the analyze and process the data, then they will put into a process of training them either into this traditional machine learning models with deep learning models. And, you know, evaluating them based on the feedbacks. And once it's ready for production, it has to deploy the model on the cloud and have some ways to maintain them. And all these kind of steps could actually break down into a lot of small steps and each of those small steps could, you know, take a whole team to do it. Let's say your data preparation from, okay, even you have the data, right, hosting on some somewhere in organization, you still have a whole team. You need to be clean those data, ingest them, analyze them, transform them, validate them and spread those data into small chunks. And once you have those, you know, this process of data, then you will build a model on top of those data, optimize them, validate them and see can you actually also trade them on scale. To make, you know, a very robust models. And this whole, you know, training step could take multiple iterations and even once you have the whole model is being ready to deploy. Deploy on cloud and on edge takes different kind of operations. And when you serve the model, how you take incoming data for that model is also challenging. And with all this information, monitoring them and fine-tuning them, improve them over time takes multiple iterations of this whole, you know, pipeline. So able to reuse them and run them as needed is very powerful to have this on top machine exchange platforms. So with all these, you know, the advantage that will introduce some of the, you know, main capabilities on machine exchange where you could do on top of them and how you can leverage them on your, you know, your life cycles. So on machine exchange, you could, you know, as a user, anyone could just view downloads and you will log in as a, you know, a verified user actually execute those problems as well. So once you find a workflow that you like and you have data provided, you could just pick a pipeline, run them and see. Let's say in this case, you could actually train the A.I. pipelines by training the models and verify if this model is robust enough to serve on a production or is it fair enough, you know, to serve without any government issues. And then next, as you, you know, building pipelines, you probably need to, you know, find some, you know, product component that you could reuse them so you don't have to peel the pipeline from scratch. So same day with the pipeline, you also view them, download them and test it out by executing them on a single pipeline, single component pipeline use case. And as you can see in these examples, let's say you want to test a how, you know, this echo components, you know, like put out a string, you could simply test it out by just doing execution. It will just compose that whole component into a single step pipelines. And over here you can see how this echoes samples being executed and actually you can see it just as expected this as a component is putting out a statement that you have put it in. And of course, once you have all these pipeline, all these workflow has been, you know, created for you and whenever you run those workflow, you probably have generated some sort of models. How do you actually, you know, like put store those models and share those models and this is also handed by machine exchange where you could like register or models when you prepackage them into either container or binary files. And material exchange provide where you could deploy this model as a standalone container or surf on top case surf where you could have multiple models inside a single container as well. And last and not least, when it comes to the asset, it's very important for us to know where the data coming from, and how we verify what licenses under this data set. So that's what machine exchange has actually the main power for where in this case you see the Kona data set is stored in a machine exchange. It has defined what kind of license or the metadata that you need to understand where the data is coming from and how you could leverage this data under whatever policies as you train the models. And once you have understand all those, you could also deploy the data set, which means behind the scenes data streams, the platform we use is actually pulling the data sets and help you, you know, cash the data sets as needed. So you you don't have to, you know, whenever you do this, it disappears and you don't have to like, always pull all the same data sets right from, you know, the Internet all the time you could just have a cash volume on using data stream that, you know, able to reuse them in multiple training process. And, of course, that's an on this, if you also have a noble, you know, able to leverage whatever data set, whatever models, whatever problem you have, you could also, you know, create them as an asset. Basically, a noble is just like a piece of code you could mount on any data sets that we have, you know, created on machine exchange. And then behind the scenes noble just, you know, in this case, you just kind of like test the data sets process them and create a language classifier at the end of the notebooks. So, so you could leverage that with the data sets integration with machine exchange so the noble could just take, you know, something on the Internet and just run them and the whole integration is done by machine exchange. And of course behind the scenes this this execution of the noble is that is empowered by the people pipeline. And with this I kind of just want to close up, like, what kind of catalog machine exchange has hosted right now so we have, you know, a list of pipelines components models notebooks and data sets that has been verified and with all the licensing. And especially when it comes to the asset, you can see like the point that they are set. And all the any financial and public tab assets, we also have to verify them and make sure like they are, you know, free to use on open source and able to leverage them without any government issues. And with this, I will go into past back to question to show us demo how you could, you know, leverage the corner data set on top of machine exchange. Thank you, Tommy. Yeah, so jumping right in. This, this is how the machine learning exchange looks when you open it up to home screen. And as Tommy just explained, you will have a navigation menu where you can navigate through all the different asset types, data sets models and notebooks is what we're going to focus on in this quick demo. And so under data sets, you see this variety of data sets that we have. And here you see project code net. And since project code net is a very large data set. We have broken it down into smaller data sets that lend themselves better to particular tasks. So in the machine learning exchange, you can find the data set that's trimmed down for the language classifier and another one for the masked language model. Today I want to show the, the language classifier. And similar to the animated gift that Tommy just showed in the presentation. You'll learn on the description, which is a really, you find all the things to where you can download the data set if you want to do your own experimentation. Along with the license, some description, and a lot of links if you want to dive deeper. The YAML, Tommy has explained that and related assets. There's a notebook that you can run with this data set. And if you want to mount, create the PVC for this data set, you can launch it right from right from here. You can provide a namespace that you want the PVC to be mounted in and click submit. And you will see that the queue for pipelines run graph, run graph shows up. And I think this pipeline that that we started here has two steps. One is to generate the metadata. We have an MLX and generate the metadata that's required by data sham. And then the persistent volume gets created. You can follow along with looking at the logs tab here. So in this case, I have run this just prior to the demo. So the data sets has been founded. And you can also, you know, check out the configuration run output. If there's any run output. We will go and put an output here. You can see the output that's been generated status and the name of the persistent volume in the queue flow namespace. I'm going to copy that and go to our notebook. We have an accompanying notebook for this data set. For both the data sets we have the language classification and the mass language model. I will jump into the language classification notebook and take a quick look at the notebook preview. So when you upload notebooks to the machine and exchange, we use the NB viewer plugin so that you can get a preview, a rendered preview of the notebook. And you can find out more of what the notebook does, what the output should be, and you can compare the output here that that was there when the notebook was created with the output. You know, when you run the notebook, you don't have to download the notebook though. You can also launch it right here from the machine learning exchange. We're going to use that PVC with the cached data and we're going to mount it to a local directory in the pod and click submit. And now the pipeline gets launched and the notebook will be run on a Kubernetes pod. We use the Papermill library and the Elayra AI notebook component. And you can get similar details here and most interestingly you can follow along with the logs. You can see what is happening. Typically notebooks, the first thing we need to do is we catch all the required Python libraries. And then these Python libraries have to be downloaded and installed and that whole process might take a while. So allow me to go to a previous run and I did just before that demo. The run typically takes about two to three minutes. And at the end of the run, you can see all of the logs, all of the individual cells, you know, the input of the cells, the code block itself and the output. And in the case of notebooks, we generate an output notebook. And you can download this directly from here and open it up in a browser. And here you can see the notebook looks very similar to the NV preview that we had earlier. And the notebook actually takes the data, does some pre-processing, processing, you know, creates that data set, the training set and the test set, does the training. You can see it will go through 20 epochs of training. And then at the very end, you can see the training and validation accuracy and the training loss. And in the last couple of cells, there is some test where our model that we trained, we give it some test data. And in that last run that I did a few minutes ago, you can see that out of the 100 test samples of the 10 languages, most of them were classified correctly, except one C++ sample was misclassified. Now, once we have that model trained, we can serve that model also with MLX. And so you can go to the coordinate language classification model. This is a containerized version of that language classification model that we just trained. And when you look at models here, Tommy explained, you can see the description, which is a rendered GitHub style readme. You have the YAML file that is required to upload it to MLX. You can find out where the actual container images and some other information, depending on the asset that you're looking at. And you'll see the code that we use to run the pipeline. In this case, we want to serve it with Kubernetes. So we go straight to launch and say we want to serve this model on the Kubernetes platform. And the run name belief at this, click submit. And it's same as before. You will see the Q4 pipelines run graph come up. And this time, we will see there's two tasks in this pipeline. One is to generate the configuration data we need. And the most important one is the actual deployment. Again, you can follow the logs if you want. And you can see the output. And the output will also tell you where it's been deployed. And once it's been deployed, you can forward forwarding here. You can open it in a browser and see it's being served on localhost 5000. And each of the containerized models we have, they are flask-based Python. There's a flask-based API that shows you the Swagger generated UI. And with the Swagger API UI, you can actually exercise the models and points. And so for each of the models in the machine learning exchange, you can get some metadata and you can even use it to make predictions. So for this model, you can click try it out and then provide some data. Yeah, I did that before. I looked at some Haskell samples. So let's just pick one and click execute. And then you see the prediction output is that it's 99% sure it's Haskell. And there's a slight probability it would have been Python or Java. Depending on how well your model is trained, and of course, you know, the size of your sample and, you know, various factors, the prediction can be more or less accurate. And then back in MLX, you can see that we used that data set for Project Coordinate. We run our notebook and then we showed that we can even serve that trained model. Now, before I end, I want to show two GitHub repositories. One is for the machine learning exchange on github.com machine-learning.exchange where you can see all of the MLX source code and all of the sources for the assets in our catalog, including the notebooks that I just showed and even the notebook source code. And the other more interesting project perhaps is the Project Coordinate repository. You can find that under github.com, IBM, project Coordinate. And here you should find everything you need to know about the data set itself, about the research papers, benchmarks, you know, the tools that you can use to process the data. And you should also be able to find the notebooks that I just showed you. With that, I want to end our presentation and say thank you. Tommy, I hope I didn't miss anything. Yeah, thank you very much. Thank you very much. Thank you very much and see you at the next conference.