 Thanks everyone for joining. Welcome to today's sections on training AI to code using the largest code data set called Project CodeNet. I'm Tommy. I'm a senior software engineer at IBM. I'm mostly focused on open source and open source technologies and today we're going to introduce this new CodeNet data set about how to use this new data set to help AI to train coding and help you enhance coding in a day-to-day basis. So we have seen, you know, languages is a very powerful tool and we have seen NLP and a lot of the deep learning applications have NLP to help make human communicate with machine very good but however we could also apply this kind of application to like code like human coding application as well. So you can enhance let's say how you fulfill your coding lines and help you translating codes and also help you, you know, simplify and summarize the code you have written for your programming language. And the concept of, you know, AI for codes is actually how teaching machine, you know, to actually code the language of machine as well just similar how, you know, we use NLP to code like how much use machine to code human languages. So since, you know, like this field is still relatively new in computer science and it leverages a lot of the, you know, traditional concepts like let's say NLP and document understanding and more importantly is able to understand, you know, the core analysis and the compilations techniques they use before like coding. And the goal I want to achieve with this data set is to actually automate, you know, the software engineer process, help them, you know, mobile active using this tool let's say and code to code translation could be some of the use cases with this data set. And with this kind of like use cases where we actually could like apply it to like business use case such as, you know, converting legacy software to modern code. So this way you can migrate, you know, old legacy software much easier. And we look at like the machine learning fields, right? When we see back in 2009 when the image net was introduced, there's three million images with 5,000 classes. And in 2012, the number of images increased to like 14 million images and 22,000 classes. And with this kind of like amount of data sets, people were able to develop your new algorithms using deep learnings. And over the time you can see within the five year of SPAMs, people could develop a model that able to like perform, you know, human level tasks. So 2005 with like the new deep learning frameworks, like it could actually like beat humans on particular like image based classifier. And as we can see how, you know, like deep learning could help in, you know, image based models. We want to also apply this for, you know, like code, right, a base model as well. So as we can see, like a lot of the old infrastructure, especially in like government fields, we see like there's a lot of codes do using cobbles. And it's the early big need to, you know, migrate them into a modern, you know, technology and using modern stacks. This is why like we need to be, you know, able to build like new models able to help use code to translate into a modern code. And if we kind of take a step back and look at how, you know, NLP has been done back then when the first, you know, DNN was introduced, we start, you know, applying more, you know, human based natural languages. And over the span of usually five years, we could see like as we put more, you know, compute powers and more data and with more research, like a researcher able to develop like better models and able to perform at the human level, so used within like five years of spans. So what AI for code needs is like an image net light data sets that you could help, you know, do like code language translation and go search, go similarities, you know, perform like code performance improvement and able to enhance, you know, code memory improvement and finally do like some code classification as well. So now we introduce Project CodeNet. So Project CodeNet is one of the very big last data set that we have introduced last year. It contains 40 million code samples and it has like 4,000 plus code problems and this covers all the problem covers 55 programming languages and we somehow have like a half a billion lines of codes. And this code is very diverse and more importantly, it also includes a lot of test cases. So you actually could use them to verify whether or not a code is actually running well and also it will measure how much CPU and memory is being used. And more than I think legacy code is very important. For example, we could see like in model mobile like kinds, there's a lot of, you know, code need to be rewritten, you know, over the decades and we written the decades of code usually takes like years to finish. So we can see like one example when we apply the app for code modernizations, when we build a model able to like help us automate this kind of process, you know, moving like this decade of codes to a modern legacy Java version only takes four weeks versus a year to generate it. So now we kind of introduced, you know, like a project code net, but we also need to, you know, dive deep into like what AI system we could also apply. So data scientists could actually build new tools and create new pipeline to automate, you know, all these models and, you know, able to help us create new business value, let's say convert legacy code into a modern code and help, you know, boost developer productivity as well. So now let's dive into like the different, you know, languages that code net have provided. So you can see the distribution, most of the languages are based on, you know, C, C++, Python, Java. And it also includes a very old like legacy languages like Cobo. So you could actually do like, you know, legacy code migrations, but as well as you have multiple versions of like Java's code as well with these submissions. So you could actually migrate from older Java version to a newer Java version using like this kind of like data sets. And as these data have 40, 4,000 more problems, and 80% of those problems have at least 100 solutions for every languages. And as you can see in the diagram, like more than 2,000 of the problem have more than 500 solutions for all those languages. So you can see like the solution is very diverse. And you've got to use this kind of like a data set to help you do a lot of, you know, different kind of use cases. Then we can kind of break down where this data is actually coming from. So this data is actually mainly coming from the online church website, I drew online church and ad coder. So it has 4,000 problems and with 30 million submissions, half of the submission actually like categorize as accepts. So that means like the solution is actually working for that particular problems. 30% of them is wrong answer, and then roughly 15% is rejected. And it has like 55 different languages, you know, more than 95% of them are in the main six languages C++, Python, Java, C, Ruby, and C sharp. And with C++ has the most common submission as a million submission in this case. And more importantly, you compare, you know, KONET to the other, you know, data sets out there like DCJ and POJ. KONET has significantly more amount of problems, more languages. And more importantly, it includes a lot of test cases. So half of the problem have like, you know, robust test cases. And it has, you know, like measurement on memory run times, and what kind of error it creates. So you can actually use them to calculate, you know, memory consumption, run time performance, and error prediction in this case as well. So in terms of like kind of metadata, so on high level data sets, you know, given like the name and the data set itself. But more importantly, you have each problem, like each kind of category, you have the time limit, memory limits, and complexity for each problems. And when a user submit, there's submission to these problems, right? Each submission actually break down into like smaller chunks where you can see what language has been submitted, what kind of dates, the CPU time memories, and actually accuracy on, and the co-size on like all these submissions. And then when the user actually submit the code, you usually have like various amount of submission code status, right? So not only you could see like code that is being passed, all the test cases is being passed, you could also see like what kind of error if a code actually is not able to like pass the submissions, right? So with this kind of information, we not only just know like if something is code incorrectly, but we also could know if something is like code inefficiently, right? That costs a lot of run times or memory. And we could also use those information to help create new models that focus on optimizing codes instead of just doing code translation and code generations. And in addition to that, the CodeNet project also provides a lot of tools to help you navigate these kind of big data sets. So aside from the data set right in the CodeNet GitHub, we also provide a very sadistic, as you can see like the breakdowns on how the code is being submitted and how much memory is being used. And also like have ways for you to like make selections. So we have different selection where let's say you want to try out like language classifications or like math based models. We have like subset of those data set selected for you. So you could actually start to explore with the subset they have set and understand the code, the data set better before you actually jump into the whole, you know, big 30 million submission data sets. And furthermore, we also provide codes for you to actually do basic kind of NLP tasks. Like let's say tokenizer to help you tokenize all of those codes, right? So you can actually know how to mask them. And you want to do more traditional kind of NLP style language model. We also have like AST generation to build a traditional abstract syntax tree to understand how the, you know, the code language works. And also we also have like control and data flow graph to construct like code analysis. So you could understand like what kind of a CPU one time and have like more detail breakdown how this code is being compute. And we also provide some example you want to start with, you want to just train a simple, you know, deep learning models. We have, you know, graph neutral network experiment. You could just create simple like GNN networks. And these two use case we have provide the mask language models where you could able to mask a particular token of the submission. And you should try to predict that token. So let's say you mask like a parentheses. You should try to, you know, autocomplete a parentheses. So it will help you do that. Let's say like code completion and code generation in that scenario. And we also have like some, you know, notebooks that helps you solve how you, you know, create this mask language model and do some like language classification as well. And with addition to that, we want to, you know, open source, we did open source this to let the world to able to create more potential use case. So what use case we kind of like show is, you know, code classification. But we also want to see like how this could expand to, you know, like code similarity search source code, source to source code translation. So we help translate into different languages, different like, you know, version of the language and also like optimize the language runtime as well. And, you know, you see like some of the existing applications, the IBM AI for code stacks that we have provides is actually using the CONET database behind the scenes, as our source. And also, you look at the research paper for D-Mind Alpha codes, they also refer like CONET as one of their training data source. So you can see like CONET is actually very useful and provide very useful test cases that you could, you know, start and leverage to build your own, you know, code-based NLP models. So for now I'm just going to show you one, you know, very simple demos on how you could leverage CONET to build the mass language models. So let me make it, I think it should be good enough. So just a short introduction for this. So the mass language models, it's basically, you know, you have like, let's say you have a piece of code. This model is trying to, you know, mass different kinds of tokens, right, within the models and try to guess is that token, can we generate a token that able to match that. So this is kind of like the foundation on how you, you know, want to create like code generation, code translation, right, and code completions. You want to have some like mass-based models to how you guess what is the next code token going to be. So this is like one of the basic models that help you do that. So we're just going to go over this noble to see like simple example, how you could leverage the CONET data set. So once, in the first set, we're going to implement and import all the tensor flows and all the necessary like pandar numpy libraries. And then now we're kind of like take a look at the data sets. In this case, we're focused on using the C language programming language where we're in the CONET. So we're actually taking just the subset of the CONET data set to train it. So as you can see, like the CONET data set, right, is like multi-terabyte data. But you could also take like a subset, right, a sub-selection that we have selected for you to get started with. So one of them is the mass language model one. This one is very, very small. It's just like several kilobytes of code. That's doing like containing like thousands of examples. You could, you know, start with this and create your first, you know, mass language models. And now we kind of break down with this subset of the CONET data set. This is focused on the C language. So you could see like, first of all, when we prepare this data set, one way we could do this is actually, you know, sort them into five token class. So in C, you have like keywords, functions, identifier, punctuations, and C preprocessor symbols, right? So we want to like put them in this five, you know, token class. So we know like what category of like syntax we want to generate. So next we want to actually create these text vectorizations, right, in CARES. So we are to be able to tokenize, you know, every single, you know, tokens were in the code because right now when you get the code initially, it's only just a text file. So you want to like tokenize it first. And later on, we're going to apply a masking on each of the tokens. So you actually could, in the model, you could either mask on different tokens and able to apply them in your model weight. So this is just that kind of preprocessing and, you know, mask the inputs and labels. And once we kind of like preprocess all the data, right, we could tokenize, we were able to mask them. Now we want to build like a very simple, you know, a bird-like model. So it's a simplified version of the bird models. So this is just the purpose of demoing how you could do a simple mask language models. And you could, you know, take this to a more advanced use case and train it on GPU as well. So with this, we're able to create a very simple, you know, bird layer. And then we're just going to train them, you know, like this 50,000 examples, right, with like a batch of 32, right, with 5E parts. So once we train them on a CPU machine, typical CPU machine, probably just see on your laptop, it takes around an hour. And for this demo, we kind of see once it trained on 5E parts, the last way is kind of acceptable. So we kind of just stop that. And for the purpose of this demo, we could just have enough, you know, accuracy to, you know, to play around with. So once we train these models, now we could evaluate, right, like the accuracy of these models. So with this subsets, right, when we kind of train this with 5E parts, with 50k examples, and we kind of like have another 5000 examples test cases, we're able to see like some of the masking tokens, right, able to predict the top 5 examples. Let's say if you have something missing, you would see, like, you try to have like void tokens, right, or you just have like empty tokens, that these are the five top categories you could predict. And with this kind of accuracy, right, you could see the top one accuracy gets to like the 92%. And it would count, like, if the guess is right, we're in the five guesses, it's actually close to 99%. So this is actually very useful. And it could help you, you know, do like co-completion, you extend these models into a more advanced use case. So with this, we know like how, you know, data scientists could leverage these data sets to train and, you know, like develop entrepreneurial books. But like how do, you know, data scientists able to share their use cases or share their research, share their like model development with other researchers, right, so they could actually combine and leverage multiple models to create new use cases. So the next thing I want to introduce is the platform called Machine Learning Exchange. It's one of the incubated projects in LFAI. The concept of Machine Learning Exchange is to actually have a category of AI access. It includes, you know, data sets, models, pipelines. So data scientists could create different kind of like models when, you know, data scientists have pre-processed data sets and, you know, create like new data set selection. They could also upload to this. And at the very end, once, you know, the data scientists create a pipeline to help them automate all this process. This could also, you know, upload on Machine Learning Exchange. So other data scientists could just take these pipelines and apply it into their, you know, developments. And if the business use case is relevant as well, the business news circle also, you know, see this model is interesting. This pipeline helps us, you know, create new business value because it's just take it and make it into productions. So as you can see, you know, Machine Learning Exchange including multiple kind of data assets, mainly pipeline data sets, notebook, and models. And behind the scenes, Machine Learning Exchange is up running on Kubernetes and OpenShift, right? And it's running in the micro service architectures. So as you can see over here, the main API layers is how hosting all the, you know, assets, the DSS of load like models, DSS metadata, notebook metadata, etc. And if like someone wants to just execute a pipeline and want to know how all these assets leverage together, behind the scene where you see, you know, give a pipeline on Tectons. The reason we choose like the Tecton version of give a pipeline is because we also want to support, you know, OpenShift. So this helps us, you know, able to run, you know, AI pipeline on multiple platforms. And we also leverage the serving engines, right? So when you have a model, it's been uploaded and you just want to try it out. We also leverage the serving engine called K-Serve. It used to be a Q4 project and now also moved to LF AI. It's a serverless serving, a model serving platform that runs on Kubernetes. So everything you serve, everything you run on, you know, machine learning exchange is actually running on Kubernetes. And furthermore, so for data sets management, we also have like a new project called, have a project called Datashimps that do like data management. So whenever you want to you leverage KONET data sets that runs on Kubernetes, behind the scene, we leverage Datashimps to help us, you know, cache the data sets on the cluster and then do processing on the Kubernetes cluster. And with this, we're going to introduce the list of categories that we have on machine learning exchange. So by default, in the browser we have hosted on LF AI, we have this list of pipeline components, model data sets and notebooks that we have verified and able to run it with all other compliance completes. And as you can see, the highlights we want to show here is the project KONETs. We have the project KONET as part of our data sets and we also have several examples, for example, the mass language model and the co-language classifier on machine learning exchange catalogs. So with this, let me show you how machine learning exchange is being hosted and how you can leverage machine learning exchange to host your assets and share it with your data scientists. So if you go to the public hosting website called mlxchange.org, so this is actually hosted under the infrastructure with LF AI and data. And with this website, you can actually just view all the catalogs that we have available and you can just see, let's say for project KONET, you can just see where the data sets come from, you can see the whole data set files, you can just download from here. And you can also see the list of selection we have created. So you just want to subset up the KONET. This is the subset of selection you could use for, let's say, Python benchmarking, Java benchmarking, and we also have two data sets just for doing mass language model and language classifications. And with the models, as you can see, we have a list of models and we kind of focused on, let's say for today, we could focus on the KONET language classifications. In this model, the data sets are up loads. It also lists on what the data set is being used on this model. For example, in this case, you could see this one is using the project KONET training data and using the course classification application examples. And it also kind of shows you if you want to just run on your local machine, it has ways where you could just deploy it on your local container run times. And if you have a Kubernetes class, you could just apply it directly as well. And the next asset we have is the pipelines. So behind the scene, we are leveraging Q4 pipelines on Taktown to run this. So when someone have created pipelines, they could actually just register over here and able to run them. If they wanted to experiment the AI pipelines, how they run on Kubernetes and how they automate their life cycles. As you can see, it will create a DAG and show you how to, let's say manipulate, they download two different kind of assets and join them together. And for these assets to want to build these kind of pipelines, they would need to have individual components that combine them together. So as we can see, we have three different components over here. And if you want to just drag and use one of the pre-create components, we have a list of components over here you could use. And in this case, you can see that one of the examples just echoing like code file you can find over here. And we also have other components to help you evaluate your model of business. You want to check out how LFA have other projects, how you do like explainability. Those are the components that help us automate those process. And we also have upload them on machine exchange as well. And lastly, if data scientists have some development works, let's say they develop a model, someone to see how this model is being developed, they could just upload their notebooks and show their development process. So the notebook we have just uploaded, just have to check it out, is the project code net mass language models. As you could check it out here, we also have the server to help you visualize how the notebook code has been host. So you can see this is the exact same notebook we have just run or just show during the previous demo. And you could all check it out from the mlxchange.org over here. And this is kind of like a host website, so you cannot run anything in runtime because we cannot handle a lot of anonymous users. So machine exchange, you could also run it on your private cloud or in your own infrastructure. So in your own infrastructure, you could import like admin runtime. So you can actually run models or pipeline and serve your model and test it out in your own infrastructure. So I have an instance that actually run on my own infrastructure. So in this case, let's say we go to models and enable admin runtime access. I could actually launch and serve this model on top of my Kubernetes clusters. So simply just like submit, behind the scene it would use the pipeline system to automate where the model image is being located and actually serve it on top of my Kubernetes and automate this whole process for me. So first it would just gather the model config and know where the model is being host. And then in this case, because the model is actually just an image, so we actually just take that image and create Kubernetes deployment and actually host on my Kubernetes cluster. And after it creates the deployment, it actually also give me a deployment URL and create a services so I can actually use it and test it out. So once the model is being deployed, we can see like this model actually include a Sprag API. So actually we use this Sprag API to test out how this model works. So this model is actually a language classification models. It includes like 10 different language categories. And to just simply try, I could just click on try out and upload one of the language, upload any language coding files. So in this case, I could just upload a simple Python file over here. This is one of my self.pys in one of my SDK development. When I upload this and execute, you could actually see like this model able to help me classify what language is this file is. In this case, you could see like 72% confidence is Python and with some like 26% dot is like Haskell and then the rest is Java. So you could see like in this kind of use case, you could classify this code is Python and then you could automate this code is Python that could use do Python auto code generation, et cetera. This is kind of one of those use case and you could upload this kind of models to machinery change and let other data scientists know this kind of use cases. And if someone is interested and able to leverage it and expand it to a more advanced business values, they could do so and they could just check it out and able to apply in the organizations. And with this, we want to kind of summarize what we have discussed today. So the main goal we want to show today is the project code nest. So we have open source high quality code assets to help innovate and benchmarking code and able to help us do more for developers. And we also introduced one of the open source AI system staff machinery exchange to help data scientists and developers to exchange their assets that have developed for code nets. So they could actually improve and create more use cases that help developers to have easier jobs. And with this kind of new use cases, it could also generate new business values such as for legacy codes, we could actually use kind of like when researchers kind of research this data cycle, you apply on language translation, code language translation, business kind of use case could kind of like view this as like some potential business value and use this model as migration legacy codes or like outdated code into a more modernized version of it. And this is like the powerful of this code net data sets and having a shareable AI system stack could help, you know, develop and data scientists to enhance the ecosystems. And that is the end of the sections. Is there any question in the audience? Go ahead. For example, I have a very simple use case of sending a product of CICD pipeline. What is this? Is it more like I write my code and it just completes or it just generates a code or I just give a simple pseudocode and it creates code for me. How does it work? I see. So the question is like how this data set could help you, you know, on your day-to-day work. Let's say for CICD, it could help you auto generate codes or help you optimize codes, right? So, you know, the main focus on this is just the data sets, right? So there's a various amount of use cases you could do. Of course, the very common use case, right? Like what I have shown with the mass language model is to actually help you, you know, complete your code. Let's say if you, you know, type like one line of code, you try to predict the next line of code you could go into write. So that's just code completions. And of course, there's like use cases you could do like code optimizations where you have like a whole function of code. It'll help you, you know, measure the runtime and try to optimize it. That is another use case that we see in researcher and as, you know, more users have came out, we could see like more use cases could be generated, right? Especially like with the new concept of foundation models. With this kind of large data set, we could create a single model that could, you know, apply in the multiple use cases like completion and summarization. So it could, yeah, it could, yeah. So we are not open sourcing like a completion models. Just to be careful, we're open sourcing the data sets. We're just showing like several use cases you could do it, right? Of course, for completion model, you could use an example of multiple deep learning models, right? One just, you know, masking, just help you, you know, tokenize it, help you predict the next, you know, most likely tokens. And then you have, you know, you could create a generated model that use that output, right? Generating new codes or complete a new code or, you know, summarizing new codes, right? It could be a sample models or of course in the new research field, right? Foundation model is being very popular to just generalize all the, you know, NLP tasks. So that could be another approach where, you know, a lot of researchers have, you know, done, but like as, you know, as a lot of research, right? Some of them are just kind of full of concepts. So that's why we kind of provide this new, you know, code and data set for people to create more pro concepts. So they could display a new way on creating this kind of models. Yeah. Yeah, go ahead. Have you guys given any thought to running this over like Git repose versus code net? So we do see, you know, scenario code repo. So we're scrapping code on code repose. You just don't have like a good idea on how, you know, the test cases is being run, right? Because when you start code on GitHub, you have to guess, like, is this test code, is this test case able to verify that function, right? Because you cannot kind of scan all the CI CD, like how it ran behind the scenes. You only could kind of like get the code, you know, out of GitHub. It could be under maintained. It could be like untested. So it's just, you know, that is one of the challenges. If you run them in a linear fashion starting from the beginning towards the end, you catch the ability to find the, especially if it's like a kernel where they use fixes tags and stuff, you can start mapping things like fixes and commits in the future to bugs in the past. And you can tell them you can go back and back correlate what's buggy and what's good and then use that to help train. It would seem like that would be a good place to start somewhere like that. Right. I mean, that is definitely a very good use case to just, you know, like trap, you know, different commits. I think right now we, there's no good automated way to, you know, generate that kind of data sets. Right. So I think we decode that, right. The concept is we collaborate with different kind of like, you know, like coding challenge websites to just, you know, like with that, we know that like the environment they ran. We have a more like controlled environment. So we kind of like easy to collect all these kind of data from various angles where you're going to get from GitHub. There's more challenges on how you need the developers to understand like different, like get commits, right. And able to like understand and reason in them and put and label them to like, like useful data sets. So that's, I think what that's one of the challenge on, you know, collecting the monkey hubs. Yeah. Is there any questions? Yeah. If not, yeah, you could go ahead. Yeah. Okay. So my next question is, instead of doing natural language processing, have you thought about doing a specific language pre-processing to break the code down into its actual like pre-processor elements, like the actual compiler would see and then process it that way. Would that help or would that work? So, I mean, that could be a very interesting use case, right, to do it in pre-processing. So I think right now what we have, so what we have open source, when we focus on the open source site, we focus on just developing and deliver these like data sets. And we do like want to see that people like you provide us more use cases to give us more idea how you apply these data sets, right. Let's say on just on the compiler side, we, you know, never have use cases on that. So that could be a very interesting use case. And, you know, feel free to, you know, submit issues or let us know on GitHub and we will take a look into it. Yeah. Are there any questions? Yes. Go ahead. Oh, yeah. If you want to like deploy a machine learning exchange, you could just go to the machine learning exchange website. I could, you know, show you up after. I think it's also at the ml-exchange.org. So I think the only thing you need is just a Kubernetes cluster and access to public networks. So everything is open source where we host everything on public registry. So you can actually just download them and we have like customization, installation and also just pure YAML installation. You could just apply it and it will just install in your cluster. Yeah. Thank you. Yeah. Yeah. Go ahead. Oh, yeah. Did they say in half of the problem have like test cases, were all robust test cases? Yeah. You could also apply that use case as well. Yeah. That's like the concept we want to pull out user. Like you also could, you know, pull out, generate test and verify them, you know. And that's what we're hoping, you know, like people like developers able to, you know, give us more idea how you apply this data set right into various different use cases. Are there any other questions? Yes. Yeah. I think I could kind of show the list of other fact we have like kind of like published. So on the official published access where we actually like have to go over each of the data set and make sure there are like licensing compliance with all the open source licensing. So behind this thing we do work with some like lawyers to make sure like we could also use an open source and not able to have legal problems. So that's why like this is a very manual process where you have like one to publish a models, let's say a models, especially a data sets that involves, you know, like, you know, kind of like sensitive data. It has to go over like some like consents process where it has to be approved and able to compliance with all the open source license and able to apply on, you know, all different kind of process on, you know, open source. So for your questions, right, in the existing licensing, I don't work with, I don't, you know, actually work with the licensing, you know, team. So we are only like, so we initially we kind of came up with, you know, a list of license like Apache to MIT and these are the common license is using open source. And we give it to the licensing team and, you know, see what are the list of data sets we could open source them within these licenses. And that's the approach we came with. The other way around where we have a data set and you wanted, you know, you know, certified for some particular license, I think that is more challenging, you might have to like die deeper on how, you know, that data is being used or being processed. So that will have to follow up with the licensing team to get back with you. Are there any other questions? If not, yeah, thank you very much for joining these sessions. Thanks, everyone. Yes. Thank you.