 Thanks everyone for coming to the last session of the day. My name's Tommy. I'm a senior software developer at IBMs. And today we are going to talk about how to training AI to code using the largest code data set called CodeNet that we have released. And in the past few decades, we have focused a lot on training AI on natural languages. But there's also different kind of other form of languages that we use in a day-to-day basis. Such as in the molecules, we have different kind of representations for molecules, stories, imagery, and expect theory for coding as well. And today we want to focus on how we actually want to emphasize on building AI for code. And as our power initiative in IBM Research, we have this new program called AI for Code, which we want to focus on teaching new machine the language of machines. And it is very new in terms of like fields in computer science. And we leverage a lot of the core concepts, such as NLPs, document understandings, and co-analysis and compilation techniques. And the goal of this whole initiative is actually helps create new models to help automate the software engineer process and able to help them improve their productivities. And in addition to that, we want to create tools to help software developers to build a practical task, such as code search, summarizations, completions, and code-to-code translations. And with all these tools, we are hoping to be able to accelerate the effort to modernize legacy of software and help them migrate modernized software into like a new microservice architectures. And you look in the past, when ImageNet just announced, they first have like three million image and 5,000 class. And just within three years of spam, they were like expanded to like 14 million images and 22,000 of classes. So the data cycle expanded very fast in like a few year spams. And with a big data set like ImageNet, developers and scientists is able to like span like just within five years from like more than 30% of error rates right from the beginning and able to achieve like human accuracy, right? Just within five years of developments and research times. And we want to do the same thing with like AFO codes. So, and we'll see a lot of the post-sensory business use case as we see like a lot of the governments and banking automobiles. Software they're still using like, you know, legacy codes like Cobals, right? And there's more than 220 billion lines of, you know, legacy Cobal codes. And as we're aware, like we don't have that many, you know, Cobal developers are able to translate this into modern technologies. So this is why we need to develop tools to accelerate this effort so we can migrate like ancient legacy technology into a new architectures. And when we actually look back into how we have done for natural language processing, well, you know, like deep learning first introduced in like, you know, early 2011's within just five, six years of spams, you know, with deep learning and GPU accelerations all the research we're able to achieve are human level accuracy just within six years. And even if we're in IBM, we'll be able to like produce a lot of NLP based, you know, services and document understanding using this new like AI technologies. So we're going to focus the same technologies on code as well. So it helps us to do NLP and MLs for, you know, software artifacts, automated reasoning and decision-making. And furthermore, we're able to explain like how this code has been generated. So we create the explainable AI as well. And with all this, we need to create a new data set, right? Just similar to image that in open source. So, you know, developer could use this to create, you know, like code, language translation, code search, code similarities, code performance improvement and code memory improvement and code classifications or examples and algorithms. So we got to use this and put into like more specific use case for each developers. And this is why we want to announce the Project CodeNet. We have released the Project CodeNet data set to the open source last year. This data set is actually a high-quality code data set of augmented innovation and batch markings. It is very large scale, so you can see there's more than millions of code samples, right? More than 4,000 code problem is out there and it covers 55 different languages from like code or ancient legacy language to like the latest modern C++ Java Python language as well. And all the code are actually well tested and they also provide the various test case. So you got to train on top of the different test case and make sure that you could understand how these code are being ran and how this code has been failed as well. And from some of the examples we have seen like the potential use case is actually modernized, you know, legacy codes. So one of the kinds we have is like automotive clients. They have very old, you know, like stack of Java legacy codes, multiple version monolegual applications, more than 45,000 Java files and more than a million lines of codes. And we want to even migrate them into like modern architectures. And in our initial effort, when we actually estimate how many developers and how many effort we need to spend, it actually at least need to take us a year to migrate all this code into like modern Java versions. That's why we kind of initiate like the AI for code project and then able to build some models that help us accelerate and we architects find out like code that need to be refactors. And after building these models, this model able to help our developers to reduce the time from a year down to like four weeks of modernize all this code into like 25 different microservices and 450 different Java classes and running in the latest Java versions. And furthermore, we can also, the AI is able to like, how it's comprehends like how much runtime and data dependencies we have for these codes. And they will expose like any deck code that is no longer being used or is no longer suitable for the new applications. And as with this, when we open source project code net, we wanna see like put this data set out to the open source. So open source developers could bring new algorithms, new methods to train these data sets and a new way to compute this data set using new technique of data parallelism and model parallelism. And once we have enough research, then anyone could just put in the AI system stacks, fit into their own specified like data sets to do like transfer learning, build their own models and input their own business pipelines. And once at the end, once like all this effort has been done, then we're able to like use this new kind of creative model to enhance our business values, such as modernize legacy codes and boost developer probabilities. So now I want to dive in due to what is this code data set and what is actually contained. So code data that's have, as we describe have a lot of languages, 55 different languages. Most of them are composed with like modern language like C++, Python, Java, C and Ruby. But we also have like ancient, code like language like Cobos where you can actually use them to do like code translation and understand how like legacy code is being coded in the modern coding problem as well. And you're kind of break down into like how each language, what kind of problem is contained with each language. 80% of the problem actually have more than 100 solution in each of these 55 languages. And more than half of this problem has been accepted so they're actually like workable examples. But you also have the other half where you'll provide you like mistaking examples or runtime error, memory error. So you could actually like use them to figure out when a developer create different like one kind of solution you could able to find out like how to optimize them and tell them the right solution to recall that into the right way. And when we observe this data set, that's this is how we kind of collect all this data. So we actually like talk to the agile online church and add code and help them ask them to able to collect those data for us. This actually contains more than 4,000 problems and have more than 30 million submissions, right? And as we described before, like more than half is actually accepted solutions. Only about 30% is wrong answer and 16% is rejected with different reasons, right? Like memory error, runtime error, et cetera. And we're in this 55 kind of language, you know the main six languages C++, Python, Java, C, Ruby, C++ and they are all coming into different version as well. So they could do, they have different kind of version of C++ and Java solutions. And with C++, there's more than eight million submissions and more than four million is accepted. So C++ is kind of like the biggest data set we have in code net. And you break down how this data has been collected. So this data is actually like a computer programs in a particular programming language. And each of them, each of those programs only contains one single files and it tries to attempt a particular programming task of problems. And a lot of them have multiple solutions that you could have multiple solutions and multiple runtime different approach to actually tackle this problem in all different languages. And at the end, once we collect all these kind of datas, we actually make sure it's all certified under the CDLA permission of V2 that is defined in Linux foundations. So this is actually like able to be good for every open source developer to use it and do development research for their own models. So let's dive into a little bit on what kind of metadata it actually provides. So for each of those problems, we have defines like kind of like ID names, time limit, memory limits and complexity that's required for these kind of problems. And when it comes to submission levels for each of these problems, each submission is able to like identify how many CPU times, memory times, accuracy is being produced, right? Because we have a ton of test cases to be able to calculate how much accuracy you could satisfy with these solutions. And once the submission is done, whether or not it pass or no pass, we will output a status. So this actually how you filter out what kind of problems that you want to like categorize. So not only just a problem could be just pass or fails, also could determine whether or not this problem is like exist the time limit, memory limits or it just have like wrong run time or output. And furthermore, with all this kind of information we have, we also provides like different tools and examples on how you can get started with these data sets. So the tool we provide is like satisfy from the data set, like the stuff you have seen before. So we have like tools to help you generate different statistics, you could get subset of the data set as well, using our tools. And also like you could convert this kind of data set into like different kind of data formats, right? So by default it's just like, you know, code files, but you could also convert them into like a stack of files or, you know, put them into text file as well. And we also provide different kind of pre-processing, you know, source files depending on what kind of, you know, model you want to train. You might want to do different kind of pre-processing and we provide like some simple tools like a tokenizer to generate as team of tokens. You want to build like traditional, you know, abstract syntax tree, we could have like AST generation as well. And you want to do like just control and the data flow graph. And we also have like core analysis to help you do that. And we also have a few, you know, initial experiment we have put out that anyone in the open source could just try. So we have like some simple, you know, GNA experiment that people could just, you know, train them in their deep learning frameworks. We also create like some symbol in the mass language model so you could get start with like simple mass language and, you know, build on top of like different bird models. And also like create, you know, token-based similarity classifications. And for all these experiments, right, two of them the mass language model and the, you know, token-based similarity classification. We also have like a very simple notebook to, you know, help you train a very small model just to see how it could be work, right? Step-by-steps. So we also have like no book just, you know, guide you how to do that from pre-processing to actually producing your models and test it out on the mass language model and language classification that we have up there. And with all this kind of information and tool, we are actually aiming to expose the potential use case, right, with these datasets because we could see like AFO code could help, you know, developer to do, you know, code classification to know what language the code is, how to do like code similarity search when you want to identify a particular problem, you could just, you know, search it based on the topics. And then you could do like source-to-source code translation so when you migrate from legacy code or all the version of the code, it could help you like automate some of the process. So the developer could just spend less time just on, you know, migrated manneries. And lastly also more importantly is like we want to help develop the right much better code and faster codes. So some of the kind of like technique we have, the other is actually using natural language to generate code. So when you actually type down, let's say I want to like run away from two to five and you could generate a function for me to do that. And also, you know, it improves like existing, you know, code performance and memory footprint. So it would analyze, you know, code and tell you the runtime complexity and able to provide you a better solutions. And then of course more importantly is like the key for this is to help us, you know, find different error and debugging the existing code and create, you know, code test generation. So make the course more robust in the long terms. And when we see out there, we have multiple in existing application has been using this open source CSS. So in our IBM AI for code stacks in our IBM research, we're using it to do our core research. But we also see like the deep mind alpha codes also using the code as one of their training data sets. And we can see like they're able to use these data sets to help them achieve like human programmers 50 to 60 percent accuracy. So let's dive into one of the kind of like use case that we have, you know, figure out in like IBM research. So last week, IBM researchers announced like this cooperation with Red Hat to create this new project called Project Wisdoms. So the core concept of the system is actually how they're able to generate the Red Hat Ansible Playbook pipeline using just plain English. So it's a use case where you just provide simple English context and create, you know, automations and infrastructure as code. And the main focus for us right now is actually just aim to develop a foundation models, right? And while keeping the accuracy high, we wanna figure out how we can reduce the number of parameters so this new models will be able to compute and we train like in a decent amount of times. As you can see in like this like page when someone kind of like put down a simple text on installing in Android Next and Node.js 12 package, right? On Red Hat, you were able to generate you some like scripts, right? Some infrastructure as code in Ansible to have you like just run the install package. So this is kind of like very useful tools for anyone who just wanna like do a simple task but they don't have any knowledge of infrastructure and self script. This could be like pretty a big rule when someone just do automations without relying too much on automation teams. But with all this kind of information we have, right? When we just do all this kind of research and like we provide tools, put it out in open source, how we actually share all our farm things, right? We're in our teams or we're in the communities. With this, last year we actually announced a project called the Machine Learning Exchange. And we work with the LFAI and data and proposed this as a LFAI and data project sandbox. And now this is actually hosting on LFAI infrastructure as well and the key concept with this is to provide a data and AI as a catalog, right? And furthermore we also integrate some of the execution engine. So for those who wanna try out in their own machine and see how these assets have been executed, they also could do that in their own cluster. So some of the high levels on what the Machine Learning Exchange actually contains. So mainly this Machine Learning Exchange, the main purpose is actually sharing all these catalogs. So we have different kind of like data assets such as data pipelines, components, models, data sets and notebooks. Those are like the core components that all the data scientists need to build and share within different teams. And I think the next step we kind of see from the catalog step is like when data scientists just wanna try it out, like they wanna just have a simple experiment. We actually like leverage some of the execution engine we see in open source. So they could just do a simple run and determine what kind of tool they wanna use. So for pipeline engine, we actually like leverage, you know, SQL pipelines and because we're actually running on OpenShift, so we actually also use the Tec-Town version of SQL pipelines. So we actually run the Red Hat, the OpenShift approved version of the Tec-Town run times. And for server engine, we chose K-Serve or we also have an option for you to just deploy on plain Kubernetes deployments. So K-Serve is actually a very popular project for deploying, you know, serverless models on top of Kubernetes. And then for data sets, we used a project called Datashim where AI could actually help you, you know, like host all your data set into your local cluster and use them within your cluster nodes. And lastly, with all these kind of execution engine, we actually able to fine tune our data as AI catalog metadata and rebuild those metadata based on a spec called MS spec in open source as well. So with this, let me just show you like a default catalogs we have on machine exchange in open source. So we have some like sample pipelines and components that you could get started with. And more importantly, we also have like different kind of models, data set and notebooks to help you see how what kind of like data and models people have been trained and how you get started with this model using a notebook as well. And you could see that some of the, you know, data sets like Project Colnet and the IBM Debater data set is also on machine exchange as well. So with this, let me go on the machine exchange demo. So this is actually the host in a public website in Alpha AI. So you go to ml-exchange.org. You should be able to see a list of like catalog hosts on machine exchange. So this is just only a catalog page because on the public side, we don't have, we don't leverage the runtimes. So, but you could see what kind of like data set you could download and what kind of model you could use. So we look at like the list of data sets. You could see like Project Colnet, different kind of subset of Project Colnet language classifier. And we click on like, let's say the Project Colnets where you actually have like different selections of the data set. Let's say the full data sets, right? It might contain like zero gigabytes of files. You might enable the download in one fly. We also have different kind of data selection where you wanna just view the metadata or you only need like benchmark for Python, Java, C++. We could have like those selections for you, like preview over here already. And you wanna just see how this metadata is being built and wanna upload your own data set and propose your own data set in open source. Like this is kind of preview of how the metadata is being stored in YAMP. And similarly with the models, so we also have like different kind of models that you could write in containers or just upload as a model files. So for example, we have this like Colnet language classifiers. Let's click into this. So with this, we actually have determined like what kind of like data set is used. So with this data set we're using like Colnets. So it's under the CDLA permission of E2. But also with the model weights and the model code we actually make sure it's under Apache 2.0 license. So when you actually use this model to test or to play around you guys to make sure that this is under the open source license as well. And similarly with the pipelines, this is like we leverage like Q4Pyline behind the scenes where you can actually see how you could build a simple pipeline, just leverage multiple tasks and join them together. So when a theater scientist kind of built like a full-blown pipeline that do like different kind of data preprocessing, different kind of like data training with how you distributed the training process, this could also build into a single pipeline and you know, upload it over here and share with different, you know, DSN teams. And each of those, you know, pipeline components also could be shared into the components categories. So this is the category how each component is being built and you could connect them into a whole pipeline. And lastly with the notebook is where if you wanted to see some examples and how this example is being executed, let's say for example, we picked like the Project CodeNet mass language models. You could see like how, where is this data, where is this notebook is being host on GitHub. But you could also like leverage our kind of back end engine where we actually have you to render the notebook to preview on this page. You just wanna feel like just render and stay on this page. To see how you, you know, let's say for this example is what the mass language models, you could see like how, you know, you could just take a subset of the model from CodeNet, prepare the data, right? Do some kind of tokenization and create, you know, a bare bone verb models. And at the very end, right, like it able to train within just using CPUs within an hours. And for this example, we make it very small. So we just train for five parts and just show you some of the evaluation on, let's say you wanna just predict, right? And master words, how you could actually rank like the top five accuracy, right? Using a simple, you know, mass language model. This is the example on how you could start learning and using the data sets. So with this, if you actually want to like, you know, find out like what kind of models or like data set you wanna work on and you just happened one that I tried out on machine learning exchange. You actually deploy on your own cluster and run on the cluster using our integrated runtimes. So for examples, I have one instance that is deployed on my development clusters. So I have imported the same kind of like data categories here. So let's say for example, if I go to the CodeNet language classification models, in my own cluster, enable like the execution runtimes. So I could actually able to launch these models, right? On my Kubernetes and try it out. So let's say I just wanted to launch this model as a plain, you know, deployments container and test it out. I would just simply do like launch. And behind the scenes, right? Depending how you actually train the model or serve the models, you could actually have a preset pipeline to help you get the data trainer. But for this example, we just have like a simple pipeline where we get the model information and deploy it on our cluster using the latest image that we have registered. So once the pipeline is finished, it will just show us like whether the deployments is available. And for this pipeline, I actually configured to deploy on no ports while on Kubernetes. So with like an IP and a no port, we're able to like just try it out. So the no port we have is free 1174. So I just do here. So with this way, behind the scenes, this pipeline just kind of exposed these models as a REST API server. So you can actually leverage kind of like this Swagger UI. And this model is extremely simple. So you could see like, these APIs only have like two methods get and post. And you could just try out with this model just by sending some files. For example, let's say we submit like a pattern files here. And we could try out these examples able to produce us the top three, accuracy on the language classification. So you could see like Python is the highest accuracy rate with almost 70%. And then you could kind of see how, this model has been built. And can it give you some ideas on how you could leverage, and use this to enhance your user experience and able to build tools to help you like enhance your data per cycles, right? On migrating calls, bidding calls and completing calls. So with this, I will just summarize like what we have discussed today. So the main point we are going to show is that we want to open source this project code net. It's just that the high quality, very big data sets we have created. And we wanna see like, we posted in the open source and wanna see that the open source community able to leverage it, build it and able to like give us feedback. And hopefully in the next five years, we will have like very good tools to help us migrate different kind of legacy calls able to like enhance their output able to code them in new ways. Instead of like in the current way where we have to go like, let's say start overflow to find the solution, right? That's kind of like the goal we aim for. And once we kind of see that being going in progress then we're able to like leverage some of those kind of like open source models and put into kind of like production AI system stacks. And with production AI system stack then we have able to develop like business values such as actually able to use those like future models to help us do like automatic, you know, code translation and modernize legacy code with very minimum efforts. So with this, I would just say thank you very much for everyone for attending and if anyone have any questions? Yes, please. Yes, so I would say like some of the kind of like public announcement we have, I think one of the newest announcement we have is the project wisdom. So we actually work with like Red Hat team to attribute new tools to do like code generation for automation and infrastructure as code. So we hopefully to help like Red Hat developers to able to do automation by just, you know, putting like plain English text, right? So this just describe your workloads and build a Ansible pipeline for you. This is kind of like one of the use case. And other customer use case we have done is actually like help us modernize, you know, old Java code into new Java codes. So those are like kind of the use case we currently have but we are also like envision like more use case that are in open source and hopefully like people can get feedback and we could able to use those feedback to enhance and improve this data set over time. Sure, yeah, go ahead. Yeah, let me repeat the question. So I think the question is around like how we can approach, you know, different institutions like bank, you know, to migrate, you know, old legacy code like cobalt to like the model technologies. So I think at this moment for the research and open source team we actually just wanna leverage, you know, kind of different use case and a lot of the kind of research we have done is more on like research and development. So when it comes to the kind of clients, I think like we do have like client engineer team that work with different clients but we don't have any like particular data we could expose this one. Maybe some of behind, yeah. Yeah, I think the question is like how we envision, you know, this data set to brought into such a real like proposition and values to real customers, right? So I think we are still kind of in initial stage. I think as we kind of see it in like different kind of AI roadmaps when, you know, a new data set is being created, new methods being created. It usually takes several years to actually able to achieve like more real world or like more sustainable use case. So right now I think our main kind of commitment is actually like show what kind of use case that we have and also like open source data set to the community so anyone could just use it without any risk. And from there, from a research team and open source team we actually want to just leverage feedback from the open source. And of course, do we have any like particular customer like kind of requirements? I think at this moment it's actually just based on whatever company you work with like they get like internal feedbacks. But on the open source side, we actually want the community to show us, right? How your different research paper and different like open source use case that have, you know, kind of like just posted on like developer blocks and use those feedback to help us, you know, improve and build a new kind of models that could leverage this data sets. Yes. Is there any other question or if not, then yeah. Right. I think like right now we do have like 55 different language for our initial kind of data sets. So as you can see, like even for the cobalt, right? Like we kind of describe over here. And with the cobalt, like we at least have like 100 solutions for them for area of the problem. So with our initial kind of investigation we do able to see like a lot of the problem you could actually solve very simple. But as we kind of discussed, right? Like this data set is more like for people to explore different kind of use case. And I think cobalt is something we are still working at and to see like what kind of like performance we could get. And I think that is a very like good kind of like we still allow for feedbacks from the research team already and they already like kind of, you know producing like kind of those examples. So I would say like initially I think we see enough for at least for cogeneration at this part we could see some good results. And we want to see committees like take it to the next level to like more advanced in a co-translation and co-completion from here. Is there any questions? Yes, please. I would say more or less similar. Yeah, like I think with OpenAI as you mentioned like how OpenAI kind of leveraged similar concepts. Because OpenAI kind of like proprietary we don't know what kind of code or models they use behind the scenes so we cannot really say. But is the concept similar? And what we want to envision here is like there's no good like data set is being test and have good test cases verified in the open source. I mean you could also wear a scrap code from GitHub but you have to like have someone to do labeling and test out everything for you. So this data set is actually aimed just like imagine that you could create like a benchmark on all the let's say models to build. And just to make sure like when you create a new use case like you could use this data set as a benchmark to test whether or not your generated let's say code is actually matching the one we have in our data set and use that to improve your model accuracy. Yeah, is there any other questions? If not, yeah, thank you very much for attending.