 Hello everyone. Thanks for being here. My name is Carlo Corino and today I have the pleasure to present you the GraySystems Lab vision for Enterprise Grade Machine Learning. First of all who we are, the GraySystem Lab is an applied research group. We are embedded inside the product side of Microsoft, inside of Azure data and we focus on research in the area of systems, database and machine learning. The work we do end up leading into improvements of Microsoft products, open source and academic accomplishment. The rest of this keynote is going to be organized as follows. I'm going to spend about 14 minutes describing you the motivation for Enterprise Grade Machine Learning and sort of define the problem. Next I'm going to talk about our ongoing work in this area, talking about some work in improving the world of data preparation, model optimization, how to make better model scoring infrastructures and talk about machine learning products. What is the future of machine learning enterprise? To answer this question we go back on what was the past, why machine learning was so successful. The class of problem we tackle was very good, was a small number of very valuable problems, things like search ranking, recommender systems, time detection. The nice property of this problem was not only that it was really really high value, but also that we only cared about accuracy on average. And what do I mean with this is if a couple of links in your search page are out of order, it's annoying but it's not a big of a deal. As far as your average experience in using a search engine for example is a very good one. And what were the solutions for this kind of problem? Well we had a huge amount of data so we threw a huge amount of hardware at the problem and literally thousands of PhDs and software engineers. And what they did? They built custom big data infrastructure. Big data didn't exist before this problem. They built custom machine learning model and they built custom pipeline. And all of this was worth it because each one of our problems was worth billions of dollars. If you go one step further you look at today's machine learning. The class of problem we tackle is broader. This was made possible by great advances in the algorithmic side of machine learning, also great advances in the hardware and a huge availability of data. So the class of problem is still really high value but now includes things like computer vision, speech understanding and so forth. The argument about accuracy being okay on average, it's not as true anymore. In some cases like for Serium is understanding you every ten times you use it. It's annoying but it's not a problem. But if you're a Tesla autopilot jumping off a bridge every ten times you travel over it, it's more annoying. So and what do the solutions look like today? We've seen a shift towards using lots of specialized hardware and typically do so in the cloud for a very good reason. The GPT-3 model has been trained on thousands of GPUs and you don't want to necessarily purchase thousands of GPUs. It's really convenient to rent them for a short period of time during the training phase. And then the model will be used many many many times for scoring. A good advantage is that the big data tools have been completely standardized over the last decade. We spend a huge amount of effort to make them easier to use and easier to understand and provide services around them. So fundamentally the Hadoop, Spark, Kubernetes of the world are now a commodity and everyone have access to them. We're seeing something similar happening in the world of machine learning libraries. Our team runs an analysis of all of the Python notebooks of GitHub. I think nine millions of them by now. And we are observing a semi-standardization in the sense that the number of libraries that people use keeps on growing. But the top few libraries are becoming more and more prominent and this is a sign of standardization happening. As a consequence we throw only hundreds of PhDs and software engineers to the problem. And custom pipelines are still a thing but things are progressing in that space as well. So what is finally our projections for what's going to be the future of machine learning? We see machine learning being absolutely everywhere. We see millions of applications being built and lower individual payouts. Obviously because there are now millions of applications each worth billions. But the application will still be worth millions of dollars. So the overall footprint is going to be absolutely huge. We're also seeing machine learning penetrating more and more of these high-stake domains like medical, military and financial. The trends for the solutions that we expect. The move towards the cloud is going to continue. We're going to use more hardware and specialize hardware in the cloud. The standardizations will also continue. Our expectation is within a few years we're going to have absolutely standardized big data, libraries and pipelines mechanisms. And as a consequence the teams that will build these applications are going to be actually quite small. Maybe one through ten analysts or software engineers. And they will typically have lower expertise. Now do you see a problem with this? Did you notice what is scary here? Well we're saying that we're going to have millions of high-stake applications being built by low expertise teams. So the problem is machine learning is too good not to use. But too dangerous to use today. So we have these great desires to leverage machine learning because the wins are really actually super good. But it's very dangerous. And let me make this concrete with an example. This example comes from Rich Karana. Rich Karana is one of the world expert in explainable machine learning. And this is in fact one of the early example in his career that really motivated his passion for this problem. So Rich at the time was looking at pneumonia risk. With coronavirus these days this is actually a very poignant example. He was using machine learning and neural network and other models to predict whether a certain case of pneumonia is high risk. So we should keep them under observation in the hospital. Or it's low risk and so we can send the patients home maybe just with anti-biotics. And the models where he was able to build were really really accurate over 95% accuracy. This is fantastic, right? And any lower expertise team would have simply deployed those because these models do better than most doctors out there. Rich is a great scientist and spent a lot of time trying to really understand what was the model picking up. Like why was it so accurate? And the model were picking up some weird things. Things like if you have asthma you have a low risk. You're typically correlated with a low risk prediction at the end. But every medical doctor will tell you that that's not true. It's actually asthma is a bad comorbidity for pneumonia. And so in fact you tend to be a high risk. But what the model was picking up was exactly that. That as soon as you walk into a yard with pneumonia if you have asthma the nurses and the doctors will get really scared. And they start running the absolute best test that will give you the strongest drugs that will keep you under observation. And as a consequence you end up not dying out of pneumonia. So this is actually quite problematic. But now if we have low expertise team building the same class of models that Rich was building they will probably miss this and deploy this in production and potentially kill people. So we're going to argue that what we need here is enterprise grade machine learning. And what I mean with this is we need a machine learning that is a much more of a mature engineering discipline. And we're quite far from that. So why is this not just software engineering and data management? We've done this before and what is machine learning? I grabbed some data and I massaged them with some software after all. Well, the challenge is exactly that. Machine learning models, the product of a machine learning training is actually a piece of software function that is made out of data. And this is the first time this ever happened. Before we had data that through queries was producing more data. And software was written by engineers. And so we could do all the classical things. But now we have this weird mix of software made out of data. So multiple communities need to come together to help us build enterprise grade machine learning. On one side the software engineering community need to help us understand what is the moral equivalent of a code review. Because there is no software engineer that can reply to the comments that I make over the code. What is CI CD versioning? What is unit testing? What is debugging? What packaging and deployment looks like for software that is continuously evolving because the data changes and I retrain it. As an example, I like the example of find bug. It is a fantastic tool for Java that allows you to kind of detect probable mistakes. We don't have this for machine learning today. And yet there are probably very common mistakes that data scientists do every day. Next, the machine learning and system community will keep on delivering great algorithms and libraries and scalable infrastructure. But it's also finally jumping in and really trying to tackle the problem of explainability, of fairness or bias, or broadly speaking the responsible AI kind of movement. And database and knowledge based communities have a huge responsibility. The large portion of the work as I'm going to describe in a few minutes of machine learning of data science is really a data massage in data transformation. Lots of data prep, lots of versioning, lots of understanding the semantics of data, lots of governance and provenance. And finally, we have example of customer asking us for asset transactions for deploying a large collections of model. So a lot that the database and knowledge based community can do in this space. So in summary, we've been talking about enterprise grade machine learning. And the problem is that we have an industry wide liability. We are moving towards a world where we're going to have very low expertise teams, building millions of machine learning enabled applications, many of which may end up being critical applications. The solution we propose while is to build and build so in a hurry tools and practices that make machine learning enterprise grade. Now that we've established that EGML is something desirable, what is our path towards it? So we discussed this in our cider paper 2020. First of all, we need to start from the users. What do the user want? So what is the dream for data science? So what we hear all the time when we talk with data scientists in our team, or other data scientists around Microsoft or outside is, I like to work on my laptop and focusing on modeling. So this is a very common as some very common things that they express, right? And it's very reasonable. They are asking us to focus on the part of their job that is high value that is really meaningful. Right is the modeling aspects of the problem and not all of the fasting on finding data and interpreting the understanding the data and the sort of operationalizing models and so on. But let's see what is the reality, right? I'm going to describe one of the classical machine learning loops or data science loop, right? And obviously going to be similar to many you have seen, but I'm going to try to highlight a few differences, a few important aspects. There are where the database and knowledge based communities can help, right? And where our team has invested. So let's use an example or pneumonia risk example. So the application logic here is probably a portal where nurses and doctor can read information about a patient or input new information about a patient. And those are typically backed by some relational database. So this data, together with some other reference data is fed into an experience, a model development and training experience. This is where the data scientists really comes into play and spend most of their time is or would like to spend most of their time. And that's the component, the middle, the model training. But as you see is surrounded by something else. Typically, this is an iterative process where we featureize the data, we transform the data, we try to interpret and understanding what the data means. And then we transform them into into a model, right? And this, you know, keeps on looping until until we got a good model. And once we have a good model model has a pretty good accuracy for for the problem, we need to optimize the model, we need to make the model efficient when we run it, right? And this is a separate step. And this is not the only part, right? The problem is once we have these models, we need to deploy them deploy them in typically a different infrastructure that can run live or online to the application. Now this offline online split that I'm describing here, it's not necessary, but it's a reasonably common one where the development and training happen offline, and the model scoring happens sort of online. So when new live data comes from the applications, in this case, someone walking the ER with a suspected case of pneumonia. The doctors and nurses will probably take blood pressures and fever measurement and so on and input them into the system. Then we need to feature eyes using the same featureization we have developed during the model development and training steps and runs these new data against the model to get a prediction. This is what is referred to as model scoring. Now the prediction typically takes the form of some floating number representing the probability of being part of a category or another. Around that we need to wrap policies that can translate this math domain back into the real world domain and makes a decision. In this case, advise the doctor or the nurse that we should send the patients home with antibiotics or should we keep it in the office. And if I were to stop in here, this will be functionally correct, but it would not be enterprise grade. So what we need next is to have also all of the governance dimensions, access control, data catalogs, training and tracking of models and the provenance of all of it, logs and telemetry. All of that is on the slides around that little small component in the middle. It's all necessary to build a good data science experience, enterprise grade data science experience. And we know this because about half of our team is actually spending time building data science experiences around our system at Microsoft. So we know that we spend a lot of our time on things that are not just the model training. That's maybe like the 10% problem. So how do we make this better? So in the remainder of this keynote, I will describe several projects we have that pick certain sub problems that kind of appears on the slide and we try to make it better. Now, this is by no mean an exhaustive list. There's a huge amount of work to be done here. We just selected a few problems where we had a good idea or a particular hook onto a project to try to make things better. This is basically a long release of possible things that can be done here. And it's your opportunity to come and help build enterprise grace machine learning. But let me start with the model development and training and in particular focusing first on the futureization step, right? The initial sort of data transformation step. Remember, our data scientists would like to work on their lap. So the typical approach is they take any data that is in a DBMS of any sort, they dump them out on a CSV. And that is then interpreted in pandas right out of pandas and transform and operated upon in many different ways. Now this works, but it's problematic in many different ways. First of all, in terms of performance, databases are really bad and output in large amount of data. And since we're going to maybe filter and project and change the data later, we're going to dump out very large tables and not just small portions of a table that are the one of in. Next, there's a scalability issue because pandas can only operate based on how much RAM it has or it's a main memory only system. And so while a DBMS is very well trained to operate on large amount of data, pandas cannot do that very well. Next, there's a security set of concerns, right? We are now creating copies of potentially very sensitive data that will float around different devices and different laptops and be stored in different ways. As well as an availability, because of course we have a large complicated set of tools that come into play and data freshness, reproducibility and all sorts of other issues that come from this very manual set of processes that are in play here. So ideally, what we would like to do is to have pandas APIs at the very top, push down the computations on the big beefy enterprise grade engines that we have built over the many years, right? In this case, I'm describing the Azure data engines, but you can imagine in general a series of databases and big data pollution. So to figure out how we do this, let me start by showing you a demo. This is Venkatesh Mane presenting the PyFroid project. I'm going to talk about the features of PyFroid, which is a Python library that can translate computations in pandas to be run transparently at the database. If you see the script here, it reads data from a database and then it performs a few operations such as filter, join, projection, robot aggregation, etc. When this program is normally run, the data is fetched into the memory of the client and then iteratively at each step the data frame is manipulated. Now let us enable PyFroid. This is as simple as changing a single line. Now when we run this program, instead of the data being fetched from the database, PyFroid has translated lines 1130 into a single query. The merge operator has been translated into the database join. The condition has been translated into the where clause and so on. Not all pandas computations can be translated into SQL. For example, pandas has the interpolate function which can fill in missing values based on other values in the data. In such cases, what PyFroid does is it extracts these untranslatable operations into a Python UDF that can run at the database. When the script is run, you can see that PyFroid has translated lines 12 and 13 into a query with a where clause and the interpolate function has been put inside a Python UDF that is then run at the database along with the query. So what we've seen here is pandas API running on top of Enterprise Grade back end. How does this work? We fundamentally dynamically wrap the pandas so that we can fall back onto pandas when we don't know what to do. And we actually construct, accumulate an expression throughout the execution of pandas. And only when it is strictly necessary to materialize, we actually push the query and we lazily evaluate it into the underlying engine. To do this, we leverage IBIS. IBIS already has the support for pushing down relational style expressions into actually a collection of different back end. And this way we also narrate very interesting properties. We can take the same pandas set of pandas operation and map them onto Spark or Postgres or SQL Server or so on and so forth. Multiple different back end. Now, and here comes a funny research part is looking at how to automate the choice of back ends. With this current ongoing work, I'm only going to show you one result. If you were taking the same pandas set of operations and translating them down into Spark or scope, which is internal Microsoft Big Data Engine, you will have a different set of performance results. At small scale, around one gigabyte, Spark is actually running faster because there's a faster compilation step. At a larger scale, like 10, 100 gigabyte, a terabyte, scope is substantially more performant. And there's a point in which Spark simply doesn't scale to a scope can easily run it in the petabyte scale. So this gives you the hint that both the set of back ends and the choice of back ends could be an interesting research problem, as well as the fact that pandas would never be able to operate at multi-terabyte scale, but it can easily push things down to scope or Spark and operate at that scale. Next, we look at model optimization. This is the process of taking an already trained model and optimizing it or improving its form so that it is more efficient for scoring an inference part. So the opportunity here is to look at the big split between the great popularity of classical machine learning model, both broadly and specifically for enterprise data, and the bulk of the investment in industry that are happening for the neural network. So what's happening here is we're seeing classical ML being about 80% of the common machine learning used on GitHub from our analysis of all the GitHub notebooks. And simple operations like futurization, linear and trim models work really, really well on top of tabular and text data that are common in the enterprise. But the bulk of the investment is happening on the neural network. We have a lot of investments on special inference chips, like the GraphCore, Sambanova, AWS Inferentia, the Intel Nirvana of the world, the TPU, and a lot of great investments in terms of better run times and compiler techniques like TVM or TensorRT or XRuntime that makes neural networks incredibly efficient on top of this new hardware. So the question is, can we actually take advantage of these investments in the world of classical ML? Or putting in the questions differently, can we translate an already trained classical ML model into a tensor representation so that we can then use the inference chips and the compilers for neural network? Now it turns out the answer is positive. This is work we've been publishing at OSDI 2020, and we have open source this is available on GitHub right now. So I don't have time to get into all of the details and describe you all of the different techniques and how we transform all sorts of featureizer and different models. I'm going to make one example. I'm going to take a decision tree and I'll take a very simple decision tree like this, just a few nodes operating on a six dimensional input space. And I'll show you how we can turn this into a neural network and show you some results. So in this simple representation, we have six features in input and each one of the nodes has been trained and we decided for example that feature X3, if it is greater than 5.1, we should go down one branch of the tree versus the other. So for simplicity, I'm going to annotate the graph. I'm going to call the internal nodes of the graph N1 through N4 and the leaf nodes L1 through L5. Now, what is a good representation of this as a neural net? Well, we can automate this. We can take every feature and put them in input, every internal node and put them in a second layer and every leaf and put them in a third layer of the neural net. Now how are we going to wire this together? Remember, this is just a translation. We don't want to retrain the neural network. We literally want to copy over what we have trained for the decision tree. So we're going to have a weight of one between a feature and a node that uses the feature in its internal condition. So for example, the node N1 has a weight one from the feature X3. We're also going to have a bias plus one and we're going to have weights that are corresponding to the internal value that were used. So if the feature was X3 greater than 5.1, we have now a minus 5.1 in the bias signal that goes to the node N1. And now the N1, you can imagine, simply computes the greater than operation, right? So this becomes a pretty simple comparison operator for N1. And then we're going to wire the intermediate nodes to the leaf nodes so that a leaf node L1 is connected with all of the nodes that are on the path from that leaf back to the root. So for example, N2 and N1 in this case. And the leaf L1 will basically compute an end of all of this leaf. So notice that as a result, only one of the leaves will be active at any one time. And so we can simply put the final value that we had on the leaf as a weight from each one of those leaf for final results. And this will produce the expected result that the decision tree could have reached. Now this will produce the equivalent result. But notice that there's something very peculiar here. We are exploding the computation instead of a very sparse traversal of a tree that only computes the conditions that are needed. We're actually computing all of the nodes in the tree and all of the paths in the tree. But the big advantage that we can do so in parallel, we can leverage the SIMD GPU and FPGA parallelism to compute all of the conditions on all of the intermediate nodes in parallel and all of the paths in parallel. And it turns out this produces really good results. Let me show you this and how this is going to surface to a user through a demo. And this is going to be Carla Sauer presenting Hummingbird. For this demo, we're working with Hummingbird and Scikit-learn's random forest classifier. We're going to use the breast cancer data set and we're going to create a random forest classifier with 1,000 trees. First, we'll time this using regular Scikit-learn by calling predict with this data and we can see that it takes around 114 milliseconds. Now, we import the Hummingbird library which we've already installed with pip install and we convert the model that we made with Scikit-learn to PyTorch using Hummingbird. Now we can see how long it's going to take PyTorch. At first we'll run in CPU mode and it takes around 21.9 milliseconds. And now we switch this from CPU to GPU mode using the PyTorch command. And next, we can see the same model with the GPU runs much faster in only 1.33 milliseconds, which is around a 100x speedup from Scikit-learn. So to show you more results, let's look at four different data sets and comparing Hummingbird against random forest, XGBoost and LightGPM, some of the most popular machine learning models. Again, in this case, we're comparing the CPU mode for those to the best GPU mode, also leveraging the TVM compiler. And we're seeing speedups in the range of 10 to 100x from using Hummingbird. So the idea is, yes, we can translate classical machine learning model in neural network in TensorFlow. And we can get a great advantage, great performance boost from operating in this form. Next, we're going to talk about model scoring. In model scoring, we're looking at the problem of taking an already trained model and serving the answers of this model. The most common way of doing this is to building a structure that looks like this. We're having a web server where our application logic runs. Let's jump back to our example of the ER web portal for nurses and doctors. So when a patient comes in, the nurses will operate in the browser and send in some information about this user. We're going to look up more data from this user to construct a feature vector that can then be sent to a container where we have deployed our model. So this is a very common way of deploying model is to basically wrapping all of the model up into a container and have a rest endpoint in front of it. The information will be sent there. We're going to get an answer back. And our policies will probably look up some more information from the database and finally give the answer back to the nurse. And if we have multiple models deployed in our system, we're simply going to deploy more containers. Now, the problem with this is we have no enterprise features, right? The security model is quite problematic. We have data models that are living outside the DBMS, which is typically one of the good perimeters for security. We also have extra infrastructure. We need to deploy all of these collections of containers that need to be deployed and kept up and running and updated for security and so on and so forth. So overall, we have a higher total cost of ownership because it's a more complex infrastructure in place. And also we lack good tools and good practices because these are a new set of infrastructures that users need to get used to and learn how to operate and monitor and so on. Moreover, in terms of performance, we're taking data and moving them outside the database. So both there is a latency problem because we have many round trips in the network. But also, even if we were performing a batch throughput kind of operation, we're taking a lot of data and actually getting it out of the database, which generally is not good for throughput. So what is the alternative? Well, one option would be for us to move everything inside the DBMS. So essentially supporting the futureizations and model scoring inside the DBMS itself. And this is possible using something like Onyx Runtime or Bytorch or TensorFlow and just embedding in the DBMS. What do we gain from this? Well, for sure, we're going to have better enterprise features because we now have the same security envelope and the same set of enterprise feature of a database. For example, maintain the high availability, disaster recovery, backups, tamper proof logs that are available for a database are now available also for the models because models are now stored inside the database and scored inside the database. Second, we reuse a lot of the infrastructure. So we have a lower total cost of ownership and we can also reuse a series of tools and best practices and monitoring and mechanisms that are available for data. Now the question remains, though, is this good for performance? Can we match or even beat the state of the art performance in this scenario? And this is exactly where the Raven projects come in place. This is a cider paper that we have presented. Let me walk you through an example. So first of all, doing this in the DBMS, it is possible today. And let's take a query of finding pregnant patients that are expecting to stay in the hospital for more than a week. This could be an administrator in the hospital trying to determine how to use the rooms, the auto-assigned rooms in there. So this query is a valid SQL server syntax. It looks very big, but it's actually reasonably simple. We have three sections. First, we extract a model. So we have stored a model inside a table as a binary variable. And now we assign it to this et model variable. Next, we define a variable data that is simply a query which joins three different table, patients, information, blood tests, and prenatal tests. And then we use this predict keyword to basically ask the duration of stay model that we have selected to be applied against the selected query on data and produce a new column, which is the length of stay. Then we can do the standard things we do in a query. So we can, for example, filter only for pregnant patients. And then we can filter on the length of stay, which was the output of applying the model onto the data. Now, how do we make this go fast? We take this query, and of course we can just execute it as is. But we want to do more than that. First, we extract an internal representation, an IR, an intermediate representation that can both capture the relational part of this query. So, for example, the filtering on pregnant patients and the joints between the three base tables. But also, it can capture a series of transformations that are embedded inside the model. So we basically make the model, instead of being a black box or a glass box, we can see inside the model and we can now observe the entire collections of transformation operations that are both the model and the query. And once we have this, we can now apply a series of transformation and produce a more optimized plan. So a plan for this query that does produce the same results but is now more efficient by both changing the model and the query. And I will show you some example of how this is done. Finally, once we have a good model, we can deploy into runtime, cogeneration at runtime and produce a both an onyx runtime that has been transformed and SQL Server. Now, this is obviously an example of SQL Server onyx. We have our example in another language. In fact, we also support Spark. So the key ideas here are there are two main key ideas. One is cross optimization between SQL and machine learning. And I'm going to talk about this in more details. And then the supporting of a high-performance machine learning inference engine directly inside the SQL Server process. So let's look at the cross optimizations. So the first one we look at is predicate-based model pruning. At the very bottom of our query, we are selecting only pregnant patient. So we could now look inside the decision tree classifier and notice that some of the feature might be filtering on exactly pregnant as pregnancy as a feature in the model. And so portions of the models can now be pruned because they're not going to be used in the context of this query. Now for the convenience of this explanation, this happened to be the very root of my decision tree. So the entire right branch of the tree is now not necessary. Second, because of applying this first rule or because in general the models learned a certain feature are not as important. We might have several columns in the model that are either unused or if it's a linear regression or maybe multiplied by zero. In this case, we can actually push those down as projections and modify the model not to use those at all. This will reduce the amount of data we need to move out of the database portions of the system and into the decision tree. And then generally pushing down projection is always a good optimization. Next, we can do model splitting. So we can observe the model and find out that portions of a model are reasonably simple while other are more complex. And they apply to different sets of data. And in particular, if the simple portions of the model are applied to a large amount of data, this might give us some advantage in terms of performance. So in this case, we're observing that based on age, we can split the model into patients that have aged less than 35 for which a simpler model can apply. And patients with that age greater than 35 where there might be a more complex model that need to be applied. Then we can perform modeling lining. This is a very interesting trick. If you're familiar with the Freud paper, this is the UDF inlining that we have implemented in SQL Server. We're taking UDFs and transform it into just pure SQL. And this is possible for a pre-broad set of PDFs. So we can do this, for example, for a simple decision tree classifier. If we have a simple model that is small enough, we can actually inline this as a switch case or a set of if-then-else's directly in SQL Server syntax. And this will allow the optimizer to further optimize things below. In this case, in the example I'm showing that a couple of the different types can be removed based on other information that the system has underneath. Next, we can perform the neural network transition. This is the example I showed before with the Hummingbird project. So we can take portions of this decision tree classifier and turn them into neural network so that we can use the powerful inference chips or GPUs to run those more effectively. And finally, we can apply standard database optimizations. In this example, the database has already applied some joint pruning because the left side of our creep or age under 35, we do not use data from the prenatal tests. We do not need to join that table before performing the model operations, right? And so on and so forth. There are a lot of optimizations that can now kick in. And finally, there are sort of compiler optimizations like the one the TVM would apply to make our neural network more efficient and will give us further speed up. By combining all of this, we're going to get really good performance. Let's see this on a Spark plus Onyx implementation. Here is Raven Spark running inside a modified HD insight cluster with a Jupyter notebook. First, we'll show the data that we'll be working with. This hospital data set has 24 columns, nine numeric features, and 15 categorical features. The data set has around 302,000 rows. Here's the Onyx model that we'll be working with. First on the left are the nine numeric features. Notice that zero in the dimension of the inputs indicates the batch dimension. In our implementation, we are using batches of 10k inputs. The nine numeric features are pre-processed by a standard scalar operator. And then on the right are the 15 categorical features. The 15 categorical features are pre-processed by one hot encoder. These features are input into a tree ensemble classifier. Going back to our notebook, we'll start the Raven session. And for the first run, we'll show the query without any of our Raven optimization. Notice that we have introduced a new predict operator in Spark that allows us to execute Onyx models by invoking Onyx runtime. It takes as input the model and the data that we'll be using for scoring. For all of our sessions, we use arrow and batch inference to speed up scoring. Now with a little bit of video speed up for the sake of time, we can see that this runs in 11 seconds without any optimizations. And here you can see the output. Taking a look at our query plan in the Spark History server, you can see how our model is invoked by passing it the 24 features. Now we will show you the model projection push down Raven optimization in practice using the same model data and query as before. With a bit of video speed up, you see that this query runs in 8 seconds, which is 3 seconds faster than in the version without optimizations. And here we show the data. If we go to the Spark History server and look at the query plan, we can see that here Raven is using the optimized model and that we are only using 16 features instead of all 24. Taking a look at the optimized model, you can see how there are only 7 categorical features used instead of 15 as before. Here we are with enabling the ML SQL optimization on top of the model projection push down. We don't need video speed up now as this runs in about 1 second. Again, we show the data to verify it's there. Now back in the Spark History server, we see very little except the SQL query, which will be running entirely in Spark without any need for onyx runtime. So in summary, cross optimizations are really useful to improve performance. So here I'm showing 2 in particular, a model projection push down where I'm showing a 5x speed up and modeling lining including some form of pruning where we get about 24x speed up. Next, we're going to look at governance. And in particular, we're going to look at the problem of tracking provenance for machine learning models. Why would you do that? Well, let's start with a very simple example. Recently, Twitter was in the news because their cropping algorithm was racist. Given an image with both a white and a black man, it would always speak to the white man regardless of the positions in the image or the size of the picture. That night, some engineer at Twitter has probably spent the evening panicking by trying to figure out where was the problem. Was it in the input data? Was in the training? Was in the futureizations? Was the model optimization the mess something up? Did we deploy the right model? Which model was deployed? Which data was it trained on? Was the policies maybe messing things up? Is the image that are being used for training the same kind of images before all sorts of questions? So being able to track the provenance would have been fundamental to answer the questions of why is the model being racist and can we fix it fast? Now, Twitter probably made the right decision to decide to basically not try to fix the model in a hurry but it provided a feature where the user can simply manually do the cropping while they're probably going to try to figure out the model first thing. But the act of tracking the provenance of the model, it was definitely a fundamental step for them to answer this question. And this is not just for these debugging scenarios I described, it's important for compliance, auditing security, it's important for monitoring, reacting to changes in the data and for anything that has to do with responsible AI so explainability, reproducibility, fairness, bias, transparency, stone support. And we ran a survey across several companies that there's generally appreciation for this problem. So we believe that tracking ML provenance is something really important. And you may wonder, well, why don't you do this by hand? Right, MLflow, for example, allow you to track your model, track your inputs. You can simply declare what are your dependencies and you're more or less done. Well, the problem, again, this is not easy to do for the users. It's easy to make mistakes in doing it. And it's definitely not feel like I'm operating on my laptop and I only focus on model. So we want to automate this process to guarantee that this is actually a full-faithful operation. So the question is, can we actually automate the provenance tracking for data science? So let's take as an example, and this is a toy example, it's a simple training. So we read a CSV, we drop a few of the columns that are irrelevant or some of them that might be PII, so private information for the user. We split between training and testing. We mark some of the column as the target or the label we're trying to learn. And finally, we train cut-boost classifier. So the system we have built, named VANSA, is going to take this, is going to perform a derivation extract. Fundamentally, this means we're building a workflow of all of the dependencies inside the code. From this, we're going to run on a Melanolizer that uses a knowledge base to understand what each one of the steps of this code really means. What are the semantics? Is this a model being trained or is this just like a print-out or what are the semantics of the program? And this produces an annotated workflow that then a provenance tracker basically traverses to construct a final set of very, very compact provenance for me. So let's see what the derivation extract will look like. So in these steps, we're basically taking the code extracting an AST and then converting this AST in a data flow graph. So if you can see here, at the very top we have the part-disease.csv and from there we go down to, you know, through all of the transformation to produce the final affitted model at the very bottom. So what we want to rediscover is of that part-disease.csv which columns made it all the way through into temp.temp. And obviously this example is very small, but this might not be the case for large scenarios. This step is a static analysis. Typically static analysis loses some precision due to for loops and if statements and generally dynamic programming constructs. Fortunately, what we observe in our data science analysis of GitHub notebook is that the vast majority of notebooks are pretty linear code, so we have a pretty good hope to be able to do this through static analysis or at least get a fair bit into, deep into the problem space. Next, the ML analyzer will walk this graph and for every node that it encounters will look up in the knowledge base and try to understand whether we understand what this operation is. So for example, when we look at the fit function of the cut-boost-classifier object of the library circuit learn, we're going to have an entry in the library that says, ah, I understand it. This is a fit operation of a model. The input are respectively a set of features and a label. So now we know that train underscore x2 it is a set of features that we can walk backwards through it all the way until the beginning of the program again understanding each one of the steps because we understand what the transformation means. And we have now annotated the entire static analysis. And finally, once we have this annotated workflow, we can fundamentally walk the graph and identify what of the input made it all to the output and produce a reduced representation for the problem. Now, to verify that this makes sense, we took about 300 scripts from both Kaggle competitions and the GitHub notebooks that I mentioned a few times and we manually annotated every single step in the script to understand what are the feature exclusion and inclusion, label inclusions, as well as the precision in annotating models and training data sets. And here you can see the results are in the upper sorry, in the 90% plus range. And finally, to observe coverage, we took the entire Kaggle and notebook data sets available about 29,000 scripts that were, you know, complete enough for us to perform this analysis. And we ran all of the steps of the transformations and we get 75 to 80% provenance script, so about 80% of the scripts we can track full provenance for. And notice that this has been done with a very small knowledge base. So we spend only a limited amount of time of building a knowledge base that covers probably 110,000 of all of the libraries available in GitHub notebooks and yet we were able to reach about 80% of the coverage just with those. So so far, I describe you a few steps towards enterprise great machine learning. However, we believe there's a very long way to go. We talked about some steps in data preparation or futureization, some step in the model optimization, in model scoring and provenance tracking. But as you can see there's a lot more just on these slides probably many more problems I have not mentioned. I strongly believe that we need to get enterprise great machine learning right. We cannot afford to be wrong. So we try really hard to understand this problem well. So as I mentioned, we have analyzed millions of notebooks both in GitHub and internally in Microsoft. We have first-hand experience in building lots of these, about half of our team is working on applications of machine learning for systems tuning. And we have been talking with enterprise customers. We have done market research and we are trying to keep up with machine learning literature. We are doing our level best to try to understand and try to solve and tackle these problems at best. But there's much more than need to be done. And your role in this is that you need to keep us honest and make sure that we are indeed building a good form of machine learning that we can all thrive. Thank you very much from the great systems lab. And if you're interested, come join us so we can invent the future together.