 So we're coming to the next talk. We have Victoria Fedotova. She is a leading software engineer at Intel. She does some optimization of data analytics and machine learning in the Intel One API, data analytics library. I think we will maybe hear more about that. And then we have Frank Schlimbach. He is software architect at Intel with a HPC background. So the title of the talk is the painless route in Python to fast and scalable machine learning. So let's see if this is really painless. So hello. My name is Victoria Fedotova. And in this talk, my colleague Frank Schlimbach and I will tell you how to create fast machine learning solutions that can scale from multi-course on a single machine to a large clusters of workstations. Python is a very popular language for data analytics and machine learning. As you know, it's a superior productivity, making it the preferred tool for prototyping. However, as an interpreted language, it has performance limitations that prevent from using it in the fields that have high demand for compute performance like production machine learning. If an organization wants to improve the performance of the machine learning solution, they often hire engineers to rewrite existing Python codes into a more performant C++ or Java code or even to reimplement the solution into distributed fashion using MPI or Apache Spark. Also, many machine learning workloads are compute limited, meaning that the performance bottleneck is in the compute part of the workload, not in the data loading. And for interpreted languages like Python, it is hard to achieve bare metal performance on such workloads. As a result, a typical data scientist analyzes only a small portion of data that they think has the most potential of bringing the great insights. This means they may miss out on available insights from the remaining big portion of the data, insights that may be crucial for the business. As a response to all of those challenges, Intel created the Intel distribution for Python. Intel distribution for Python is a Python interpreter and set of libraries for numerical and scientific computing. All of those packages are linked with high performance libraries which allow to provide close to native code performance. And as you might already notice, one of the ingredients of the better performance is optimized libraries. Such libraries as NumPy and Scikit-learn spend a lot of time in native computations. This helps to significantly increase the performance of the machine learning solutions. Another key ingredient of good performance is just-in-time compilation. Just-in-time compilation allows to reduce the overheads that are introduced by Python interpreter. So proper use of both those ingredients help you to get the performance of Python code almost as good as for C++ native codes. Data analytics and machine learning workloads usually consist of multiple steps. We usually start with data input when we load the data from some data source like from file or from database or some stream of data. After that, we pass the data to a processing stage where we prepare the data for further use by machine learning algorithm. Here we can do some filtering, some data transformations, feature extractions, and so on. After that, we train the model. And with this trained model, we usually do prediction or inference to get some insights from our new previously unseen data. Those stages can repeat and so on in the process of the machine learning task. But all those stages have different demands to perform effectively on their modern hardware. Another source of complexity is hardware itself. Each year, we get new machines with more threads by the vectors and so on. The hardware becomes more sophisticated each year. And even modern machine learning packages could not use effectively all those features of the modern hardware. So it is quite a complex task to have one solution that implements all the stages of this data analytics and machine learning pipelines effectively and use the hardware effectively. In this talk, we will talk about scalable data frame compiler, scikit-learn optimized, and the alpha pi packages, which help to optimize those end-to-end machine learning pipelines and bring the performance to Python workloads. Let me now describe how Intel makes machine learning in Python faster with Intel Data Analytics Acceleration Library. DAAL is an acronym for Intel Data Analytics Acceleration Library, and this library helps to optimize machine learning algorithmic building blocks by providing algorithms for all stages of machine learning workloads from data input to decision making. And default, as you may already know, default condo installation of scikit-learn already comes built with Intel Math Kernel Library, which speeds up linear algebra operations within machine learning algorithms. But to get the best performance, it is not enough just to optimize linear algebra because for some algorithms like decision trees or gradient boosting, we have little use of linear algebra there. So that's why we created DAAL. When data comes to DAAL, it splits it into blocks and process those blocks in parallel using Intel Trading Building Blocks Library. And of course, we use Math Kernel Library to process mathematical parts of the algorithms like linear algebra, random number, generation, vector algebra, and so on. So all this leads to great performance. Here you can see a comparison of stock scikit-learn performance to Intel scikit-learn performance. And both of these are compared to native code performance, which mean here 100% is a performance of C++ DAAL algorithms. You can see that Intel Python performance, which is shown as blue bars, almost everywhere is greater than 80% where scikit-learn, which is installed from PIP, rarely reaches 5% of native code performance. So let me now describe what you need to do to get all those impressive speed ups on your system. First, you need to install the alpha pay package from condo from PIP. After that, you have two options available. Either you can use this minus M DAAL for pi command line option to enable the optimization for all the algorithms, all the scikit-learn algorithms available in DAAL in DAAL for pi. No code changes needed here. Another option is to enable monkey patching for the algorithms case-by-case programmatically. For example, here we see this for Cayman's algorithm. That's it. On the left, you can see the list of algorithms that are currently supported in DAAL for pi and which are equivalent to the scikit-learn algorithm. This means that those algorithms have numerically the same results as the scikit-learn algorithms. They pass all the scikit-learn compatibility tests so you can use those algorithms and get the same results as with scikit-learn but with better performance. On the right, you can see the algorithms that have identical IP API with scikit-learn API, but those algorithms do not pass compatibility tests. This is, for example, it doesn't mean those algorithms are incorrect. Just it is hard for some algorithm to pass compatibility tests. For example, for random forest, it is hard to build the trees which have the same depth as the trees which are built with scikit-learn due to random nature of the algorithm. But we constantly work on increasing the number of algorithms on the left, either by adding new algorithms or by moving the algorithms from this part to the left making them pass compatibility tests, which is actually the most trickiest part for us. Now, let's talk how to scale the machine learning solution to multi-nodes to cluster. Dial for Pi is a convenient Python MPI to Intel Dial and it contains a variety of algorithms that have close to native code performance. Part of these algorithms has ability to work in a cluster environment has distributed implementations. For distributed execution, we use communication avoidance algorithms which spend most of the time in computation on the nodes, not in data transferring. This is, we try to make this kind of designs and this results in a good scaling and for transfer layer we use MPI library. And both Dial and Dial for Pi packages are open source and they are available on GitHub, you can see or contribute to them. So let's see how the API of Dial for Pi looks like. Here is a single node API. Here is a comparison with scikit-learn. For example, to implement K-mins cluster in the algorithm and scikit-learn, you need to do some imports, import K-mins, import pandas. We will use here for data load. Let's read the data from comma separated file. After that, we need to create K-mins algorithm object with 20 clusters. We use K-mins plus plus as an initialization procedure and we set up maximum iterations to five. After that, we do actual clustering and the results, the cluster centers and labels will be available as attributes of the result object. Here is the Dial API that does exactly the same and produces the same results. Here we also need to do some imports. Also use pandas in the same way. The difference is that for K-mins in Dial, we have K-mins split into two algorithms. First, we need to run K-mins init algorithm to initialize compute the initial centers of the clusters. We also use K-mins plus plus algorithm here. And after that, we create a K-mins algorithm with 20 clusters and with five iterations and set the assignment flag to true. This means that we also compute labels or assignments for our input samples. And after that, we run a compute method with our input data and initial cluster centers from the initialization algorithm. And we get the same results as with Psychic Learn in the attributes of the result objects. You can see that Dial for Pi API is also almost as simple as Psychic Learn, maybe a little more bit verbose but not too very verbose. And now let's see how this code will change if you need to do a multi-node computations. What will change in this case? Here is the comparison. I highlighted yellow, the differences between a single node and multi-node code. You need to do some additional inputs which you will need during multi-node computations. First, you will need to initialize the distributed environment. And after that, in this particular example, we use different names of files on different nodes. On the first node, we have K-means dance one CSV file on the second node K-means dance two, et cetera. So we use node ID to get the name of the file on each node. And finally, we need to provide additional parameter to all the algorithms, which says that distributed equals true. And that's it, this is all the differences you need to do to implement the distributed machine learning algorithm. And the common line to run this script will change. We use here MPI run utility to run this on four cluster nodes. And after that, we just use our script and it will run the computations on multiple nodes in cluster. Now let's dive a bit deeper into dial for PI implementation. On the background, you see C++ code actually that runs distributed principal component analysis algorithm using dial for PI C++ API and MPI as a transport layer. This is one of the shortest examples. K-means example is about twice longer. And as you can see, we couldn't just like port this API to Python, it is not very usable, user friendly. So we use semi-automatic process to generate our dial for PI IP for distributed and for single node algorithms. First to parse dial for PI, dial C++ headers with our in-home developed parser to locate different classes, functions and emirations and other objects. And we generate site on code from this C++ API. And after that, we use Jinja2 to generate Python classes for which are very different from our C++ classes. This process has two main advantages. First, it simplifies the API. So we get a simple, not very verbose Python API. And second advantage is that when we add a new algorithm into dial or some new feature, we get API for that feature for free just having this API generation process. It is a cool thing. And... You're running out of time. The thing that helps the alpha PI to achieve great performance is effective data transfer across language boundaries. Python data types store data differently. For example, Numpy and D-Ray is used, what we call homogeneous numeric table to represent Numpy arrays. And in pandas data frame columns can be used and in pandas data frame columns can have different data types. And usually each column is represented as a contiguous array in memory. And this representation is called structure of arrays. Dial for PI supports various basic data types and the number of representation formats to avoid data copies and perform optimally with various data layouts. And finally to the performance. Here you can see a scaling of K-Means clustering algorithms from one up to 32 cluster nodes. Hard or strong scaling means that the size of the task is fixed with increased number of nodes. Orange bars show hard scaling for K-Means algorithm for 35 gigabyte data set. You can see that on two nodes the computation runs twice faster than on one node, on four nodes, four times faster and so on. The time to process the data reduces when the number of nodes increases and it shows a good hard scaling. The scaling measures the performance of workload when the task size is fixed per node. It means on one node we process here 35 gigabytes of data on two nodes, 70 gigabytes, and on 32 nodes up to more than one terabyte of data. And Dial for PI shows close to ideal with scaling with this algorithm in yellow bars you can see. So the runtime stays the same with the increased size of the data. And this is due to our communication avoiding the design of the algorithms. So actually that's it about a compute part of the machine learning workloads. I transferred the word to Frank. Yeah, thanks Vika. So the machine learning part, so this actually the actual machine learning algorithm is often only a small part of the entire data science application. So actually in many cases the pre-processing, so data input is reading the data and doing data cleaning, data manipulation, transformation, et cetera, to make it ready for the actual algorithm and takes most of the time. So that can easily happen. So it would be good to also improve the performance of this pre-processing steps. In most cases people use pandas. Pandas has an awe, really a very, very powerful and nice API. The problem is that pandas is not necessarily optimized for speed. For example, it uses only a single core. So it leaves all the multiple many cores and vector units that basically all computers today have unused. So there's a lot of hardware that is unused if you're running pandas, even though it would be interiorly available. The other observation is that the pre-processing code is usually not just one or two lines of code. It's actually a large chunk of code. So if you have a large chunk of code, there's usually great potential for optimization if you can do optimizations across different instructions. For example, loopfusion, something like that. And because pandas is actually inherently data parallel, that is something that we want to utilize and that's why we're writing a compiler or that's why we decided to write a compiler and it's actually a just-in-time compiler. What we do is we just extend number. Number is an existing just-in-time compiler provided by Anaconda. It's a domain-specific compiler. So it focuses on numpy and numeric codes around that. So what we do is we extend number so that it also understands pandas and code around pandas. And by that, we get a very good performance and the beauty of number and its approach is actually that you don't need to leave Python. It's a pure Python package. And all you need to do is you annotate your function with a function decorator, the ngit or jit decorator. So you don't need to leave Python, learn a different language or use compilation prior to your runtime. So everything stays within Python and all you need to do is if you have a function that you want to compile, you annotate it with your decorator. And then when the decorator is called, so when you call your function that is decorated, then the argument with the decorator is of course the function. So that function, the jit function or the ngit that is provided by number can now decide whether it wants to call that function directly or whether it enters the compilation pipeline, creates a native binary and then calls that instead. And of course, hopefully if you compile it to native, that will be much faster. And here is a very super, very simplified view of this compilation pipeline. So it's really, really simplified. But just to give you an idea of what needs to be done is there are two things that are very important to get the performance from Python into native. The first one is type analysis. So Python is a super flexible language. The typing is mostly not even explicit and it's dynamic. While if you want to compile something to native, of course you need to map that to the native types in order to do, for example, use vector registers and all that. So we need to do this type analysis and of course you can imagine that can be pretty tedious and pretty challenging. And here is where the domain-specific application actually plays into because number natively understands number, it knows how to read all the number data types and know how to do that. Now we are adding Panda support and now we're adding to number that it understands what Panda does and all the data types in there. So it's a domain-specific optimization. And the same is true for the parallel analysis. So you know that pandas operations are usually like data parallel. So you could write that yourself but because we know from the domain that how things work we can do that for you. So we can automatically extract the parallelism for you and that is part of our pipeline. So we add the parallelism within the code that we generate in the efficient binary. So you don't need to do that automatically. So this is not claiming that the parallelism can be done in a generic way. It's really domain-specific for NumPy and pandas. Here I just want to show as an example two optimizations that we can do in the compiler that are not that easy, cannot be done easily with other packages or if you have just a library. Here's an example that is really stupid. It's really a simple example. You see this function that is annotated with a tic-tac operator. All it does is it reads a file in that file you have data about employees. The employee first name, for example, and the bonus, and maybe there are 10 other attributes in that. But once that's read in we compute the mean of the bonuses sort the first names and then return the two series. So it's really silly. So we don't need to talk about that. The two things I want to stress here is first the read CSV. So if you look at the read CSV and what pandas the default implementation does is it simply takes the entire file read it from start to scratch from start to end in a serial manner. So there's no parallelism in it. While our implementation that is generated by STC divides the entire file into multiple chunks and then each thread computes or does the reading on each chunk. So we have a fully parallel read even for the read CSV. That of course brings quite a performance boost. And of course this parallelism is not only applied to read CSV but also to things like mean and sort value. So all these operations on data friends and series are parallelized by STC. The second thing that is interesting is here is a memory optimization. If you look here in comments you have this one argument that you could apply to read CSV use calls which means that the listed columns are the only ones that you are interested in. So if you provide that to read CSV you will read in only these two columns. With STC you don't need to do that because STC can do the analysis in your function body in the function body and see that this function actually only uses bonus and first name. So it can do that for you automatically you don't have to do that. So it basically takes away some burden of optimization from you because it can do analysis and things and because it knows the domain it knows what to do. So these are just two examples. Of course there are many more optimizations that are applied like refusion and things like that. But this just gives you an impression what is possible if you use a compiler compared to just using a library or a package that is optimized under the hood. But of course there are also limitations to approach like that. The most fundamental one here is because we are doing a static compilation even if it's at runtime at some point you have to fix the code or you have the code in some state and then you compile it. So it's a static compilation in that sense. The code needs to be type stable. Type stable means that given set of input arguments basically given from the set of input types which are deduced from the input arguments all the other types in your function like the variables and other functions etc. need to be determinable statically. So that means type stable. Basically that means that you cannot write expressions where Python allows you to write expressions where that's not the case. And those expressions are usually like the examples here below where you have a condition that is not constantly, it's not a content expression. And then the result of the if and the else case result in different types. So that applies to functions as well as variables. But here in our domain with dataframes that also applies to the dataframe schema. So the column names need to be basically determinable as well. So that's a limitation of the column limitation. It's thinkable to do it differently. But if you require that to be static we can do certain optimizations which we couldn't do otherwise. In most cases this is nothing that data application actually do. But some cases they do it. In some of those cases you can work around it by just changing the code. But of course it's not always possible. So this is really a fundamental limitation but hopefully in most cases you can work around that if you use that. The other thing is of course that compilation takes time. So it doesn't come for free. And you can't assume that the compilation takes less time than what you gain by the compilation. So the first time when number joins in it takes the code compiler and then run it. So it has some overhead for the compilation. It will cache that result of the compilation. So the native function will be cached. So the next time when you call it it will actually not compile it again use the cache binary and run that. So before you annotate your function with a chit you should think a little whether it actually makes sense. So if you have a very small function where no data parallelism no computation in it and it's called only once it doesn't make sense to really chit it. But if you have a function that has parallelism in it does a lot of computation and it's called for example in a loop. That's your target. That's your candidate to do it and you will get very nice performance just out of the box. The other limitation or the other thing to think about which is not that obvious is that currently the compile time of STC is dependent on the number of columns. So if you have very wide data frames STC might take very long time to compile. We have one example where STC actually takes 10 minutes to do the compilation. In this particular case we are still able to reduce the total runtime from hours to 10 of minutes but that is of course nothing you can assume. So think before you apply the chit decorator but if it's a good candidate it basically gives that performance free. Let's look very quickly at some performance data just to prove that it actually works. So here is read csv so we're just reading a csv file and on the x-axis you see different number of threads so from 1 to 56 and on the y-axis we see the speed up over stockpunders. So the non-compiled punders compared the speed up we get with the STC compiled punders over the stockpunders that is not compiled. You see with one thread it's actually slower and that's what I just mentioned. We have compile time so that gets slower but as soon as you have more than one thread it's faster with four threads it's already two and a half times faster and it gets up to more than 10x. You cannot expect something like a linear speed up of course because we're doing disk IO and so we are limited by the disk speed so we can't expect something like a linear speed up here. So this is already a very very good number. This one here provides you with two charts comparing operations and entire data frames or the data frame with multiple series in it, multiple columns. Again we're showing the speed up of our STC compiled operations compared to the stockpunders on multiple on multiple threads. On the left-hand side we're doing that on the entire series on the entire data frame so we're comparing the number of rows to count. We're dropping rows and then we are computing the max for each column. And you see the speed up is quite impressive. So for the max for example which has a lot of has more computation than the other two and we get more than 100x. On the right-hand side we do a similar thing we compute the quarters as the mean and the standard deviation but this time not for the entire columns but for rolling windows. So this is a problem that is much harder to parallelize but you see the performance is still very nice and it scales in particular if you have more if you have operations that are a little more compute intensive like the standard deviation. This is the last performance slide here and we do similar comparisons but on single series not the entire data frame. Again on the left-hand side we do that for the entire series. On the right-hand side we do that with rolling windows. Again we see nice speed-ups and in some cases good scalability. Let me just highlight the applied thing, the green one on the left-hand side. We get more than 400x speed-up and that is because when you do an apply on a lambda a compiler can actually compile not only the apply into native but also the lambda. If you had a library, a package that package even if it's very efficient in its implementation it always would need to go back to Python, call that lambda and go back. We can eliminate all of that Python stuff and that's why we get this fantastic speed-up of more than 400 with a simple apply function. Here's just a quick summary of all the functionality that is supported right now in STC. It's an evolving project it's of course not finished and the partner's API is super wide so there are many many functions in it and of course in our initial version we can't support everything we have more than a function that we already support for example statistical operations that we have seen in the charts but we also do relational operations like filter, group by join, things like that we saw the rounding windows and just let me add something to the data side of things so we not only support the data containers like the data frames and the series but we also added support for dates and strings which is something that number did not support before and that is a major source of performance optimization so if you have code that uses strings or dates or times and you compile it with STC you get super super speed-up so we had some case we have 1000x or something more than that just because we are optimizing how strings and data are handled yeah this is a slide about the current status of STC it's an open source project so it's not closed it's free so you don't have to pay anything for it it's open source you can get it here from GitHub and compile it yourself or we also provide binary packages we provide condo packages and pip packages pip wheels so whatever you like more STC will be part of our new one API product that will be released end of this year this will not change anything so it will still be free it will still be open so it's just basically if you want you can get support for that so until then it will be in beta so this is basically in product terms that's in beta quality last but not least you see the second link here which is the documentation for STC and STC the documentation not only lists all the things that it can do and probably more importantly it lists for the pandas things that it supports what it does it does not support so you can actually see if there's some argument that it does not support yet you will see that so what's next we are working on a few things of course adding more pandas features but more interestingly we want to extend the parallelism to not only do a scale up within a node but also do a scale out a cross node so scale it out to a cluster because it's again because we know what we're doing and we know what you're working on it's domain specific we can do all this automatically for you so we can scale that do all the communication all that internally we know how to do this the user doesn't need to do anything just run mpr run and the rest of the code is identical the other thing we are currently actively working on is running things on a GPU so that the JIT thing can actually not only generate code for the CPU but also for the GPU if you have done something like that in python already you might know that you see on the left, that's what you usually do with CUDA for example you have this CUDA kernel startup programming that we think is too low level for a pythonist so what we want we are aiming for much more so we want to actually apply the same style of chip programming that we have right now for the CPU also also for the GPU so no change in your function that you want to run on the GPU only when you execute it when you call it you call that in a so called device context so this is there in a nutshell what we are trying to do so we are very far with that it's not yet ready certain things belong to that like two more packages one is a number equivalent for GPU and the other is to do this device context of that's all in preparation so maybe next year we are back and talking about that that basically concludes our presentation we hope it was interesting to you if it is here are some links check it out and one last word I want to thank the organizers for this really incredible good preparation of this conference it's fantastic thank you so much thank you very much Victoria and Frank so you convinced me it's really painless it's a really nice work and I like if things go faster we do have some questions the first two can combine actually the question is do you need any specific Intel hardware to use Intel data analytics acceleration library and the second part is are there any benefits good question achieved on AMD processors or is it strictly Intel so you can assume that everything also works on AMD so it's of course all the best optimized edge so it will work best on Intel so I mean that's clear we are Intel so we optimize for our hardware but you can also assume that it will perform really well on AMD so we are not supporting something like ARM for example but for example for the STC part the STC has different back ends and we are mostly working on the front end so it actually should be pretty so for example for the GPU work that we are doing should be fairly simple and video or someone else to write a CUDA back end for the same thing if that's what okay and question from Artush I wanted to add not about Python but in C++ products we are currently additionally to DAL classical product we have one API, one DAL product that can run also on GPUs on integrated GPUs or any Intel GPUs and I think further we are going to scale this to Python also that's great so as fear we are running out of time but the question from Artush how closely does the Intel Python distribution follow the release cycles of pandas, scikit-learn etc so we have a very close collaboration with Inria so with scikit-learn we are usually very very close so when they have a new release even during the release cycle we are checking against our patches and our optimizations so we are trying to be as close as possible pandas is usually a little more behind but you can expect that we are no more than maybe a quarter behind so there are more questions we have to take this to the discord chat so best is to continue there you will find the chat by pressing CTRL K and I think it's under scalable machine learning I'm sure you will find that one so let's continue in discord we even have three questions here but time's up now it's time for the coffee break the virtual coffee break time to get some coffee and thank you very much again Victoria and Frank it's really great work and continue it please thank you