 So today, I will go through some work that we are currently doing for SkinnyCyclicarn. It's mainly about the introduction of new computational routines for the library. I will mainly go through, rapidly, the context of the library. And then I will describe a pattern, which is used in many algorithms, which we would which we have, in fact, improved and we would like to port on other architecture. So before getting started, just a brief few words about me and about Inaya. So I'm a research engineer in the Soda team, which launched the projects a bit more than 10 years ago. I've been working there for a bit more than a year and I became a Cyclicarn maintainer in October 2021. So my work mainly focuses on improving the library performances and on reviewing contributions, but there are many other aspects and we are looking for people to help, especially for reviewing requests. As for Inaya, Inaya is the French National Institute for Research in Digital Science and Technology. It's kind of like important historically for the country and it ranges from academic research to technological startup creations. There are many other open source libraries that are used for scientific computing there. So if you are willing or interested to know more about about those, feel free to have a chat then after the talk. So when it comes to Cyclicarn, I don't know if we need to present the library once again, but in a few points, Cyclicarn used to be the reference for simple and efficient and stillies. Implementation of data analysis and machine learning algorithms. It kind of like brought the famous estimator.fit.predict.api came with like a nice illustrated documentation. It's kind of like the reference for many product in the industry and work in academia. We also organize on lights and on side sprints. So there's like a sprints this afternoon. Some people from the team will be participating. So if you are new to open source or if you just would like to contribute to the library, feel free to come. And we just like will help you with other maintainers of library help. So these kind of like features just made it one of like the most popular package for Python yet there are other things that can be improved, but it's mainly about improving what exists. We don't want to have like new shiny features in the library want to make it simple and efficient. Especially we can improve it on many points such as documentation, community engagement, UX and usability so that it's even more user friendly and performance. So as for performance, there's many work which is going on. I'm going only to focus on one pattern to get to technical details. So if we look at performance, in application people have many be using gradient boosting methods because those are performance. It was mainly kind of like boosted by some libraries such as like GBM, EJ Boost and Cat Boost. Scikit-learn implemented also like the Instagram gradient boosting and if we compare performance of Scikit-learn versus those library, it's kind of like relatively similar. So here it's kind of like a bias graph that we've done or aggregating all validation scores on many sets of like hyper parameters and random data sets. Yet those are available online can be reproduced if you want. But the point here is that there's not just those kind of methods, there are many other ones, starting from really simple ones such as the cannibals classifier. And if we compare our implementation to for instance Intel's one API implementation like in the Scikit-learn 1.0 in previous version, those implementation were not the most optimal ones. So there was like some kind of performance gap that we could just work on. And actually if we look at this classifier here, most of the computations are relying on two sets of methods which are the cannibals search and the radius neighbor search. But it's not just the case for this algorithm here, it's the case for many other ones in the library such as those ones and mid-orders. So focusing on improving the performance of those two routines just like would improve the general performance of the library. So the focus of our talk is to get through or we can speed up those two things here. So just a bit of like formalism for the nearest neighbor search. So I'm just going to follow the details. What you need is distance metrics which is a function taking like two vectors of given dimension with some other properties that I'm not going to cover. What you are working with generally is a vector x called a query which you would like to find neighbors of in a dataset here as referred to y. And you can consider the distance metrics here called d of d x of y. So the goal as said is to find the closest vector to x in y and there's generally two formulations. One for the cannibals search where you'll find all this case smallest value using like the what we call arcaming operation here on the matrix or you can find the closest neighbors within the border of radius r as shown here. There's generally two kind of solutions to solve this problem. So the first kind of solution is to rely on binary tree data structure such as bold trees and clear day trees. So here is the complexity which is theoretically provable and the problem in fact is that in many cases for many reasons including like backtracking I see it's explained I'm not going to cover this it gets to this complexity as p tends to infinity. So it's not even the best option in fact when we have big datasets and it just is the best for smaller datasets for example p for this value those kind of like range for p. In practice people are using much bigger datasets so we need to consider alternatives. The other alternative is just to use brute force which is supposedly worst in terms of complexity but in practice it just can be way more performance because technical optimization can be used. So here we are going to only cover this case which will give you an overview of the work rate that we are going through and so when it comes to brute force cannibals search one can see the problem as this. So you consider your vector x your matrix y here you can compute the distance and in easy words what you can do to get the the neighbors that you are looking for is to perform what you call a reduction on this matrix. So in this case the step of the reduction can be as follows. You can partition this matrix around the pivot which we take here as like the kth smallest value of this array. This allows you just to extract the kth all the kth smallest values and then you can just sort them and as well sort the indices so that you have all the information of your neighbors for your vectors and then you're done you have here your neighborhood for your vector x. In Python this can be implemented easily and they're using our partition and our sort so using them by you can just in a few lines get the results here. Actually sometimes we are just not working with one vector but with many so instead of using a small x i will be using a big x here to refer of the sets of all vectors and what you're computing is not just like a vector but an entire distance matrix like a pairwise distance matrix which consists of all the distances between the vector in x and vector in y. So if you try to perform those operations naively it will just crush your RAM because these matrix here don't fit your RAM. If you have like some data sets like a few like a hundred thousand observations you just will not fit into your memory. So what you do here is that you can check you can check just the matrix x and compute all the operations kind of like independently for groups of vectors so you consider like smaller distance matrices here and this can be done like in parallel and for example you can do this in parallel using Java. So this was like kind of the previous implementation of K-neighborse search so the implementation as of 1.0. The problem with this implementation is that if you want to use big machines a machine with many threads is that it's not properly scaling so here you have like a simple graph which consists of like in x number of threads and in y the speed ups. So what you see is that as you add number of threads as you use big machine you do not get any speed ups. That's a case for kind of like different kind of data sets you get up to times two speed up. So why is it the case? So the problem is that using Python there's a few reasons for super optimality. Python was not meant to be used for scientific computing at first but only to like have like a natural language than bash. So the first problem here is that the execution of this algorithm is bound to the C Python interpreter into the GIL that is you have like a lot of costly instructions for simple operations like basic arithmetic. Moreover the execution just is limited to one thread at a time due to the GIL which might be removed or we don't know which is like a vertex on the interpreter state so this just comes with like performance degradation. Sometimes you may just have like bigger data sets for y when you are training data on like a lot of example and you only have like a few queries so chunking on x is not adapted you may want to chunk on y instead. What's what's the problem here as well is that we use Ilovel operations on top of our array so this is just costly because internally you create a new data structure so that you need to allocate buffers this calls malloc and this calls free to like under the hood and this just block at the level of the operating system so you're just like performing a lot of operations which are like long and which blocks and in the context of like parallelism this just come as an extra cost over the cost of like thread pool or process pool setup and theorem. Moreover sometimes with Python it's nice because you have like a simple language but you have only a few primitives and you can't really use like come up with like complex or like more details like a set of like executions. So the reasons or like the like we can solve this problem by using a few things so first we can get right out of the C Python interpreter and the GIL using something called or using a language called Satan so if you do not know Satan I encourage you to have a look at this this is like a super nice project and you can get really nice performance just by modifying your code we have like a few extra typing and other things you can make sure that you're not using the GIL with the no GIL close here so you're sure that you basically write C code and it runs like efficiently you can adapt your checking strategy so that you're not only checking on x but on y and you can choose whether you want to parallelize computation on y on or on x you can use instead of in a way you can use other data structures and algorithms underneath and even cool libraries that are they can be used for like sorting or in algebra and Satan allows using OpenMP which basically has a super low level parallelism and basically no cost or no overhead so it's even nicer you have a more like you have much more like directive with OpenMP if you want to perform computation but they are not all exposed in Satan yet it's possible with like some tricks but I'm not going to cover this so generally people the user of scikit-learn aren't like I don't know data centers or everything it's many people with laptops such as this one so this machine here consists of several things so as you can see here there's like it's a simple machine with like 50 gigabytes of RAM you have like four physical cores at the bottom and in between those two you have like those things called F3, L2, L1 and L1D and 1i those are like small sets of memory called cache that are used to store intermediate results so if you want to have implementation like performance you need to consider this and this is what we've done so firstly we try to make sure the data structure that we used are like fitting into the L3 caches and that they are only allocated once at the beginning so that's just like efficient you can make sure that like data structure are like properly allocated so that they are mapping correctly to the core as well you can use as said rollable parallelism using sitem for example we open mp and add a tail algorithms in data structure so in our case instead of using in umpire arrays what we can use are max-eaps this allow just having like a better complexity and in terms of like technically you can just use cbuffers and pythymetics and lastly generally people are when people are like considering this kind of algorithms they are mainly using the Euclidean distance case which can be decomposed as follows for the in terms of like distance matrices so in this case we have like three terms that you can compute two of them being computable at the beginning and can then be stored and you have like this term here which is a small little rectangle which can be computed efficiently using a blasphemy 3 operations so those are just like the most like most efficient algorithm that you implementation of matrix multiplication that you can have so using those tricks what you can now have is just like a linear kind of like scalability up to a plateau at the end so what you have here is that using bigger machines you can even have good performance there's just like I say this plateau here which can be explained by several things first the dataset relatively small for the machine that we use and it just like like the the parallel section of the algorithm is negligible compared to the sequential one so in this case you do not have speed ups as you add more threads because you want to add parallelism and there's like a small cost of like thinking of the the the strategy opening P and the small slowdown at the end so this is sometimes refers to as Adam's law so you need to get this into you need to remember this when you are implementing algorithms so just to make sure is it possible to see what happened on the hardware and know if this is actually efficient or if we can just speed this even more so coming back to this diagram here what we can have a look at are cache eats and cache misses on the L3 cache to see if it's properly used we want to have a high heat rate for the L3 cache we want to see if there's like a low low overhead for open mp and see python and lastly we want to know if the implementation is mainly of like the algorithm is many is like mostly used mostly spent into victorizing instruction so you can have a look at this using a tool called perf which is a really a simple tool to use which runs on like many kind of like programs so you can even run it on python and see what happens under the hood so I won't go into details but basically you just need to set up a few things here we want to record the cache eats and misses and the cpu cycle then you can have a report which eats everything and so if you have a look at where the cycles of the machine are spent there are many spent into here as you see live open blast p is the implementation of blasts by a project called open blasts so this is where we spend the time computing the matrix multiplication for the distance matrix there sometimes spend like pushing values on ips and the time spent into the c python interpreter and in open mp here is just like negligible so this is something that we wanted we wanted just like a super low overhead for c python and open mp so that's what we got and like the execution is mainly spent into cpu there's like the execution is like cpu there's no problem with like the memory if you have a look at the time spent in the core of the computation that is the small region here at the top in green what we see is that all the operations here are like vectorized instructions namely simd instructions like you can perform operations on like several floats at the time and this is what you get in terms of like assembly code here so this just shows that like the implementation is efficient for this case and this is what we wanted to do we haven't wrote it's open blasts for those open blasts is just a super nice project if you do matrix multiplication chances are that you are using open blasts under the hood but you maybe don't know about this so just have a look at open blasts super nice project and lastly if you have a look at like the cache every cache we can have a look at like the at the top the cache it and at the bottom the cache miss and what we see is that there's much more cache it here than cache misses this just show that we have like a high every cache it rates so that is we do not move like data structure between the ram and the every cache more like a lot so this is nice so you if you wrap up everything you can say that we kind of have a high confidence in quasi optimal performance for this kind of like algorithm so if you go back to this pattern here actually we can extend it to many other algorithms or many other situations firstly we can just adapt the data structure at the end and the operation that we use and operate on as well as call to other libraries like c++ library or other things to support other reduction in algorithms so that you can implement the radius neighbor search like this you can implement k-means like this you can implement value thresholding you can implement even kernel methods or it's just like you you just can do many things like this you can also work with sparse datasets sometimes you may have like one one dataset which is sparse the other one being dense or all the combination possible all the possible combinations and you may have like storm distance matrix or or things that's like not the clean and distance that you would like to support so in our work we have kind of like move to a modular class hierarchy to have the support for media algorithms back end so this is kind of like just a simple overview we have like the nearest neighbor search interface at the top which depends on two other interfaces and then it's just like satan has some restriction but it's possible to using some kind of like templating this patch calls to specialized implementation so here they are like it's kind of like a partial overview of the design but you can have all the the main operation done into this abstract pairwise distance reduction class and have then the reduction and the data structure be like specialized into subclasses here you can factor like some operation for example the matrix multiplication term computation into a dedicated like component so there's support for all distance matrix like this all the combination of sparse in this datasets pairs support for 3264 and everything but it's not he does not fit in like in signals like I don't want to show this okay so part of like this job or this work just is kind of like just the first step towards making new computational routines so we kind of have now a partnership with Intel to work on their technology and extend this kind of like pattern to like the hardware and their technology so this is not just something that we want to to do for potentially one vendor we would like to have kind of like general design so that people can even have like implement their own back ends if they want so this is just the start of a design projects there's like I can I really like add the reference to the slides at the end but basically those are like ongoing discussions this is just one part of like the preference work there are many other things another way to speed up implementation and and yet so this is just kind of like the the beginning for Intel they are working on new hardware which they call XPU which is the combination of like a CPU and a GPU with unified share memory so you can have like a set like most sets of like memory space between your CPU and GPU so that you can just like work efficiently sometimes for vector operation on the GPU and for simple operation on the CPU and that's it so the combination of differentiation is that improving cyclone performance is one of the next step for the library there are many others for this work we kind of like focus on the core pattern for the algorithms that are like prediction over parallel distances pattern is kind of like slow for so and are some problem performance yet it's possible to mitigate this using technology like satan open mp and c++ and like the next part of the world for the work is like work on hardware specific computational routines so that's it for me thank you for your attention and if you have any questions feel free to ask them thank you julien very impressive I'm sure we have lots of questions please go ahead yeah thanks for the talk is this on well I just came out of the pie arrow show were you there for chance but they showed the chunked array which can be zero copy no cost converted to an umpire race which already do this sort of chunking that you showed is there any chance that you might alleviate these problems with python by basing your back end a little bit on the pie arrow for certain of those operations that are paralyzed yeah so that's a nice question we have not considered a pie arrow as of now there are in fact many like project or technology that exists to perform these kind of computations so there's a project which is called pycaps oh there's not any access coconut is connected to the internet yeah I can look at it later well it's it's a project basically which yeah yeah so this project here here this one yeah so this project really is dedicated to perform these kind of computations on GPU but for CPU we haven't found any kind of projects we haven't looked for a pie arrow yet but yet it may be possible to integrate this it's just that we don't want to have like too many dependencies we might want to have like a plugging system so that if people want to have specialized implementations they just can like eventually at the end have something like pp install scikit-learn brackets pie arrow or bracket intel or bracket whatever um so the transition plan the machine but for pyro we haven't got each other thanks that's thanks for your suggestion hey thanks for the presentation I was really cool um especially seeing how you improve k nearest neighbors I think that's one of the first things that new people who start out with machine learning work and I guess scikit-learn is like the one thing people go to to learn and get into machine learning so really impressive I'm wondering what other estimators are on your roadmap for improvement okay so thanks for your question so what I've covered here are kind of like low level performance improvement so that is like we are mainly writing it ourselves but there are other ways to improve the performance of algorithms or estimators in scikit-learn one of them is kind of like trying to use the standard which is the area api which is kind of like a standard for tonsil library so this way we won't need to rewrite all the implementations of algorithm we just would have kind of like a high level api for like area operational tonsils so there's actually someone the team thomas will work on a proof concept to speed up linear discriminant analysis you can get up to times 40 speed up on that's a gpu we've like no change like to the like the implementation of the algorithm or like didn't change so this is one part of like the performance work that we are going on but it's kind of like high level and it's a matter of like getting into a design for all the tonsils library so it just takes time we i think people like in the community are working on many things if you want to join the adventure just feel free to but it's just like that we need to make sure that the designs are correctly set so that we do not do any mistakes for this case they are like libraries that are using sometimes api's that are a bit different so this creates some kind of like friction but yet i think there's kind of like room for making sure that we get the best performance especially for gpu because as of now sector on many has been relying on on them by but there people are moving towards gpu so we are like working in this direction as well cool stuff thank you thanks for your talk i have the obvious question namely when is it going to come out so how long do you what is your estimation of when these changes are going to be implemented which one exactly um the power's distance computation okay so uh so for yeah i should have been clear i guess so for this um for this one yes it's already uh in second one that one so not entirely it's it's the case for like the main kind of like the main use case that is the k-neighbor search using nuclear instance on this dense array so we are currently working on uh extended to uh float float 32 and to kind of combinational sparse in this datasets it's mainly um like we're going to work probably it would be in sector one and two depends on whether three uh it's more than like we are working for people uh that are interested in satan uh the bottlenecks mainly uh peer reviews and uh and debugging and benchmarking and satan is kind of like nice between the as well like people we are working on many things and uh that's it so i would say something like safety plan one or two on what the three really have kind of like uh much more support for this kind of like pattern okay yeah thank you just one follow-up question um so i had problems um especially also with rum uh when i used my own distance uh when i used my own distance matrices like that are non euclidean or something then i also with a lot of um algorithms i had uh rum issues um are you are you working on that as well or is this um yeah so i was just wondering about that i do understand that you have a lot to do okay what was your case exactly you you had like your distance matrix a custom distance matrix or yes exactly a custom distance matrix and i constantly run into uh ram issues if i do that so okay um i think uh this is something i'm interested in so if you want we can just like discuss it um potentially there are like patterns in safety plan which can be improved uh not just for this case there's a finger case for uh agglomerative clustering which is compute like a distance not or the dissimilarity distance something like this uh and sometimes people just have like them and we crash because of this but like if you have a problem with like a pattern like this let's discuss this at the end yeah i'm happy to thank you okay i guess there are no more questions so around the plots for julian again