 Okay, so hello everyone, welcome to our presentation. My name is Jakub Pavlik and I am director of engineering at Volterra and I'm responsible for leading the Sari team. Great, thanks Jakub. My name is Sandeep Pombra and I'm the head of data science and I'm responsible for doing all the machine learning data science and data modeling at Volterra. Okay, so today we are excited to get our session to KubeCon and we would like to share with you our story and journey, how we get to the scaling to the million machine learning models and how we are using Kubernetes, Apache Spark and Apache Arrow to work with all those data and what great features we are able to get from them. So a little bit about agenda. First of all, I will take you through quick overview of what is Volterra because not everybody probably is familiar with what we are doing. Then Sandeep will explain our machine learning functions and our model explosion and the problems what we were facing. Then I will take you through machine learning infrastructure evolution journey and how we continuously improved our large infrastructure and then Sandeep will take you through all model scaling challenges. Now to begin with, I pick up this slide. Don't worry, it is not so much vendor stuff but I wanted to show you what is basically behind. So we at the Volterra, we built a distributed cloud services wherever your apps and data need them. So it can be public cloud, it can be private cloud, physical edge, nomadic edge or our global backbone. We focus on providing distributed network services and distributed infrastructure for the application. So this is how looks our normal slide but from the our engineering side and what you can see behind and what is running. So for us, it is basically Kubernetes everywhere and all those locations and all those sites means Kubernetes sites and it also means lot of locks and metrics and data which we are able to pull from and then analyze them and provide great features for our customers and also for us to be able to operate. Okay, thanks, Jakub. So I'm going to talk a little bit about our machine learning applications. As Jakub described, we have a very complex distributed microservices, multi-cloud environment and this entails a lot of different machine learning functions. First of all, we need to provide a very sophisticated web application firewall function and typical rule-based wafts are not sophisticated enough to handle zero day attacks. So we need to provide the most sophisticated machine learning algorithms that allow us to handle that. Besides that's an important part of an application for us is understanding how the APIs within the application are working. So we use machine learning to discover our APIs and basically compress the whole application into a bunch of different API endpoints. Then we do a lot of different types of anomaly detection to again detect bot attacks and different types of DDoS attacks. We do time cities anomaly detection and we also do per request anomaly detection and we also do user behavior analysis to understand if there's any malicious users as well as to understand the application better. So as you can see from this picture, there is a lot of different types of machine learning functions. They are divided between the learning core which is the training and the inference engines which run on the edge. And the learning core actually is a global learning area where we do all our training. And as you will see in the next few slides that to do this, we have to really manage a very massive scale. So we'll talk about this more as we go forward. Okay, so before we get to action models, let me explain you the scale and how we collect metrics and locks in our infrastructure. So we have a free type of sites. We have a customer edges which are available in thousands, right? So there can be thousands, hundreds of thousands of the sites. Then we have something what we call region ledges which are our point of presence and our global backbone, those are tense today. And then we have a global controller which is today distributed across three regions. And this is the place where we are doing data analysis. So the way how it works is that we have a prometheus in each of every site which scrapes local Kubernetes workload as well as nodes. And then we are doing from the connected regional edges, prometheus federation with metrics, wide listings. And we are scraping only certain metrics which we are particularly interested in and which we want to work with. So not basically everything. And we use a remote write which gets write data into Cortex as our long-term storage for all those kind of metrics. Regional edge prometheus prometheus also send alerts and produce alerts to our alert manager. On the log side, today we are using fluent bit who captures all the log messages from our services and from the third party services. They are forwarded to aggregation, fluent D and fluent bits, demons in the pops and the regional edges. And from there we write into two places today. So we write into Elasticsearch and we also write into AWS S3. And those APIs are then available for our metric data analysis, even data analysis and other services which Sandy will be explaining on the future slides. Yeah, so as Jakub just showed that we have a very complex architecture with a lot of different types of data ingestion. And we also have, when we are deploying applications, we are typically deploying very complex applications. And these applications, basically underlying these applications are a lot of different dimensions. Basically we have application, virtual host, source sites, destination sites. And one of the things we found when we're doing all our machine learning modeling is that every application, every customer, every geography has its own characteristics. So it's very difficult to develop a universal model for all of these. So basically to get the best performance in terms of our machine learning accuracy, we have to develop models across each of the dimension. And as you can see, basically doing that kind of a combination just by a multiplicative logic leads to a very large cardinality of models. And in our case, we were looking at, in certain cases, some of the models, for example, for the time series, we're getting into millions of time series. So basically we need to figure out a way to scale these models. So initially when we started this project, obviously we were trying to get the machine learning and our algorithms working. So we were running these models on a single instance in a serialized manner. And what that did was basically it took a very long time to run. And obviously when we are training and doing inference and scoring based on these models, if the model takes several hours to run and the model itself becomes obsolete. And definitely it's not something that was sustainable. So now Yakub is gonna talk a little bit about how our infrastructure was also struggling to meet this need. So when we started, we actually, and this is the infrastructure picture, which is looking now on the global controller as I introduced in the previous slide. So we started with the more regions but this picture cover one of the region where we provided the elastic search cortex and AWS S3 API and we run everything as a EKS cluster inside of the EKS cluster. And it was basically a single cluster running continuous learning jobs. And the issue was that parallelism and also inefficient CPU, RAM usage and the issue was that we had to resize to bigger and bigger VMs and flavors. And it was not quite efficient because those jobs running few hours and then they don't run many hours. So you need to find kind of balance. So it was not really cost efficient and resources efficient. And therefore we wanted to find a different way or take a look on the different perspective, how we can handle this very large data sets in gestational and the training. Yeah, so as Yakub mentioned, we were running into a lot of bottlenecks in our infrastructure. So obviously from machine learning, model scaling, the obvious approach was to find a way to scale the model scaling. The obvious approach is to run several models in parallel. And this can be done in various ways. We could use Dask or Job Labor, a variety of Python because most of our code was in Python, but we want something which was much more easier to maintain, easier to scale, easier to manage. So we wanted to combine the best of scaling, horizontal scaling, as well as having the ability to auto scale, scale to different customers, do automation as Yakub mentioned with CI CD and minimize our infrastructure management. And we also wanted to be able to have a very universal way of our data ingestion so that we could easily do secure and seamless data ingestion. So for all these reasons, we decided that we wanted to, I think it's not allowing me to go to the next slide. Okay, there it is, sorry about that. Yeah, so basically to do this, we wanted to combine the best of Spark scaling and parallelization and Kubernetes infrastructure. So we use Spark basically as a horizontal scaling and Kubernetes as a distributed data ingestion architecture with integrated CI CD. So we wanted to have a faster time to market. So rather than creating our own open source Spark engine, we decided to leverage the SaaS technology from Databricks. Yes, so when we start, let me talk about a little bit about how we actually integrated Databricks and what we had to do. So if you look at the standard Databricks integration, basically they calculate that you will give them access to the full AWS VPC and they can provision the VMs and jobs as they need. And you do AWS peering with your VPC where you have your data, in our case, our global control with Cortex, Prometheus, Analytics, HPI. So the problem what we find here was that we didn't want to give them access to our VPC. So we created dedicated instance, dedicated account, but we still didn't want to use just the peering because we wanted to have a visibility on what is flowing, set the detailed firewall rules and make sure that it cannot get breached and they cannot access our core infrastructure and they don't use our CAs and our certificates. So we wanted to isolate them and VPC peering was not good enough. So we come up with the idea that we can actually leverage our own technology and make it better. And therefore we worked with this design. So we took, we let run our EKS and our existing VPC as is and we just created dedicated account for data breaks, learning jobs. And then we launched our ingress gateway called VoltMesh, which basically allow us to get connectivity for only particular API, which we need. In this case it's a Prometheus, Cortex and Elasticsearch. And we just advertise only those APIs with different certificates and the different authority for the data breaks, which allow us to really provide granular API filtering and service policies and allows to alter our learning services such as API discovery, time series anomaly detection, per-request anomaly detection, request data analysis or user behavioral analysis to run and consume and produce metrics back to our infrastructure without any security breaching or direct access to core Volterra services. And of course this helped us also to improve our own technology. Okay, great, thanks, Yagop. So now I think the rest of the talk, I will focus on how we use Spark to parallelize our models. And basically, sorry, I can go to the previous slide. Yeah, thank you. So the way Spark works is, you know, Spark relies very much on running the functions on the driver itself and then running a bunch of executors in parallel. And typically that can be done by creating either data frames or RDDs, which are resilient distributed data sets, which are collection of data frames or data modules that run on different executors. So the idea is basically, you know, you take some huge data and then you've split it into different executors. And actually if your executors are multi-core, you can even go and split them into multi-cores. So for example, if we have four executors with four cores, so we could parallelize by a factor of 16. So the first approach we took was for the kinds of scaling where basically we were going to ingest the data and we were going to also access the models into the various interfaces that Yagop talked about. And for that, we came up with a very simple scheme where basically we took our dimensions like applications, namespaces and created what we call is, you know, Panda's data frame out of it. And then we did, you converted that data frame into an RDD. And then basically what we can do is do a map, which is basically applying any function. This allows us to run a function. Obviously the input output of this is more symbolic where basically we want to make sure that your function is executed. But the actual core functionality can be very complicated. So if you look at this code snippet, I can basically explain a little bit further how we did this. So basically Spark has two kinds of operations. There is the transformations like the map, which is apply of a function. And then there is basically actions which actually execute the thing like collect or count and things like that or other kinds of aggregations. So in this case, you can see if we define this outer function, which is basically a standard Python function. We get our Panda's frame, which has all the keys that we're going to use for mapping. We create a Spark data frame based on that Panda frame. And then basically we make sure that we have the actual mapping function which is embedded within this function. So we can use any of these variables like variable one, variable two inside this function. And these will be automatically exported to each of the executors. And so pretty much this function is, we take the data frame we map, we convert into RDD and we map this function. Therefore it applies this function to every row of this RDD which is the original Panda frame. And then we do a collection to make sure that this function is actually executed because Spark has lazy execution where it will only execute the function when they actually do the collect thing. But one thing I wanted to point out is that this model function, which I've not really talked described here can be a very complicated function. And we can pass a lot of different types of objects to it without any problem. And this will all be done seamlessly. And this automatically scales the function into various parallel components. So this is actually a very cool approach. And this works very well in scaling when we do everything within the function. But there are other instances where basically we have much more complicated data frame. Sorry, let me see how do I go to the next slide here. Jacob, could you go to the next slide? I'm not able to do that for some reason. Can we go to the next slide? Oh, thank you so much. Yeah, so there is a lot of situations where basically our data set is already available. It's a complex data frame. And not every single column within the data set is the keys, there's a lot of actual data. There's also things where we want to actually get data out of our functions, which are much more sophisticated. So we cannot use that simple approach in that case. And in that case, what we have to do is we have to come up with a different approach. And we decided to use an approach which uses a conjunction of Pandas UDF with Apache Arrow. And we'll talk about Apache Arrow in the next slide. But basically the idea is UDFs basically means a user defined function. And typically in Python, when you do Python UDFs, that is basically takes every row of the data frame and converts and runs the function every row. And that's very inefficient. So we wanted to use Pandas UDF, which actually work in a much more vectorized fashion and can increase the performance back to 100 times on Python UDFs. And obviously, since we are going to do our models across a lot of different dimensions, we want to use something called a group map Pandas UDFs, which allow us to take a group by approach to split, apply, combine these UDFs and do these functions in a much more seamless fashion. So I'll talk a little bit more about Apache Arrow. So when we do this kind of go from basically Spark, which is running in like a Java virtual machine and go into a Python Pandas API, there's a lot of serialization involved if you don't use any Apache Arrow. And that serialization can take a lot of time and it's very inefficient because it works row by row. But with Apache Arrow, we can use a columnar way to basically send this data from Spark to Python API and to a Pandas API. And this columnar data is very efficient because it takes advantage of all the SIMD architecture of all the modern CPUs. And this allows a very efficient way to get this data in a very vectorized fashion from basically the Spark data frame to the Pandas API. And so this is essential in getting the performance and the cost reduction that we need. So I will talk a little bit more about the group Pandas API. So as I said, basically the idea is we have an original data frame which consists of all our data and a bunch of keys. And we use those keys to group these data frames into different groups. And these groups then go into a Pandas function and the function is applied to each group separately. And then the result of these groups is a Pandas output and that allows us to get basically a very efficient way to do this function. So basically the idea is we are kind of doing the Python function as a Python API, but we are using these underlying technologies like Apache Arrow and Pandas UDF to do it in a very efficient and a very parallel fashion. So what I'll do is I will go through a code snippet that actually explains how we do this in a little more detail. So basically if you look at this Python code, there's a very simple instance of how to run several models of a random forest, scikit-learn random forest regressor in parallel. So that would be a good example of some of the models we use. So the way we do that is first we define a schema which basically defines what kind of output we are going to have. In this case, we are doing something very simple. We have basically our group ID, which is the key we do the parallelization by, and then we have the model string which basically just gives you the model file name. But this schema can be very complicated and we can do a complete Pandas data frame with all of different types of objects in it to really return a very complete data frame. So in this case, we have to use a decorator to basically instantiate this Pandas UDF and we are doing a group map UDF. So once we do that, then everything else is pretty straightforward. This is your regular Python function that basically has a Pandas data frame as an input. The group ID is basically something we pass as part of this data frame and that allows us to identify this specific group. And then within the data frame, we can have a lot of different columns. And in this case, we have three columns which has the two features and the label. And so by extracting those columns, we can run the random forest regressor on it and then we basically can do a pickle dump of the model and we can basically pass the model and the group ID back and this way, this whole function runs in parallel across several executors. And the way we instantiate this function is first we basically enable arrows. So the serialization is very fast and is vectorized. Then we have our original Pandas data frame that contains the actual data. We convert that into a Spark data frame and then we apply the group by Pandas UDF on it. And then once we apply this, as I mentioned before, Spark has lazy execution. So we have to convert this back to Pandas and this way we've get our results back. So it's a really simple way with just using underlying Python functions and doing parallelization Spark without getting into too much of the nitty gritty of Spark. So this actually demonstrates the two ways we have paralyzed and been able to scale our models to this level. So that actually concludes our presentation. We have a few concluding remarks. We can talk about those. So basically, we presented a way to take the best of Kubernetes and best of Spark to do a scaling with end-to-end automation and security and CI CD to basically prioritize our models and scale them to a very high cardinality. We used a very unique architecture as Jakub described. It's very unique because we embedding Spark into a microservice within Kubernetes. I think this is a very novel but very simple. So obviously, the one advantage we have with Spark as Jakub mentioned was when we run these models we can run them as basically clusters that we create and terminate once the actual job or the training is over that way we save a lot of resource costs. And then as we evolve with more models and more applications it's very easy for us to integrate those into our current infrastructure. And now in terms of scaling, we're also looking at other dimensions of scalings that we can do within this architecture which is beyond just the models itself. We're looking at when an application is very big and has a lot of different complexity we can basically scale within the application with the data itself. And then we're also looking at some very sophisticated deep-lanning models which tend to be very complex and we're looking at how we can leverage this architecture to even scale within those models and parallelize those models. So those are some of the things we are going to be doing in the future. Great. Thank you. Okay. Thanks Sandeep. Thanks. Thank you. That's all from our side.