 One two, I think it's it's 9 a.m. So I don't know if you should start but we are streaming so Everyone can watch the recording after that So welcome the big crowd the audience. Thank you very much for getting up so early on Saturday And also there is I think then wall she's having a presentation at the same time. So I understand that everyone is there So my name is Vashek Pavlin I work for redhead for six years and I would like to tell you something about what we are doing in a CoE with Jupiter hub on OpenShift and When I saw my talk on schedule I called it data exploration with Jupiter up on OpenShift and then I realized I'm not going to do any data exploration So I fixed it and I will call the talk enabling data exploration on With Jupiter hub on OpenShift so what I would like to talk about is How we deploy Jupiter hub? What are the what are components like the technicalities of the actual platform and tooling? And how we integrated with other parts of the of the platform So why do we explore data and why do we want to do whatever AI center of excellence? A smart person Clive Humby from UK said that the data is the new oil It is valuable as it is if you collect a lot of data, but it's it needs to be refined same way as the oil needs to be refined into plastics and and gas and and chemicals and whatever because without cleaning and transformation and stuff like that It's just pile of pile of bits, which you can't really make sense of in most cases because it's too much and and And it's it needs to be clean and stuff like that the gentlemen also came up with the Tesco Club car, which That might not sound as a lot But it is a great source of information about customers and what they buy and how they work. So he probably knows his stuff For being able to work with Jupiter hub on OpenShift there are some prerequisites So let me quickly go through that and that's one of them is OpenShift you need to have OpenShift running Which is kind of obvious if you want to deploy there What is OpenShift who doesn't know what is who never heard about OpenShift or doesn't know anything about OpenShift good But quickly for the for the audience of big audience on the on the YouTube It's an enterprise distribution of Kubernetes. It is built on top of Kubernetes. So there are all the It's a scalable container orchestrator So if you have anything to do with containers and you want to run them in production You want to use something like Kubernetes or OpenShift? and then it has all the basic all the basic concepts as Kubernetes so Things like pod services deployments persistent volumes, but it adds more it adds the the development workflow with builds and image streams and things Like that you can go to okd.io, which is a new new Place where to find information about the upstream version of OpenShift which is which was called OpenShift origin in the past The second thing that we we need for the work that we are doing with Jupyter Hub is some object storage You are probably familiar with AWS S3 We use Ceph and then Ceph implements the S3 API So you can use the same libraries like both of free in Python or the Hadoop S3 library in Spark To access your data in Ceph Sorry with the S3 API Luckily, I don't have to I didn't have to set up either OpenShift Nor nor Ceph Everything that I will use here is deployed on MOC mass open cloud If you saw Stevens or Sherard or whoever was talking yesterday about MOC We are working with them on something called OpenData Hub. So this is part of the OpenData Hub and We are deployed there So what are the tools that we will be using for the data exploration that we are not going to do? First and they're like the core part is Jupyter you can run Jupyter server Jupyter notebooks On your laptop and you can use that it looks like that basically So you have some it's split into cells and the cells can be either some kind of markdown or It can be code or it can be output of the code and then you as a user type something into your web browser, it's a web application that has a backend and You type something in a web browser and the commands the code is sent to the Jupyter kernel the kernels There are plenty of them. You can find them on github and there is Python, Scala, I even saw C sharp R is pretty popular and then it sends output back to the Jupyter UI and you see that in your web UI of Jupyter The good thing is that the the actual file is just a JSON So if you want to do something fancy with the content you can take the JSON and parse it and work with that Or you can view it in the in that Jupyter notebook UI and and and actually run code and stuff like that What builds on top of that is something called Jupyter hub and That basically the change is that Jupyter notebooks themselves are a single user you as a user Run them on your laptop and you can write the code and you can see see the stuff But if you want to provide that capability In a distributed way to like a team of people in your company or at school I think it's like for universities that might be super interesting or even high schools I give you want to start coding in Python you can provide these notebooks and you can provide the Jupyter hub to the to the Class and and they can just go in and they Jupyter hub will automatically when they log in it will spawn up Jupyter notebook for them and they can work with that and it spawns and manages those notes notebooks So that a user always gets to a his own persistent version of the notebooks that he worked was working with and The last part of the system would be apache spark. I assume that you are probably all familiar at least Slightly with the apache spark As the website says it's a unified analytics engine for large scale data processing So that means that if you want to process some data You want to do some cleaning and you want to do some model training and something you would use spark It provides API's and libraries like spark SQL and machine MLip machine learning library, which has implemented plenty of algorithms and it works in a cluster mode So you have master and workers and the workers do the work and master orchestrates them and we will use Jupyter notebook to connect to spark and do the processing in spark so that the notebook or the Jupyter server doesn't have to be dead beefy and and work with so much data So I have quick demo it is It is nothing fancy, but basically just walked through of how that works. How did you better have works? So this is I have open shift. I have Jupyter have deployed So I go to the URL that open should generate it for me and I will sign in with my open shift credentials there is quite a quite important because If if I don't I don't want to remember another credentials. I want to use something that already know I'll get back to that Later how that this solve in Jupyter have now we select from the list of images. I Want to use spark so I will select this spark image And these are basically just what do you would if you if you install? Jupyter notebook server on your local machine or your laptop These images basically represent your laptop and the install dependency So if you want to use spark and pie spark you would need some configuration and pie spark installed and Java installed on your laptop And the same way it works with this With this notebooks that the image contains all the dependencies that are needed So I have these two notebooks One I called Boto because I use Boto free library, which is a library that implements S3 API. So I Have my credentials in environment variables I'll show you how I got them there and I will I connect to some end point And so I can run this right So it installs the dependency if it's missing and I'll connect and then I can list buckets if you sort of Open data representation yesterday you saw Steven to create the his bucket here. So we are on the same Endpoint so I see his bucket and I created mine here. So He could see my data there And then I have this other one which I actually just downloaded from internet I looked for pie spark Jupyter notebook and I got to this repository Which someone created I don't know the gentleman And he has coupled so I just took the last one I think because I thought that that is going to be the coolest one Probably and I had to fix some stuff because he was writing it for Python 2 and we are running Python 3 But it was mostly just a syntax fixing And what it does is that it's again connects to it connects to S3 connects to spark see I have this Spark cluster URL in my environment and then it Downloads some data from them from the object storage, which I pre-uploaded there and then it does some decision tree Building and this is your tree tree decision tree training and Let me run that run all below and It validates whether the decision tree was a good one it uses data data set from KDD Cup, which was some Network intrusion detection competition so build a classifier for network intrusion detection and So that uses the same data set yeah So it's now connecting to spark and For the spark we can look here into open shift that it's running as Part of my namespace as part of the Jupiter have namespace. I have two workers And I have created a route so that we can look Into the We can't look into that because it doesn't have Me fix that quickly Okay, so let's not fix that So what do you would see here? Let me try different thing. Let me start fireworks. What happened? I'm sorry Yeah, so what do we see here is that we have two worker nodes Each each executor gets 20 gigabytes of RAM and eight cores Well eight courses together so each executor gets four and then that's where we are actually running the the notebook code So it downloaded the data and now it's it's processing the data splitting the CSV Into multiple and then it will be then it will be training the decision tree What do we have in open shift as I as I mentioned we have the Jupiter hub and we have now my own Jupiter server with Jupiter hub is routing to and we have the spark Spark cluster, so I'll Let it run it will take some time the training it takes like I don't know eight minutes So I'll go back to presentation and And then we can we can revisit that so about the architecture of Jupiter hub Basically the Entry point where you access to as a user is Jupiter proxy which then routes either to Jupiter hub API or that your server that you started and then there is something called spawner Which takes care of spawning those notebooks or those servers per user We use cube spawner as the name suggests it is a spawner that is working with Kubernetes And it generates the pod definition and submits it to open shift There is also a database which gets it's just a for tracking of users and started notebooks and things like that so that it doesn't And then the proxy routes and things like that so it doesn't disappear When there is some restart or something So how it works as you saw a user comes to Jupiter hub and Then it's redirected to some authentication there are multiple implementations of authentication for Jupiter hub you can have github authentication you can have Kerberos You can have a pre-generated set of users and password. So if you we when we were doing some demos We just generated 20 users and gave gave the attendees for the workshop those users and passwords Then when you when the when the server is spawned so the user requests the server to be spawned Jupiter hub generates Generates the artifacts for open shift and if there is if it finds that I want to start a spark notebook It also generates like a config map for spark operator. I will explain explain what spark operator is later but basically It takes care of the spark clusters so open she starts at my Jupiter server and Notifies the operator about the requested spark cluster and The spark cluster is started and then the user and then the user accesses His notebook and connects to spark and when he stops the server It also kills the cluster We use APB and simple playbook bundle You can learn about that from the documentation of open shift, but basically it's just a it's just a set of Artifacts for open shift about how to deploy each service and how a day should work together And you can have that in a catalog in open shift and nicely deployed it by three clicks or something like that So that APB source code can be found in the open data hub I owe and we have we have it built in quiet Under the organizational data hub so we can go there and you can download the image and deploy to your open shift and and try it tried it yourself So what is special about our Jupiter hub because this what you could do Basically, I built my work on top of work of Graham Dampleton who has I have linked at the end, but he is Jupiter hub Jupiter hub on open shift or Jupiter hub quick starts something like that And I just took that and build something on top of that and what are the differences basically are mainly these four things So image auto discovery single user profiles a formal clusters and publish and share What did me? What does it mean? So you saw that select box for the images that is automatically generated from the Notebooks that are built in open shift from the images that are built in open shift It is not very nice right now the user experience is not very good But I'm planning on improving that with some descriptions and then like install dependencies and things like that So it's it's it provides more information to the user But it's it's helpful if you you don't have to know you have to remember you just pick from the from the select books the single user profiles We quite quickly when we started to use Jupiter have we realized that every Sub team in our team and every image has to have different configuration like if you are working with Some parquet file that you download from object storage and you don't use spark because you just want to process it directly in the notebook you might need more more Memory for the notebook if you are working with spark, then you don't need that much memory But you need spark diploid if you are working with some specific object storage Endpoint or bucket you might need to have that in your environment variables. So we build this Library that basically is configured with a config map in open shift right now and you can Mix and match the images that are that are used and the user names and users with like what should happen when the user selects that image And how it should be configured You saw that I had that spark cluster inside inside my namespace and That works it's called spark operator and operators are a concept in communities and open shift that there is a service the operator Which listens on events and if you find some specific event it will react to it. So here user comes and says please operator fairy can I get can I get a spark and Says yes, sure you can get a spark and it deploys spark based on the configuration And when the user leaves and says I don't need a spark anymore. So it removes the config map or the custom resource It will delete the spark cluster again So we have that in the in the profiles that we say that if you select the spark image We want to instrument the spark operator about that Configuration of the spark image and say please deploy two workers And one master with these resource limits for us Yeah, and the last and the last bit is basically What we hit also quite early the workflow about Sorry, the workflow about how do you share your notebooks because if you have jupiter up and you want to I don't know I want to give share My notebook I have to download it I have to send it over email or push it to git and then he needs to download it and upload it to upload it to his Jupiter notebook to bless his Jupiter hub. Sorry Which is not very nice if I just want to show him a simple change in like line 24 I change this letter and now it works So I build a plugin for Jupiter hub where you click a button Where you click a button You give you give it some name No, you give it some name Yeah and You hit publish and you get a URL Which you can access and you get a nb viewer And we viewer is a tool that lets you view the notebooks without being able to execute anything So it's read only but it renders the notebook in a basically the same way as the Jupiter hub and this this URL you can share if it's it's public It's not behind the authentication so you can share it with anyone and then he can just view and he can also download the notebook So that that I think helped us a lot To speed up the speed of the process So how is our training going so we saw that we see that the decision tree classifier Got trained This is the decision tree so there is a lot of if and else statements and Now it's doing something else so I didn't really dig deep into this notebook I just wanted to show that basically with our deployment we can directly use spark and and The ML libraries without having to do many changes to the notebook that I found randomly on the internet So that the integration is it's really good So Yeah, so I just wanted to Go over a couple ideas that I have about like next steps for the for the Jupiter hub that we could do So we right now have this spark operator integrated But I've it seems that the dusk. I don't know if you heard about dusk. It's a Python based Distributed analytics engine or whatever you would call it Provides advanced parallelism for analytics and anything performance at scale for tools you love so it's basically seems like spark Implemented in Python supporting Python better than spark. Maybe So We are thinking about like adding that next to the spark operator having a dust operator Which would then spot one dust cluster if users wants that If you if you noticed in my notebook, I have I have these environment variables with with credentials They are not there automatically, but I would like to have them there automatically populated for users based on some secrets somewhere I Have to add them in the single user profiles config map so I would like to have that as a as an automated way how to get those credentials from some source of truth and and push them into the Server automatically so that user doesn't have to care about that I would like to work on github and github integration So you can have a button same as the published one I would like to push this to my git repo or I want I want to create a git repo for this notebook or something like that I've seen some attempts on the internet that people were doing that but it never really worked in a user-friendly way You saw that select box which was pretty ugly for the images So I'd like to make that more fancy more more user-friendly and make users Make it more useful for users And also jupyter hub exposes metrics So if you want to know how many requests how many users and things like that and maybe Build some alert think on top of that like my cluster is getting full because I have too many users using jupyter hub at the same time but we need to Enable I think they are enabled, but we I'm not sure what is exactly in there And we don't have a prometeus watching that so we need to set that up also for the jupyter hub apb and probably is extend the metrics because as we start using the spark and the connection between jupyter hub and spark we need to be able to map it together in the metrics and And that's basically It's basically everything I had these are some useful links. So this is the apb. This is the link for the For the single use profiles, which is a quite simple library just for that one use case Here are the open shift configuration for the jupyter hub, which is then used in the apb Spark operator a colleague from red analytics IO red analytics team in redhead was working on on the spark operator So that I just used it and it worked perfectly And this is where we came from with the jupyter hub the jupyter hub on open shift Which Graham them put Graham them put them put together and you can go there and you can try jupyter hub without all these Sparks and things like that just on open shift in a simplest in a simplest form Yeah, so that's that's basically it any questions Yes, sure Yeah, so the question is whether with the spark operator we get we have one shared cluster or if we have cluster for users Yeah, I didn't mention that so We are basically thinking whether we should deploy one big beef a cluster for spark and then let everyone connect to it But that has its issues and limitations like that you have to reserve that capacity on your open shift cluster like if you want to really have a Hundred users and you would you want to allow them all of them at the same time go to that cluster Then you need to have reserved hundreds of gigabytes of RAM for those for those workers or you can have ephemeral clusters So when the user comes and logs in and start the server It will start a spark for him and when he goes away, it will kill the spark cluster for for his spark spark cluster So we are doing the ephemeral clusters right now. So when the user comes he gets his own Fresh clean spark cluster with some resource limits, which are obviously tighter than if it would be one big cluster And you would be the only one there but We need to do some performance testing and and Get some more data about like how that actually works and if it's if it's useful for people Start items for what? The spark cluster it's quite fast Basically, I can I can probably show you so I'll call my cluster. I will stop my Jupiter server I Think it's well, it's basically a couple couple seconds or maybe maybe couple tens of seconds Why can I close that? Great, so once the Jupiter notebook Server disappears Which should be any seconds now But it has to wait for the timeout because there are no Shutdown scripts in that in that image. It's also one thing that we need to fix So when it goes down the the spark cluster will disappear as well, and then I can start again So in the meantime, we can take probably another question or two if there are any Yes Where's your data kept? Where is your data kept I have an HDFS cluster so we have that we have the safe cluster backing the backing the open shift and The safe the safe cluster is basically where we push and pull data from okay So I could just connect my head to cluster. Yeah It doesn't it doesn't really matter like what technology you choose for that So it's it's gone. So I'll just go here. I click start my server. I will pick this park and I'll go back here and you'll see that my Jupiter notebook is starting and basically immediately I got the two workers running and Master node takes a bit of time because it needs to connect to the workers and Figure it out. It depends on like if you are starting it for the first time There are there is some time that needs to That the images the container images takes to download on the note But if it's like the second start and the images are the same for all every user So once they are downloaded on the note, it is basically instantaneous Start I don't know what why the master takes so long now But I think that it's it's running. It's fine. It's just didn't update the UI. Yeah, so it was basically instantaneous start Yes The the notebooks plan spawner Does it scale up to multiple nodes or do you have to configure that for just on the Jupiter hub? Yeah, so this so the spawner the spawner is Not doing anything smart. It just Generates the pod definition and pushes it to open shift and open she schedules the pod so that basically means that as Is the it's up to the open she scheduler to schedule these so it would it would distribute them across the cluster It wouldn't put them on a single note Depending on the size and load on the cluster. I don't know the details of implementation of the open shift scheduler, but Yeah, it is it is it is based on the open shift scheduling so it would be distributed and The same same goes for the for the spark. It doesn't have and we could configure it in a way that it would have some affinity so like Get put my Jupiter hub close to the spark, but it doesn't really Bring anything because we are trying to we are trying to pull the data from the safe for S3 or something Not from not not sending it from the set or notebooks error. I Think there was some other question How mature is the spark operator? Yeah, it is it is quite new. I think it's like couple weeks old But honestly, there is not I mean there is missing and I still miss that I've already filed a couple feature requests I'm missing some configuration options like At the beginning there was there was no another way how to set limits Resource limits for the for the workers and for the master and stuff like that So that's there now and I have a couple more feature requests in queue for like I want to be able to force update the images And I want to be able to configure these values and stuff like that But generally the working of like spawning and killing the cluster It's it's working very well. I haven't I haven't had an issue with that The the guy who works on the spark operator. He actually built a library in Java I think it's he calls it JVM operators, which is like a library that you could use to build Another operators. So he's trying to get that very stable and then Spark operator would like to benefit from that Okay, I think we are out of time anyway. We have one more minute. Okay, so if there is there's a question No, let's hear it for our speaker. Thank you very much Okay, everyone for the next session. We have Andy Gaspardaric Software architect at Broadcom who will be talking to us about improving network latency and throughput with dynamic interrupt moderation As expected this is a I'm gonna talk with wide appeal open to turn us on All right, yeah, as expected this talk with wide appeal to many conference attendees something is Sometimes feels dry like the kernel or doesn't contain a container buzzword Might be problematic, but I'm really proud of this work. We did and really glad that we could That I could come here today and share a little bit with it with you on this and in particular I think this is a good fit for this track surprisingly, so Rather than thinking about calling it What it is which I can't read You know improving network latency with Jim we're gonna talk about auto tuning your network So and I thought I had a little little picture of our famous our favorite auto tuning artists there So what is dynamic interrupt moderation so for those that what this might be people probably Maybe maybe aren't familiar with how the packet how packets in the Linux kernel actually make their way from physical hardware into the kernel stack itself so The main idea here is that we're gonna tune the time between when the first frame arrives on the line and And when an interrupt pops and so there's a variety of reasons to do this We'll go into those in a little bit, but this is kind of the flow. So we have an interrupted pops We schedule a polling event and That polling event that ultimately reads reads the receive ring of the nick so in this Beautiful picture. We have frame zero to frame n and a head and a tail. So these are these are essentially considered to be the frames that have Not been read and pulled out of hardware yet and marked as complete So it's kind of a typical workflow if our arrow going from left to right indicates time moving on Each forward-facing or upward-facing arrow would signify an interrupt and the stack of five Rectangles indicates five frames that are read out of the ring buffer So in a fairly consistent flow of traffic that you have coming in this interrupt period As I'm waving my hands from this first one to this next arrow would represent The interrupt timing that we would have so we're going to get a little have a few frames come in Service them pop another interrupt more frames Etc. Etc. In a steady state. This looks pretty good So if we have a short interrupt time, of course This means that we have a really small number of frames process in each polling event now that can be good if your concern is Latency that can be bad if your concern is throughput because an interrupt is pretty expensive So if we think about doubling the interrupt period with the same traffic flow We have a situation like this where instead of just receiving five frames at each polling event. We would now receive 10 so this is a great case for a A great description of a workload where you want high throughput and Also the downside is high latency So as you might be surprised as you might not be surprised to find out This is not a particularly new problem. This is something that people have been dealing with for a long time so One of the first attempts to deal with this administrators. Yeah, it literally decades. I think I first probably came across an issue like this easily in the knots if we're calling this lap the previous decade that and Regularly would have to talk to customers when I was a red hat to try to figure out whether or not How they should tune their devices so the first attempt at really dealing with this was in one of Intel's 101 gig adapters they had a hardware feature called aim or adaptive interrupt moderation and This was actually the source of fairly What ultimately was a fairly long why someone was having a particular problem because they were primarily concerned with low latency Not with throughput and unfortunately at the time When we were only dealing with a single receive queue most people were concerned with throughput That was the big one of the big tests that was done So one of the things about aim is it was liked by some disabled by many Like many of the hardware features that have existed in the past. There's always a little bit of angst Hardware designer does it rolls out some software to configure it Maybe doesn't work exactly as everybody expects So then there's some significant frustration about, you know, why does this feature break my network? so Kind of the same story over and over works for a lot of folks, but the lack of flexibility that existed in hardware Was not good enough for some They always people always seem to default to thinking that software is more flexible and better And for many cases it is so at the time one of the interesting things is that We sat around the office and postulated whether or not it'd be good to have a user space demon that controlled this interrupt timing so At the time it was at the time this happened is when we're first starting to see some of the tune D profiling come out on various Linux distributions most of them red hat and fedora and You could you could tune your workstation for whether or your laptop for whether you were most concerned with high performance Or whether you were concerned with better battery life or I think they're at the time They're even some they're even some some networking Configurations that were available and many of these things twiddled bits and Intel's power management capabilities at the time And so we thought what if we did a parallel what if we came up with something that could that could really have Administrated the beginning of time say you know what this is a this is a workstation where latency is the most important thing So let's tune for that or what if we You know this was a file server where we cared most about moving bulk traffic on a regular basis so sort of sat around the office and pondered whether or not that'd be a good idea and And and I think ultimately when I look back on that now We've come to a realization that that was completely and totally the wrong strategy So we could say it was blind luck that we didn't implement that but realistically it was probably more about laziness than anything else and so Let's fast forward a few years and think about where we are now So Machine learning AI everywhere. I'm amazed I'm like sadly amazed by how much it's on our phones You know things automatically presuming a time of day that you want to do something based on where you're physically located Like and I don't know it's funny to me how impressed I am by just little tiny Simple simple things that you're probably never taught in any sort of CS or computer engineering program anywhere and so one of the first things that was that that came to mind is a talk that Tom Herbert gave and that's of in Montreal last year and He talked a little bit in his keynote about the impact of artificial intelligence and you can see I've got a screen grab of his Video on YouTube about this with the link here All very clickable for everybody right now and You know he talks about the fact that that machine learning and he's he's got a new a new company that I think Or machine learning will play into this one of the things he talked about is like will the latest congestion control algorithm TC TCP BBR be the last human written congestion algorithm that exists and It kind of struck me when I was thinking about this like How interesting it would be to think about that being the last one that's written and how through machine learning We could come up with better ways automatically. It's a little bit Skynetty a little bit scary, but at the same time I think the power that we have is massive compute power and their ability software's ability to Do the same thing over and over? Effectively that mouse is moving around Not good or is good for us the mouse is not good. So Coincidentally Mellon ox added support for what we're now calling dim in their main 2550 100 gigabit driver in 2016 and Bless you and the fact is we were I Was trolling around looking at their driver and wondered like now. What is this this operation here? It doesn't it doesn't really make sense. They're doing something on receive They're doing a little bit of data gathering it looks like and it's like they're kind of using it to make a decision later and that's exactly what they were doing so that they were Calculating how how many bytes were coming in they were counting the number of times an interrupt popped and They were using that data to come up with what they felt like was an optimal setting for their They receive interrupt timer. So if we go back here a second Remember our two pictures that we had so this one pretty steady state regular interrupts servicing a small chunk of packets at the time This one longer interrupt rates serving more bulk traffic. So it looked like they were trying to figure out a way to Know which time was the best based on the traffic that came in Not pictured in either of these slides is the fact that there's a different each one of these packets could be a different size Each one of these frames which also plays into it because again It's easy for us to think when we receive a packet and we just know that it's long it's easy to think that It's all the same that a 64-byte packet and a jumbo, you know 8k 9k frame is the same But realistically they all take a different amount of time to be on the wire because there are Discrete bit times required how to handle these things so so I thought I thought that was pretty interesting that that Melanox had that and We started looking at it. This is basically how it works. So in this slide credit to Algarboa from Melanox I gave a talk over this year on this Take a sample Compare that sample to previous runs previous iterations and then decide whether or not you want to make a change So when we dug into it it seemed pretty good. So the other cool thing and One of the things that we see as a kernel developer I'm okay with it One of the things we see a lot of talk at talks is you know escaping the constraints of the kernel You know people feel the kernel limits them and the dpdk is so much better or some other Thing is better for their application for their individual applications. I would 100% believe that one of the other things That's allowed us to do is by running this in a driver We escaped sort of the lock-in that the global e-tool API uses for configuring these interrupt timers So in the past when still today because there's interesting keep an e-tool pretty static if you configure interrupt timing It applies across all queues. We of course now live in a networking world where it isn't just a matter of a single queue receiving all these traffic all this traffic multiple Multiple scores are tasked with servicing this traffic, which is how we can get to 1500 and pretty soon 200 gig ethernet on a server So this allows us to escape some of those that kernel kernel lock-in So what we really found is that because it can operate independently We can also have different types of traffic being handled by different cores This is especially useful in a virtualization case where you might have an application That needs to be low latency that's running in a VM or you might have another application That's ultimately serving as a storage destination So having now all of a sudden we can have the best of both worlds We could run a net perf test and receive full utilization of of all of that of that core at max at pretty much maximum throughput and we could run a TCP RR test with that perf at the same time and C low latency because they're end up being serviced by different CPUs. So that was super cool So we'll talk a little bit about the algorithm It's not super amazing and the great thing about it is I'm here talking about an algorithm today. That's actually open source. It's in the kernel You can look at it. So spending a lot of time explaining how it works. It's not probably super valuable Because I know everyone loves reading kernel code. I know it's what that helps them sleep at night As well as it's the first thing they read in the morning so in a typical case we have five profiles that exist right now, so You can see what's critical at the top is the different timer settings So obviously down here on the far left. This would be the low latency case so we want the timer to pop really quickly and On down on the far side the timer of 256 microseconds would be the high throughput case so the reference to left and right is something that's baked into this algorithm and Everything starts down here at the low latency case I think that makes a lot of sense to start there rather than starting in the middle Because typically low latency is going to be small traffic. It's typically going to be quick sessions It's typically going to be a small number of bytes so it's default to that and the rate at which we sample and the rate at which we make changes quickly moves us Down the line to the right So this decision tree is really pretty simple We have our previous decision Either right or left we compare the samples that we've collected on every single packet We receive and every single interrupt we process and then we make a decision. Well is is Is this better or worse or the same as before if it's the same we park it Analogy that probably applies to everybody that drives If it's worse we go left in the case where we were previously going right and If it's better we go the opposite direction we go more to the right so The compare samples piece can also Can be tuned a little bit depending on your? Your workload or your speed, but really one of the coolest things about this is I've tested this across fast processors and slow processors and Super fast processors if we want to have three examples and what really works is this holds up across all of them so this is something that works in a small system, maybe even a 32-bit arm case and it's something that works on works well on the latest Intel Devices so all right, so I mentioned Intel and Melanox, but what about Broadcom? I mean they're the ones paying paying for me to come here and talk about this So of course this is the the big reason I'm here is that we found this to be interesting We ported it to our driver and we really liked what we saw in fact it was Other people confirmed they really liked what they saw so here's some super fun graphs so in the case of Default and adaptive coalescing in this first picture on the left You can see that basically the throughput was unaffected by the number of streams. This was a 25 gig nick That's why we're up there at the top so we can almost fully utilize that just with one core And certainly once we hit two cores or two streams. We're utilizing it a hundred percent and the graph the point of this here is to show that Even with the small hit that comes with cataloging this information We were pretty much right on my throughput. There's no hit there The graph on the right is a little more complicated to understand so I'll explain it a little bit The x-axis represents the number of streams in use and the y-axis represents the total CPU utilization So unsurprising with one stream. We're utilizing one core completely It's as the graph is kind of funny that there's a two and a half core example there That wasn't really what we did. We there's a dot there at two. It would be nice if that's how Whatever spreadsheet technology we're using a graph this would have chosen to put the lines at two, but anyway At two with the default coalescing settings that we have in our driver. We saw much higher CPU utilization because there wasn't the ability to adapt and have the interrupt timer move way out and On the case on the right lower is better. So adaptive is clearly winning as we scale up towards eight cores eight Eight cores being used for receive traffic. You see there's a 7.0 there. We're still not even we're barely over barely utilizing two and a half cores completely when you add all that up versus Close to probably four and a half four and three quarters With the default settings. So we feel like this is going to be a huge win in in the throughput case The other thing we did is we did some TCP our performance now this is a I Hesitate to show raw numbers here because every time I get a new system in with a new processor these numbers all change But we went ahead and put it in anyway So with our original static coalescing at the best rate we could do we could do about 20,000 transactions per second with adaptive or a little bit less So I'll talk about why there's a 4% reduction, but we were really happy with this to be honest the fact that we're paying attention to every interrupt paying attention to everybody that came in and doing computation on those not on every packet, but Statistically within a certain number of packets we would analyze whether or not we need to make a change the fact that that only caused us In this in this single stream test of a total of 4% hit We knew was going to be a real positive And at least for the the people we were going after for this and and they were they were quite pleased so We also confirmed that one receive ring can be optimized for low latency and another for high throughput This was really the case that I think was most interesting to me I think This flexibility just doesn't exist today in the Linux kernel So by adding this feature we were able to provide something that really other than Melanox No one else could do so I was really happy and and What we decided to do was rather than just take Melanox's code I'm completely added to our driver and that seems really weird in some ways I worked with Tau Gaboa Melanox, and we actually made a generic Layer in library now yesterday if you if you sat through One of the late afternoon talks there's a panel and said oh AI is not just about adding a library and thinking that like everything Just works magically I won't necessarily refute that but I will say that in this case That's one of the cool things about this is you can just add a library you add the right pro points within your driver You add a function call that can set this value in your hardware, and you can just use it and in fact After posting my first patch of stream I got several off-list emails about this people who are interested and One of them doesn't even work in does does happen to work for Broadcom, but not in my division So didn't know he was interested, but the BCM G net driver Used this right away, and in fact they also Adapted it and wanted to use it for transmit as well this is a great example in my view of the power of this because this is a Driver for an arm so see it's typically embedded in set-top boxes. So if you have used any pretty much Many of the triple play offerings from ISPs where you can plug a phone in and you can plug some ethernet in and has Wi-Fi built-in That's the type of application that this has and this Type of application for this and in their case they've got a wide array of traffic patterns You might have home use traffic that is you know streaming video And so you're gonna have large frames you're gonna want to want to make sure you're optimized for that but you're gonna have other flows that are very small or very short-term and As soon as this came out Florian Finnelli is the one that did this work. He was Excited about it because he'd been they'd been pondering the fact they saw such a huge difference in the way their systems performed When they would use different values so the fact that this could do it tune it for them without doing anything They were pretty pretty stoked about so more drivers to follow. I don't know. I've talked to folks at Intel. They have They have a little something in their driver that does something similar. They also have some hardware that has some fun fun features We additionally have actually amazingly lots of hardware IP blocks to try to handle this situation we have More than just a basic interrupt timer We've got several things that we don't completely expose because there's no API and part of the part of the challenge for this was actually Working Michael Chan and I working out which how we should how we should handle this and how we can how we can best You know give customers and more importantly administrators the opportunity to run this no longer are The theory should be when this is when this is working and this is in the distro and this is everywhere There should be zero support calls Again, that's the theory there should be zero support calls to anybody who who says oh my network's not performing in this low latency case Oh, it's not performing in this this bulk transfer case. This should be done They should eliminate those calls. So we have we have outsourced this this work to the machines So I want to share just a couple observations some some surprising some not For me, this is a fun thing to work on which at this stage in the game that working on the car as long as I have That's sometimes a little bit rare So one of the first things we came across is that programming hard work can be expensive And when I say expensive, we're still talking about milliseconds or microseconds But it can be and this is a common case across multiple hardware vendors In fact, we spent a lot of time tuning and understanding when the ideal point When when's the ideal point to sample? When's the ideal point to decide whether or not we should make a new decision? Because you can do it so frequently that you see a much greater than 4% reduction in your in your low latency tests and And this this expense when running on the same CPU as The traffic as the CPU receiving the traffic does is going to cause a small interruption in traffic so we talked about scheduling another CPUs and we decided that was That was a an experiment that we could look at for another time But but another thing to think about, you know the cost of doing these operations to hardware is never free So a good thing to remember The other thing We found is that we had a few benefits that it appeared sort of unexpectedly so when we were Doing some testing we are a typical test case where you have you know a whopping two devices involved And you're doing some transmit from one to another and you know in the case with almost anything You have an experimental group in a control group. So what we started with was using our adaptive interrupt moderation on on our test server Running an upstream kernel and we had another system just running an upstream kernel With our normal driver and we slam traffic at it and watch what happened And one of the things we found is that we were not getting the throughput that we expected And it was you're sort of scratching our head a little bit saying like well, you know, I would expect that that We can see that it's moving up to this higher profile. We added some debug FS support so we could see this in real time and It just wasn't happening as Efficiently as we thought it could And some of that was because I'd previously tested two systems back to back So we started doing this control group now what the performance wasn't worse It just wasn't as good as I thought it could be and what I realized is that if you're ascending system despite not having any transmit interrupt Moderation features enabled acts basically are classified as low latency traffic Acts are small they're coming all the time and the speed at which you receive an act Definitely determines how quickly you're gonna send out traffic again So we actually did some tuning in and so we we emulated what we thought the algorithms would have done and On the sender move the low latency back to the sender of bulk traffic Move moved just to a low latency profile and we actually saw improvements in CPU CPU utilization So that was kind of fun And I think to me this is one of the examples one of the things we can point out that Had I spent had myself or the other folks working on this spent a lot of time thinking about this ahead of time We probably would have come to this conclusion Maybe maybe not you never know might give ourselves too much credit, but the difference was that Just trying this enabled something newer and maybe more fun than we thought and was an improvement So I think this is a for me This is a thing I'm gonna continue to think about as a big win for AI showing us something that we didn't think of we could do before So the other big takeaway for me is that the kernel has a ton of configuration knobs a ton and so many of the folks that have worked on the kernel are Some of them no longer working on the kernel. They're doing the next the next most interesting thing that they think exists Or they're too busy, you know working on the next version of hardware Or whatever that I think there's a lot of low-hanging fruit out there for us to really examine Different kernel config options. I mean take for example, just the discussion I had about the the tune D profiles that exist well, why Do I need to Why do any of those need to exist? What is it? What would what would it take for us to figure out with any of these things what the ideal number of What the ideal settings are for highest performance what the ideal settings are for low battery life even take things like You know data plane technologies that are of interest to me right now whether they be BPF and XDP or or even DPDK Things like why do we have to guess at what it takes to be the proper number of packets that we batch anytime? Anytime we're doing reception we can improve packet performance by batching. Well, let's figure out how many that is Automatically, let's not figure out. Let's not spend four days with the person Recompiling and testing over and over again to try to figure it out. So that's that's my encouragement for all 11 of you that are here To go forward and figure out and think about whether or not areas you work in can be can be done automatically so also want to Leave a little time for questions, but I want to make sure to give a shout out to Gil Rocca Archie and Tao from Mellanox who came up with the initial Implementation of this the initial design and push the hitter driver and Rob Rice and Lee Reed and Michael Chan from Broadcom And then of course copyright holders saw images used in the presentation. So that's all I've got Questions, please say no No, that would have been I Given more time I would have loved to have auto-tuned the entire presentation Cool, well, thank you. Oh outside the lab Yes, absolutely. So the question was outside of a lab environment what sort of other applications have been tested and For me that's all I have done because we had a lot of this was motivated by a specific requirement from a potential customer and They had some workloads that they weren't able to emulate with net perf Pretty effectively and so they came up with this recipe and said, okay, you can do this and you can do this and you can do this And you can do this No touch, you know, you have a chance at winning and so that was a lot of what A lot of what I do in my job now is to figure out what it takes to do that And so we looked around and looked at different things And so they came to us with the TCP RR test with net perf and they came to us with some of the TCP stream tests and Some other specific things So aside from just a system-to-system test the other thing that this has been tested on pretty heavily from our own interest for another another reason Was actually syncing the traffic into a VM So a regular host that was pounding a VM with traffic But both the TCP RR and the TCP stream so the VM was the sync for the traffic And that's actually one of the interesting points to me is that's where this really shines Whether using Broadcom hardware Melanox hardware because the VM has zero good the VM might be running for IO may have zero control Over what's happening? So now what you've done is you've got a way where it doesn't matter what workload is being run on those VMs You you've given them a chance to to both be successful because typically both of those separate IP addresses separate streams They're going to hash to separate CPUs So they're going to be on unless you have really bad luck And then they're going to be received at different rates and that's that I think is the is a big strength And so I look forward to when These upstream kernel changes roll down into the main distros and are used In virtual in virtualized environments like that whether it's open stack or just other other places. I Think that's going to be key Yeah, so It was it landed in January in Dave Miller's tree. So that probably means for 16 So, yeah, so it's freely available in in everything shipping past that point Did you ever oh, yeah, absolutely So Yes, I can definitely say okay. I shouldn't say definitely. I Haven't been asked to so broadcast maintains our inversion and Collaborates on an ESX driver. I don't think I haven't been asked by anybody that maintains The question was related to whether virtualization environments ESX or KVM Etc. I can't say for sure. No one has asked me on the ESX driver team anything about this Which is typically a sign that it hasn't been implemented Not always but no one's asked on the In a KVM environment if you're running a new enough kernel, this is available So if you're running if you're if you're based kernel on your hypervisor is I'm just going to go ahead and make a blanket statement and say 417 although I think really 416 or 415 is probably right Like I said, it landed in Dave Miller's tree or this year and his tree is always it's a development tree So it's always you know one version ahead. So if I do like a get described I always have to add one to whatever is there because he keeps linus tags so Probably should have done that homework But yeah, and any I mean if you go out and run fedora with this right now or even probably I guess 1804 One two, it's probably got a new enough kernel that it's going to be there It keeps one it knows the last state and that's it Yeah, very very low overhead and that's one of the things that we really liked about it And why like it was kind of crazy how simple the how simple it was and how low overhead it was and how Small of an impact it had on I mean the impactful part is actually the couple millisecond delay hit that you take less than that but the Writing to the hardware if you have to make a change That's actually what we like we never had to worry about tuning I mean you're talking about one or two instructions that with a good compiler are probably going to slide right in With some other delay that you have in the network stack The the cost is always how frequently we wrote to hardware like I can tune that and watch it change Like if I write to hardware every hundred packets Throughput and latency suffer heavily because you're spending so much time writing out Obviously if you do it every million packets, it's less useful Especially since most flows aren't that long but but yeah, it's very lightweight. I mean I was shocked at how well it worked Like it doesn't and that's the thing too is it doesn't have to be complicated like we don't a lot of the base This layer is created in such a way that if you wanted to do a very a much more complicated Stateful inspection and keep track of your preemptively decide based on Something that's coming in that you should go one way or the other you could do it But this is such a great easy intro to start that minimal hit anything else Thank You Andy. We have a coffee break from now till 1120 and we will be resuming session then Okay folks for the first session after our break we have Share our Griffin senior principal engineer at Red Hat's AI center of excellence We'll be talking to us about building AI with Seth and open shift Thank you. Thanks everyone So Saturday morning, thanks for you know Not watching cartoons or Netflix or anything coming out here to see us speak today I'm gonna talk to you about what we've been doing in the AI center of excellence at Red Hat You guys have seen a number of us talk before about our strategies with AI and machine learning You've seen Vasek and Steven talk about Jupiter hub and Seth and spark as well I'm going to expand on that and introduce a concept of open-wisp where we have serverless actions So you'll see a little bit about how we chain all of those technologies together to get a nice machine learning pipeline First off to do some level setting and ground rules Everyone is familiar about the concept of machine learning and AI This is to just show an example of a typical workflow It starts off with of course Daniel mentioned this before the data in this case We have a data set live a library of data sets and from there we want to start to develop a model That's where a lot of the brains and the intelligence comes into Analyzing the data and understanding what you want to do trial and error From that developing of the model you then go into the exercise of tuning the model and training and using The data that you have to then train it Once you do that, of course, you know the big excitement is after you've trained it and developed your model You get to deploy it and actually see some data coming in so for today's session I'll walk through an example of all of those types of steps that would go through in a typical process So you can see how we can use tools in an open-shift world in order to achieve that workflow the first part of this as we mentioned before was the the Open-shift framework is as we start out. How do we do this in a containerized world? Well in this case, we're using open shift open shift as you all know by now is a container platform that's it uses certified kubernetes enterprise kubernetes and it also Allows you to do more hybrid cloud type of things. You may have some of your Infrastructure in Google AWS Azure you may also have it on-prem Open-shift gives you that ability to seamlessly manage all of that in one ecosystem Once we have our containerized world We need a place to store the data for this example. We'll be using Seth. You can also use other technologies The reasoning behind Seth is really because of the growing need to be able to Separate your data from where your performance and your compute actually happens And that was one of those architectures that came from some of the Amazon maybe a lot of you may be familiar with Amazon's Infrastructure you had a lot of the EMR type of scenarios We'd have elastic map reduce and a Hadoop ecosystem and then environment allows you to spin up a Hadoop cluster Process your data really quickly and then tear down the cluster and not have to worry about maintaining it Obviously if you tear down the cluster you have to make sure the storage is still there and Object stores like S3 and Seth came about we use Seth because it has S3 capabilities And we're able to leverage a lot of the technologies that are already built on top of S3 Of course, it's it has a restful gateway, which is nice to integrate with as well and it's a distributed system Some of the other software that I'll be using in this demonstration Will be spark you've heard a lot about spark in a lot of the AI talks and a lot of the container talks as well You know spark is a great engine for processing data allows you to do batch and streaming But it also runs on kubernetes kubernetes, and in this case we are using the rat analytics.io Work that they've done in order to move spark into a kubernetes framework with their of shinko and rat analytics spark engines that we'll be using Jupiter hub we've seen several several examples of that that allows us to have a multi user Console to manage Jupiter notebooks users can have many different notebooks You can also do many different users on there And that allows you to do your data science work It's designed for data science and research and a great tool to do that that'll be running on kubernetes open shift as well And then the last little piece here some of this may be a little bit new to you guys But once you actually do the data model work you want somewhere to deploy it in this case We'll be using open whisk open whisk is a serverless action. If you've ever used Amazon's Amazon AWS Lambda It's a similar concept where it allows you to focus more on the delivery of the real code and not worry about the architecture So I'll show a quick example of deploying open whisk and then taking that model and the code that uses the model and deploying That to open whisk as well so that you can have a nice rest API wrapper around execution of that model So to start off one of the first things you want to do is make sure you're collecting your data This is just a little bit of a diagram that shows you a typical workflow that you may have to actually Ingest your data internally at red hat. We have a lot of different systems that send us data This is consistent some of the examples. We do we retrieve data from git. We also have build logs CI logs All that information coming into us, but we also have a warehouse that houses IT data customer data customer feedback support tickets and we leverage all of that send that data into Seth and Then use the machine learning on top of stuff in order to process it We have a different bunch of different mechanisms for getting that data into Seth Some of it is through Jenkins. Some of it's through other types of workflow Managers and it just seamlessly moves that data into the object store I'll show some examples of that as well In order to start out with Seth the first thing you'll do is you'll actually configure it from an object storage perspective to Be ready to ingest your data I'm not going to go through the exercise of installing Seth I think some of you have seen that before but I do have a link on the slides if you want more information about Installing it. I'll just step into the part about Setting up a user and then actually after you have your user set up then just getting at getting your data loaded into the system So once you have set object storage installed You have to make sure the object gateway is installed and then once you have that you want to set up one of your users with s3 access and In this case, it's just a quick command that you might run where you would create your user in the Seth the Seth s3 environment and when you do that What you'll get back is an access key and a secret key That is just like as if you were working in any other s3 environment AWS or anything else and that access key in that secret key is the important part about letting you actually write and read data from the Seth environment So from there what I'm going to show really quickly is I actually have let's do a clear here and Hopefully you guys will be able to see this. Okay. I'll try it. Yeah, it looks pretty good. All right So in here What's that? Oh, whoa, that's weird. Yeah, that's seriously cleared it out. Oh That's weird. It's doing all kinds of funky stuff here Alright, can you see it? Okay, that's good So I'm gonna show really quickly is I have a bucket That I have let's see Oh, sorry about that in the Massachusetts open cloud and that bucket We just called it open data hub and since I'm using Seth instead of AWS Which you'll see here is I'm actually using the AWS command line interface and you would go through the typical steps of setting up and configuring the AWS command line interface Actually, I'll just tell you to show you a step real quick and Steven touched up on this before You would typically do an AWS configure where you take that secret key and that access key that showed you in the previous step And you would just populate that information I'm not gonna actually show my secret keys here, but you get the idea Yeah, I know it has awesome data in it though. Trust me So once I do that and have it all set up then I do an AWS LS you'll see here for my training Sub-directory I actually have no data in there So what I want to do really quickly is just upload some training data that I have here and I'm going to just run this command here What it's going to do is upload a tab separated value table called training data This is actually some data that we're using for doing sentiment analysis and you'll see that as I work through the example However, there's all kinds of different formats of data. I'm just using using tab separated values right here You could use snappy Compression with parquet. It can be JSON. It could be CSV files anything that is typically supported by a lot of the Hadoop ecosystem Now that I've uploaded that you'll see I actually have my training data. They're great. Awesome. Whoo. Yeah uploaded data now What? Well, here's where you start to analyze that data and a little bit of what Boschek and Steven has shown before You will then start to use Jupyter hub Jupyter hub is cool. It's very it allows us to Integrate with Seth and actually query the data and use tools like spark tensorflow scikit learn All of those other frameworks in order to process the data for this example I'm going to stick with spark and you'll see I'm actually accessing the data in Seth by using the s3n library the s3n libraries and jars in order to get access to it So a very simple concept there and what I'm going to do is show you really quickly What we've done as part of the work for the moc deployment So here I'm logged into an open-shift instance and this wouldn't necessarily be a data science role, but it'd be more of the DevOps, you know systems engineer type of role. We've actually created an APB that allows us to go through the service catalog And find Jupyter hub down here. So what I'm going to do is I'm going to create a quick project here And I'm just going to name it Jupyter hub and You'll see that project there and what I'm going to do is actually click on this servers broker Select my projects and it will start to deploy Jupyter hub now There's a couple of options that we have here the database memory Jupyter hub memory notebook memory I'm going to increase the notebook memory to 2 gig and then start to create this cool Now it's actually starting to deploy my Jupyter hub You'll see it's pending and then you'll start to see some really interesting things happen once the pods start kicking off here in a second So we have the Let's see There we go So the pod will start to initialize you'll see With Jupyter hub if you saw Vashik's example earlier today You'll see that there was a number of different Images that you could have we have a TensorFlow image. We have a scikit learn image We have a spark image and so this is actually building all of those images behind the scenes preparing Jupyter hub for you and also Creating a spark operator that spark operator once we actually start a notebook that has spark in it at that point Each individual user will have their own spark cluster that spins up behind the scenes So as that's running, I'm actually gonna shoot over to a different instance of open ship that I have Since this is still still running and getting started. I want to show you exactly what it looks like when you start to get Into Jupyter hub, so I'm gonna go back here in a second Once you put a hub actually starts up. In fact, I will Actually, I'm not gonna Just continue on my server once it starts up it gives you the option to select whichever image I'm gonna stick skip that step because I've already selected a spark image for the sake of time and In this park image, I've created a notebook called sentiment analysis training I would love to take the credit for all the intelligence on here But actually super did on our team was the the data scientists that helped us create this What this code is doing? I have no idea. All I just know is the end where it actually trains a model So I'm gonna actually run this really quickly and just kind of step through the code and show you a little Be the magic behind the scenes here first thing it's doing here is You do have the ability with Jupyter hub to add a few more libraries that may not be on the image that you're using in this case We do have spark and we do have Scikit-learn installed, but we do not have TensorFlow and keras and some of the other ones So what I've done here is I'm going through and installing those on the system and then it goes down a little bit farther and Starts to import some of those libraries Here's where a lot of the magic happens with Seth You'll see here. I'm actually instantiating a pie spark instance and what it's doing is actually leveraging the spark that's already on My Jupyter hub so if I go here to my Jupyter hub instance You will see that There is a spark cluster for sh Griffey. That's me. That's me So I've got my spark cluster and I am submitting my job to that But you'll also see there's some other people s-hules that Steven Huell's and then we have a couple of other people that are in the system here So what it's done here, I've actually used the S3n sorry, I said S3n, but it's actually S3a the newer one I've used S3a to actually connect to Seth in this case I'm reading a CSV file and it's just doing some basic printing some validation Hey, here's what the CSV file looks like farther down. I'm using a little bit of pandas And then shows you what some of the data looks like That data is stored in Seth, but now I'm reading it. You see some of the contents of it And then it does all the data science see stuff and then it shows some graphs. I like graphs So I said, oh that looks cool. Don't know what it tells me, but it looks cool And it continues down. We've got some more graphs here the interesting thing is When we're doing the sentiment analysis training we have to Understand whether the accuracy of this is pretty good You know, is this a good model that we're working towards or we're having trouble with the model and we need to refine it In this case, you know the example that I'm running through right here. We don't have much data in the system So I'm not gonna get too much into it But this will show you some some of the cool things we can do with the data We're showing a little word cloud where we're actually detecting what's being talked about in the training data But as it goes as it goes a little bit farther down, you'll see it starts to build out the models and that happens Somewhere around here where it's actually building the model. It's gonna run the model This is where going back to that diagram You would do some tuning and manipulation to make sure that the model is accurate and then at the bottom here Once it's actually done. It's almost there. It's at step 83. It's finished with that You'll see it's the accuracy printed out Accuracy of 69. Maybe that's good. Maybe that's not I would say it's probably not good. I'm not a data scientist But that'll tell you as you get more data in there as you train the model you can do some cool things there so now I've got a great model and I need to do something with it. What's the next step? Well in this environment what we would do is since we already have Seth We're actually going to store the model in Seth as well and that allows us to get access to the model from any number of places We can use open-wisk. We can use any basic Python code. We can use Java code Whatever we want and it's agnostic at that point so storing the data in the model in Seth what we're doing here is Pickling the tokenizers and the models and storing that there and the outcome of this if I go back to My s my Amazon CLI. I'm gonna do an LS on the model folder. Oh Yeah, sorry about that guys. Let's just shrink this and see if you can see a little bit better Does that work better? Is that good? Okay? Awesome All right now my screens all jacked up. Oh, let's see here. You still see that? Okay, cool. All right So what I'm gonna do here is I have a Do a list on this and you'll see I've got a number of folders. I have data sets metrics Models and training so now I'm gonna look at my models folder And I see that I have some sentiment data there and then I'm gonna go to the sentiment folder I'm sorry left out a slash And now you see okay great. I have my model. I have my dimensions. I have my tokenizer. Okay, so what do we do with that? So we go back to here Now we want to talk about How do we deploy the model and make it useful in this use case? I'm using open whisk which is a serverless action I mentioned that earlier But again, you can use any number of things if you wanted to use Argo if you wanted to use NIFI or any of those technologies as long as you can get access to the data and stuff. That's great Sometimes you may want to even cash the the model maybe put it into Some kind of caching layer that you can do that as well So the first thing I'm gonna do with the open whisk is I'm actually show you a quick and easy way to deploy it so I'm gonna go back to my Open shift here, and I will create a new project called sh griffy open whisk and You see that project has been created right there. So I'm gonna go back to the Command line. I'm just gonna copy this command really quickly, and I'll explain what it's gonna do and I Am just going to use the open shift command line interface to actually deploy this Recently the open whisk project that was done on top of open shift has moved into Incubator at Apache. So if you actually search for Open whisk with open shift on Apache, then you'll it'll come up what I'm doing right here Is I'm just taking that master template and I'm deploying open shift. So I'm deploying open whisk on top of that and Let's see so And what you will see here Is it instantly started creating a lot of different deployments you have engine X Strimsy if you're not familiar with Strimsy that is the The Kubernetes Kafka that's also been worked on from a lot of the redhead folks And it has couch DB and a number of other things What you see in the background here is it starts to spin up a bunch of different pods But again for the sake of time, I won't focus on waiting for this to be done I have another instance where this is already up and running Once I have the open whisk environment all set up Then you actually use the open whisk command line interface to take a look and see exactly How to deploy the system. So what I have here is I have some code That is actually going to consume that model and actually run a Sentiment analysis on top of that using the model Up here, you'll see some code where it says analyze sentiment This is actually where it will load the model and Start to feed in the text that I pass it and then I have some additional information here where it's just taking some command line arguments nothing fancy just a quick example to show you In order to deploy the open whisk The open whisk Python code that I just showed you you just run a couple of commands here. So I'm actually going to Create an action and I'll kind of step through exactly what this is doing In open whisk everything is just an action. So I showed you the Python code very simple code I don't want to have to worry about spinning up my own engine X my own web interface all of the REST API that has to go along with it So I can take that Python code and you'll see here what I'm doing is I have the name of the action called sentiment and Service and then I have main dot Python and I actually already have a docker image That has some of the model libraries that has TensorFlow has Keras already installed on that docker image So I'm going to do a quick deployment of that Boom, there you go. It's already deployed. So Once I want to actually test it because open whisk comes with a REST API I am going to use postman. Hopefully you guys can see that. Okay, man. That's a really big postman So I have two different postman commands here and All I'm doing in postman is a post With a to a REST endpoint in this REST endpoint. You see I'm now pointing to the sentiment service and I'm giving it some credentials. These are my open whisk credentials and the body of it is just some text that I want to Pass in that I want to send them an analysis on I'll send that along the way And down here I have an activation ID now if you saw you may not have seen it, but really quickly It's starting to build up if I go over here I'm going to go up to the open whisk project It spins up a pod automatically and there's my sentiment service running in open whisk and it's actually doing some Analysis right here. It's loading up some mod loading up the models You see it loaded the model and it's actually Using TensorFlow as well to start to process the data right as That's running. I can then pull and see well what's going on with the action So I'm going to take that activation ID. It's basically like an execution ID and replace that with Just the one that I copied here And I'm going to do a get when I do a get to open whisk to see what's going on with that service It'll all it says it's not exist until it's actually done here So I'll just keep sending it until it completes There we got a 200 there and now what you'll see is the result of that action There's a little bit of metadata that open whisk provides But then at the end of the day you see here my results sentiment equals positive Now the way that we've used this in red hat is we have a sentiment analysis service That sits out there in open whisk and as people are submitting their request They're actually calling to our rest API in open whisk to get results back We do a lot more than just this sentiment analysis that I've shown here We do entity detection as well so that we understand for the given text What are people talking about is there a good feeling about red hat great feeling about open shift? What do people think about devconf and it gives you a great way to analyze the data? But to bring it all back home Now we have an entire process and where we've built the model We we've actually uploaded the data built the model and deployed it so going back to What we had in the original slide It gives you the pipeline and what we've done here is for the storing of the data and the models We're using Seth To develop tune and train the models We're using a combination of spark and jupiter hub and then to deploy the model into the rest of the environment to actually make It usable is open whisk obviously for this use case there's a number of different users that you might have in the system where your Engineers your data engineers may be more focused on the Seth side And then your data scientists will be in the spark and the jupiter hub side and then for the dev ops and the data engineers as well They'll be involved in the open-wiz side So you'll have a number of different people involved in it, but that's that's teamwork, right? So that's all I have If you want more information We are starting to publish a lot of what we're doing inside the data hub to the open data hub project And just to let you know a lot of what I showed what I walked through here is part of the open data hub project So Stephen mentioned it before with the MOC and we've seen a lot of it as we've gone through the past couple of days So you'll see more and more information and more Code being submitted into that repo and of course if anyone wants to contribute, you know, feel free to join us We also have a lot of information about spark on Kubernetes and OpenShift you can see that in the rad analytics diet.io page You can also take a look at the now incubating open-wisk on on OpenShift project and contributing that way At the end of the day, you can always just contact me shgriffy at redhead.com any questions That was awesome live demo on the mass open cloud is coolest thing ever Of course, I did that for you you you So yeah, so we're running data hub on the mass open cloud Sherrard and I know many of you probably don't that the Infrastructure we're using right now on mass open cloud isn't necessarily ideal for Machine learning training models that kind of thing If you could if you could have whatever you wanted up there within sort of reason What what kind of architecture would you like to see what is it is it storage? Is it is it vector processing like what where do we need to go now? Yeah? I think that's an interesting one and The reason I'm so excited about the MOC is I actually would like to see the users tell us that I think We'll certainly be doing some performance evaluation on the workloads that are happening on the MOC I have a feeling that we're going to Explore more about Seth in the storage on Seth. There is going to be a Knowledge sharing that has to happen where we have to understand the best way to store this data I mentioned before you could have a tab separated value table or CSV file But there's other technologies that will allow you to process the data beta better more column They're formatted data like snappy compression with parquet things like that But from the physical hardware perspective was certainly want to move more into the GPU enablement FPGA type of enablement of the system So it'd be great if we start to leverage some of those technologies as well That will allow everyone to do work faster and I think that's going to be the nice blend of having the Right storage for your data, but then also having the right compute horsepower in order to make it happen Any other questions? My question is mostly around data acquisition We are currently using I mean in your example. We are using training data so what kind of Data normalization or cleansing Components can be used In the OpenShift environment good question The simplest answer to that spark is great for ETL We've used that internally at Red Hat for a number of our products and I think in this use case It's great as well the nice thing being again you've separated the storage from the compute so you can have spark You can do your cleansing you do your manipulation of the data And then what you would want from that is more of a workflow manager to Manage those spark jobs as it's flowing through the system Some of the other technologies that we are looking into is the ability to do hive on spark so you can take a hive job in Kubernetes but use spark as the framework to do that and Manipulate the data from there a lot of people are just more comfortable using using hive as opposed to spark And there's some other technologies that we're looking at but it gives you you know a good baseline to go off of So you mentioned open-wisc on Open-shift as kind of an alpha project. Yes, so what's missing there? Um What what major problems am I going to run into if I try to use it for something I depend on the major issues? I've seen honestly is how fast is being developed I'm actually excited the fact that it's moved to incubator because before then If I deploy it today versus next week things may break and there may be some inconsistencies there So I think incubator will give us a little bit of a better chance to version things off it has a lot of support for Multiple languages Java, Python, you know some of the other ones no J s I think the missing thing may be You know just putting it through the ringer of a real use case and see how it scales And then also we're working through the exercise of hooking up from Evious to a lot of these as well to make sure We can monitor the system So I think really it seems to be in a final Finally in more of a stable state. This has been the development on that's been going on for a year now So I think it's starting to level off and you get a little bit more stability there Okay. Thank you. Sure. Amazing. Thank you. Thank you while Seth principal architect at Red Hat and Yaret Hattuka Software developer working with Seth and they will be talking about adding smart disk failure prediction to Seth Take it away. All right. Hello everyone. My name is safe while Hey, I'm you read Okay, yep, so today we're gonna Talk about adding smart failure prediction to Seth. We're gonna start with a little bit of background What's that is why we're adding smart failure prediction. We'll talk about the journey We took in order to arrive at the architecture that we did the current status of the project And some next steps and then we'll talk a little bit about outreach II just an internship program That led to this whole project Maybe one of the most exciting thing about this project was how it brought together The open source communities along with the outreach you internship of course set open source community and We gained a new Industry participant in this project, which is really Excited about this project profit store Yeah, and this is how open source does it magic which is not trivial sure Is this better this way So probably all of you here know what Seth is but just a quick recap Seth is an object block and file Story system in all in a single cluster. It is Designed so all components scale horizontally With no single point of failure It is software defined so it can run on Commodity hardware And it is self-managing Wherever possible, which is really relevant for this project Because we wanted to make it even more possible to be self-healing And maybe again the most important part is that it is completely free and open source You can use it. You can see all of the code, which is It's awesome Thanks So the concept was to teach Seth how to collect health metrics from its devices and then to Passive to a pre-trained model that can predict whether a device is about to fail or not Or when it's about to fail It was important for us to keep the Design modular so we can either use an in-the-box model or to send all the collected data out to a Service on cloud make make it to the prediction and then get the Prediction back to the cluster and then with this estimation of what's If the device is about about to fail or or when we want to teach Seth how to respond to An imminent failure before it actually happened Which makes the data even more safe It brings us to Reliability it's a hard fact, but Devices eventually will fail Does anybody here ever experience data loss? Terrible losses of photos documents They still hurt On a personal level Yeah, it's it's it's not very fun to lose data for business businesses. It can be devastating and So we all Use redundancy in order to avoid the data loss We will replicate the data By using raid a razor coating so it's playing the numbers game just to Know how much we can Invest in that it can be really expensive to to be able to replicate data With lots of replicas. So whenever a device fails it Changes the statistics or the redundancy The redundancy gets worse and we will use rapid account and a window opens for even elevated risk for data loss Larger systems mean that we have more devices and it means that we have more Failures eventually and this is a sentence that used in fact itself Failure becomes the norm rather than the exception again hard truth, but True fact So if we can predict whenever a failure is about to happen we can act ahead of time and We can preemptively recover that device which Makes the cluster much more reliable Yeah And that brings us to the other part of the performance So it is natural that the cluster usage has a certain pattern of Picks and off peaks periods during the day In case we have a failure we have to respond with a recovery and The priority is very high. So it might happen on a peak hour, which is not ideal whatsoever So if we preemptively recover we can schedule it for A time which is off peak which is fantastic Yeah, and with stuff in particular recovery can be can have significant impact on performance Maybe say do you want to say a couple of words about that? This is sort of a I think in any system you're gonna have an extra cost that you pay in order to do the recovery It's a problem that we've struggled with in Seth's making that as little as possible But it's still something that you ideally want to do it in off-hour if you can So when we started this project We're sort of blank slate Our goal is just to make everything sort of work in step out of the box So when you install Seth as a standard user it would gather the metrics to do the prediction Do everything without having to install extra tools or external dependencies? So we would build in the data collection We would build a simple prediction model the expectation was we would start with something really simple It would be open source and we would hope that the community or would develop something better and it would improve over time But it was sort of a monolithic approach to the problem About halfway through the project A company came along called profit store that specializes in doing AI enabled Data center operations and in particular doing disk failure prediction And they had a south-based product that collects data metrics from the customers and has runs a very sophisticated prediction model that they claim like 97% accuracy and Has a dashboard very sophisticated and they were very interested in integrating with Seth so that Seth storage clusters could take advantage of their service And and reap the rewards and so they wanted to work with the community to figure out how to integrate this with us And their goal basically was to provide a free service to any set users so that they could share their day with data with profit store get Their predictions back or if they become a paying customer for example because they had a large cluster They wanted more accurate predictions and they could use that sort of fee for service as well So and they wanted to have both an on-premise option that would work in your data center or be able to use their SaaS service and so As we this conversation evolved with with private store realized that the two completely different models and ways of approaching it We're sort of colliding and we need to figure out how to how to approach the problem So what we eventually ended up with was a modular approach That allows you to swap out different components of the overall pipeline to enable to accommodate both models Self-contained and using a commercial service So it basically breaks down into three pieces and you have to collect the metrics on the devices You have to run some prediction on them and then once you know what the life expectancy of devices are then you have to You want to make Seth automatically respond and have some mitigation action and so we Built all these components. We want to build all these components into Seth so that the whole thing can work out of the box It can do the collection It can do a prediction and it can do the mitigation or if you want to use external tools because you're already scraping device metrics using Some other infrastructure Or if you want to use an external service like profit store, you can do that as well And by building sort of breaking it down into these three phases We allow you to swap in these different components and use it in whatever way makes sense so the first part of this is Gathering device metrics So what is smart? Smart stands for self-monitoring analysis and reporting technology It was It was started as an attempt to to give access to the devices health parameters and it So supplies a Very very simple prediction model whether the health The device of the health is okay or not It defines several attributes So for example, the device is temperature The number of the hours that the device is powered on number of bad sectors So on and so forth and each manufacturer Define their own thresholds for these for these attributes so in case An attribute crosses a certain threshold the device is considered to be failing This is a It's a very nice intention, but eventually It doesn't work many devices will fail without crossing certain thresholds or The It can be expected that the device will fail will fail, but it will not show it on the smart simple prediction So we decided to collect the health metrics ourselves And to analyze it But that came with a few challenges. So for example the interfaces of set up Sus and MVME they present different health metrics And of course it was the issue of the vendors again that They have their own So they don't have to include all of the attributes of smart. So if you buy Samsung device it will not necessarily have the exact same attributes such as IBM device or whatever and That that That actually Put some more overhead to normalize the data And also they have different scales. So They one vendor will decide on scale from zero to a hundred and another one would be set on a scale from zero to 255 so it's not really standardized We used the smart one tools smart control to collect the data, but If you guys had the chance to work with that, you know that the output is not ideal for machines So you can see how it looks First of all, it's really important to say that the smart one tools community is awesome They are super robust. They they give They support all the devices out there. They react quickly They're doing a really great job However, the output is aimed towards humans. So you can see there are all sorts of pretty tables in here but if you want to get this data and transfer it to some sort of a model It's pretty challenging There are out there all sorts of reference that will take this data and will process it and pass it to sort some sorts of Jason's but but the Thing is that they don't always Take care of all the cases What happens if there is a new device out there? It's not it's not ideal so We decided to do the right thing those of course staged idea and to contact smart one tools for A built-in Jason format Output So we prepared a patch that was part of Outreachies application process, which is amazing that It just froze you in the water and says, okay, get your hands dirty and make a patch for for that specific project So we contacted the maintainer of smart one tools Christian Franke Who does a great job in responding and helping us? We submitted a patch as You can guess this is a long-standing feature request for smart one tools Smart control in specifically And the good news is that it's about to it's expected to be released by the end of the year Which is just in time for the upcoming set not always release February 19 The second piece of this was that the way that Seth had been Designed and built up until now it can run on any hardware But it really didn't deal with the details of the underlying devices And so we would pay attention to where all the OSD's are and what hosts there are But it didn't have the internal tracking to map that to physical devices And so we had to add a bunch of infrastructure into Seth in order to maintain that metadata and last to store it So the first challenge was figuring out how to actually identify a physical device And it turns out that vendor model serial is a somewhat standard way to do that It's what the Udev and block ID libraries use and so we adopted that So we added this additional tracking into The Seth cluster manager so that all the demons are reporting which devices they're mapped to we have the sort of many Mapping between devices and demons so we know which devices depend on which and so on And we had added the ability to store a life expectancy property with those devices so that once we had a prediction We could tell the cluster what the life expectancy was and then it could respond as a result and then We adapted the initial prototypes from the start of the beginning of the project to add a new module to Seth that would First of all implement a command on the object storage demons Which are actually storing data to scrape the spark metrics with smart control and pass that back to the central cluster The there was a sort of a background operation that would scrape those on a daily basis and store them in a radar spool So that we had a recent history of all these metrics for all the devices And we had sort of a self-contained metrics scraping and collection framework The question that we frequently got in this project from other people was why don't we rely on other tools? There are you know Prometheus plug-ins for example that scrape metrics and there are all sorts of external tools to do this And then and sort of the balance that we're trying to reach is to be able to make sure that every user Who's using Seth can have something that works out of the box without having to install extra separate stuff But we also because we adopted this modular approach We still leave the door open so that if you already have an external scraping framework You can still use that and have some other Infrastructure that's doing scraping prediction and it can still feed back into Seth to tell Seth what the life expectancy of those devices are And then the same Seth sort of automated management logic can say oh I know this device is going to fail and and do the right thing as a result Which brings us to sort of the second phase of the architecture, which is the the failure prediction So today we have two approaches and two ways to address this problem profit store contributed a pre-tained prediction model To the upstream open-source project. It's a bunch of SK learn Library module files or something. I actually don't understand the data science at all So it's a comparatively simple model, but it works it runs inside the Seth manager demon And so you can sort of have an out-of-the-box cluster Analyzing the metrics and doing predictions based on that There's also the ability to enable a feature where it will call out to an external SAS API and some service Either hosted in your data center in the cloud That will feed the metrics to an external service and get a prediction result back and then store that in the cluster So you have both the sort of external cloud model or the the the online model And again profit stores goal is to have sort of a free service that you can use Or you could pay them to get their very accurate predictions to do it And because we built this around sort of a SAS API There's an opportunity for other people to implement that same API or one similar So that you're not a profit store wouldn't be your only choice. You could implement your own External prediction service as well The natural sort of question that we ask is how can we build a better model? So we have the initial model that that profits are donated. That's that's relatively simple as I said But the goal in all of this is to build you know the most accurate prediction model So we can have the best data reliability that we can and there are sort of two key pieces to this And the first is that we really need the disk failure data So a lot of the academic papers have been published about Disc failures and predicting them and they tend to rely with private data sets that the researchers You know got from Yahoo or Google or whatever from their big data centers And they do their analysis and they publish a paper, but the data isn't public Backblaze is a Cloud backup company that is very generous and that they publish all of their failure data And they have a huge fleet of hard drives that they use And So that's that's really the only public data set that's out there And the challenge with sort of both of these is that the breadth of the device models is limited by what those particular cloud vendors Or back place happens to buy If you look at sort of the enterprise world where you have companies like EMC and that app or whatever deploying their devices They're of course gathering all the metrics for the devices that they buy for their customers But again that data set is private and so although those particular vendors might have failure prediction that they build in There's nothing really for the rest of us So the bottom line is that we need we need failure data and more data in order to build build a better model I'm the other thing that's interesting is that There's an opportunity to use metrics that aren't actually necessarily from the device To enhance the quality of the prediction so profit source model for example looks not just at the device metrics that they get from smart But also like things like the server load the network Traffic and how many processes are in the system all this other stuff that they scrape about the cluster and the systems that are actually consuming The devices and they use that to generate a more accurate prediction of when things when things are going to fail So there's a question of which which metrics are the important ones and are there even other opportunities that that we haven't thought about And this led us to come up with this concept that what we really want is an open source public data set of Disc failure data so that researchers and the open source community can build build a more accurate prediction model And the question is what can we do to help help make this happen? That the concept that we came up with is a sass like service not unlike what profits are doing But one that's run for the community in a open and transparent way where you have Systems that are basically sharing their device health metrics with this public data set service So they're publishing their smart metrics and in response. They get a disc failure prediction. So it's sort of using Providing an accurate. Hopefully failure prediction as a care in order to motivate people to share their data People are obviously and naturally skeptical of any instance where you're sort of sharing data about your internal systems So it's very important that you make the system transparent and protect the privacy So for example uniquely identifying devices With randomly generating IDs and hosts instead of having without having any identifying information like IP addresses logged and so forth There's sort of a question around whether you want would want to share your serial number That's sort of the only really identifying information within the device metadata And it's a trade-off because if you do have that you can identify things like you know Bad batches of devices that come off the manufacturing line might correlate with failures But people might be more paranoid about knowing that they might have that particular batch So hopefully the goal is to motivate people to share as much information as possible because you get a more accurate more accurate prediction And then the result would be that you would accumulate this big database and build generate this data set and then share that with Academic resources and the open-source community and people who are trying to build better failure models So you sort of bypass the problem that you have Currently where there just is no good data to train these things against one of the key challenges with this that we've identified is identifying the failure events because when you're training a model you need to know what the signal is that you're actually Trying to predict when did the device fail after all the hell of these metrics are not and part of that is a definition is Arriving at what the definition of that actual device failure is because that might vary between different users So is it when the device is completely offline and it won't even spin up Is it when you have too many read errors and you finally decide that you're not going to use it anymore Is it when you get a single read error and you decide not to use it? Different environments have sort of different thresholds for when they say I'm going to stop using that device and it's it's no longer acceptable The other thing is that when a device fails in the real world do you imagine in the wild Somebody might be running any system whether it's stuff or something else a device fails There's some action that they're going to take you know They're going to their their rate array might use a spare. They might replace it is they might just leave it failed in place And lots of different software stacks and human interventions might be involved in that And it's I think it's unrealistic to say to require that those users like take an additional step of notifying this Cloud service that they're sharing their data with that. Oh, by the way, I decided that this device failed Because it's if it's not automatic Then they're not gonna they're not gonna share that information. So it's hard for this Public service to sort of get that signal know when when the device actually failed And the other thing that you you have to be careful with is if your failure prediction is working really well Then the devices won't actually fail right? You'll take them out of service before they actually crash and burn And then you have to be careful about making sure that that's not Polluting the model as it continues to be refined and trained So one idea of how to how to deal with this is to how to infer failures Is to associate the device with the hosts that contain them? So typically you have a server that has multiple hard disks in it And as long as you have a unique identifier for the server then the servers can see that there are some number of devices that are associated with a particular host And over time you're gonna be receiving a stream of metric updates for all those devices But then after a failure presumably just that one device you're gonna stop getting reports for it So the idea is to basically infer that a device failed if you continue receiving reports for all the other devices in the system But that one particular device failed And perhaps you only do that if you see signs that the device is likely to fail And then it goes away Then you can sort of assume that it went away because it actually failed or because it was taken out of service And I think the real question here is a data science question Is it is there a sufficient signal using that kind of inference in order to train an accurate failure prediction model? and Probably the more like specific task that would need to be taken would be to validate that type of approach for your You're inferring those those data points from an existing data set using like the big back plays data set Which actually does have failure events because they took that time to actually say this device failed But to ignore that Try to infer failures using this method and then see if you can still train an accurate model in order to validate whether This in fact would work. So that's a that's a question for a data scientist that hopefully someone will pick up so that brings us to the Sorry That brings us to the third phase Okay, thanks. Yeah, I don't know what happened So we had the metrics collection we had the prediction model And now we we have the response phase So it's it's a question. What what do we do when when a disk is about to fail? So the question is how much time we have left If we have enough time, maybe we can just let the self operator know that one of the devices is about to fail and then they can Do whatever they decide But if we don't have enough time we would like the cluster to Automatically try to self-heal like we mentioned that stuff is Self-managing and can be self-healing as well So what we do is we will mark these OSDs out of the cluster and we will Migrate that data to other devices Yeah, we also divert the workload away from these devices so So Yeah, we will not we're not causing more harm All of these So we define the thresholds to these Actions so they're all configurable. I think that the default now is in case a device is about to fail less than two weeks from now We just automatically take action But other than that I've left life expectancy is greater than two weeks The set self operator Can can decide what to do that? Yeah, and then there's a question After the device is successfully off-loaded What why do we do that should should we drive it to failure To prove the model was right or wrong Yeah Some open questions So currently we have we have merged The part in the portions of the device management and Matrix collections and the respond they automate automatic response We still have under review the Huge pull request of a profit store Hopefully it will be merged soon And we're targeting this feature for February 19th next Next release of self And in the future we wish to see an open SAS service that And open that say just like sage mentioned And to have an improved free and open source prediction model, so this is a call for academia for professionals for everyone who is Interested in this story in this project Everybody can help So this project Happened in collaboration with our three-tier organization like we mentioned earlier Which offers paid internships with that promotes diversity in free and open source software There are many communities which are participating this is just a partial List as you can see so the way it works that is that applicants Look at the projects that are participating They pick a project that they're very passionate about They contact the mentors they see Which contributions can be made to this to that project? and again, this is hands-on so it's not just Studying theoretical stuff. It's just it's real and many times That there is a certain Well, maybe it threshold needs to be the right word here That prevents people to contribute to open source because it can be intimidating so we with this organization you get the support of mentors which is fantastic the contribution does not have to be just for developers you can contribute For the commentations bug fixes marketing so There are a lot of ways to contribute Yeah, and then you you fill out an application form I think you made many projects Happen many many contributions to open source And that's that's a win-win for everybody because the interns get experience the interns get again the support that is not trivial and also mentors Potential mentors if you have any project that you think would be good for newbies in open source I encourage you to do that Because Eventually it makes the project happen The internships run twice a year a year sorry next one starts December and Applications start they open September 10th. You can see the website. I I really encourage you to go and check it out Tell everyone about it on a personal note I'll share my experience with the project so So picking the project I was passionate about I was making the contradiction with the Sage guidance and then once it starts a Very common Syndrome for interns is to have the imposter syndrome so There is You're all starting to realize oh my god Maybe I'm not the right person to do that That the code base is huge. Where do I write the next line of code? You kind of freak out you have some fear and doubt but then Yeah, we'll know that but we tend to forget nobody knows everything we all learn all the time And it's just It's good to remember it and Seth's community is is is great. They're Sincerely happy to help. They don't criticize you. They know that it's okay. Nobody knows it and nobody Not everybody knows the project and at the very first stages Sage Told me that there isn't such a thing as a silly question Yeah, just think about it sometimes We are afraid to ask so Yeah, so first of all I would like to thank our treatise organizers say sharp Marina and Karen and all the team Who are were very attentive? Responded to everything super quickly and really wanted to help all of the mentors that Took part in this project and of course sage huge. Thank you for all your patience and All your will to help Thank you so the challenges that With just just a quick cap for the project. So we still have smart controls upstream We hope that We hope that we're gonna see that release around the end of the year We still have some changes to the architecture So we're done Yeah, and we still don't have the data to to build our own model hopefully We'll have some open data soon And the outcomes that we had from the project is we do have a modular approach for the metrics collection The prediction model and the response We have a new participant in the staff community profit store for a very Thrilled about this project Yeah Sorry And that's it Thanks to the smart modules upstream Christian Frank who has been great to profit store for contributing And outreach you of course first setting up the internship. So we have a Kind of out of time, but if you have any questions Please please find us and ask So the last time I did do this in production was a couple of years ago But one of the things that I found was that the data supplied by a solid-state storage Was you know both completely inconsistent between different vendors And often extremely sparse Has that improved at all in the last couple of years? It's It's quite complex. I mean That we're focusing on collecting the data well we're collecting the data from Potentially all the devices, but what went at the beginning we were focusing on set up for hard drive so did they do oh Sorry So I'm not sure if it's if it has improved Yeah, I think mostly we're relying on smart control to scrape everything because they've been pretty thorough And I think that as far as what data if we're actually getting out and the ability to actually predict on that is sort of the Self-contained prediction problem at least that's the way I've been doing it And so we're starting with something and the hope is that The situation will improve over time. I was wondering how you evaluate the Accuracy of the profit store predictions The truth is that we haven't yet They've provided a model and we don't we haven't had the time to evaluate it yet. So We're focusing just on that on the system integration problem first I guess I had a similar question. So about the model, right? It says like it's 95 percent 97 percent accurate But it doesn't tell a lot about like a more better metric would be what's the false positive rate? And what's the false negative rate because if it's a false positive then there's a bunch of things that you guys do to make sure That there are no problems, right? So how good is the model at in terms of that? And how much do you care about the false positives? Yeah, that's exactly right. There was a talk that I saw at vault last year that Defined it. I'm forgetting the technical term, but it's a two-dimensional matrix or essentially you have the probability of positive or negative false negatives That's the number that they've given us and again, we haven't actually analyzed it from that data science perspective. So Unclear. Okay. Thank you, Cajun. Yeah. Yeah, what we have one more question So basically if I if I were to try something out like this Would I need to set up an agent to collect these logs and push it somewhere? The you're talking about the open data set Yeah, or yeah, so they did I think they're they're sort of two goals One is to make it work for staff out of the box and if we have a public data set target You could just turn it on But the expectation is that not all storage is set obviously of lots of other storage in your data center So we don't want to build an agent that you just install any host and turn it on and point it at upstream And they would just it would share share that data. Thank you. You are it. Thank you sage We have lunch served at the ziscan lounge machine learning application, so if we want to develop an application we would like to use some programming language and If we take a look for example developer survey results that are published each year by a stack overflow we can see that Python is becoming more and more popular programming language and To me it's not an accident Python has a wide range of of libraries for data processing and as That was stated in the previous thoughts data becoming more popular more interesting then actual code so Python is really nice to how to how to Analyze your data and come up with some some outcomes Okay, even if we take a look at predictions Python is predicted to be rising more and more when it comes to its popularity and As I stated there is wide range of libraries that can be used So we will use Python as a programming language. We will use tensorflow as a library for machine learning We will use flask Flask as an API server and this API server will run on top of whiskey server That is J unicorn and we will use pandas for data frames Okay, so these are all the Requirements that we have and in Python world you create something that is called requirements TXT file and in these requirements TXT file Okay, okay, so sorry for that so Okay, okay, so you have these requirements TXT file and In these requirements TXT file you state all your requirements for application So that was tensorflow pandas flask and J unicorn in our case You take some time you develop your application and you have a good repository of of your machine learning application So the next step is basically to deploy your application For that we will use Some some platform highly scalable So we can use for example open shift and we provide Directly git and use feature that is called source to image and we directly deploy our application to open shift so now we are running our application and As time passes our application is becoming more and more popular and after eight months We have community and our application is serving user requests Okay, so After eight months you decide that you would like to Do some improvements in your application so you code and you open a pull request and in this pull request you change some behavior in your application and Suddenly your CI fails Okay, so what do you do you spend some time? debugging your application and After some time you realize, okay, there was some change In a library that you are using so the error is not directly caused by by changes you introduced in your application But we're other in the library that you are using so what is this behavior change? So previously you used pandas in version all 21 all and in this version you had some some data that you analyzed and You expected that values that are not stated in a in a data frames are summarized this way so if you have ten and None ten plus none is is ten if you have none and none it's none and 30 plus none it's 30 in pandas data framing this version okay, then you did something like mean and The mean was computed like 10 plus 30 divided by 2 because we had two values and none was omitted because it's not an under So the result was 20 with new version of pandas. You suddenly have different behavior, so You have exactly same data, but now when you summarize them You suddenly don't get none plus none is none, but none plus none is zero and that also means that it affects your Mean so now we have ten plus zero plus 30 divided by three Okay, so basically by chain Basically by the change that was introduced in pandas In new version your application is suddenly not behaving as you would expect and you were lucky because your test Uncovered this misbehavior The problem is that when you developed your application, you didn't state it exactly what library is you expect in both versions and these libraries change over time, so Even thought your application is not changing these Libraries are changing and you didn't state it like okay. These are the libraries that I'm compatible with So now let's do it differently. So instead of having requirements the xd file where we state names of libraries We can have requirements in file that's where we state names of our libraries And then we let generate requirements the xd file so now we have Generated requirements the xd file with all the libraries in specific versions and if you As you can see these versions are flipping down So they are pinned down to a specific version and it's not only of our Requirements or our dependencies we direct to use that is TensorFlow pandas flask and pandas I guess but it's also There are stated also dependencies of dependencies of these libraries that we are using that is somehow Goods because even if there is change in these transitive dependencies We have still pinned down to a specific version and we don't no longer rely on a resolution that is time-dependent This approach is kind of old school. It uses peep tools and its command peep compile Now the recommended way to pin down your dependencies in a python application is to use peep and And peep and does exactly the same thing. So it locked things down all your or your Dependencies including the transitive ones Okay, so we fixed our application We pinned down pinned down dependencies to specific versions and we pushed our code into Give you a repository and suddenly we have another warning We found a potential security vulnerability in your application or in your dependencies as you can see by pinning down dependencies We can just by checking versions or what you've stated in these requirements the xt file provide information like okay, this Will be installed and it's no more time-dependent on during installation. So you have specific requirements stated and Just by checking Which which versions are installed we can directly Say what's wrong? Okay, so we exclude the given vulnerable version from from our requirements the xt file and we push it again And suddenly our CI is read again so it's probably not that easy to find the right software stack and In in some cases you you have something like okay There are things like you are not using this or you you you should not use this But there is nothing like okay What should I use instead and that's something that thought is trying to approach okay, so If we have these dependencies that are stated in requirements in file and recommends the xt file like flipping down We are probably losing some information. So we have We have lost some some information. So let's take a look at what information If we check our direct dependencies, we have tensorflow pandas flask j unicorn, but these dependencies have also dependencies So if we plot it as a dependency graph, we see something like this. So tensorflow is depends on gas to store and and so on and so on and Then for example tensorboard that is that is dependency of tensorflow relies on numpy markdown and other Other libraries. So this is our dependency graph that is resolved at the given point of time okay So now let's let's say that for some reason we would like to to change version of tensorflow What does it mean? That means that by changing tensorflow? We are basically changing all its transitive dependencies. So we are changing also tensorboard possibly if tensorboard is not requirement of tensorflow that that was the case of previous versions of tensorflow So if we take a look closer, we can see that for example numpy and verzoic are also affected by change of tensorflow and They can resolve into different versions Given the version range specification of tensorflow That also means that pandas and flask is somehow affected because they are They are using numpy or verzoic in specific versions So if we change tensorflow indirectly, we can change pandas flask and also other transitive dependencies of these These applications of these libraries and suddenly we are changing most of the libraries in our application stack This is an example of another Step that was resolved and we change version from tensorflow from version 110 to By changing this we basically introduced two or three new libraries in our application stack. There were also removed libraries and And there could be also change in versions. So it's not that straightforward to just Change some version or just to change some some Library in your application stack Okay, now let's suppose that we would like to change something in not directly That's not something that we directly use in our application So let's suppose we would like to change numpy and if you change numpy We are again in the same scenario that we are affecting directly or indirectly other libraries and their Dependencies so this is something that thought is trying to approach and thought is trying to resolve these dependencies on on server side and Besides the resolution resolution algorithm. It also tries to Incorporate other information based on experience based on observations based on monitor repositories and with this information guide basically the Resolution to to not the latest Latest version of some specific libraries, but to the Back to the best one the best or the optimal one. So based on knowledge we have and we aggregate Okay, so let's say we again have our requirements for application and we have tensorflow pandas class j unicorn and we explicitly state that we Want tensorflow at least in version 1.5 because of some feature now Todd will resolve to a software stack and it will Basically give you information like what is wrong and why should you use some libraries in some specific versions? The design of thought is that it's actually a team member if we have bots that are communicating with with back-end and these bots monitor monitor Your your repositories and if there is something wrong if they learn from from observations and proactively open open pool requests to to To repositories Okay, so now how to find the best software stack Let's suppose that we have An application that uses one library and this library has no transitive dependencies And we would like to come up with the best possible software stack So when we have something like best software stack, we need to have some scoring function So we need to have some mechanism how to say how good the given software stack is so Let's suppose we have this function and if we plot results of this function for some specific versions of our Library we see that library It's called simply that we are using in version one point all our scores Zero point eighty six for example and in other versions other values we can and interpolate polynomially these values and We can see basically Some kind of trend in in this data. By the way, if we have enough Data to the past we are basically doing machine learning because we can predict based on previous patterns that were found Future so for example if some library releases too early the major release and then tries to fix all the All the bugs in in for example patch releases or a minor releases respecting the same there then You can see this in in data Okay, now let's suppose we have Another application and this application uses two libraries now. It's simply and another league So now we are going into 3d space So we on one axis we have simply on the another axis we have another leap and Then we have actual score. Okay, so let's interpolate these values and what we get is basically a surface Here we can see for example that another leap in these three versions scored really poorly So it's probably best to install software or to install another leap in the specific Versions and on the other hand simply in version 3.5 and another leap in version 2.5 gave the best score We can project just to make sure that The score is the highest and here you can see by projecting into score and simply You can see that the highest value is in version 3.5 Okay, so let's try to generalize it. So let's suppose we have some scoring function and the scoring function has Input n minus one where n is number of all possible packages that can be installed with the given So or when resolving the given software requirements and This scoring function can take into account things like how does the Given library perform like does it perform good bet or how reliable the given library is or how do These three libraries work together. So when you install for example tensorflow in some specific version and then by in another version you can say okay, these are performing better than in comparison to other other versions and If you want to generalize it, you can basically have two types of scores One is solely on package level. So you score the given package like does it have CVE and then you can have cross library score or cross package store and here there is taken into account for example that case with tensorflow and I spoke before Okay, so how to find the best software stack? Well, the answer is that it's nearly impossible for a bigger applications for example tensorflow Resolving tensorflow just having tensorflow in your requirements can lead to 68 billion on possible software stacks and If you add new it just multiplies by number of all possible versions However, we can find a good enough software stack and we can approximate it In the thought so our thoughts is a recommendation engine. We are trying to resolve stack and We are computing basically candidates that are scored based on the scoring function We are trying to provide recommendations like you should use this software stack With reasoning so we for example say don't use this library because we know that The other version performs better We are trying to be as close as possible to source code So we have these bots that automatically open pull requests to your repositories We will try to learn from users behavior So if you for example close pull request and you say, okay, I don't want to use this software that was suggested We try to learn from it and next time come up with with better recon better Request We are trying to be proactive. So once Thought learns that there's something wrong with that particular package We can learn from it and by monitoring other the other repositories we can proactively open new new pull requests and fix issues even before they they happen in Product production besides that we know what you are installing So if you are installing tensorflow we can also suggest how you can deploy your tensorflow in into production This is some kind of vision of thought as of now we provide automatic updates of Dependencies and we cooperate internally with in the redhead with team and we are trying to gather information about updates Whether the given update or whether the new release of the given application broke broke for example CITS or stuff like that We Have implemented server side Resolution so we can find or exclude given package from from software stack and and Resolve to another software stack that does not have the given version of of package that can be possibly wrong We have these boards that operate on the source code and we have checks of provenance Python does not provide things like having to saw package indexes Deployed and for example in one package index that we have provides optimized tensorflow builds and for example running Public IPI index besides that which will have all the rest libraries that you need to Use for your application. So in thought we know this provenance or origin of Artifacts that you are installing and we can say, okay, you are probably not using the given Optimized version of tensorflow you would like to use We can also check runtime environments of application so we analyze images and we can say which packages are present there and which versions and Basically, this is feeding information to our database or knowledge of of packages that we have The next thing that we will do is aggregate more information as I stated we are cooperating with other teams at redhead and we are trying to aggregate informations like okay, these are software software sticks that are not good or information like which what breaks what and Come up with cross library scoring So these things like okay, you have tensorflow in this version Mumbai in that version Basically score that that thing one one possible solution is dependence monkey where we generate stakes and Different stakes and we run for example test suits on on these stakes and we can measure for example performance index and how good For example tensorflow performs So that's from me. Do you have any questions? You mentioned dependency monkey with the hardware, but a tough mostly focusing on software Is there any thought to how some of the package resolution or package recommendations may depend on certain hardware? Yes Basically in the back end we are Saying this is some observation we have for the given runtime environment and this runtime environment can have information like okay, it's for example GPU enabled GPU enabled note in the cluster and in that case If you are using tensorflow in this version or you are using something like that we can measure measure Also information about about it Or keep information about it Hey, Rita You mentioned that a lot of how you have taught integrated right now is in bots in your workflow Are those bot configurations generally available for people to integrate into their repositories as well? We have I can probably show it we have configuration available In our repository. So one of the boats is Quebec heads and Quebec heads is monitoring repositories and aggregating information about these repositories It's it's deployed internally and This is basically configuration for for different repositories we monitor we monitor taught itself and monitor user contentization team and If somebody wants to deploy taught on its own this is basically like configuration a User can can follow and configure it for its deployment Okay, thank you for everything Okay, thank you. It started all right My name is will Benton. I work on and with distributed systems and machine learning at Red Hat I've been at Red Hat for about 10 years currently. I lead a team of data scientists and engineers Focusing on machine learning on open shift. I'll talk a little bit more about that at the end of the talk Thanks so much for joining me for this session today We're gonna talk about what I think are some really cool data structures for scalable computing Basically things that let you get approximate answers to interesting questions And you can run them on streams or in parallel and you only take a constant amount of space But first I sort of want to sort of read the room here and see where people are coming from I'm always interested in why people get interested in things Early last year. I was talking to a colleague about like problems that we realized were interesting You know, you think a problem isn't interesting and then you realize that it is anyone have this experience before Maybe maybe maybe you just are really good at identifying interesting problems right away and you don't have this problem so I had this problem now and If you think about if you think about scalable data processing, of course It's a hot topic today, right? If I if I give a talk today and I'm saying I'm gonna teach you how to do scalable computing Everyone has big data or wants to have big data Everyone needs scalable processing, but it wasn't always obvious that you know for a lot of real-world problems Stream processing and parallel processing were necessary, right? I mean there are people some people have that data that require this some people don't and you could imagine, you know 20 years ago thinking about problems that You could solve a lot faster if you had parallelism or that you could solve a lot with a lot less space If you could do them in streaming problems, but you didn't really necessarily think about problems that you couldn't solve at all Unless you use parallelism or streaming and So I was always convinced of the theoretical benefits of potential speed-ups for running a lot of these algorithms in parallel But the first practical problem where I found them totally necessary wasn't looking at sort of mean and variance estimates Which is pretty basic, right? You probably already know the easiest way to calculate mean and variance estimates You pass over the data once and you sum the samples And then you divide the sum by the number of samples to get the arithmetic mean Which I'm representing in this figure as the height of this shaded box Then you go over the data set again in the second pass and you calculate the difference between each sample and the arithmetic mean So here and it's a little hard to see on the projector because it's a little dim but we have positive differences as red bars from the mean and negative differences as Blue bars, but it winds up not mattering because we're going to square all these anyway And they're all going to be positive and then we're going to add them up together and divide by one less than the sample count And we'll get the variance So you learn this in school. This is a simple technique and it works for You know any data set that you can keep in memory and pass over a few times But for some large data sets that second pass is not going to be feasible it might not even be possible and So I first ran into this problem while running architectural simulations if you're a computer architect and you want to evaluate the impact of a new cash replacement policy on memory access latency or if you're a static analysis designer as I was and you want to evaluate the impact of a particular compiler optimization or on branch predictor performance You're going to be running a simulation and in the sort of stone ages when I was running these Simulations it took like two weeks to simulate a few seconds of wall clock time for an interesting program So you'd be running a bunch of different configurations You'd be using up a whole bunch of people's computers and you'd be producing a stream of values for every memory access Indicating how many cycles it took to get to that memory So you can imagine that if you're taking days or a week or more to run an experiment the prospect of running the experiment again So that you can calculate the variance is probably not something you want to do So you want you want a way to run this in a single pass so Beyond basically understanding how the textbook method worked I hadn't spent a lot of time thinking about this problem at all and the first time someone showed me an algorithm to do this in a Single pass it seemed like magic. It seemed like a superpower and so let's take a look and maybe it'll seem a little less like Magic after we see it in action So instead of looking thinking about our population as a set of values that we can replay as many times as we want We need to think of it as a stream. We need to think I can only see each value once So we'll examine one sample at a time and after we examine a sample will update our estimate So when we've seen one thing the mean is that one thing that we've seen right? That's sort of an easy trivial case We like to look for those where we can get them our variance is undefined because we only have one sample and to calculate the Variance we'd have to divide by zero When we look at the second sample we'll look at our its difference from our sort of running estimate of the mean Which I'm representing here is that blue top of the bar there and we'll update our estimate of the mean to reflect the Contribution of the sample that we're looking at to our mean estimate So we'll say split the difference between what we have and what we've seen and so on and then we can also get an initial estimate of Variance because variances are larger quantities than means Typically, I'm representing that as a as an area rather than just as a line But there's nothing inherently two-dimensional about it When we examine another sample we'll update the mean estimate again This winds up being like a weighted average since we've seen three things now We update by one-third of the difference From our mean estimate to what we're seeing and then we can also update our variance estimate And we can see in this figure how the mean estimate gets updated over time as we examine each sample So each of these samples has a colored bar over it indicating where the mean estimate was Whether we raised or lowered the mean estimate and how much we updated it by after examining each sample So the actual arithmetic mean for the whole stream is Represented by the height of that light gray box and we can see that after the last step We've actually landed on that arithmetic mean for the whole stream that we'd have calculated in the traditional two-pass algorithm on the bottom we have the variance estimate for each sample which I've scaled down so that it will fit on the slide and The height of that dark gray box represents the variance estimate We'd get from actually running the two-pass algorithm now We can notice if you if you sort of have really good eyesight you can see that it's not exactly the same But it's close, but we were able to do this in one pass So maybe that loss of precision is is acceptable So the cool thing about the summary that you get from this online algorithm is that you can combine it with others For example, let's say I have two streams You know, maybe I'm running an experiment in parallel And I'm running different parts of the parts of the binary on different machines And I want to combine the mean and mean latency from each of those Well, here's a stream with its actual arithmetic mean and here are the mean and estimates mean variance estimates that we would get for that stream and If we have another stream with its own mean and variance estimates We can combine the means from these two estimates by treating them as weighted averages So if we've seen, you know, a dozen things in the first one and ten things in the second one We could combine them by taking twelve times the first one and ten times the second one and dividing by 22, right? Pretty pretty straightforward And we can combine the variances in a similar way by taking advantage of the fact that the mean and the variance are independent We don't need to know anything about the mean to say something meaningful about the variance So this means that we can process streams or very large data sets in parallel And we can produce reasonably accurate estimates of their mean and variance just by combining the estimates of Partial streams or subsets. So how many people in here read abstracts before they go to talks? Good good I commend that Some of you may have an objection at this point, which is that this isn't actually a probabilistic structure It's sort of an approximate structure, right? The answer we're getting is close to accurate But it's not but I also said that there would be some code in the talk and I haven't shown any code yet So I'll address those objections in order The first thing I want to say is that this is just a motivating example This is when I discovered the streaming algorithms were interesting a long time ago But it has a few things in common with every other technique. We're gonna look at today The first thing is that it's incremental This means that we can update our summary by looking at a new sample once So we can process a large data set in a single pass. We don't ever need to keep observations around to replay them later the second property this technique has is that it's parallel if I have a Summary for a subset or a partial stream I can combine it with a summary for another subset or partial stream and get a summary of their union This means that I can divide the problem in pieces that are reasonable to process on a single machine and Combine the results it also means that we can scale this processing out by processing subsets of our input on separate machines And we can thus benefit from elastic scale out in the cloud where you have a lot of relatively small compute resources that are cheap and may come and go depending on spot pricing or other availability and The last thing I want to point out about this technique is that it's scalable And that means that whether you process one sample or one trillion your summary is going to use the same amount of space So when you think about something being scalable, how many people think like a linear time or linear space? Algorithm is scalable sound pretty scalable Raise your hand for linear scalable How about logarithmic raise your hand for a log is logarithmic scalable Sounds good, right? So I Always think of those things as sort of scalable myself, but but when we're dealing with truly large amounts of data Any growth in that summary size at all is Ultimately going to lead to problems right if that summary gets bigger for looking at a trillion things We may not be able to deal with it when it gets to 10 trillion, right? So we need something that's going to take a fixed amount of space no matter how many things we look at and we need a way to reason about How much precision we're giving up for using a particular fixed amount of space? So all of the techniques will look at in the rest of this talk have all of these three properties They're incremental they're parallel and they're scalable and just as I was sort of empowered by being able to compute statistics in a single pass I hope that some of these techniques will seem like magic to you if you haven't encountered them before And if you have I hope that you'll at least be inspired to think of new things that you can do with them So the second injection was that I didn't have any code The code for streaming mean invariance estimates fits on a slide It's not particularly interesting. So I'm not gonna talk about it, but I did show you some code for this We'll look at some more code later in the talk But it will all be sort of code that illustrates something and is is actually useful If we step through it So the first problem I want us to look at is set membership for very large sets and Set membership is interesting because it's fundamental to a lot of data processing problems But we'll start with a motivating example If you're running a web cache and you default to caching everything that anyone ever requests You'll get improved performance on the second time someone request something but as you continue to get requests the size of your cache is going to grow and There's a long tail of these web requests so some things will only get requested once now something that's requested more than once Is likely to be requested a third time and maybe even more than that but Most of the things that you see In a web proxy are only ever going to be requested once and no one will ever ask for them again And you're gonna have these things in your cache they're gonna be filling up your cache and They'll have a performance impact on your cache So you can eventually evict those things When they're not ever asked for again, but in the meantime you're using resources in this cache Probably in a very high-performance system. That's expensive to run That you could profitably apply to caching things that might actually be requested in the future Which are only gonna be a subset of the things that you would actually cache if you cached everything So it would be nice to have a way to only cache things the second time that someone requested them right So the things that were only requested once would never land in our cache. They'd never take up space And the problem is we sort of have to keep track of the things that we've seen In order to know whether we've seen them But if we're keeping track of the things that we've seen in order to know whether we've seen them We basically are caching them right so what we want is we want an Incremental parallel and scalable way to model set membership So we'll maintain a approximate set of things we've been asked for when we get a request for something Yeah, we'll check to see if it's in the cache if it's not in the cache It could either be the first time someone's asked for it or the second time someone's asked for it If it's the first time someone's asked for it It's also not going to be in our approximate set that we maintain in parallel to the cache Which is much smaller than the cache and much cheaper to update If it is then we'll put it in the set but not put it in the cache If it isn't rather if it is in this set of things that we've seen once Then we'll put it in the cache and it will stay in the cache until we evicted So because we want our proxy to be scalable We need this set membership Approximation to be a lot smaller than the actual cache We want something that's going to be a fixed amount of size no matter how many random web pages we get requested and so this actually isn't a hypothetical example like Akamai did this like 20 years ago and Saved a lot of money and wasted effort with their content distribution network, but So I want to talk about the approximate technique we can use to solve this problem But first I just want to sort of review how we would solve this problem precisely so that we can think about the trade-offs We have when we go from a precise solution to an approximate solution So the first thing we could do to represent a set Especially a small immutable set is just an array right and if this is less fewer than 20 items You probably don't even need to sort it because it's going to be more effort to keep it sorted Then it's going to be to scan through and see if the thing you want is in the set If we have data that are comparable and admit some kind of natural ordering We can make a set as a tree right where we have We have values and we sort of can search for something to see if it's in the set or not quickly and Then if we have data that we can we can hash we can use a hash table So we can just have the keys be set elements and we can have the value be anything right if we have the key In the hash table the thing is in the set so it'd be true or one or whatever So recall that when you put something into a hash table you look up the hash value of The thing that you want to put into the hash table and you use that to identify a bucket Which is just an element in an array and then we're either going to store We're going to store a linked list of things in that bucket, right? So I mean there are a couple of ways we can do this. We're choosing to store a linked list So I'm putting foo into my hash table. It's not there So I'm going to construct a new linked list that has foo as the first element and no second element So we use that list because we need some way to handle hash collisions where two different values map to the same hash bucket So because we're hashing arbitrary Values to relatively small number of buckets You could take any value in the universe and put it on to you know 10,000 or 20,000 buckets eventually they're gonna there are gonna be two things that land in the same bucket So we're gonna keep this list around so that when we ultimately get that collision we can handle it Let's say we have bar hashing to the same bucket as foo, which is probably a good sign that you need a new hash function, but What we do in that case is we depend it to The linked list there so when we have a collision and a hash table We we pay a time penalty and a space penalty and it's easy to sort of see these in action If we think about a hash table that only has one bucket Right you have a hash table with only one bucket Essentially, you're a linked list Searching for something in the hash table is linear search through everything in the linked list and you have no memory locality So so pretty awful all around For scalable computing we don't want to pay either of these time penalties or these space penalties So we're gonna look at the bloom filter which is very old data structure But it's a hash-based data structure that models set membership and uses a fixed amount of space and constant time and returns an approximate answer to a question is this thing in a set and The answer is approximate because the answer isn't yes or no it's no meaning definitely no or Maybe meaning we can't be sure So let's see how it works We want to handle a large amount of data in a fixed amount of space and a constant amount of time per update So we're gonna have a fixed number of buckets We're never gonna expand our hash table and we're not going to use a linked list to extend it But the values we care about storing are either true indicating that something is in the set or false Indicating that we haven't so we can just use a bit vector for these buckets, which saves us some space right away So we don't know what we've inserted into the bloom filter We just know whether something that has a particular hash value has been put into the bloom filter If we're querying for something That hashes to the same bucket is something we've already put into the bloom filter We'll have to return true even though it might represent a collision and as the number of elements in the bloom filter grows The probability that we'll see a hash collision and return a false positive increases So the bloom filter mitigates this to some extent By using several independent hash functions For insertion and for lookup. So in this case, we're going to insert foo and we're using three different hash functions To identify buckets and we're setting all of those to true If we update again to insert bar, we'll use the same hash functions But they're likely to return different values at least some of them are likely to return different values So in this case the third hash function has collided for foo and bar, but the other two have not And again, if your actual hashes are performing like this seek out a better library So when we look up a value in the bloom filter, we only return true if all of the buckets We've hashed it to are set. So in the case of foo All of those buckets that foo hashes to with the three hash functions are set to true So we'll return true in that case Likewise for bar, we see that all three of those buckets are set to true and we can return true If we look something up that's not in the filter We may have hash collisions for one or more of these hash functions But it's really unlikely that we'll see hash collisions for all of them So in any case if the bucket for any particular key For any of the hash functions is on set we know absolutely that we haven't put something into the filter So this is why our no means absolutely no But in the event that every hash function leads us to a set bucket our lookup function could return true even for an element that we haven't inserted into the bloom filter and in this case the hashes for blah Collide with the hashes for foo and bar together and so we have a we can return true, which means maybe but in this case It's a false positive So we can actually calculate the probability of false positives based on some properties of the filter And we'll talk about that a little more in a second But a really cool thing about the bloom filter is that as you might imagine The implementation fits on a slide, right? You have a bunch of buckets He has a bit vector when you insert you take each of your hash functions and you set the value in the bit vector to true When you look something up you return true if any of the buckets you look up or false or false and So on the other cool thing about so this is you can update this One thing at a time, right? You don't need to see all your elements more than once so it's incremental And we'll see how it's parallel as well so a Really a really cool thing about these is Combine the filter buckets with bitwise or if we have two filters one of which we've inserted foo into and one of Which we've inserted bar into One of we if we combine the two filters using bitwise or We actually get a Filter of the union of these two sets. So this is where we get the parallelism from right? We have a very simple operation That we can do to combine two of these filters and we can estimate whether or not something is in the union of two sets Even if you can't keep both sets in memory at once The other nice thing is that this is exactly the same filter we would get if we took all those elements and put it into a single filter and This also means that we can scale these things out if we process the data structure one partition at a time We can combine the filters for each partition And get a filter for the whole data set So if you're algebraically inclined You might be thinking I can get something interesting with bitwise or Probably also get something interesting with bitwise and So what is to and as union is to or? Intersection right and so actually you can get a bloom filter approximating the intersection of two sets by taking the bitwise and of their buckets so the union of Two bloom filters is precisely equivalent to the bloom filter you get from taking all those elements and just constructing one from scratch But the intersection of two bloom filters may have a higher false positive rate Then the filter you would get by taking the intersection of those two sets first and then calculating it But again, this may be the only way you have to calculate that intersection in the first place So the point is that we can always calculate that intersection. It's useful to have a scalable and parallel approximation so an extension to bloom filters partitions the buckets and Basically the idea here is that each hash has its own partition of the bucket space So that instead of having three hash functions that could write anywhere in your bit vector You have the first part of your bit vector belongs to the first hash function the second part belongs to the second hash function and so on You could also think of this as having a matrix of buckets right where you have one row for each hash function And so the interesting thing about the partition bloom filter is that it has a lower false positive rate under intersection than the the classic bloom filter If any of these partitions is empty, you know for sure that the two filters don't intersect The partition bloom filter is also the basis of the second structure we'll look at today And we'll get to that after we talk about some applications for bloom filters So what are some cases where we might want to be able to answer set membership queries without keeping explicit sets around? Well a motivating example from Bloom's original paper was a hyphenation dictionary So if you ever thought about hyphenation, especially if you speak more than a couple of languages You know that there are enough corner cases in hyphenating natural language text to be extremely frustrating and in Bloom's case The example is a hyphenation program that could use simple heuristics 90% of the time But had to consult a dictionary of rules for the remaining 10% and because this was almost 50 years ago Neither the dictionary nor the set of words That needed this for the dictionary would fit in memory That's hard for me to imagine at this point in history, but maybe it's maybe it's easier for for others to imagine so The problem is though that our thing saying do you need to consult the dictionary or not only fits on disk We really don't want to hit the disc right the disc is super slow So having to hit the disc to decide whether or not we need to hit the disc Turns out to be a really bad deal But if we maintain a bloom filter in memory to see whether or not we need to look something up on disc We can minimize the number of unnecessary disk accesses to the false positive rate of our bloom filter More on that in a minute Second application is bloom join which comes up in distributed databases And if you have a distributed database with two relations that live on separate machines You'd like to get the pairs of tuples where they have some value in common If we're gonna implement this naively there's a whole bunch of communication and unnecessary work To take all the values of x from a and send them over to the machine that has b so that we could sort of figure out which tuples might be involved in one of these joins and vice versa But instead we can construct a bloom filter for the values of x in a and a bloom filter for the values of x in b and Shift those across the network and then we can use those to sort of filter out Records that aren't going to be involved in the join before we do any communication And if we have a relatively low false positive rate We can calculate that final result that we care about for our join Without doing a lot of unnecessary communication The last application I want to talk about is enabled by the fact that bloom filters are so simple You can't just implement them in Python on a single slide. You can implement them in hardware and From 30,000 feet essentially every innovation in computer architecture in the last 50 years has involved exploiting implicit parallelism This parallelism comes in two forms. The first is instruction level parallelism where we sort of identify Instructions that aren't going to interfere and we execute them in parallel or where we Divide instruction execution into many stages so that we can run these stages in parallel And this requires the second the specular instruction level parallelism requires Support for some speculative execution so that we can start executing an instruction before we know whether or not it will interfere with something It's already executing and have a way to roll it back if necessary So hardware speculation is something that if you haven't thought about it You may not realize how important it is But if you pay attention to how fast your computer is Before and after you install security updates and you installed security updates at any time in the last 10 11 months or so you May see that hardware speculation is actually pretty important because of the source of some really critical security bugs in the last year that Turning these features off really impacted performance So the second kind of parallelism that we talk about in hardware is thread level parallelism And if we have some way to speculatively execute functions or loop bodies in separate hardware threads We can really speed up a lot of workloads and this functionality is especially nice because we can speed things up that a compiler wouldn't touch Consider the C function that increments every element in an array If we call this function twice as we're doing in the second function We might think had really love to run these in parallel, but no C compiler is gonna reorder these these code at all, right? Why is that? Anyone know You can't model the side effects So we just have these unrestricted pointers to blocks of memory and the array that v1 points to might Overlap with the array that v2 points to and in fact, they might even be this the same array or they might be you know They might overlap and if they're the same array, it's not necessarily a problem but if they overlap it could be a big could be a big problem, right and so so In general our compiler is not going to prove that these two functions commute and So a way to get around this with hardware support is something called thread-level speculation the basic idea is that you have special support for tracking what memory addresses a Program rights to or a sub-program rights to and then you can roll things back if it turns out they weren't safe to execute and In practice what this means is that we want to detect a case where the orange indication Write something that the blue invocation reads or vice versa If they don't touch the same memory at all then we can run them out of order in parallel because we know that they'll commute and We can actually use a bloom filter in hardware to represent the sets of memory addresses that each speculative thread touches doing execution and then identify whether there's any intersection between these two so if there is any intersection will roll back one of the threads and not or both of the threads that execute them in order so All of these applications can benefit a lot from approximate sets But they all depend on having a reasonable trade-off between the filter size and the false positive rate So we need to choose a good trade-off Fortunately, this isn't very hard in the bloom filter actually has these properties that we can calculate this with a formula and We don't need to run experiments. We can actually just sort of approximate it and this is what the formula looks like I promise this is my only slide with a lot tech math mode. We are in a university I hope I hope it'll slide that this is the only equation in the talk and let's see what's going on here. So we're taking One minus e to this power here and taking that whole thing to the k power K is the number of hashes we have in our bloom filter And this winds up having a big impact on the overall false positive rate because the fewer hashes you have the more likely You are to have a hash collision for all of the hashes n is the size of our filter in bits and M is the actual number of elements in the set we're approximating and as this grows we'll get more false positives so we have a formula and If you're like me you sort of look at this formula and you think I basically have some idea how this works But I don't have a really clear picture of what this actually means for an actual application that I might care about right But what this gives us is this gives us something that we can plug in and think about for a problem We might care about and plot it and figure out where things break down And so that's what we'll do So this is the the false positive rate for a 16,384 bit bloom filter with three hashes as we add unique elements to it So the x-axis is a log scale and it's the number of unique elements We've added to a bloom filter and the y-axis is the false positive rate So the underlying implementation of this is a bit vector that takes up 2048 bytes That's all the space we're using So if you think about it in those terms the false positive rate is actually really great throughout because even when you get close to having more unique elements than Elements in the bit vector you still aren't at a hundred percent yet But where you have 2048 Unique elements, that's one element per byte in the filter you have a false positive rate of three percent Like imagine a imagine a precise hat what it would take to make a precise hash table of any interesting element And how much space that would take versus this 2048 bytes To hold 2048 items with a 3% false positive rate I think that's that's really amazing the bloom filter is a totally remarkable data structure that I think really punches above its weight but sometimes Just knowing whether you've seen something before isn't enough you you want to know how many times you've seen something And so that's what we're going to look at next Let's say you want to identify I mean have you ever looked at like Twitter or Instagram? And what's popular on these social networks these social networks seen an enormous amount of enormous amount of Activity a lot of posts how would you keep track of what the most popular hashtags are on one of these networks? Or say you're Categorizing infrastructure logs or metrics from a data center like you don't need a very large data center You don't need a data center that's running a lot of services before you're talking about tens of millions of records You know per week Or per day, right? So which subsystems are chattiest which subsystems are sure in the most errors for a lot of these problems Actually calculating exact answer would would be a huge pain. You wouldn't be able to do it You wouldn't have enough space you wouldn't have enough time But a lot of these questions don't really depend on having a precise answer either So we can take tolerate some imprecision in exchange for getting a useful answer quickly with a limited amount of space So let's look again as we did with the bloom filter at some precise structures that we could use to support these kinds of queries and It's not surprising that because this is a similar problem We can use structures that look like the structures we use for the bloom filter If we have just a few elements we can use an array of pairs Where the first element is the event type the second element is how many times we've seen that event? If we have more elements and the event types have some kind of ordering We can use a tree of pairs ordered by the key which is the event type and we can update that every time we see a new one and So on with a we can have a hash table of counts instead of a hash table of truth values So again, these are essentially the same structures We'd use to represent set membership precisely and we really want to wait to sort of generalize the bloom filters that we can handle things Other than membership, right? We want to hold counts rather than true or false Just as we can generalize arrays trees or hash tables to hold arbitrary values and not just truth values So the approximate structure we're going to look at next does just that the count men sketch Uses hashing, but it generalizes the bloom filter to hold counts Instead of truth to our false values and here's what it looks like So we start with a partition boom filter, but the buckets are going to contain counters instead of bits When we insert something into the sketch we hash it with several functions and increment the counter for each Just as we do with the bloom filter So this looks pretty familiar so far, right? We've gone from zero to one because we started with an empty filter And it looks a lot like the bloom filter But once we have a structure that's populated with many counts. We might want to look up how many times we've seen a particular kind of event So as with the bloom filter, we're going to use those hash functions to find the appropriate buckets and And also with as with the bloom filter We're going to return the minimum value we get from these lookups as our approximate answer But unlike the bloom filter our minimum is more interesting because these are integers rather than Boolean So instead of just saying if we have any false is our minimum is false We say what's the smallest count we've seen for any of these buckets that we hashed to for this thing So with the bloom filter we over approximate set membership, right? We either return definitively false or maybe true and Hash collisions lead to false positives But with a count min sketch we over approximate event counts So the count we return is not less than the number of times we've seen something But hash collisions can lead to it being more than the number of times we've seen something Again like the bloom filter this basically fits on a slide. It looks a lot like the bloom filter But also like the bloom filter It's an incremental scalable and parallel abstraction for this problem, and it has some very cool properties The parallelism comes because you can just do an element-wise add of two bloom filters to get the bloom filter of their union a Really cool thing that you can do with boom filter up with Count min sketches that you can't do with bloom filters is you can take their inner product as well and Get an estimate of the size of the join Between two streams of events, and I'll show you what that's useful for here We just sum the result of multiplying the counts in corresponding buckets After we do all this we get an estimate of the size of the join Right and sort of stepping through here and this turns out to be useful if we hash the same kind of event in two different sketches So if I have infrastructure logs, and I'm hashing log records in one count min sketch For the subsystem and one count min sketch for the severity I can ask questions like how many times did at CD have an error Because the things that are in both of these will be the log records that were error log records that at CD produced So it's that's a really nice way to efficiently get an answer to that sort of question We can also use the count min sketch as a building block for an interesting problem like these top k queries Which you can think of as like trending topics on Twitter or the most popular wikipedia pages or most popular search queries that you've seen search terms that you've seen in a search engine and Support these top k queries by combining a count min sketch with a priority queue So when we insert an element into the sketch We'll also look up its value and insert that value into the priority queue If the key is already in the queue, we'll update the priority and move it if necessary So after we do this a few times Looking up values and and keeping track of the top few that we've seen We'll have an estimate of the top in this case five elements and Because with the count min sketch Sticks around and has counts for things that we can hash to it Even if we lose this priority queue we can probably reconstruct it as long as the distribution of popular things doesn't change that much in the future So the queue fills up and we get we get an estimate of the top k things For a concrete example if we wanted to track the trending topics in a social network This is actually a really big problem, right? A lot of people are no longer using Twitter because it's frustrating because they turned off support for third-party clients but if you are still using Twitter, you know that Twitter gets more than half a billion tweets per day in scores of different languages And so it would be really expensive to keep explicit counts of all the Topics or even the hashtags you observed in a given day and if you want to understand Sentiment that people have for various topics you've probably even want to get find your grained with that you want to track Trending topics that you can drill down by day or by geographic region And so you need a way to represent summaries of trending topics that was small enough So you could keep one for every hour one for every day and one for every geographic area that you've wanted to track separately In this case if we said we wanted to track find the most popular topics over the weekend I'm sorry Saturday and Sunday should be should be read in this in this figure It's a little hard to see on the projector You could support that for having an account man sketch for every day and then simply adding the sketches for Saturday and Sunday In a similar vein if you're running like a video streaming site And you wanted to identify the most popular videos based on views in a geographic region like the Americas You could do this by building account man sketch and a top case structure for each of the 35 countries in the Americas And then taking the Union of all of them Now the Americas aren't monolithic and you might wonder whether videos that are popular in Boston are also popular in Tulsa Or whether videos that are popular among Francophones in Quebec are also popular among Lucifones in Sao Paulo But you can get approximate answers for all of these queries by looking at the inner product of the count men's sketches for each region And dividing it by the total number of likes that you've observed in both of those regions You can even sort of do more interesting things like our videos that are popular in winter popular in the summer Well, you could make a controlled experiment by looking at count men's sketches for the Northern Hemisphere and the Southern Hemisphere separately and doing comparison with those So another problem that I want to look at is the so-called count distinct problem, which is if you have a Stream of observations and you want to turn it into a set and say what's the cardinality of that set? So if you just wanted to count the number of things you've seen That's trivial, right? But if you want to count the number of distinct things you've seen that's sort of interesting so as before We can use precise approaches, but none of these precise approaches. We've looked at will scale to a set of interesting size Very easy though If we have a bloom filter we can use one of a few different techniques for estimating the cardinality of the set It's approximating and most of these are a function of some of the observable properties of the filter Whether it's the number of buckets the number of hash functions or the number of set bits Here's a Simple implementation on our partition bloom filter this technique works really well for filters that aren't too full But as our false positive rate increases it begins to dramatically overestimate the counts Which is a problem if we want to estimate the cardinality of a really large set So the technique I want to talk about to address this is called hyper log log And it counts the distinct elements in very large collections And I really want to dive into the intuition behind this because it's a little trickier And I think a little less obvious than the first two things we looked at just How many people feel like the bloom filter in the count men sketch makes sense So I'm gonna ask the same question about hyper log log. I think I hope it will make sense after this But it's a little trickier And my hope is that you leave this part of a talk with the sense of why it works how it works and a definite idea What it's useful for So let's say you flip a fair coin and it lands with tails facing up This is this isn't surprising right this doesn't surprise you at all If you continue flipping the coin and you get four tails in a row That might surprise you that would probably surprise me but then I'd think about it and say well I have a one in sixteen chance of this happening It's unusual, but it's not that surprising and if I'm spending enough time flipping coins that I can think about The times that I've flipped a coin and four times in a row One in sixteen is going to come up eventually right Now if I keep flipping the coin and I get 64 tails in a row I may start to suspect that my fair coin is not actually a fair coin Because there's a vanishingly small probability of getting this result by chance right one in two to the 64th one in 18 quintillion. I don't even know what that number looks like. I have no way to have a concept of how small that is But I do know that there's not much chance of getting 64 tails in a row So what do coin tosses have to do with set cardinalities? Well, if we have a source for uniformly distributed n-bit numbers We can think of these numbers as sequences of n coin flips right So every bit in a uniformly distributed number is independent of every other one again assuming you have a good hash function And it's equally likely to be one or zero So the probability of seeing a number that begins with n zeros is one in two to the n Just like the probability of flipping a coin n times and getting n tails in a row Is one in two to the n So we can estimate that if we see a number with n leading zeros That we've probably seen two to the n numbers because the chance of Seeing that many leading zeros is one in two to the n so in this case on the first Uniformly distributed n-bit number we have we have zero leading zeros. We can assume that we've seen one number If the second one comes up, we'd assume that we'd seen 32 right and so on so Every time we have another leading zero The number of numbers that we could have goes in half right and This if you think about binary this sort of it may be intuitively makes sense but if we look at the cumulative distribution of numbers and here i've just taken 4096 Uniformly distributed 32-bit numbers and plotted the cumulative distribution of the number of leading zeros So you can see that as you might imagine half of them have no leading zeros 75 percent have one leading zero 87 and a half percent have two leading zeros and so on and as we move to the right Each bin takes up about half of the remaining population every time we're adding another leading zero We're cutting the number of things we could possibly see in half and the chance of having More than a few leading zeros becomes very very small so we have a technique For estimating the number of distinct uniformly distributed random numbers we've seen by counting the largest number of leading zeros and If we didn't have a hash This would just be a pretty cool trick right like oh i can i can tell you how many uniformly distributed random numbers you've seen That makes me a hit at parties right Like but since we have a hash we have a way to turn An arbitrary object into a uniformly distributed random number So we can use this technique to count Distinct ip addresses or search query terms or or any arbitrary object Just by using a hash So hashes take arbitrary objects they map them to fix size integers And a hash function is always going to return the same value for each input If you have a good hash function the results are going to be uniformly distributed And we can use our insights about how likely it is to get a certain number of coin tosses in a row To estimate how many unique objects we've seen based on the results of that hash function So if the largest number of leading zeros we've seen is one we can estimate that we've seen two objects Now the problem with simply counting zeros though is that it leads to really high variants Right like I don't want my estimate to always be a power of two Because those grow pretty quickly Um, and I don't want my estimates to be that course But there's a technique we can use to smooth out that high variance, and that's what hyper log log does Instead of just tracking one count of the largest number of leading zeros we've seen We'll divide our hashes into several subsets And track the largest number of leading zeros we've seen separately for each subset So in this example, which is just small enough to fit on a slide and is not a structure you use in the real world For anything we're using an eight-bit hash and eight counters For a real world application, you typically use a 32 or 64 bit hash and between 16 and 65,000 small counters That is between two to the to the fourth and two to the sixteenth The largest of these structures will answer queries really accurately, but use up only a few kilobytes of memory So we'll decide which counter to update by taking the first few bits of the hash So in this case We've identified the hash that we care about We'll pick that bucket And then we'll look at the number of leading zeros in the rest of the hash In this case we have two so we'll add that the largest number of leading zeros we've seen is two Taking the maximum of zero and two For the next object we'll select counter six We're starting from the starting from the right here And we have no leading zero so we won't update the counter After adding a few entries to the counter registers, we can estimate a count And we're going to do this by taking the average Of all of the registers So what kind of average do you think we should take here? You take the arithmetic mean of all of these things That just doesn't sound right does it so if you think about a probability as a rate, right? Like i'm going to see this in one to two to the n times What's the right way to average a bunch of rates? Anyone know a harmonic mean is actually the way to average a bunch of rates So if what hyperlog like does is it it takes Two to the values of each of these registers and takes the harmonic mean of these And what this gives us for this example Is that it estimates that for these register values we've seen 16 elements in our set Now if you had 16 elements in the set you wouldn't need a probabilistic structure, but again it fits on the slide So as with the other structures you can add these hyperlog logs together They're composable via commutative and associative binary operation You just take the maximum of each pair of registers at each index As with these again The code fits on a slide Right you can sort of read this and and understand it and it makes sense So Actually using these in production there are some subtleties you need to get as you might imagine You can get much better performance by paying attention to a lot of details I'm not going to talk about those details, but I will show you where to go to learn more And there's a great paper from From google labs in ebt 2013 Basically sort of detailing all of the things that some engineers at google did to make this a really Much better technique than sort of the the stock implementation So before we wrap up, I want to have A couple of sort of brief Look at two problems. We won't talk about in detail today We can think about these almost as advertisements because i'm going to sort of invite you to do some additional things To learn more about stuff in this arena The first problem I want to talk about is finding similar documents Let's say you're a literary agent you're reviewing a manuscript and it starts out with this sort of promising passage It seems like it might be kind of fun. Funny. It's a little clunky But it sounds sort of familiar and you start to wonder have I heard this somewhere before And in fact if you're if you're a literary agent, you probably would think about like well Where have I heard this before and it's a lightly edited copy of the beginning of Jane Austen's Pride and Prejudice But if you didn't already know that You'd probably have trouble figuring it out, right? Maybe you'd search the internet for some phrases see if it see if it came up anywhere So plagiarism detection is a really interesting problem Those of you who've ever taught classes know that plagiarism is a really serious deal And it happens more often than anyone would like to admit And it's an interesting problem both for human language pros and for computer language programs But even if you aren't a literary agent or an instructor There are a lot of interesting natural language problems that are related to this like if I wanted to Filter web page results by grouping similar web pages together Or if I wanted to present a view of news stories and identify news stories that are related So you have the same wire story getting republished in a bunch of different newspapers You don't want to return all of those as separate search results. You want to group them together So let's see how we'd solve this problem precisely and then I'll introduce the technique and show you where to go to learn More about it So the first thing we do is we get a set representation of the document In this case, we represent the document as a set of words There are other representations we could use as well like taking the set of all substrings of a document of a certain length Like all of the five character substrings of a document And once we have that representation we can see how similar it is to the set representation of another document So if we want to see how similar two sets are there's a simple way to measure this called the jacquard index And what we do is we take the size of the intersection of two sets and divide it by the size of their union So if we consider these two sets, which I've represented here as bitmaps They have three elements in common And ten elements in the in their union So the jacquard index of these two is three tenths indicating that they aren't particularly similar Now as you might imagine we can use the jacquard index with larger sets like with document Sets representing documents to identify similar documents The problem is that yeah, you can compute set intersections and unions in linear time And the jacquard index isn't particularly expensive But is linear time really scalable? Yes, who thinks linear time is really scalable? Who thinks linear time is not scalable enough? Who is not willing to raise their hand? Yeah, so so linear time is not scalable enough for the kinds of problems we want to solve with Linear time for the set of all words in a large document or the set of all five character substrings in a large document Adds up really fast and the constant vectors get really big And then if we need to do the pair wise Comparisons of every document we'd see right that that is really where where this problem kills you because it grows Quadratically, so if we have 10 million documents We have to calculate and sort 50 trillion jacquard indexes Just to find the most similar pairs and I don't want to wait around for that and neither do you right? So There's a technique to solve both of these problems. It's called min hash. It's very cool If you combine min hash with something called locality sensitive hashing You can quickly identify things that are likely to be similar and then run an expensive operation on a very few things To identify the things that really are similar. I'd recommend you do that I'm going to have a link to a notebook at the end of the talk where you can learn more about this The last thing I want to talk about is sketching distributions We learned about mean invariance estimates, but sometimes you don't just want a mean invariance estimate, right? Like if you're talking about sort of Inequality and someone tells you what the mean income is in a given community That's that's like totally meaningless, right? You want the median income because the mean is going to be skewed by outliers So if we want to know the median or if we want to know the 90th or 99th percentile Or if we have distributions that might have similar parameters like the uniform distribution Which has a cumulative distribution that looks like this Versus a normal distribution, which looks more like this. We really want a way to say What does the cumulative distribution of the data? I've seen looked at look like And there is a way to calculate these that's incremental scalable and parallel and my colleague Eric is going to be Eric Robinson is going to be talking about this later this afternoon in this room Eric In this room at 450 430 430 so this technique is called the t digest. It's very cool and Eric has a bunch of cool applications for it, too So like I mentioned at the beginning of the talk my daily work involves machine learning And you may be thinking like well, what does this have to do with machine learning? Well, so a lot of these techniques are actually useful in the sort of preliminary stages of the machine learning pipeline, right? If you want to understand your data if you want to visualize it if you're doing feature engineering These are techniques that should be in your toolbox but I also want to encourage everyone to take a broader view of machine learning What is a machine learning model? If it's not a compact sort of fixed-size summary of data that can support queries and predictions and tell you something useful About something you may or may not have seen Right, so by either of those definition by the definition what we've been talking about today is machine learning But again these structures are also things that you can use as building blocks From techniques that you might think of as more sort of traditionally machine learning. Here's what we've looked at The first thing we saw was a technique for calculating mean invariance estimates for a stream of observations And we introduced the ideas of incremental parallel and scalable algorithms The next technique is the bloom filter, which is totally cool because it introduces the concept of hashing It's a very to make these structures scalable It's a very simple Technique that does very useful things And uses very little memory to do so The bloom filter is generalized by the count man sketch which tracks unique event counts rather than Set membership we saw how the count man sketch is Scalable incremental and parallel just like the bloom filter And we can also use it to support top k queries Trending topics and approximate the size of joins between different subsets of our data The last thing we saw was hyper log log a structure that exploits the fact that hashed values are uniformly distributed To support counting the number of distinct elements in very large streams And you can really get precise counts of hundreds of billions of distinct elements with only a few kilobytes of counters Which is also really cool So I hope you've learned a few new tools in today's talk and are inspired to tackle some new problems Here's how you can keep in touch with me. I'm will be on twitter and github You can reach me via email. It will be at red hat dot com I have a web log that I update occasionally at shepo dot free variable dot com If you're interested in developing modern applications that use scalable data processing or machine learning You should also check out red analytics.io, which is a red hat supported community For building intelligent applications on open shift and as I promised there will be code There's a notebook if you scan this qr code You can you'll go to an interactive notebook that you won't need anything It will just sort of let you start running it and experiment with these techniques in your browser Everything we've talked about in the talk except the t digest So, uh, it's great to be a dev conf in the us. Thank you so much for your time and attention and this extra long session and I Take any question that people have now yes For the counting in sketched you is there a bound on how how much the error is on your estimate versus the actual value Yeah, let me I don't have the formula for that in the slides, but let's let's take it offline and we'll we'll find it sure So quick question on the the document sample you're seeing so the bloom filter would contain enough Slots for all the dictionary words. Is that the Size of the filter so in the hyphenation example in the hyphenation example The bloom filter contains just the set of things that you need to look at the dictionary for So if if you can say like I have a heuristic to hyphenate words that works 90 of the time And I have a big dictionary that has the complicated sort of special case rules for 10 of things Then what the bloom filter has is it's basically an index for the dictionary So you look at the bloom filter with a given word If the word is not in the bloom filter, then you use the heuristics because it's okay If the bloom filter is in the dictionary, then you you hit the disc and load the load the load the rules to see Whether the disc is in there whether the word is actually on the disc or not In that case I was wondering if you'd heard anything Uh Any cases where people are using that last algorithm you discussed that text plagiarism to possibly just detect bad literature Well, it's it's just it's just similarity. It's not making value judgments, but Yeah, so I mean This is this is a fun problem. We should talk offline about like some I think there are probably some other nlp techniques You could use to find bad literature Um, I was just wondering how does the min hash perform as compared to something like a word-to-vec model for the document? search so Yeah, I mean sort of sort of apples and oranges. So what word-to-vec does is word-to-vec trains a representation of vectors so that words So that the vectors basically have semantic meaning for the context that words are used in so two words that map to similar vectors Will be used in similar contexts in a document. So there's also this There's there are extensions toward to-vec that you can use with sentences or paragraphs or whole documents that have a distributed vector representation of Documents as well so you can say like what's the similarity between these two things cluster in that space um I don't know of any work comparing min-hash toward to-vec or to doc-to-vec, but I think the It's sort of a different problem that the min-hashes is a simpler technique that would be easier to train And you know, it doesn't depend on sort of building up a corpus and pre-training that model either So it's something you could do in a single pass as well Because my main concern was like because if you're plagiarizing something you're probably not going to just change a few words You're going to take synonyms of the word, right? So min-hash would probably not be able to or would it so depending on where you set the threshold The idea is that if you're plagiarizing, I mean if you've totally rewritten something like min-hash won't detect if you've just stolen someone's idea Right. I mean the thing you're looking for is if if they've actually taken the prose and substantially you're used it So in the in the sort of set of words Model that might not be ideal. You might want to do the set of substrings Instead, but I think depending on where you set the threshold and I mean There are companies that use min-hash to provide plagiarism detection services. So It works well enough, I guess Thanks. Sure I have a question myself So You specified that the bloom filter is going to be fixed size. How do you decide the size of the bloom filter? Yeah, let me let me pull that up here So I said I was only going to have one slide with math mode. I think it doesn't count if I show it again, right? Um, really it's it's just sort of optimizing for a trade-off between how much space you're going to use And what false positive rate you can tolerate Right. So for the for the dictionary on disk example, like I mean that was in the that was in the early 70s so like the memory the gulf between accessing something in memory and accessing something on disk was not as dramatic as it is now right, but if you say like How many false positives can I tolerate for the expensive operation? right In that case you could sort of you could sort of analytically model like well, how expensive is it to hit the disk and how many of these How many of these can I do spuriously before the performance of my application is unacceptable? Then you would use that in conjunction with this formula to pick a filter size That would be likely to give you less than that false positive rate um, so earlier you had a very big cache problem um But later after using the bloom filter and using multiple hashing algorithms Don't you still have a big cache? like you're storing a lot of data by So in the in the webcaching like the akamai webcaching Problem situation so in that in that case you're maintaining a bloom filter in parallel to the to the cache Does that answer your question? So the bloom filter itself is not a cache Okay, okay, but it does um Does it store it in the cache? Does the bloom filter itself stored in the cache? Yeah, I mean it's it's in it's in memory on the servers that are Serving the cache Any more questions? Thank you, will you thank you so much Talk real loud You want to start? Okay So thanks for joining me today. My name is Michael McEwen and uh, I work for red hat and i'm going to be talking about building a patchy spark cloud services in python So to start with What let's talk about what is a cloud service or at the very least what I mean by a cloud service and what I want to talk about today And so I'm kind of relying a lot on definitions that have been put out by the cloud native computing foundation And this is an organization that's like a governance body That is kind of helping to govern projects like kubernetes and prometheus and What they say is that a cloud native application should be something that's containerized So, you know, we're talking about building for containers. We're talking about deploying to kubernetes So this is obviously a strong part of the story They also say that a cloud native application or service should be dynamically orchestrated And so what this means is that you can use the container with a container platform to, you know, migrate between instances It really should be generically useful in those in those situations And then they also say it should be microservice oriented and this is a really nebulous topic. It's a little more Philosophical or ideological, but the way I take it is uh, I kind of look back to the unix philosophy of old That a microservice is an application that kind of does one thing and does it well And so really what you're talking about is building purpose built applications that you're going to put into a container and deploy to the cloud And here's the link to their f aq They've got some great language that kind of talks about what these platforms mean and what these applications are and so I highly recommend checking that out Now since we're talking about the cloud or we're talking about cloud native The platform that i'm using is open shift kubernetes Now how many people here are familiar with kubernetes or open shift? Okay, pretty a pretty good audience. So You know, this is this is like the generic diagram I'm sure many people have seen this, you know, you've got kind of your container network and everything But what I'm most interested in Is this part over here? Because I want to use this as a developer and I'm kind of interested in how does it connect to my Source code repositories and how can I get kind of automated build features in there? So although all this part is really interesting from an infrastructure point of view What i'm kind of curious about is this part and this is what I want to talk about today So uh-oh looks like we've got an image that's out here I'm going to switch browsers quickly because this is Apparently I have a I have an issue here Sorry about this So what I want to talk about is Apache spark and how many people here are familiar with Apache spark Okay, so again a good number of people So this diagram might look familiar to you and this is basically a general outline of what Of what an Apache spark application might look like, you know We have what we call the driver process Which is where all your user code lives and then there are a number of executors that help perform the distributed work And underneath all this is a cluster manager that controls how these pieces interact with each other And so I'll go through some of this a little quickly because it seems like a lot of you guys already know this but Kind of the fundamental abstraction at the base of spark is something called the resilient distributed data set Well, you know, we refer to as the rdd and this is really the the primary or base level abstraction that happens for all the data and what it is is a partition lazy and immutable Homogeneous collection and so what does that kind of mean for the resiliency in the distributed nature of these things? Well, they're partitioned meaning that when the data comes in Spark makes partitions in your data set. This makes it easy to distribute that data Second they're lazy meaning that any calculations that need to be performed on those data sets They don't happen until they actually absolutely need to return a result And finally they're immutable meaning that you can't change the the data set You can make a new one, but you can't modify the one that's being distributed And these pieces together help to build this resiliency into the system So if if one piece falls out it's very easy to recalculate that partition and send it back to do work And this really helps spark to become kind of a stable platform to do these type of calculations So what does this look like in action? Let's say you have an array of numbers and you and you want to do a calculation to say which one of these are even How many of these numbers are even? So the first thing I would do is tell spark to parallelize this this data set And that means it's going to distribute these things and it's going to create the data set for me So you can see this is our rdd of each number in its own partition And then I'm going to say perform this filter operation on my data set And this this will give me a new data set and in this case I'm just looking for any number that has modulo 2 equal to 0 Meaning it's even I want those in my new data set and so I end up with 2 and 4 and I count it Now I know how many even numbers are in my data set So this is a really simplified look at the operations I might do But this is how these these rdd's are built up and how they might be distributed to do work So when we take open shift and we take the model of spark This is what it might start to look like at the infrastructure layers So we have our nodes that are the physical nodes where the cubelets live And inside those nodes we have our container pods and you can see that you know Over here we have a python application and maybe it's got some spark containers You know, maybe these are the executors and this is our master You know, maybe we have more applications a mango database So you can start to see how we can use the platform to distribute the different processes that are occurring for us So what is a spark application? And this is kind of the way that I I generally reduce these things you have Source data that's going to come in you'll perform some sort of processing on it and return results now Source data and results can be very nebulous in many ways, you know source data might be coming from a database It might be coming from a file on a file system It might be coming from a stream It might even be coming from an api call where data is pushed in Likewise results could mean the same thing. So some of these are abstractions that you'll have to deal with And what we're looking at here is a very simple python Spark application and this is going to do Kind of what we just looked at earlier It's going to take in a data set and then it's going to parallelize that data. It's going to run our little You know evens counting function on it and it'll count it and return that And so If I run this on my desktop This is maybe what it would look like I would have the spark submit command which is a tool that comes with spark And I tell it to take my application in this case. I'm going to say for all the numbers zero to a thousand Tell me how many of those are even right? So again, it's a very simple process and it starts to run and the jvm goes and it spits out a bunch of logs Now as an application, this is probably not that useful to me because what you know what just happened What do I have as output how many numbers were even well? You can see You know way up here and spit out some little print and said all right 500 of those are even okay, so How do I use this in a cloud situation? How do I go from using this locally to taking it to the cloud? So again, I start to think about how am I going to design my application in a way where it can become a microservice So I go back to this pattern of ingesting data processing data and then publishing it right Ingest and publish can be very nebulous and they'll depend on this the systems that you're designing for You may have a database that you're reading from to ingest data And publishing might mean sending a call to an api that another service exposes Likewise that could be inverted you might have a service that calls your it calls your microservice on an api And publishes data to a database these these will change depending on what you're doing But in general you'll be operating on this type of model So as you think about building these applications you need to consider the structural needs of what you're building Depending on what cloud you're using it may be kubernetes. It may be mazos. Maybe something else You have to know how will I deploy my application? How will I command that application from the outside and then how will I control it? What tools are provided to me by the platform to do that? And likewise, where is my input and output data going to come from and go to? And these will be dictated by the systems that you're working in there's really not a good way to generically cover all these Except to say that you're going to have to consider these pieces And so what I'd like to talk about is kind of three common architectures that I've come across And I have a feeling that many of you will come across these as well as you build applications So we'll talk about um on-demand batch processing continuous batch processing and then stream processing So first we'll talk about on-demand batch processing and generally what this means is I have some event that occurs Which kicks off processing and then of course I take some data in that I want to process and I create results So this is something that happens every time it's triggered and that could be a cron job. It could be Something that comes in on an api call Maybe a user visits a website and that produces the action, but this is one pattern you're going to see When might you want to use this pattern? So in general These are kind of some of the top things that came up You know to me as I was thinking about this when you have Non-deterministic request windows. So think about a user who visits a website like amazon They're going to be clicking through products And just looking for something that maybe they'd like to buy and every time they click on a product There's going to be returned some sort of ratings some number of stars maybe says I suggest this product for you or I don't suggest this product for you This is the type of area where maybe an on-demand call is what you want to use You know, you may not have these things pre calculated You may have some model that contains this data But you'll need to filter that at the time that a user actually hits it And you won't necessarily know ahead of time that that user is going to do that likewise When you have quick results to calculate something that can be returned very quickly because if if you have a Situation where you don't know when these things are going to happen or you have a user visiting a website They're not going to want to wait a minute for some long processing cycle to happen before results come back So this is another situation where you might think about using this technique And then likewise when you have a lot of situational dependencies So again, think of the example of a user visiting a website If your processing depends on a user entering information before the request can be performed This is a situational dependency. So this is something that you can't pre evaluate or maybe it's very difficult to pre evaluate This is another scenario where on-demand might be what you want So, you know, what does this kind of look like? I'm I'm using an example here that I'm calling the hello world of spark because it's been in the spark code base I looked just the other day at like spark 0.1 alpha and this example was I think one of the only examples that existed in there And it's a Monte Carlo method for estimating a pi, right? So you're talking about throwing darts at a dart board Counting how many land inside a circle versus how many land outside and that ratio will approximate pi for you So this is just a function that does that And you can see, you know this This part here We're doing a similar type of operation that we did before, you know We're paralyzing a range of numbers in this case. This is the the number of random points I'd like to use to to calculate pi Then we're mapping a function onto it and we're reducing that and this is you know, this is the technique for returning pi So now I've got a function that every time I call it can give me that answer for pi And what I might do is I might embed that into an http server So if you're familiar with the uh with the framework called flask for python, this is an http Framework, this is what it might look like if I embed this function into a return So now I've got an http service a rest-based service that can basically take a request and give back pi whenever I need it And this is just one way to look at it You might use grpc. You might use some other kind of remote procedure call mechanism But this is just one way to look at it So let's let's move on from that and talk about continuous batch processing right and and I look at this is uh This is a type of scenario where you have data sitting in some sort of data source Your processing is always occurring There's no need to trigger it with an event because you want this process to always be running and always be producing results to wherever your store is And when might you use this pattern? So if you have data that updates very frequently, let's say you have a database of users who are always giving ratings for example You want to be creating recommendations? Let's say off those user ratings and you want to be doing it continually because that information is always happening So there's probably never a time that I don't want that to be running And one way you might use that is when you're creating machine learning models for evaluation, right? So if we think about a recommendation engine Where users from all your products like amazon for example, they're always rating what they like and what they don't like You want to continuously be creating models so that you can evaluate How those models are performing against what you have in production, right? So you want to say As users add more data are my models getting better? Maybe they're getting fresher in the case of a recommendation system You probably always want it to be fresh because users are continuing to add information And then another way to look at this too or a situation you might use it in is what i'm calling life cycle management processes So think about a lot of the work that gets done in distributed computing is data engineering It's transforming one schema into another schema or taking one format and turning it into another format This is something where you might just want it to be continually running You say i've got users who are putting in data non-stop Maybe it's a text data and i want to make sure that no one's putting in words that are you know on our band list or something Right, so you want this to always be running always be searching for results that are coming out of that Now what what might like this look like so this is a piece of code from a recommendation engine that One of my colleagues who's actually sitting here helped to write rury in the back there And this is part of a model generation service and what it does is you can see at the top there's this Wow true loop and it does some sort of database selection here to pull the ratings out of our database And then you know there's a bunch of calculation going on here And what i really want to highlight Is this part here We see where we've taken the new ratings from our database So since the last time i created a model I want to see all the new ratings that have been added to the database I create an rdd out of that and then i'm going to do some processing here where i create a model from that And so this is just always running and if it sees changes to the source data It creates a new model and it puts that model into my database where i'm kind of storing these models for later usage So the last kind of pattern i want to talk about is the stream processing pattern And this is something that i find very intriguing and i think it's a really cool way to work with data But this is where you have a stream of information that's always occurring So think about Kafka or or amq or those type of uh message bus type applications There's data that's always coming in and my processing This is a little different than the continuous batch my processing is reacting to every time data comes in on that bus You know within some sort of window And then again those results are stored somewhere else Now you probably would want to use this in a situation where you have like real time event processing. So think about An iot situation where you have sensors on maybe public transportation, right? So you want to follow where all the buses in your public transportation system are going are any of them running late Those type of things so you have a continuous stream of information coming in and you always want to be updating on what's happening So that may be one area Another is if you're working with systems that are built on a broadcast messaging system So i like to think about the fedora message bus. How many fedora users we have in the audience here Okay couple fedora Since it's a large community system they have something called the fedora message bus and it's it's a federated message bus system where you have Messages coming in from build systems Maybe for all the different packages that are being created messages coming in from mailing lists, you know all sorts of information being aggregated And this is a situation where well, I would obviously want to build something to that message bus because that's the architecture of the system i'm working on and then likewise Another level to this is what what we call like, you know, what people are calling kappa style architectures And and this is a way to look at stream processing where you have a stream. That's your input data You have processing happening and then your output is always going to be to another stream So you're kind of creating these message bus scenarios where you know one topic might be Your your clean data and then you have a process that runs and maybe changes the schema or pulls out some specific information Lays it to another stream and this allows you to build up really complex hierarchies of applications That don't necessarily need to depend on each other. All they really need to depend on is is the message bus So spark has a really cool api called the structured Streaming api and and this is kind of what it looks like there's multiple ways to do stream processing in spark But I think the structured streaming is really interesting and and what you do is you kind of build up this Set of instructions you tell spark where will my information come from and that's kind of up here I'm telling it a a Kafka broker and then I tell it what to do So I wanted to select, you know in this case, this is real simple It's just taking every value that comes in on the stream and casting it to a string and then Re-broadcasting that so this is just really kind of looking at a simple example And then what happens is all those strings that come in get grouped by the value that they created and counted So what's happening here is this application is doing like a word count on a stream It looks at every string that comes in it groups them by similar and then counts them up, right? So you could say imagine you had a stream of words coming by this is just going to count them And then at the bottom here you could see I'm telling you where I wanted to put the output of that stream And then I just tell it to start in this case. I'm just storing it in sparks in memory Storage and I'm giving it a name so that I can query that so at this point It's doing a bunch of work, but how do I get the results out of this? So what I might have is a function like this That allows me to look at Sparks internal kind of sql in memory representation and then I can I can run a query on it So in this case the query that I'm running is I just want to know the top 10 entries of all the things It's counted so I want to know the top 10 most, you know frequent things that are happening But this I could call this function on demand when I wanted to get information out of it So that kind of brings me to the next way you could do this, right? And this is an example of what you know the kappa style architecture would look like you have a stream coming in You have processing happening and then you have a stream that it lays it out to And if we look at this kind of similar function that we just looked at the top part of it's all the same It's all doing the same work But if we look at the bottom part where it's writing the output of that stream I've told it now just to put it on to another Kafka topic, right? And so what this means is that I don't have to worry about having a routine that pulls that information out using an sql query I could actually have another microservice that just listens On that second topic and then it would It would be automatically giving me all those counts and I could do whatever I needed to there I could aggregate them or you know, I could do something different at that point Okay, so You know, we've talked about some different patterns you might use How do I take these from that desktop example and bring them into the cloud? You know, I want to take it from source code I need to turn it into a container and then I want to push it into my orchestration platform And at the same time I need a way for the user to still get in and out or myself to get in and out of it So the group that I work with at red hat We have a community project called red analytics.io and we've created some tooling a project that we call oshinko And this allows us to use some of the source to image type workflows that are in open shift to say I'm going to take my source code from a git repository I'm going to use the source to image to build that into an image that can run And then when it gets deployed a spark cluster will go with it and be bound to my application's life cycle So in this way, I don't even have to manage spark anymore I can use the workflow that I'm used to in open shift going right from my code Making pushes to my code and that might go through a you know, a ci testing framework And then when it's successful it gets deployed onto open shift and a spark cluster will appear and be bound to it So I'm going to assuming nothing else goes wrong. I'm going to try and demonstrate for this view, you know, real quickly here So I've got a I've got a small github repository here And this is a tutorial that you could find on rad analytics And this is a web micro service that's going to create that spark pie for us, right? So it's going to create an http service that I can on-demand query To get a spark calculation and you can see my repository in this case My repository is pretty simple. You know, I mean, I've got to read me I've got my app file Which is not, you know, not overly long It's just a little flask application and then I've got, you know, the requirements like any python application might have So what we're looking at here is Open shift This is my project and I've already taken the liberty of loading into this The rad analytics templates that I'm going to use and so what I'll do is I'm going to select from my project the template that I'd like to use So I'd like to launch a patchy spark python. So, you know, I select the template that I want Kind of click through the description. You can see now it's asking me for a bunch of information here So maybe I'm going to call my service spark pie and now what it wants is The url from my github repository So I'll just copy this from here Now there's a bunch of other options I could use here Maybe if I wanted to build from a branch or there was a subdirectory I was building from or, you know These options help you control how the application gets deployed But my application is written in a very simple manner So I don't need to fill in most of these and likewise at the bottom I could I could adjust how the spark Cluster gets deployed. I could change the options that go to it So I'll click create what we see now is Open shift this is being built on open shift. And so if I look at the logs for this You can see this is kind of a standard python build process and then it's pushing it to the internal registry Now you can see at this point my pod is up and running But what's happening is the spark cluster is being deployed with my pod. So This is automatically being bound to my application and it's kind of given it a random name Now the last thing I need to do to get to this is just expose a route to it so that I can get uh, I can get to it And now if I click on this, hopefully it'll work Okay, so I hit the root endpoint and it tells me you know the python flask spark pie server is running I need to add this extra thing to get more information. So If I add this extra route See now there's kind of this weight going on this goes back to the quick result side of this right What what's happening all I can see is a little spinner going but eventually it comes back and gives me A really bad estimate of pie. Okay So like don't you know don't go to the moon with this or anything, but it's it's it's fun to play with right And you know something I want to point out here that I talked about before which is Now that my application is linked to this github repository this git repository you can see I've got this thing called a build here and I don't have my webhook setup But if I if I did what I could do is push a change directly to my git repository The webhook would hit open shift and this would actually rebuild automatically for me now in this case You know I could hit start build like if I made a change to the repository I could hit start build and it would run and then deploy it again and attach it to the spark cluster again So as a developer, this is really nice because I can really easily test my changes out and even in a private project I can do this So let me Switch back here Okay So you saw when I made that web request to do the work it took a little while to come back, right And this is where one of the problems you're going to run into when designing these type of services Is kind of the synchronicity issue, right? So I make a request for pi and now that service is off doing something, right? And if I if I tell it to scan, you know to use too large a data set This could take minutes to come back, right? And you don't you don't want that result coming back later You know the user just walking away from the terminal or something so to mitigate this in our designs What we like to do is start to separate You know the api concerns from the actual processing concerns And and this might be a very common way to look at this would be, you know our main process the api Whenever I make a request for a new Pi estimate Maybe instead of giving me back the pi estimate Maybe what it'll do is right away. It'll give me back an id And then I can use that id to query the results of what's happening Or perhaps, uh You know the application on the other end of this uses a web socket, right? So I make the request and now the main processing loop can push the information back to me when it's ready And in the meantime my application can display some message saying this is you know work is happening or whatever So you know depending on what this api is You'll have kind of different ways to mitigate this but in general what you want to start doing with microservices Is is pulling apart these concerns to make it easy to deal with the other ends of them and to address issues like this Synchronicity issue So what might this look like in python? Well, this is like the main process, right? You know got some code here We set up, uh, you know cues for doing the inter-process communication And we start off some other process and send everything going This is pretty compact and I don't want to go into every line of it But what I want you to take away from this is the top line which says import multiprocessing The python multiprocessing package is really powerful And I would say that if you're going to start doing these type of things Read the docs on that package because the primitives in there are very easy to use And so if this is the main process side This is maybe what our processing loop looks like, you know, and so we've got this big thing here where there's You know and this is coming from a service, uh That actually responds to uh incoming requests for recommendations, basically, right? So you have a user who wants to get a recommendation. This will kind of do it for you And you know, it's a big piece of code whatever I've taken out pieces of it But the main thing to look at here Are these areas that are probably really difficult to see because they're in red and there's a lot of light washing it out But you can see these primitives this response Request Response response these are the cues That I use to communicate back with the main process And so whatever i'm doing here i'm using these primitives and again This is coming from the multiprocessing library And you know, I really recommend checking it out Another big thing that you're going to run into especially if you're doing python programming for spark is dependency management So right now spark has some really good features for jvm languages that need to distribute dependencies There's a to that spark submit command. There's an option called packages and you can give it, you know, maven targets and it will Pull all those packages in and send them out to the entire cluster so your applications can use them With python the tooling is a little behind the times So let's say we've got this application And you know the filter this is maybe like let's say we're building another service to tell us how many evens exist In the data set, you know our classic hard problem here Um, and you can see it at the bottom here is where i'm actually doing the work You know, i'm telling it to kind of make the paralyzed data filter it and count it And this filter evens function is doing some sort of database action, right? So what this means is that filter evens function is going to be distributed to the spark cluster And so what that means is each executor is going to need pi mongo in place Because what's going to happen is something like this My main application wants to talk to mongo, but now i've distributed code that also talks to mongo And so that mongo library needs to be on every one of the executor nodes And this is a situation where you might have to manage these dependencies yourself Until the spark community kind of catches up with this and gives us better tooling for doing it So just let's recap a little bit here You know, we talked about This design pattern that i really like to use he ingest publish process This is kind of the the general pattern i like to get into We talked about some different types of uh of architecture patterns you might get into when you're designing your applications We talked about You know the on-demand batch the continuous batch and the stream processing And then we also talked about The oshinko source damage project So this qr code up here is a link to this slide deck, you know, please download it and Your it's it's open source. It's it's just a reveal js project. You're welcome to use what's there Here's my email and my my blog And you know, please check out red analytics.io. We've got a bunch of tutorials there and lots of this material you've seen here Is on that site So at this point, you know, I'll take any questions and you know, thank you for your time So you guys any questions Barma, don't hit me up too hard here. Okay Yeah, I'm talking about the first pattern, uh you the client would call a rest service And the response comes back immediately, right? So That's the code you're describing. So what happens the client continuously pulls That and the second the function that that you showed us, right? It is the multiprocessing package So is the client continuously needs to connect to be The same invocation of that request right for that process to happen Or can there be like another status, you know function independent of the first one that goes back And gets the data directly from the spot. What is that? so You know the first example I showed is a very simple example, right? We just you're hitting it and it's trying to make the request and if you double up on request at that point You're going to break that because that application is really simple But once we start to separate the api Um, you know the first way that I think about it for a rest Application is I say I'm going to make a request to the rest server to start doing work And the rest server gives me back an id and that id number is the work is the the id of the work that's being done, right? So my processing, you know my processing loop there the second process It may have a queue of work requests that have come in and it knows about the id's And so when it does each bit of work it can take that id It can update the status for what's being done and the main process will always be able to see What the statuses are with those id's so my client Could keep requesting or what I make the first request to do work I get an id back and now I just query that id and the main process can tell me no, it's not ready yet No, it's not ready yet. Okay. Now it's ready. Here's the results, right? And so this is really what I mean about separating the concerns Your architecture will dictate How do I want that processing loop to kind of queue up information? And maybe I have several processing loops. Maybe that work gets distributed in a better way But really what I want to do is hide those concerns from what the user is seeing right so so that they can do that They can keep querying to say what's happening and they're not going to overload the spark work that's being done Does that does that kind of make sense? Okay, cool Cool Any other questions Can you tell us more about the oshinko source to image builder process? Sure So what what you saw with the demonstration I did was me exercising the oshinko sourced image tooling, right? so The what the tooling does is it first pulls a source code from your git repository And then it creates a container inside of of an open shift that will do the build process, right? And so that that's that log that I showed it was pulling the code and it was starting to build it, right? Once it builds that code and it's successful it deploys the built container to Open shift and then the oshinko tooling is actually inside your built container and what it will do is it says Did you did you request to use a spark cluster that already exists because you could do that, right? That's one of the options I went by If you didn't request that then what it will do is it will spawn a spark cluster for you And you could specify the configuration for what you wanted to do And then your application will run and when your application exits The oshinko tooling will kind of catch the exit on that and it will delete the cluster that goes with it Right and then your application may go away or something So that's kind of a general look at the steps that happen To do that deployment And it associates, you know, like a service with it and exposes ports So the same behavior you would expect from the other source to image tooling whether you were using You know javascript or java or python or something like that okay So you described oshinko a whole bunch without mentioning like cscd or continuous integration or deployment or anything like that Can you speak up a little bit? You described oshinko without mentioning cscd or sort of but the way you describe it and I don't know that much about it is It's a very similar workflow If like and we do something similar in a janken stack that uses helm charts to deploy stuff to kubernetes clusters What are like why didn't you work that into a more classic cscd platform? How is it better? so The the reason I didn't work it in is because i'm not really diving into the pipelines features that exist with an Open shift, but that would what you've seen here is kind of the lowest level of application development The next level of application that would development that would happen if I were going to take this into a more production situation First I would put testing between you know when my my application gets checked in to get You know, there could be a test running there and if it gets rejected then you know The commit doesn't get merged and it doesn't kick off the new build Another way to look at this is I could use the pipeline functionality that exists with an open shift To create a pipeline that says first check the code out and build it here Then run the tests here in open shift and then if all that works Then deploy it and I could say deploy it to another project or something So part of the reason I didn't get into talking about it today is because those primitives start to exist at a higher level But once you've created your applications in this manner You know, they become very easy to kind of mix and match and I think if you're using Jenkins with helm and kubernetes You're doing something very similar What what open shift kind of adds to that is this You know kind of UX around building the pipeline for you, right? So if you're not a Jenkins expert or even if you're not familiar with Jenkins the pipeline tooling Allows you to specify those things using a language that's a little bit easier than diving into you know Jenkins configuration or a travis configuration or something like that So really that would depend on on what you're doing with your application design But it starts to exist at a higher level than just creating the applications And so the oshinko tooling helps because when we're automating these things We don't need to automate the spark cluster creation. We can just let the tooling take care of it for us So when it runs the tests it actually spawns a cluster dynamically Runs the full integration tests and then can report on whether it succeeded and then you know Deletes the spark cluster after it's all done without having to kind of build it into my jenkins file or something like that I think we're running low on time. So Okay, so if you've got more questions come bug me afterwards, you know, yeah walking for questions after Thank you, michael And you can see that he's curious. He's wondering why we would even care about sketching Which is a good question We like to work with sketches because they basically can represent our data in structures that are very very much smaller than the actual data And because they're smaller, they're also frequently much much faster to operate on But even with both of those efficiencies They still preserve all the essential features of the original data Um, oops I had I actually was going to do a quick vote. Um, how many How many of you have worked with data sketching previously? Oh hardly anybody is raising their hand. I actually I'm going to show you that all you people who did not raise your hand are actually being too hard on yourselves If you've ever taken the mean or variance of some numbers, uh, you've done data sketching You've preserved like the mean and the variance of the data that you were actually working with If you've ever clustered data the Centers of your clusters are a sketch of your raw data Or if you've trained a machine learning model, um You know any any learning model is actually a compression Of the training data that you fed it and so um learning models are also all data sketches and so Data sketching is actually all around us and uh, we do it all the time even if we might not realize it The particular sketch I'm going to be talking about today is called the t digest and it was introduced in this uh paper here Competing extremely accurate quantiles using t digest by ted dunning and olma urtl It now has Implications in most popular languages that you can see here and also scala And there includes a library of integrations for a spark and pie spark which I'll actually be demoing later in the talk So what is it exactly that a t digest is sketching? you give it a series of numbers just raw numeric data and It will take that input and maintain an estimate of the cumulative distribution function Um, so what is that? Um, it's been a while since you've had statistics. So over here on the left you can see we have a Density which represents like a distribution of the data that's coming out of your system And if you take a look at that value x down there in the lower left everything to the left of that x Um has a certain mass and a distribution and you take that mass or area And plot it like over on the right. What you're plotting is the cumulative distribution So you can see as x moves further and further you're You know basically more and more of The distribution is to the left of the x and it keeps going up and up and eventually You've seen basically all of your distribution and the cumulus of the cumulative distribution maxes out at one Having seen all the data essentially So at this point you might be wondering it's like well, it's fine to have a cumulative distribution And if I've sketched that that's great, but um, does that actually allow me to say anything at all about the Real distribution the non cumulative distribution So the good news is if you have one of these cumulative distributions or a t digest sketch You can take the first derivative of that And get back the original distribution And Anybody who's like an enormous calculus geek like me you would recognize this is actually the fundamental theorem of calculus Which you might have been exposed to in college But if you haven't it doesn't really matter the main point is To rest assured that if you have a t digest sketch you can actually get back the original distribution whenever you want to So the t digest has a lot of excellent properties which are common to a lot of data sketching data structures If you have a t digest and you Add another data point into it You just get an updated digest and it has not increased in size and so its size remains constant as you feed it data That's a great property because of course if you have a very large data source Throwing data points at you Which is either too large to work with in one chunk conveniently or is possibly simply not bounded because maybe it's streaming data It means you can use this property of update ability to maintain a Constant size running sketch of your data no matter how much of it there is or how long you try to sketch it So what can we do with this? I'm supposedly running some kind of rest service and we have users making queries against this And we want to keep track of our query latencies um And we have kept track of a you know sketch of the cumulative distribution. What can we do? We can do visualization we can say well, what does this distribution actually look like? What's the shape of it? We can ask questions about quantiles so we can say well are 90 of these latencies under a second or not So you can answer basically service level agreement kinds of questions with that um Another property which is less appreciated is you can take these sketches and use them to simulate More data, so if you want to simulate data that looks like the data you've seen before these sketches can also do that So there's even more payoff You know back on dinosaurs ruled the earth um And I was young the you know data files basically, you know, we're simple It's like you had your data it lived in a single file And you did operations on it and this was all very easy to reason about um But uh today, of course, um, you know data resides frequently across multiple machines The machines may be running in you know clusters like kubernetes like here you can see I've diagrammed a artisanal small batch kube cluster running on apple II ease But if your data is all spread across, you know multiple machines you can still obviously do operations on the individual data partitions, but then You know these open the question is like well what now? And I'll be talking more about that shortly Um So I'm going to return to the t digest sketch and talk about like how it actually represents um the approximation of a cumulative distribution And all it really does is maintain a bunch of clusters and a cluster is nothing but a location And a mass and the mass is just a bunch of you know, however much of the data it has seen that landed on that cluster And so clusters with large mass Basically correspond to areas where the cumulative distribution has got a steep slope Um And at a high level it's really just that simple Um clusters are updated of course as data comes in and the update logic is also actually very simple You have a new piece of data Which has you know value and a mass of its own usually the mass is one And you find the nearest cluster And you add the mass and you update the location In a sense, that's all there is to it a fun fact if The clusters could not be updated with their location what you'd really have here would be essentially a histogram Um So The clusters can have quantiles and if you have a cluster at some location Its quantile is really nothing but the sum of the mass to the left of it Over the sum of all the masses. Um, so it's real simple So One bit of added complexity is that clusters are only allowed to grow so large they have bounds on their mass and The function that bounds their mass is this little expression here in the center And Like my colleague will I try to avoid math in my slides, but I'll be Taking that equation apart and showing you that it's actually not that hard either The bounds force new clusters So you have some new data coming in and it would like to merge with the cluster that it found It as close as to it But this cluster here is close to its boundaries. I'm sorry I can't take all of you because I would blow past the my allowed bound on my mass And so what happens? It takes as much of the mass as it can up to its bound And the remainder is used to create a new cluster So going back to this equation Um The scariest looking part really is just this expression q times one minus q And really the only purpose of that is to be small at the end points of your data And uh thick in the middle So you can see it's as you get closer to the you know Quantile zero or towards the end of quantile one, which is basically the minimum and the maximum of your data This this function is small And the reason you like that is because it forces you to have more small clusters At the end points of your data and of course if you're Played with distributions, you know, these are usually your distribution tails And this is where you ask a lot of your interesting questions about the distribution of your data. So It's essentially nothing but a heuristic to allow you to have better resolution where you're asking the most questions Um So the other term here is the big m which is nothing but the Total mass it's like it's all the data that you've seen so far And of course recalling that this is a sketch Um, if we didn't allow our bounds to gradually increase as the amount of mass increases It would basically start forcing us to have more and more clusters And our data would start to you know our sketches start to grow Linearly with the data and that's of course not a sketch at all So the nice thing is that the more data we see The higher this bound curve gets and it just allows our clusters to grow larger instead of having more of them And the last term is a term that you the user actually get to play with Just the compression factor And it works in some ways a little bit similar to mass except it never changes And so the idea here is if you have a Small value in compression It's keeping the value of this bound curve low and it's forcing More clusters and so you're basically Allocating more data points for higher resolution at the expense of keeping around a little bit more size in your data sketch If you increase that It allows there to be fewer larger clusters. So you're basically increasing the compression rate on your sketch So t-tigest are mergeable And what that means is if you have Like these two sketches like this orange and green one up here you can take their clusters and Union them together and then merge them Into a final result and the final result is a new sketch which represents Sort of like the data union of what you've seen in the first two um So why does that matter i'm going to go flash back to this slide here with data residing across multiple partitions So suppose you have a bunch of data partitions what it means is you can take each partition and take its sketch Like you see here in the middle But then you can merge all of those And get a final result and so it allows you to Not only operate on data that has been partitioned off in the multiple machines But it allows you to do it in with scale l parallelism So you can do each of those in parallel for you know arbitrary speed up And sketch the entire dataset as if it was a single file There's different ways you can do this You can take those clusters and merge them in randomized order the very Early drafts of the t-digest paper Basically proposed this as the emerge algorithm I implemented this early on I discovered something a little distressing which is that if I looked at The error between my sketch and the truth and I was doing tests. So I actually knew what the truth was As I merged more and more these clusters the thing would diverge further and further from the real distribution So this is not very good behavior for a sketch And I went back to ted dunning and actually engaged him on this and he's like, oh, yeah Well, we actually have a different algorithm now And what they did there's was Take them and order them in spatial order and then merge them that way And meanwhile not knowing that they had done this I had actually created yet a third method of merging these things where I just took the largest to smallest and merged into that order And if you plot all these you can see the random order is As I showed before bad And either the other two going large to small or Doing the spatial merge order are great. They converge closer and closer to the distribution the more of them you merge So it's exactly the kind of behavior you'd like And the main moral of the story is if you're designing these sketches How you define your operations does matter and it's worthy of experimentation So there's an interesting like I don't know algorithmic considerations If you're trying to implement these things the first is of course you have to maintain these things in the you know order of Space you know in the order of the values that they represent You have to be able to query the nearest cluster for pieces of data that are coming in You have to be able to insert and update clusters You have to be able to compute these cluster quantiles and another term for that Which you might see in the algorithm books is maintaining what are called prefix sums And another another constraint I sort of imposed on myself Because I'm a functional programming geek was that I wanted to make these operations immutable so that if you merge two clusters You get the merged results, but you haven't lost the other original two. So it's like you're not losing the arguments to your functions And not only that but of course it all has to be fast and so like operations are ideally sub linear I mean all these things I just described you can do in log time or better And so When I turned the crank on all that I came up with essentially as a balanced immutable tree and here I'm showing Essentially just clusters being stored in spatial order Using a balanced tree. It's hard to draw Maintaining a spatial order with the querying and prefix sums But if you're interested in that kind of thing If you follow this link here on the right There's a blog post where I unpack Basically the last couple slides and I talk some about the actual data structures. I created for that Otherwise feel content just to you know use use the tools So lastly there's a trick you can do with cumulative distribution functions and it's called inverse transform sampling which is a fancy sounding term, but it's actually Pretty easy. All you do is you start by just randomly sampling from uniform distribution on zero to one like this orange point up there And you go find the point on your cumulative distribution You know x q such that q is that value that you just randomly sampled. This is basically the in inversion part And that value x represents A sample from the actual distribution That you've sketched. So essentially these these sketches allow you to do generative synthetic sampling of data which has the same distribution as your As the original data So you can take your raw data and sketch it and then basically invert that process take your sketch and sample Data that looks like your original data So now i'm going to do a brief demo um sketching distributions and of course You to make this work you have to import All your tooling. So I have importing my py spark functions and my t digest package and of course some plotting So I can show some results So i'm going to use that query latency idea Earlier in the talk and so i'm going to just kind of like simulate some query latency. Here's the first few values They look like floating point numbers. So that's what I want And I can look at the shape of this so I use Seaborn's distribution plotting just to kind of like look at what it is i've generated here and You can see that the the data has a peak somewhere to the left of 0.5 seconds um And tails off pretty rapidly from there and if I do the cumulative distribution, it looks like what you would see on the right So Doing uh doing the um Merging and the scale-all parallelism in in spark is done using A tool called uh aggregators and I created a bunch of these aggregators so you can do this with t digests on data In spark easily. So I'll start by creating one of those for my data Seems to be making the little hamsters in my laptop work surprisingly hard But you can see that it basically turn returns a A data set with a single column and a single row and in that row Is the actual t digest sketch of the data. I just showed you and so I have to get that out of the aggregator so i'll just do that and You can see that its type is a python t digest type So now I have a t digest and we'll see what we can do with it. The first thing of course is a visual Visualization and so I can look at what this thing is actually saying about the shape of my data So the blue line is what the sketch um thought about the shape of my data and I plotted it against um the actual raw cumulative distribution and you can see that the fidelity is really excellent. Um, it's tracking Really surprisingly well So what this means is of course if you didn't have the truth like I have you know You just have the blue line you'd have some faith that these things actually perform very well as a high fidelity sketch And so of course as I talked about earlier you can use these things to answer questions about quantiles So the first row here is the median the 0.5 quantile and that's apparently 0.35 seconds or so And uh my 90th quantile my 90 percent quantile um Is 0.95 seconds and so if my service level agreement was that 90 percent of my queries have to be under a second I can tell from this sketch that I've met that requirement And if I start not meeting the requirement I can see that too as I look at the sketch evolving and ask questions And so I'll also show Some synthesized data. So here I basically synthesize 10 000 points Again, they still look like floating point numbers. This is important And if I plot that Here on the far left is the original raw data and in the center is the data. I just synthesized And if I overlay that you can see it actually does a good job Reproducing the shape of the original data. So they're really an excellent tool For synthesizing data if you want to take sketches of the raw data So last I wanted to show something interesting here as I showed if you take the uh You know first derivative of one of these things you get back the original density. So here's a little function. It estimates my derivative You know delta y over delta x Does that work? Yes, it does So here I'm going to plot That and plot it against the original distribution and What you can see chiefly is that My estimated distribution from the sketch is kind of noisy. It's not smooth as you'd like it to be This has to do with estimating distributions Between clusters The actual shape of the cumulative distribution as you saw was excellent But if you start taking this gradient it starts to look a little noisy. So this is an area of I've been researching and I'm actually close to being able to provide a solution to that and I'll probably publish a blog at some point So Looks like I got to go back into present mode um So anyway, if you find any of these topics interesting, um The packages That I used to Demo all these things are available via maven central So this first link is you can look at the actual github distributions and links to the packages The next two links are talks I gave that a couple previous spark summits that have different Applications of t digest. So if you're interested in applications Aside from the ones I just showed you you could go check those talks out Um If you didn't see it a couple uh a couple of slots ago My friend will bent and gave the talk on probabilistic structures for scalable computing, which is essentially four or five more super useful data sketches And lastly, uh, this is a link here at the bottom to the demo notebook that I just showed you Um And if you're into instant gratification, um, you can follow this link to uh Actually get a link to this deck that I just presented by Data sketching is actually useful and some things you actually You do already, but you didn't know that are sketches We talked about what a t digest sketch is Which is of course cumulative distribution function and what those actually are Uh, we learned about like how a t digest actually represents the estimate of these distributions And we actually demoed the applications So that's my talk and uh If you have any questions, I'd be happy to take them Platform is really useful for us to you know take this platform and You know I'm sorry I'm lost. I'm sorry. I'm nervous. So yeah, so So the reason to be here is uh for us Uh to use this platform of developers here and open source So that we can capture you guys and have You know attract you into quantum computing so that We can have more contributors in this field. So it's just to make you aware about this field So, yeah We have Quantum computing, uh, let's start with the with the Moore's law over here Um So before I go anywhere, uh, I just need to know how many go how many of you are already aware about it How many of you have heard about quantum computing? I know about it Anyone would like to have some discussion on it our interactions No, so Yes No Okay So well I'll be starting off with the with the Moore's law Based on that You know, we are going to we use transistors in classical computing. Okay So transistors just switch which just uh when it is on it represents the Value one and when it is off it represents the value zero Well, it's not exactly one and I mean on and off It's just high volt of five volts and in the zero volt a low volt for representing the value zero and As per that Moore's law in every two years our computational power should be doubled Okay So that's what was happening ever since I would say 1950s but Now by the 2013 we are hitting hitting this limit So what happens when you you have a processor inside that processor you have billions of transistors which represents the Which represents the kind of switch when you turn it on you have value one Basically, it allows the trans the electrons to pass through the The gate and you have value one if you stop it its value zero so Uh By now uh entails eight generation processor have 14 nanometer of a transistor It's very small. It's smaller than Smaller than your cells you have in your body It's so small that it it would fit less than hundred atoms. It's so small so So what's happening is uh, we have reached uh processor with the size of 14 nanometer nanometer of transistor But what happening is that we are we are hitting a limit of how small we can make that transistor So when they try to make smaller than five nanometer of transistor they hit, uh, let's say a quantum mechanics phenomenon called quantum tunneling So I will show We will come back to it, but before that I will show your transistor how it looks So these two these two images, uh, uh the diagram are example of two recent generations of Transistors in your processors So the left one is the mosfet and the right one is fin fat transistors Well, these are just different technologies being used to make those Those transistors the left one is a 28 nanometer transistor and the right one is a 14 nanometer transistor So what happens is that? In a transistor it's simple It's just a switch where from a source electrons are moving towards the drain And between that there is a gate which is working like a switch on and off That's all that's all it is. So that gate Has something called gate dielectric So it helps it helps to stop the electrons to pass through from the source and drain And how how small how small that gate is well in the 14 nanometer transistor There are less than 50 atoms. It that that's how you can only fit Maximum 50 atoms. That's how wide it is. So that's that's pretty Difficult to stop electrons to move from the source to gate So for that you they apply the solution gate dielectric. It's it's basically I would say boron You have Phosphorus and boron over there. So boron works over there as a gate dielectric It also stops something called quantum tunneling. So What is quantum tunneling is that is a phenomena of quantum mechanics where electrons can just disappear from one side and appear on the other side Okay, so this is a 14 nanometer transistor But when they try to go on a smaller scale even smaller than that to to fit in more more and more transistors currently our processors are fitting about I remember that In in the ipad which you are carrying right now it consists more than two billions of transistors Okay in just one inch of processor are even smaller than that So they're trying to reduce our scale down the size of the transistor, but when it goes below five nanometers the electrons doesn't stop because Because of this gate they just tunnel themselves to the other side doesn't matter if there is a barrier or not because of the quantum tunneling that The electrons just disappear from the one side and reappear on the other side which fails the purpose of a transistor So thus you won't have values zeros and one and thus you won't have a working transistor so Quantum computing what is it? So I was talking about quantum mechanics. So, uh, well, I won't be able to go too much into Well too much into the theory of it, but What I can tell you is that from quantum computing What what we are doing here is we are using the phenomena of quantum mechanics called superposition and quantum tunneling and to to make it Sorry, so the problem was the quantum quantum tunneling But we are making use of superpositions and and quantum tunneling To to have a different kind of computing where you don't have to use transistors rather than transistor Let's say you want to process values Zero and one right so it doesn't it doesn't have to be from the transistor it can be done using quantum mechanics phenomena called superposition and and quantum entanglement So How do you how do you process the information? How do you have zeros and ones? So While in classical computing you have classical bits in quantum computing you have qubits that is quantum bit Okay, so in quantum bit Now speaking of quantum bit It can store two classical bits of information while transmitting in one qubit Okay, so There's a little bit type over there after after transmitting. There is a comma so So in one qubit it can have two values while in classical bit You will have either one or zero on or off like you have switch one or zero While in in in qubit you can have more than one values Okay, you can have one zero or both at the same time Okay, but when you will have both values at the same time. Well until you observe that bit Okay, so there is a weird Science behind the quantum quantum mechanics that it's so spooky and weird that even Einstein had had a debate over it With the other scientist and he called it like spooky action at a distance So the the phenomena was that these quantum mechanics and entangled qubits were breaking the laws of Relativity in that As as per Einstein, nothing can be faster than the speed of light Okay, but when you have two entangled qubit, what is entanglement? Well, basically, uh, you have When you generate two let's say two photons or two two two particles together and entangle them Okay, so they will be able to send information to each other doesn't matter how far they are It can be galaxies away. It can be I don't know light years away from it But it will quickly send over that information to the other Other particle so So that that's what it is. Um, so To to to use To have Let's say Any particle can be used to represent a qubit. It can be either atom. It can be it's sub subatomical particles electrons photons are even nucleus so Each each of them have different properties While nucleus is More reliable it can Let's say stay for be alive for a longer time But the most simplest one right now is that they are using electrons as qubit. Okay, so what's an electron? well in in the electron you have a magnetic spin Okay, it can have either a spin up or a spin down which can be manipulated with Superconducting magnets. Okay, so scientists are there which are making up These qubits Using electrons and they are manipulating their value zeros are one so earlier. We had transistors When it is turned off it is representing the value zero when it is turned on it represents the value one while in qubits They read out they try to read out its spin magnetical Electrons have that they're small very tiny Magnetical dipoles so if it is spinning up it's one and if it is spinning down it's zero Okay, so let's go. Let's go further so here We will try to explain that where are we now with quantum computing? What are What are the universities are doing with it and what are the future applications for it? So for that Shadr will continue Yeah Hello So, uh, what I will be, uh, you know speaking about is, uh, you know where we are right now, right? So quantum computing Has, you know given us lots of promises that, uh, you know, it will be able to do that do this, right? So Uh, I would try to kind of explain, uh, you know, where we are right now And you know where we would probably go in maybe three or four years and you know, what are the solutions? You know that that will be able to do, uh, which uh, you know, which is just not possible using the classical computing and The the bits the normal bits zeroes and ones, right? So, uh, uh, some of the, you know, highlight that That are there, right? Is hovered at MIT has teamed up and they have actually built a 51 qubit computer quantum computer and that that's really impressive and There are a whole lot of other universities that are doing research on quantum computing And uh, university of Maryland had, uh, you know, trapped five qubits in in in a kind of pentagon kind of safe and They have basically desichaded to, uh, create different quantum gates and, uh, university of sassage is, uh, also kind of trying to do a modular design But but it's not quite, uh, similar but, uh, it's, it's, uh, also quantum computer Uh, so there, there, there are different, uh, university cities which are, uh, actually trying to build quantum computer in in silicon chip, uh, so the quantum computer, uh, you know, there's the Hubbard and MIT guys built So this is a, uh, kind of pretty big machine and you you basically have to pull down the, uh, you know, the environment Uh, to a few, uh Degree, uh, you know, it's kind of near to absolute zero to maintain the superposition and to maintain the states of the atoms Right, uh, but but uh here, uh, in uh, some of the universities are basically trying to make it in a in a transistor chip itself Right and uh, so what are the corporates doing, uh, right now? So basically nasa and google teamed up and, uh, you know, they have created a 512 qubit, uh, you know quantum computer and, you know, that that's huge, right? Uh, but, uh, you know, it is not what it was supposed to do Uh, you know, it cannot, uh, do things like short algorithm and robust algorithm So we'll come to that like what are short algorithm and robust algorithm and how they are helpful But but it is not actually kind of quantum computer, but it's it's a kind of, uh, simulation And uh, but but google was able to use that, uh quantum computer to do some search optimization Now google is uh coming up with something called their own chip that is called bristle code And it says 72 qubit computer and uh, what they're basically trying to achieve is quantum supremacy Now you might have heard this term, uh, you know, if you if you have, you know, uh, do any, you know Have watched some youtube videos, uh, or some kind of, you know, sci-fi movies So what quantum supremacy is, uh, is trying this ability to, you know, solve some mathematical problem Which is just not possible using classical computers. So, uh, google promises to do that. Uh, google already has the 72 qubit system And uh, they promise to, you know, uh, reach quantum supremacy By this year, uh IBM have a, uh, 50 qubit machine that is, uh, basically offline But they also have a 20 qubit machine that is online and you can just sign up and register and you can actually use a quantum computer for free So they're allowing you to, uh, you know, run quantum algorithm And, uh, you know, per year more than 20 research papers are, you know, getting submitted and Uh, you know, people are developing newer quantum algorithms, uh, using that, uh, IBM's q quantum computer Microsoft has also is also developing a something called topological quantum computer. So basically it's it's not in 3d. So basically They are you trying to use two-dimension particles. So it's kind of phase in phase out kind of thing And if it is correctly built, uh, you know, it can probably, you know, uh It can probably, uh, you know, do do more calculation And it can basically have, uh, you know, a stable superposition So so we talked about the superposition, right? So if they know then once, right? So but it is very difficult to, uh, you know, keep the superposition state, uh You can unlock it So in the meantime, we will unlock it. I would discuss about, uh, what are the super positions? So superposition is the is a quantum mechanical phenomena where a bit will have both the values at the same time zero or one Both at the same time So how is it possible? well, uh, they are not sure yet that Researchers that how it is having the both the values at the same time But their assumptions researchers assumptions are that the bits are taking a quantum leap Well the quantum league what they're referring to is that That there are actually more than one universes, uh, parallely are existing for that one bit Yeah And it takes a quantum bit that zero and one is actually in the uh in on any of the universe and It will collapse that there is a schrodinger's equation, which is uh, which which tells that, uh Um, there is a wave function and it tries to Find the actual value of it either it is zero or one if you have heard about schrodinger's cat That was an experiment that if the cat is inside with the with the poison in it It would be dead or alive. So, uh, the value is about two or false or zero or one So they could they both could exit at the same time one at one is in this universe and the other one in the another universe So, uh, as soon as you check that value or try to observe the value of the qubit, uh, you have only one Okay, but so that that's a superposition that it will it will have both the values But until you observe it once you observe it, uh, it will hope it will have only one of the values Okay, but there's another phenomena that is quantum entanglement So whenever you whenever you generate qubits it it generates in pairs okay, so you have a pair of qubits and The phenomena is that they're entangled to each other if one of the qubit is spinning up that electron is spinning up the other one will be spinning down and The and at the same time both of the qubits will have both the values zero one and zero one so because of that thing you will have Um, two to the power n where n is the number of qubits uh Values stored in your qubits. So, uh, while in the classical computing you can store either one of the values zero or one Where in qubit you can store two values at the same time zero or one So as they are entangled to each other doesn't matter what if you check observe the value of one one one qubit You will automatically know what's the value of the other qubit. So if the value of this qubit is uh, Up that is one so you will immediately know the value of this is zero But here's where it is spooky Scientists recently, uh, I think it was from china. What they did they generated two pair One pair of qubit and set one qubit to to to other location in Through satellite, okay, and they forcefully they forcefully You know change the value of the first bit to one and the next And then the value of the other qubit was zero Okay, so well the thing is Both the values are random. Okay, it could be zero. It could be one But thing is as soon as you observe one of them and before observing if you change the value to to a specific one zero Or one the other one will be exact opposite of it. So it it's it's it's kind of spooky Let's shut the rule continue So, yeah, so I thought I should uh, the power went down from a laptop Yeah, so uh, I was talking about, uh, you know, uh, actual quantum computers that you can, uh, you know, log in and use by ibm q and Even uh, microsoft is kind of developing a topological system. Uh, they already have a You know quantum development kit that you can use I'll be talking about it in a minute So one of the other larger player is uh, we get the forest and they have a working 19 qubit computer and You can actually use 26 qubit of the computer in the cloud But but it's a simulation. Uh, it's it's not the real one, but they actually have built a 19 qubit processor There's there are other companies like quantum benchmark. Uh, that's basically uh on the research of That uh on university of vertandu, uh, they they're kind of providing benchmark for quantum computers uh an error correction so Yeah, so, uh, you know, uh, when we are in my open source conference, so how open source is doing When it comes to quantum computing, right? Uh, so, uh, you know, you would be happy to know that most of the Codes that are being written for quantum computing are actually open source IBM has developed a thing called qubit kit and it's an open source quantum computing framework that you can use For doing research writing a quantum algorithm And you can actually use those quantum algorithm to run in their ibn q You know, which is their quantum computer? That's the actual thing That's the actual quantum computer that they have Uh, there is something called project q. Uh, where you can write high-level quantum algorithms And it's it's a Python based thing and if if you are using pip you can just you know do pip install and you can just download it and The good thing about it is you can basically Write program in project q and you can basically, you know, compile that to run in either actual quantum computer or a simulation Whatever you are using, right? So yeah, you can just go ahead and you know try it out So apart from that microsoft is uh, you have the old quantum development kit and The their language is called q sharp because yeah, why not? and that's supposed linux and Uh, this is your microsoft loving linux. So you know, you can run it in linux and It's open source Right. Uh, yeah riggedy forest has developed their own api. That's called forest And they have their own language, but all of this what what we are talking about is open source You can check the source code. You can uh, you know run your own emulation software in your own laptop and you can Kind of simulate up to few qubits But I was talking about quantum supremacy. Uh, that that is the point Where we cannot actually, uh, you know simulate quantum computer in a classical computers After around five qubits you cannot just do it. There is just not enough You know computational resource, uh using classical computers So, uh, we'll be talking about some of the you know future application Of quantum computing, you know why this is so exciting and uh, you know for cryptography and why why this is so terrifying Right, so we'll be talking about it. Uh, so mother will be talking about some of it. Uh, you know, how it can, uh, kind of You know, help machine learning Yeah, so as I was talking about before about the superposition that a qubit can store more than I mean qubit can store two of the values at the same time It makes it exponentially Faster than your classical computers in theory though. Okay, so what's the catch over here? thing is, uh When you have operations linear operations where you want to perform classical computers are faster at each operation Okay, for example, uh, you have you have to get Maybe a factorial of maybe 200 digits It is never going to happen even for uh, how many it was 14 digit or less? Yeah, you can basically, uh, you know, uh You know calculate, uh factors of You know, uh 30 digits number you can you can do it but but if it is a huge number for the it's a 400 digit number You are just not going to be able to do it because you know, so For example, uh If you are trying to uh decrypt a private key, um, uh, let's say you're trying to take a factorial of 400 digit of number You can use all the best resources all the classical computers It will never be able to decrypt it in that time. It will take more than the age of the universe Why because there are just too many steps to perform linearly, but as I told you in quantum computing you can have the number of qubits and Um, because of the superposition it will take a quantum leap and perform all those 400, uh, digit of Numbers parallely in different universe. That's what the researchers are up to So it will be like second faster And get you the result of it. However, we are we are yet to reach the quantum supremacy on that So that means that still we are under research and quantum computers aren't able to do that but Uh, as soon as they will have quantum supremacy that will uh, that means that quantum competence will be able to calculate complicated uh, let's say Any of the research which which involves a lot of variables in it for example weather science are Where you have like too much variables that you need to look at and calculate their values And have factorials of it. So all those areas will be Let's say benefited from the quantum computing So what are those areas? The first one is the ai machine learning So there is a specific field now called the quantum machine learning are called quantum ai So in that what they are doing they're they're using the artificial intelligence which work Which works on the the feedback basis learning through the learning through the action and use the use the feedback On the actions if it was good or not. So, uh, using the machine learning they will learn Which method is best to do and using the quantum computing that can be speed up So for example, it can be helpful in the molecular designs of, uh, let's say In the in the in the medical industry when they want to Let's say simulate a molecular designs of the medicine So the quantum computing can can help them with with ai to to prepare a solution Which which has less less side effects But uh a better solutions Other than that as I spoke about weather forecasting it can also have as The equation is simple wherever you have too many variables where classical computing cannot be applied wherever things are too complicated Quantum computing can help in that why because of the quantum leap It is able to calculate all and all of them parallely at the same time It's not exactly as your parallel computing where you will have multiple Multiple classical computer trying to solve it But but it's different. Uh, it it will be exponentially faster than classical computer to resolve those complicated errors so, uh, one of the, you know, uh Application of this would be cryptography As madhu was telling, uh, you know all of our public key cryptography kind of depends on the difficulty of classical computer to Uh to find out factors of, uh, you know huge numbers, right? Uh, so that uh, you know and on 1994, uh, there's a Uh researcher who has actually written an algorithm that's called short algorithm Uh, so what what it basically does is, uh, it is a quantum algorithm and it basically speeds of, uh, the The time it takes to find out the factorials of of A huge number like I told if you try to find out, uh, the factors for 400 digit number it would take, you know The time uh since the universe was born So it's not just possible using classical computers But using source algorithm and using quantum So it's basically a quantum algorithm and using quantum computer You should be able to find out, you know things very quickly. So, uh, That that is one of the, uh, you know use case While you can basically, you know go and you know find out the Your your private key. So you you already share your public key and based using your public key. You can actually, you know, uh, find out the Private key and you know You know all the security you have today is gone, right? All your email and everything it depends on public key cryptography So, yeah, so there you go. Uh, if if you have around 1000 qubit system, uh, that works properly And if it is able to run short algorithm, you are kind of, you know gone all your security today It's not going to work. So, uh Even, uh, you know, uh, it has been, uh, using it just to qubit computer Quantum computer, they were able to basically use They were able to run short algorithm and find basically factors of it to digit number that small. Yeah for now But it's going to change very soon Uh, apart from that, you know, if if anyone has, uh, you know, bitcoin Yeah, anyone has bitcoin? No, okay, uh, but if you have bitcoin, uh, you know, it's going to affect that too because the short algorithm can be modified to, uh, you know, run elective Go, uh, you know, that's going to basically affect, uh, You know, bitcoins you can basically download the blockchain and you can find the private key and basically you can spend all of it, right? So, yeah, it's not going to happen tomorrow. But yeah, definitely it's going to happen Uh, and when it happens, you know, it's like all the cryptocurrency you have, uh, you know, they are in danger Uh, apart from that, uh, you know, what it, uh, what what quantum computing can do better, uh, searches So for example, uh, if you have like 100 numbers, uh, basically you have to go n by two to, you know, find a specific number On average, right? That's a time complexity. Uh, but there's uh, after shores algorithm There was another quantum algorithm that was developed. Uh, that's called rovers algorithm And it can reduce the time to search By square root of n. So, you know There's you have a huge speed up using rovers algorithm and The applications are huge. It can be using databases Uh, and all the things are what in your searching means everything, right? So that that's going to have a, uh, your huge impact If you're use using des, right? Uh, cryptography, uh, so basically, uh, due to the size of the key It would take two to the power 55 searches Uh, and that's that's you're not going to be able to do, uh, you know, using classical computers But when we have quantum computers, basically it will take 80 185 millions of You know searches and that's not a difficult thing to do So basically you are able to, you know, decrypt des Uh, you know, you you're able to decode des, right? So I'm not saying des is secure, but I'm just giving you a real life example. Uh, why why can quantum computing can You know come into picture and you know, uh, you know, that's I mean it can basically, you know Obsolete all the security measures that we have So basically msa has developed a list of Uh, list of things, uh, uh, which are kind of quantum computing resistance Some algorithms, uh, which Can be used, you know after post quantum computing era. That is what they're calling it. Uh, so What are the other solutions? Uh, there there there is, uh Something called quantum safe. Uh, I have the link in the slide Uh in the reference, uh, basically they are developing, uh, kind of Libraries that that can be you can just plug in and be quantum. Uh, you know quantum computing safe that those kind of libraries They are kind of trying to develop, uh, which you can use They're even trying to integrate with SSL You know, everyone knows SSL and everyone is using SSL, right? So, uh, it's still on research phase But when it's done you can just plug in live qs with SSL and you know, you should be quantum computing safe So what are the other kind of implications? One of them is optimization problems, you know optimization problems are hard as madhu said because there are just too many variables Right. So quantum computer should be able to do them much faster Uh, what what are other your options? Uh, one of them is molecular modeling So, you know, uh For example, uh, you're you're trying to figure out, uh, exact configuration of an atom To participate in a chemical reaction, right? Uh, you can kind of simulate those things in in classical computers But you are only able to simulate, uh, you know only Simpler molecules like h o kind of thing, uh, but you're not able to simulate, uh, you know complex molecules, right? Using classical computing. So They're also using quantum computing. You can just you know get a speed up and you can basically you know analyze things like your chemical reaction What one of the uh, you know, uh, so google actually was able to simulate a single hydrogen Molecule using actual quantum computers. Uh, so, you know, there you go It's just hydrogen, but you know, we will get there soon. Uh, but but the You know implication of this Molecular modeling is huge. Uh, it can Uh, you know, help you to make better fertilizer. It can help you to make better solar cells And it can basically help you to, uh, you know Create better product And and you know, it's it's going to revolutionize the whole thing Actually, uh, there's another thing, uh, that it's going to revolutionize is Uh, pharmaceutical industries where where uh, you are trying to have a good medicine with less side effect If you are able to, uh, you know Simulate the whole, uh, you know Whole molecule using, uh quantum computing that's going to kind of revolutionize the Uh medical industry as well Uh another option, uh, another another, uh, you know, another, uh Use of this would be particle physics. Uh Yeah, basically why not? Uh, we are basically using quantum computing Which uses quantum physics to kind of simulate particle physics Uh, so right now doing this it's very difficult in classical computers And you know quantum computer can help to You know, uh, get better results So here's the thing, uh, basically 30 minutes are not enough to explain or go through the every single detail of it But there are many references available about it. So our motto was over here in this talk was to, uh, Attract you guys over the quantum computing research field where there can be more contribution can be made Using those open source tools which are available there. So these are the one Uh, and also the references over here, uh, we tried to explain a little bit of, uh, You know the basics of it. However, it's it's just not sufficient It will require you to have at least spending more time in it But uh, our purpose was just to make you aware in this field, uh, attract you in this field where there are There are huge potentials Okay, so from AI from AI to you know other other fields, uh, there are just many potential, uh, possibilities in it. So, uh, here where the references which you can, uh, follow Um, and if you want to have a look at the other slides So just so you know, uh, if you want to really try out a quantum computer, uh, for free, uh, the first one is the link Uh, there you can just go and use IBM's 20 bit 20 qubit quantum computer for free You can use quiz kit, uh, to you know, uh, kind of map your Q bits to q gets and do all this stuff Yeah, I have also added one link, uh, just to show you how many startups and companies are, uh, you know Interested in quantum computing you will find hundreds of them in this list And they are doing great research on on this field And and the last one is the one I have told you, uh, open quantum safe Project that is basically developing the libraries to be quantum safe Uh, so you can also visit their library, uh, website If you have any questions, uh Can I thank you Is there anything available for, uh, you know, uh quantum prototyping board for iot IOT Yeah, internet of things devices Is there anything prototyping available? No that, uh, none that I know, uh, but, uh, you can go to the IBM's website and, uh, they have lots of, uh, you know, uh, different research papers On different things, but, uh, I've not come across anything that is related to iot Thanks So is quantum computer is is extremely lot more powerful than traditional computing or is there any case where Uh, they are trade-offs where the traditional computing might be more applicable So the thing is quantum computers are not going to replace classical computers Okay, they are just, uh, let's say better at some type of calculations Where you have let's say too many variables. Okay, for example Um, let's say you want to, uh, have a website, uh, which is helping a traveler. Okay, the traveler wants to go to maybe, uh, to europe Okay, that's one variable where he wants to go second variable. He wants to go in the evening That's the second variable third variable. Uh, he wants to take uh, he will prefer this that that flight or You know, maybe jet airways or something and then You can just add up more and more variables. Okay as you have too many variables Okay, so it's very difficult for for a classical computer To to calculate that thing. Okay, so there will be there will come a point where if you are trying to, uh, Calculate, uh, two. What was it two to the power 54 or what was the limit of it? So I mean Uh, but uh, what my though is talking about is the optimization problem and Quantum computing would be just, uh, better at solving it due to the nature of Due to due to the Phenomenon it is you think one of them is superposition So as mother said it can be zeros and one at at one place, right? So, uh, it just It just exponentially increases your computational power If it is like two to the power like 50 cubis is like two to the 50 and You know, if you add it just becomes exponentially faster, right? The more cubit you add it's it's basically exponentially faster. So, uh, but yeah, you know, you are not going to use quantum computer probably to watch videos or something like that because You'd probably use classical computers for that Because you would use quantum computers in in situations where you would need for example You know lots of lots of steps to reach to a calculations, right? There you want to use quantum computers But the each step To perform would be actually slower in quantum computer than your classical computer So based on your workload based on your application You would lead to you know, you would use either quantum computer or our classical computer So like just with a simple example, let's say you're trying to Crack a crack a private key, okay Your classical computer is going to try all the combination one at a time Right for the brute force attack It will try all the possible keys and try to decrypt it. Okay, but let's say if you have a really really large Private key your computer is never going to be able to do that So but where in the quantum computer it will be able to do in in fraction of time Why because it does all the calculation using its all qubits at the same time parallely So that that's why in those scenarios where you have too many variables where you you have to perform a linear calculation, okay Which is simply not possible with classical computers in those areas. It will be useful in all rest of the areas Quantum computers will not replace classical computers. Okay, so it will be useful just like in weather science It will be useful in ai and rest of the Fields which we mentioned in that slide Any other questions? Thank you, Madhur. Thank you So for all of you sticking around We have a party for you Which is starting at seven in the ziskin lounge A ziskin lounge is where lunch is served So you guys definitely know that place See you at the party