 My name is Chris. I work for Twitter. This is a very long presentation, but you know, I just show you The second half of it because that's the interesting part So I'm assuming everyone in the room knows what Bayesian optimization is very good. So I don't have to explain it Yes, I'm going to skip a lot of slides here So you guys know what Twitter is we run on microservices and we have a lot of them and so this talk is about We're using brawl to run our mostly in scholar written services And with and by using brawl we save a lot of CPU. So What we also have is something called I'm not sure if that will come up. Yeah, so we have something. That's called autotune It's basically a framework That's using a Bayesian optimization as a machine learning framework to tune JVM parameters and so we can pass in To autotune we say oh tune this parameter for me and then autotune talks to the Bayesian optimization part Which is wet lab. That's There's also an open-source version of it called spearmint if ever anyone has ever heard of it And so then spearmint or wet lab figures out the next the next You confused me The next the next value of the parameter to try to explore the space and then find the optimum the optimal configuration, so it's a driver autotune is the driver to run these experiments and People know what brawl is so I skip that You can watch this on YouTube if you want and so these are the parameters that I that I tuned There's one called trivial inlining size. So that's the size. It's the by default It's 10 if a graph a compiler graph of an inline e is smaller than 10 nodes It just in lines it without looking at any other data. Then there's a maximum inline size So it's the other end basically if it's bigger than 300 it doesn't inline it And then there's something called small compiled low level graph size. It's Similar to the second one, but it's just a low level graph So don't worry too much about it what it does, but these three parameters are the ones that affect inlining the most All right, so I did some previous work have another talk where I explained how cool Brawl is and how much CPU we're saving and so these are two slides from my previous talk a look at blue and orange basically so this one and down here Like just using brawl instead of C2 running our our stuff. We can reduce Parallel to see cycles by about 3.8 4.2 percent no actually no yes Yeah, 4.2 and what I did when I ran this a year ago. I think I Manually tried the to change the three parameters You just saw to figure out if we can get better performance out of it And so what I was able to do by sitting down an afternoon and I'm trying this for two three hours I could reduce it by 1.5 percent and I did the same for CPU time so basically by just running Brawl we can reduce CPU utilization by 13 percent Which is a lot and saves us a lot of money and I could squeeze out another two percent by manually fiddling around with these But you don't want to do this manually for every service You want a machine learning framework to do this for you, right? So that's exactly what this is So this is the configuration, but it's a JSON file that you pass into autotune It's basically the parameter and then you tell it the range from where to where it should, you know explore the space You don't really have to Specify that you could I've ran experiments after this where I just said one to one thousand right because it doesn't matter The the framework will figure out what the right parameters are. I just used A range here because I wanted it to work for the talk But I'll probably rerun the experiments and just set it to one to one thousand and let it figure it out itself So the test setup is I have dedicated machines There's nothing else running on them because crosstalk is a big issue when when I do this performance evaluations All instances Receive the exact same requests. That's important. It's not the same number of requests It's exact same request because a tweet could potentially be one character 280 long and then that would affect memory allocation a lot And it would change the outcome We're running with this version of growl default tiered set up C1 and growl. So that's there's nothing we change here So the first experiment Is the tweet service I have two experiments. I'm not sure if I can show you the second one too because of time The tweet service is basically reading and writing tweets. It's built On finagle it's a framework an open-source framework that we developed and you can get it on github if you want It's an extensible RPC system for the JVM used to construct high concurrency service blah blah blah I have no idea what it is, but the most important part is it's 92% written in scala and growl can handle scala very well because scala Allocates a lot of temporary objects and growls inlining and escape analysis are just better than what C2 has And that's why we can reduce the the memory allocation rate reduce to C cycles reduce CPU utilization and so on Okay, so you have to pass in an objective right in this case It's use a CPU time and since the autotune framework looks for a maxima we invert this one to find you know the configuration that uses the least CPU and Then you can specify some constraints in this case. It's something so we run on Aurora in mesos or the other in Aurora on mesos and There's something called when you get throttled basically because you're using too much CPU then it kills you And so that's basically our constraint we say if because if you tune Sometimes you and I noticed that when I was tuning it manually that sometimes you specify some values Where the service doesn't even come up? Because they're just too wild and so we have to put in the constraint So that we know if we went too far So this is 24 hours of of doing an experiment an experiment What one is only 30 minutes long? It's called an evaluation and I I Picked 30 minutes because it's long enough for the tweet service to actually reach a steady state And I wanted to have a lot of evaluation so that we see how Autotune really works so as you can see this is just Request for second and it's the same for all of the services For the two instances and this is use a CPU time So the experiment is blue at the control which doesn't change is orange And if blue the blue line is below the orange one that means we see an improvement and it's above and it's worse I have the same graphs like a different here. It's a little easier to see You know every time when it's below it's better when it's above it's worse and the result When this was done looks like this It's a web page. It shows you all the experiments and then this one's the best one and You see the object. This is one point zero eight three eight that means we could improve CPU utilization by over eight percent and These are the parameters you remember, you know, I said 10 to 25. This was by default 10 This was default 300 and this was default 300. So if you use these parameters, then you get 8% less CPU and The bottom of the pay of the table looks like this. We have three ones that violated the constraint We have one that's still in progress when I shut down the experiment And as you can see there are these that were worse like this one's almost 5% worse and like 3% worse Now these these are charts of the three parameters. It's a I'm showing them It's not perfect because it's a three-dimensional space. We're exploring because we have three parameters And you know, it would be n-dimensional if you have whatever parameters But it can give you a picture Every point in here every data point all depends on two other parameters So, you know keep that in mind But if you squint a little bit you can see that there's actually a trend going up So if you increase trivial inlining size you you get a little faster if you if you do more but at some point that's too much and it's coming back down and This is maximum inlining size It's kind of flat and then this one We do you don't really have to squint to see what's going on So if you and that once by default three hundreds we would be in that area and we can improve it by that much if we You know increase that value. I didn't look at the time. How much time do I have left with all the stuff in the beginning? Okay, so what I what I did then to verify the result Was I took the parameters the first the top one and ran a 24-hour experiment I also call it an experiment by basically a verification experiment I just ran the tweet service for 24 hours with C2 growl and then growl with the autotune parameters in red This is again pscavenge cycles because tweet service is using a parallel GC and as you've seen earlier In this particular run it was 3.4 percent less GCs And with the autotune parameters we could Increase that by 3.5. So in total We can roughly reduce to GC cycles for 7% and the funny thing here is that we that autotune actually squeezed more out of it then it did by default and The same graph. No, it's it's it's basically the same graph as the one before I'm just showing this one It's allocated bytes per tweet and it's very flat over 24 hours And you see the obviously the exact same improvement here 3.5 3.4 7% roughly and This is use a CPU time As you've seen in the beginning. That's about 12 percent ish in that particular run But it varies a little bit and with autotune we can bring that down Another 6.2 percent which gets us to 18 percent less CPU We run we have our own data centers. We have we own our own machines But even in the clouds if you if you can run your business with 18 percent less machines It's a lot of money. You don't have to spend and also, you know You save electricity you save on cooling blah blah blah all that stuff. So we're actually trying to save the world here Right then this is the latency p99 latency for for tweets and You can see it's it's certainly growl is certainly better It's a little hard to tell how much it really is autotune looks like this It's certainly better, but you know again hard to tell how much it is So what I did was I was integrating over the over the GC times of 24 hours And that's that's the graph of that and as you can see we can reduce p99 latencies by 19% But just using growl and then another 8% by using the autotune parameters. So 28% means you get your tweet 28% faster And you should tweet that So I think I have five minutes left. So I'll skip this one. It's basically the same with a different service Let's just go through this. So you've seen all that blah blah blah. I would say the same thing you 7.6% and Yeah, these graphs you've seen them before look similar the run 1.6 3.5 that's interesting That just growl reduces it to see cycles only by 1.6% but then autotune gets another 3.5 out of it So Yes, it did it did here No here 23398 646 I can't remember the ones before but yeah, they were different Scroll scroll scroll We've seen this GC cycles CPU time. So if that That service is also built on top of finagle by certainly not as CPU Scented I'm not sure what to say, but we can we could only reduce CPU by 5.5 But just using growl, you know compared to 12 for for the tweet service and with autotune We could reduce that by 7.8% and I think the reason why this is actually higher than just growl It's the same as the GC graph because we can reduce GC cycles by more and that automatically means that we don't have to allocate memory. We don't have to just see it and that reduces CPU Okay, 13% questions. Yes, there's one question that everyone has Correct. Yeah, of course I did I couldn't come up here and not have done that so I Pick these three. They're very similar to the to the ones that growl has its max inline level Growl doesn't have that it's the depth of inlining that see to us at 90 stops So if if you if you call more than nine methods, it's just not inlining anymore max inline size 35 the same as the the growl parameter basically the only difference is Growl is looking at nodes in the compiler graph while this 35 is bytecode size There's a funny story about this You know if you have a search statements that actually is counted in that 35 so it's stupid we never fixed it but Chased and configuration kind of the same 5 to 20, you know, I wanted to see if actually less inlining level changes changes things Experiment as the outcome Same graph that's a result. So the best we could do is 5% and And and that's kind of I think an outlier because we see here 3.8 3.5 3.3 I think that's more the range that if I would run the experiment which I didn't do yet The verification experiment we would see a rough 3.5 percent improvement that auditing can get out of C2 Nice, right. So auditing does a very good job But compared to growl, it's just nothing because this the tweet service and we had what an 18% improvement by using Growl without a tune and the most we can get here is roughly let's say because I'm nice 4% so No, it's not No, it's not because This is Scala C2 is not tuned for Scala That's an interesting chart the max inline level you can see it goes up. So 9 is the default, right? So we run here We can certainly it should be probably 17 to be honest and then this graph flat flat So they don't change a lot Yeah, that was it My summary is always very simple and I always just ask people please try growl As you saw especially when you run Scala code you should certainly try growl. I can reduce You know the the cost of your whatever you're doing business And I want people to try it so that we find more bugs and can make growl a better compiler So if you try it run your pet project or I don't know you go to work at Monday and put it in production. That would be cool We do it so If you get a crash that would be nice file a bug If something doesn't work as expected or I slow it and see to yeah file a bug if it's better tweet about it and At me I would love to hear it. So that was it. Thank you very much Any questions? Yeah So did you run these experiments only for 24 hours? Yes, okay, because if you look at the parameter space It would be over three With the parameters you showed it could be over if I'm right three million configurations. Yeah, I'm possibly yeah, so The problem is you didn't see the first half of the presentation. Yeah, but I know what Bayesian optimization is Yes, so if you know what it is then you know how it works. So it's not it's also sometimes very fragile. So not really Okay, I mean you saw yeah, but there are a lot of other Things out there like how do you batch this can you batch it like there's a post by Facebook where they use the same technique for their Hack compiler, I guess was recently posed. Okay, and they they show that you quite there's quite some innovations needed So I was wondering is it feasible for everyone like besides Twitter to like if you have a production workloads, right to try this Oh, absolutely. So I did 40 iterations. Yeah, right That's that's a very good Size I'd have to say if you look at the table the results table You'll see that the top, you know, it kind of you're explored the space enough so that you have a good result Though the goal we're not there yet But the goal is to have this always on For every service so that the services are tuning themselves automatically all the time and then you can run 30-day experiments or something right you don't have to tune it every day the code is changing Yes, everyone is deploying multiple times a week, but it's not that much right if you tune it once a month. It's still a Hundred times more than you would manually tune it, right? You only tune it manually when someone gets upset and then he tunes it and then Yeah, I had this at Twitter. I got there and I said hey When was the last time you tuned the parameters for the tweet service? I said three years ago, right? That's basically what happens So we want this always on and then we can run I don't know 30-day experiments and run it a day For one evaluation one follow this is there any intention to make this audio tune framework open source? Yes, there is so the problem The wet lab Vision optimization part that probably we can't open source because we bought that company and you know complicated But there's the spare mint framework which is open source and then autotune we wrote ourselves our team So yeah, we can open source it. It's just at this point. It's not very user friendly Okay, I have to I have to curl chase and file a student URL and then you know and you can only kill All the experiments you cannot kill one of them. So it's you know, it's working properly But yeah, we want to open source it microphone So there exists about 1000 xx parameters. How many can you tune at once and how do they interact with each other? So how much time do you have? Right, you can do all of them if you want the autotune was written to tune to see parameters That's why many at a time likes like three or again. No, you can as many as you want Okay, yeah, it's like the space will be a little bit bigger. But yeah, you can do as many as you want Wrapping up. Okay. That was it. Ask me later. Thank you