 So welcome everyone. Thank you so much for being here This is not gonna be a selling presentation about anything. Okay, so just to be clear. My name is Victor Rodriguez. I'm from Mexico I work in Intel Unfortunately, I'm not an open stack developer. I would like to be but yes This is not gonna be an open stack developer presentation. This is the point of view how Compiler and OS developer guy tried to help to improve the performance of the cloud So what we are gonna learn what I'm trying to show in this presentation is how the latest technology of the compilers can improve The performance of many of the application that are running in the cloud. Okay, there is a lot of Background history of how how difficult is to change one compiler version to another compiler version in the operating system That are running on the servers that we use for open stacks application However, there are really good benefits of changing one compiler to other despite of the effort that you need to do So I will try to encourage you to change the latest version of the compiler So it is gonna be the gender for this Unusage resources how many resources do we have and how we are not using those resources in our servers? AVX technology is one of great and all the great examples Function multi versioning is a way to solve those kind of problems And the profiling is another way to solve the problem of the user resources that we have in our data centers There it is so on you said resources nobody wants to be there Okay, if you're in a traffic jam And there is a free line that you can take and you might be wondering why I'm not taking that free line if I'm stuck in The traffic why if there is a free line and I might go faster or this traffic could be a void Why I'm still here Why why nobody changed a path and and make the things much more efficient like in the picture that we can see And use of resources is something that is really typical in the server area that we have In total architect technology is has been there for years one of the greatest examples that we have is the advanced vector extension technology The history of computing power this is a picture that I take from one of my favorite books There is the side as a Hennessey Patterson architect or a quantity approach and The history of computing power has been was amazing in the 90s 20s and it was impressive We have 52% of increase in the performance of our data centers per year. Okay, and that was amazing What happened in 2004 something somebody can't remember that thing I mean we were just increasing the speed of the CPU clock and it was amazing But suddenly we realized that the power consumption was not that amazing at that moment, you know And that wasn't when the birth of a parallel computer came Some of you actually were in some meetings that were we were wondering. Hey It's gonna be what is gonna happen with Intel when they cannot support the increase of a frequency clock anymore Well the birth of a parallel computing now that was a change in the history that Create a new paradigm to how build computers how to build data centers How I create a software to manage them from operating system compilers and application So what happened next what happened with the growth of the of the computer since then? The growth has been amazing due to the fact or the capability that we have to increase the number of processors in a data centers That is amazing, but the question is are we using really that power in our data centers? Are we using those new instruction sets that the Intel architecture or any other director is providing to us? And this is a really good example They've info inter floating point roadmap. Okay, we started with four bits in 2008 2010 It's double the size of the array to 2011 2012 and it was amazing You know, we have much more folding point capabilities and it was amazing for the HPC system for the All the analytics that we can do for calculating the weather or the forecast or whatever you want to do That was amazing and that was a huge advantage. What changed later? well 2013 2014 we doubled again and we were the size of the array that we can manage with 60 bits and Last year sirs. It was theory two bits. Okay Then it was though the question here is why Intel is so worried to increase that huge size of the ray What why he is providing us an Intel architecture? We knew we knew arrays much more bigger or where they can manage much more folding points So here is a simple example Taking that simple example that actually just add to arrays in C and do that thing for like many times I need to do that because it was really fast in the beginning So the question is what is happening in the back of that? It's just adding one ray to another and putting the results into a third array. It's just pretty simple well The beard of the vectorization comes with that simple problem Imagine that you have the same code that it was in the previous slide So you have You can do without vectorization or before the vectorization technology came in 2008 You just have these addition of one element to another and put the solution into into another Register of the CPU was fine. But what happened with the others? Free part of the rate of the register. You are in use it that that is what I mean with in use of resources You're not using the full capability of your Of this instruction set. Okay, so what happened if I put The array that I have I'm coming back to our previous example now I will add the first element of the rate the second element of the rate the third and the fourth and I will do the addition in parallel in one cycle Just in one clock cycle and put the solution against in my in my rate in my Register in my CPU so that will decrease The time the execution time four times. Okay, instead of in the previous slide Sorry doing this one time Every time in in in all the two thousand sixty and in all the the four that I have What I have here is I do all the additions Four in one clock cycle one time. Okay, so I say four clock cycles every time that I'm doing this addition So the execution time will be reduced by four times So that is amazing. How do I enable vectorization in GCC? Okay? This is something that has been like for years and it's really simple Do you need to add the minus o2 or f3 vectorize? Okay, or if you put the o3? I know that it open a stack is in building in in in Python But there are some cases of where you and this is What the libraries that we are using in the back of that might be built in C or might be compiled in C Many of the libraries that we are using in the top of the open a stack application Might be built in C and we will see some examples nice examples So taking the except the previous code that we saw before without vectorization just doing sorry This execution way it took six hundred milliseconds just that simple code that we saw before When I enable vectorization that would use a time to 38 milliseconds Okay, and that is just a gain for using another flag So you might be wondering okay That is cool. I can reduce the time of my application in terms of execution time Which means that the workload that I put in the cloud might be a little bit slower So this is something that the HPC guys knows very well, and that's many of the things that they use every day, so What happened with AVX to okay? Coming back to the graph with a floating point roadmap We had four beats first then eight beats then 16 and now we have 32 The last 32 is the AVX tree. Okay, is the AVX 52 long register Intel architectures new features that we have since last year in the in the latest servers and that was fine The AVX 52 was it has a capacity for 52 from 0 to 511 AVX to can support a race of from 0 to 255 and The regular one that we used to show before just with a simple vectorization thing It goes with register from 0 to 127. What does it mean the instruction set is that that long? So what happened with AVX to I have much more just I have longer space so I can in the example that we saw before Do the addition of the of the elements of the rate in much more in parallel The only thing that you have to add is the may be X to flag and that's all the things that you have to add Okay, that's all the things that you have to add to enable the AVX to technology and before without using that flag You were not using the resources that was in the inter architecture So imagine that you are a CPU and you're a compiler and the compiler generator binary That is not using the full capability of the inter architecture or of the of the server So the CPU will say okay, I can run this thing But if you just add that flag I can run it faster. How much more fast? Okay, this is without any kind of vectorization 63 milliseconds with just a simple vectorization it reduced to TDA milliseconds with AVX to technology It reduced to 26 milliseconds. Okay, so that is a huge reduction in terms of execution time for using just one compiler flag I know that it's part of the what of the what the operating systems should be doing But the question is Does the operating system that is running your open stack has that enable in the libraries that you're using because me as an open Stack user and that's something that happened with my team I have I'm in the compiler and the power performance team And there is another team that is in charge of the open stack And they don't care about how do I do this as long as they see the workload in open stack room faster So what kind of warlocks go faster? Real case example number one That is a new well not that new Programming language that came out to the world a few years ago, right many Can somebody please surprise the hand if you know what our programming language does great So what does it use for we use it for analytics we use for? Big data analysis for statistics is amazing the way that they manage the graphs on the plots It's amazing and the way that you can do and linear regression is really simple So it's something that the cloud use right? I mean something that the cloud workloads or data center workloads might be using so this is a benchmark that have It wasn't done by Intel it was done by a External third-party magazine the name is for onyx and that magazine usually take all the operating system that they want that they choose and room benchmark on the top of it and measure the performance of the operating system and Give the reason why one benchmark. It's better than the other So they said I want to analyze how fast is your clearly nooks operation clearly nooks is the project Sorry Yeah clearly nooks is the project where we work and this is the project where we are enabling all these patches. Oh It's not my fault, right? I mean We can can you see the screen now, right? Yeah There you are. Thank you so Totally nooks coming back to the presentation is the operating system where we work Okay, it's an operating system made for by Intel to highlight many of the Intel architecture features That are not being used it, but any other operating system. Okay, it's not Yeah It's not We're not trying to compete With Red Hat or Fedora or Ubuntu on the other hand. We're trying to provide the openness the operating system Partners patches so that they can use this technology so that they can use AVX in their operating system and any User of the of the other operating system can benefit from that So that is one thing that I want to clear so they took clear Linux and they compare in it to other operating system Ubuntu Suze and Fedora as far as I can see lower is better And it's three times faster the way that the air script can run their benchmark and run You know in clear Linux by the use of AVX to okay the patches are available I put the link in QR code so that you can go to there and that's the link for the actual article of Veronica's and you can go and read and they explain and they link to the source code that make that benchmark Run faster. Okay, it's nothing really complicated It's just one single patch that you need to change in one file And that thing will make Ubuntu or Red Hat or Fedora run as the same speed and we have we have tested so if AVX is so cool and We can see with the results that is one thing that it's really powerful for me as a Developer of cloud technology because I don't care what happened in the back as long as the benchmark run faster for me It's fine. Why nobody used it. I mean it has been there for years ago. Why nobody use that technology since since has been there for years. Well How many binaries do I have to deploy because if I compile for AVX to flag I Have one binary if I compile for AVX tree I have another binary and if I come for file for SSE I haven't yet another binary So at least now I have three binaries. So imagine the path that you will have in user being and your AVX tree or user leaves 64 AVX to AVX tree and SSE it will be really hard to manage and it will be a nightmare for the OS developer to try to manage and come to Deploy all the binaries and they have to compile the tree binaries every time that they do a new release. So that is a nightmare So why nobody use that? Well, it's because of that because it's a nightmare to use it GCC6 which is going to be released the next week just I'm that is that is truly a commercial I'm part of the GCC team and that is truly commercial GCC6 finally see the light and next week and Release these new technology is function Multiversioning for C before it was just for C++ now it's available for C and You just need to specify in the same function that we had before The targets that you want that this function be optimized for okay So that is the only change that you have to do inside your code. You just need to specify Okay, I want this code to be optimized for atom for SLM and for AVX to okay, so At the end you will have a binary With the three different kind of Instruction set in assembly when you do the sorry, so it's broken right No, it's fine No, it's fine. It's fine. Don't worry. So We have at the end of binary with 3d for an assembly instruction There's when you do the object on and open the binary you will see the instruction set optimized for SSE for AVX and for AVX To so that will be amazing because on the runtime your binary will detect what hardware am I? Oh, am I in the latest a server see-on that my that this company run has oh, so I can run with AVX 2 or AVX 3 Okay, the overall the overhead Somebody might be wondering what is the overhead of this thing the overhead is just in in the size of the binary it increased 30% We're just we're in our way to write an article about all the details and in much more details about what is going to be there however, you can find all the information in the blocks of clearly nooks.org and That link is all the information inside and the detailed patches of what happened So that's the first thing use the unused resources that you have the second thing is and I put the picture about Traffic because I truly hate traffic and I try to avoid traffic as much as anyone else So I'm big fan of working from home by the way Help the compiler find the most efficient path if you see the picture there you And you're new in the city. It's like when you go to LA, right? You try to go in the highway to in LA and the first time that I arrived there with my car It was like, okay, I'm lost 50 minutes later. I was I had no idea where I was so Find the compiler to find the right, but who knows better what your code should do that you nobody okay, I was in in in the medallinx conference four weeks ago and There was a discussion about the compiler was not a smart enough and At the end somebody asked the question well The compiler is just a tool that we use to generate from source from the source code that we have a binary But the smart guy in the middle. It's us Profiling is something that is not new profiling has been there since years There are two kinds of profiling that is what it's new in the latest version of GCC Profiling in the old times was using as an invasive profiling. Okay There is an advantage. It's very precise Okay, and I'm going to try to do an analogy really good analogy that I found weeks ago There's a disadvantage. It has a high overhead. Yes. It has an eye overhead non-invasive profiling Has it's a small overhead But the disadvantage is not as much as notice really accurate as the other one Here we come the analogy Imagine that you you want to track the performance of one player in a soccer match. Okay What do you what? Let's see that we have two options option first We will put a huge pencil in the back of the soccer player and that is smart pencil He's gonna send information to the server with open stack of where the player has been running During the match in the in in the field Okay, and in the end we will see a nice animation of where the player has been running around all the match That is amazing. Okay, it's really accurate. You know where exactly they get the the the the guy has been running all the game The problem is that he has a huge pencil in the back, right? That's the only problem. That is the thing with profiling the invasive profiling You have to put something inside a code that can track you where the code has been running During the execution of the binary what happened with the non-invasive profiling not in basic profiling same example You put them you instead of putting something to the to the guy in the back You put sensors across the all over the field and you start to measure how many times The player has been Activating those sensors at the end you have counters and maybe with the counters You can realize what are the fields of the match of the what are the parts of the field that the game That the player has been touching more which is cool and you have the capability to more or less guess What parts are the ones that he run more in what time so you can guess a little bit more about it So that those are the two difference between in basic profiling and not in basic profile both are excellent Both are great for some cases or for some others This is an example with in basic profiling. Okay, but in basic profiling we do PTO Profiling for Maria DB and what we did was okay first we took Maria DB and we compile Maria DB with Instrumented then we run the benchmark and after the benchmark we gather all the information of where Maria has been taught has been Executed or exercise it after the benchmark and with that information with pat that it passed an information to the compiler and said Okay compiler. This is the information of what are the paths are the branches that are more They're most executed by by the benchmark of Maria and this is a regular benchmark that many guys in the cloud and the data center use so The compiler take that information if you say, okay, now I know what do you care more now? I can predict in a better way the branches that I'm going to compile now I can manage in a better way the memory and Where can I put the the variables into the heap in a better way so that you can access them the ones that Use more faster, which is nice. So the compiler now knows a little bit much more information and as you can see the performance improvement is Way much better with the profiling thing like around 20 20 20 84 cents Another example, oh by the way What do we have what we did later after that is we put Maria in an open stack environment and we start to run rally Benchmark, this is one of the rally benchmark that the dust transaction per second rally I know I I hope that I think that I'm the only one that was not aware of rally rally is one of the tests to Is the benchmark? Test suite that open stack provides to measure the performance of some stops. This is a transaction per second and the previous one, sorry Is the response time and yes, it was much more better with the use of profiling and Maria was running faster At the end so what happened with when we? Yeah, this is the results in the end that they average time to create 100 users running 1000 time with PTO is 600 I think this is milliseconds and in with the baseline could be 7,000 milliseconds So yes in the end when you ask me as an open stack user or administrator What is it game? What is that for me? It's a reduction in time of The time at the average time that you can create 100 users 1,000 users Which is nice because at the end you can create a user in a less time What happened with the non-profiling? Not in basic profiling technology not in basic profiling technology we tested in AWK okay AWK yes It's old it has been there since Unix But it's something that it's still being used by many guys in the industry and believe it or not I was in shock when I realized that it is pretty much used so what we did was to run Not amazing profiling in a Navy in AWK and at the end realize that AWK can be improved But 80% just by using not invasive profiling and in the end if your opening system can run all AWK Scripts way much much faster. You can discover the path that you're trying to find in your huge text in a faster way What is the goal of action and one of the things that I want to close the presentation? Make the cloud faster and it's not for an offer from the point of view of an open stack developer I like this picture because It reminds me a picture that I saw once in in a pier in San Francisco where it was completely foggy foggy And you it was really hard to walk through you know that there was a path in the bridge that you can Just go on the start to walk, but you start to walk slowly if you start to turn on likes It's much more faster for you to walk through the cloud Okay, so it's exactly the same here Request to your operating system provider to enable the latest technology so that you can your application can run faster It's not difficult. It's not really rocket science. It's very simple So at the end you as an open stack developer you can request to your openness to your operating system provider To enable that technology and make your application run faster through the cloud That's all. Thank you So yeah questions. I guess it's an implementation detail question about the multi Targeted architecture the functional version. Yeah, so like I know with Mac OS 10 They very long have had multi architecture binaries Mac. Oh So Linux in my understanding really hasn't dealt. Well as you explained with multi Multi-architecture binaries Is this part of like the fat elf project you're talking about here? No Is there something different that how's it gonna know if it's what architecture to run if you have sort of this Fat binary with multiple optimizations in it. Yeah, the functional versioning and I was in in in I Was co-working with the Russia team in GCC that did that part Okay, here here comes a patch At the end they detect there is a target clone three target clones and they generate all the assembly Instruction for the three of them they put it into a piece of memory and in the at the end They call they they use a system call for get CPU ID and base it on the get CPU ID They can detect okay if it matched with the with the things that I have I can go there But they usually go with the calls the the CPU ID System call okay, that's pretty fast before This patch the way that function multi versioning was working in C++ was Kind of a least they what they did before was okay. I have abx3 abx2 and abx1 I'm gonna put in the higher priority abx1 if I don't have a baby Sorry abx3 if I don't have abx3 then I go with abx2 and they were guessing so the switching between those was really really slow So that was the way that it was but now it's with the the function call and it's much more faster. Cool. Yeah, thank you more than welcome sure Thank you. It's a good presentation. So I have one question So if we want to optimize the world we want to make our programs running faster even specifically for the OpenStack programs and for the cloud computing systems, so shall we use the What is specific to some algorithms we must use use them to optimize our systems? that's a great question and and Just let me give you an example of when I was starting with this when I was starting with this I joined to the IRC channel from Python In free node and I asked the guys. Hey, do you know how can I make my Python script run faster? And I said you change the algorithm and I say Yeah, but if I have already optimized my algorithm as much as possible How can I make it run even more fast? I mean and they say there is no way Then I send them back the results of a piece of code that was really optimized It was an old benchmark and I showed similar numbers, you know, for example Maria Maria is something that is really optimized And it's check it by million of eyes in every release They say that this is the the faster Maria that you can have Yeah, but if you change the compile the the flags that you compile or use profiling to teach the compiler We're which are the paths that you need to take much more attention. Yeah, you can run your binary way much faster So yeah, it's possible to adapt to a cloud or a bare metal running. Whatever you want Thank you. You're more than welcome. Okay, so Sure So when you say you profiling do you mean after for profiling find the hotspot and change the code or just Profiling itself will help the improvements of the performance No, I'm what I mean profiling is When you do an invasive profiling Let's put in the example you you you have a service code You need to enable with just to see I'm going to build this source code Instrumented and inside the code they put counters and the counters will increase every time that you go through that line It's like when you do cut coverage. Yeah. Yeah, I mean when you use profiling to find the hotspot then you Change the code to optimize it optimizing the certain part you found and then improve from no, I don't change the code I changed the bind the final binary I'll make that I allow the compiler to change the binary, but I don't change any source code any line source code line Sorry, so I still cannot understand. So how can the binary between who is changed the binary the compiler and and okay Let me give you an example Imagine that you have a piece of code with multiple variables or or different variables The compiler cannot detect if in the heap he will put X or Y or seed Closer to your instruction. So in assembly you can have jumps and Inside there, but then the in the heap they can put the one that you use more Let's say that you use a variable 10,000 times Okay, so that you can they can put that piece of variable that variable into your binary closer to the execution line that it's needed So the jumps at the end in the execution time are faster So this is all Automatically handled by the compiler. Yes, sir. Okay. Yes. That's that's the good thing that you don't have to change the line of code You mentioned Python and In your example used are and vectorization are there other things that apply more to Interpreted languages or or that kind of speed up that might be more typical of what we're doing an open stack. Yes Yes, I'm we're doing we did that experiment for AWK. We are doing the experiments for pearl We're doing this We also enable in basic profiling for PHP and we gain 15 percent of performance with the PHP benchmark so we We enable we actually enabled many of the lamb things we enable PHP. We enable Maria We enable we improve the performance of Apache by 35% something like that amazing profiling you can improve the performance of Apache server by 35 40 percent and We're missing Python is a little bit tricky because the way that it is compiled it is is not as friendly for compiler guys As I thought but yeah, we're working on the profiling There are some examples that we have that actually with in basic profiling You can enable you can enable in Python by itself when you build comp will you build Python by hand? Not not when you do pseudo APT get when you build it by hand There is an option that say make Opt arcs and actually that things Build Python instrumented running in an internal benchmark and then give you the latest Python After that that optimization and that gives around 12 to 20 percent performance improvement So most of the results from the from the benchmarking or the performance counting resulting code changes Results not that's a good thing because I'm trying to differentiate the you know the GCC six optimizations versus you know profiling an existing application collecting data and fixing fixing optimizations that could have been done, you know without the GCC six or taking advantage of AVX or The other yeah, that's a great. It's a great great great soon Before GCC six we just had the capability to do profiling in an invasive way After GCC's actually act after the latest version of GCC five, but I'm going to make the commercial for GCC six Now we have the capability to do not in basic profiling. Okay, maybe I Do not I was not clear Before GCC latest version of GCC. Yes, you can do profiling even in Python with even GCC four point eight. Okay, that's fine, but the question is it's a nightmare You know It's like the example that I give with a huge pencil in the back of the guy It's really painful because imagine that you're building an operating system Or even that you're building the Python for your own operating stack You have to build your Python first instrumented then run the benchmark and then compile again And then deploy the thing so with the other way you forget about it You just put some specific counters You gather once and once you have the information you send that information to the compiler at the compiler try to guess Try to guess what are the benefits the world of the the paths that the benchmark are executing more and then just Deploy it, but you have to compile one more one time less. Okay, that is one of the advantage of the GCC 5.0 something will be in six. Yeah So continuing on that question, right? So you have to compile the ones and you Save that data and pass that to compiler then the second time you compare you compare the final image, right? But if you have different program, then you have to repeat that process again Yes, otherwise compiler will reuse the old the data for the other setup program and in that is for Inbasic profiling and that is what it's a nightmare in my case when I have when I receive a mail that says oh Guess what Victor there is a new version of Python or a new version of PHP or a new version of Maria I say oh my god. I have to go Compile instrumented by the compiler instrumented run the benchmark gathered information and recompile again With the not in basing thing that we did for a WK. I mean a WK is not changing in that much anyway but the good thing is that with that other technology you forget about it you just Compile as you usual but you just pass the compiler a flag that say oh by the way use these counters and Try to guess where the the best part is and doesn't matter if the chain the searchings The the source code changes doesn't matter But if the source code changes the counters should change too, right? Not really. Let me give you Okay Out to FDO just putting names is a not in basic PTO is the one that is in basic How does this counter works not in an analogy, but really in a technical word, okay? Intel architecture has one feature and that's hardware counters and the kernel also has software counters. How many of you has used the the tool perv? Okay, what does perv do internally they try to track the performance the counters of what are how? How the application is behaving inside right and can they use? Hardware and kernel counters to give you that information Those are exactly the same counters that out to FDO use Okay, there is a really amazing tutorial of how does out to FDO work? Here okay an amazing tutorial and that will show you how perv is Is you run is using that counters passing that counters to the? Compiler and that is mine. Why doesn't where doesn't matter if the source code change? It's a really great question. It's a really good question The answer is the counters Doesn't is not associated with a line of code Okay, the counters are just activated because the benchmark was running in the hardware and in the kernel So there is no association or direct association with these counter and these line of code Doesn't matter in the other way when you instrument the binary you have a direct Association with the results of the path that you are tracking with a line of code It's like when you get code coverage you have to You have two binaries the one that you are execution and another thing that is tracking The lines that that are executed, but it's correlated both of them and in the coders They are not correlated the problem is that the compiler has to guess I they call it the heuristic I call it guessing But yeah, that's that's the other thing Yep Great question. Any other questions great questions. Yeah, thank you so much Sure. No, okay. Thank you