 My name is May Chen My team owns the download language and runtime so including C shop VB Language and compiler and also around time team. Yes all the runtime services that Miguel just mentioned is underneath my team Hi, I'm Sergei I my team is on May Chen's team and I lead and my team owns various low-level parts of dotnet runtime and various Head of time compilation Sergei is my performance architect. Okay, then let's start. Do you mind to go to the next slide? So what is around time? I Think it's probably quite obvious to you guys I prefer this slide for some like a not sure about the audience This talk of being done once already. I think about around time being a translator If you think about it, you write your C shop code once you compile it once And he actually runs in all the different devices in different platforms That is because the runtime is doing a translation for you We have two well-known platform and really popular windows Linux for different architectures x86 x64 arm 64 And if look at the matrix is that is eight different languages around time speak and the Linux distro Depends on a different distro different flavor There's actually a platform dependency work that runtime has to do as well So this talk is going to be three parts first part We want to talk about how we tune the runtime for start up and throughput So that all the box installation I have on the dotnet core or desktop is such a performing The second part circuit will work us through a start up time case study and the third part We will do some sort of takeaway So before I jumping into the journey I kind of want to break down call three different Services in runtime that you're related to today's talk They are more than this three services in the runtime, but these three actually kind of relevant to today's talk I think jet and GC are probably the most well-known components that you are familiar with because these are the two parts mostly contribute to the non-determinism and latency and Kind of manifest itself more in the performance on people's face The first one is your type system We don't talk about it a lot by your type system is the central of the universe I would like to believe so because my first lead job was type system It was actually the one that holds the universe together type system is the one that if you look at this kind of Skeleton code type system is the one decide that when you allocate my base or my class How big that after instance should be and when you have v-tables, what is the layout look like and When you do castings, are you doing the right thing or not? so jit edge of consult type system in order to generate code and GC actually consult type system as well when you're trying to walk the graph Excited, please So here comes to my first question Many of you probably very familiar with this hello world web API question This application is you're quite simple just a few lines pretty much on a web page pop a hello world And waiting for you to type a string in and display on the page Do you guys want to guess how many methods need to be jitted in order to run this code? Miguel since nobody volunteer It's a yep Okay, Miguel is very pessimistic and I must clarify this on down at core because the desktop and down at core is It was different next slide, please It's only four thousand not too many but but certainly is a greater than what you have expected Miguel is pessimistic But if you look at the time the startup is actually hot start up is your one point three eight seconds And this is if you measure on the machine that is actually I seven Intel I seven three point four gigahertz essentially quite a beefy machine so think about that if you go into a low end devices or You know the last capable machines how long you will take and We actually spend one point four nine second in jet Which kind of surprises you because how can just time greater than the startup time because much core and that was actually So much together So we have a problem here Hello world take one point three eight second to start up This is not okay because real real world application is going to use a lot more function And what what you just see right? So who is liable for this? It's actually in between jet and type system We believe that this are the one that take about 60% of the time in the startup doing all the work trying to prepare those four thousand method to be jitted and some may be wrong and some may not be wrong and Interesting enough. This is my first job being a type system lead. I went to type system guy Actually, no, my first job was a jet. I had three dev I went to the jet and say jet we are slow and we are the reason why down the code doesn't run fast Just say nah, it's not our fault. It's type system my second job. I was type system lead I went to a type system team and say hey, hey, hey David. We are slow What should we do here and David told me nah, it was jet. That doesn't make sense, right? Somebody must be at fault. It turned out to be Type system only has 30 percent 33 percent one-third of it jet has one-third of it now I don't believe that they are the majority of the fault and the other one-third lending that g jet e interface Because remember early on that we talked about when jit generate codes it comes out type system It asks many many questions to the type system and then so so type system told your team say if you ask the last question We will not be so slow and then jet as you come back to type system and say if you answer your question faster We wouldn't be that slow Regardless somebody has a solid problem, right? So I'm going to ask Sergei to solve this problem to me. Otherwise we cannot ship. All right. Thank you. I'll try Okay, as the machine noted the main problem here that I need to solve is the time we spend didn't the code Didn't application start up If we can't step back we think about if I think about it What we do actually we did the same code every single time we run the app. We load the same libraries we compare some config files and Jitted and and result the code the bitcode actually the jit produces is exactly the same or every single invocation Seems like wasteful. So how about we just Cash it and reduce later So engine is a tool that we built for the dot net framework to solve this problem what it does is pretty much just load an application or a library and then it compiles every single method it is there and stores the result in the System-wide cache next time when application launches or loads one is libraries a runtime finds the pre-compiled code and loads it instead of giving it again So Let's look at the results As you can see the start up time has changed drastically, right? We improved by 2.3 percent at to We it's now two-third of what it was before Looks impressive, right? Okay. We are done here, right? Right One thing I want to talk about engine and its characteristic as a fragility so the code it generates is really just jitted code and it's Contains a lot of assumptions that make it very fragile for example Type layout right The jit makes assumptions about at which offset is each field or when it makes a virtual call It needs to know the location in v-table where to find the actual address of the method to invoke all this is makes it very fragile and Then even the data structures that will that live in CLR they can change from version to version and They do influence the jitted code So what it means is? Well, you pre-compile your code and then you decided to copy a new version of your application or a new version of the library at Application uses or maybe just windows update installs a new version of framework. What happens then? That the cash you had is pretty much invalidated So just to be clear there is not gonna be any correctness issue because runtime by itself Has knowledge about that and it has means built-in in the cash to resolve those issues So what will happen is we will just throw away the jitted pre-jitted code and we will fall back to jit again And so yes the code will run slower But it will still be correct. Well Yes, that's a problem, but how often don't happen like how often for example windows updates Not that framework, right? Anyway, in this case what happens when the cash gets invalidated and your service service kicks in it will compile the apps I bet the cash and the app starts running again fast as before so seems like it just an engineering detail and Everything is good. I can tell my boss You can keep your job So circuit was happy for a while remember we talked about when engine got delivered was done at 1.0 Do you remember at that time that you should come to visit disk? It's nice. You know windows update. It was actually later on it become windows update Can you do the next slide? He was happy for a while, but he's happiness did not last forever. Actually world is changing on him and he just didn't know We were asked about devices that actually is had to be is Sensitive to battery life. At that time it was windows phone And then wearables and actually even laptop Let's say 15 years ago laptop lifetime is probably what battery life times two hours today Nobody's going by a laptop is battery life's two hours actually the expectation changed and wearables HoloLens think about HoloLens and spending 30 minutes jetting the code. It's actually not acceptable and The second one is actually the workload is also changing Think about a servers when you provision a server images you build it once you will want to deploy it into millions of servers And those server instances you will expect them to be responsive immediately Rather than those servers had to spend what 10 minutes 30 minutes generating engine images. It's not okay The third one is also a killer Securities we generating images on device How do we know it's not tampered and we our engine services actually elevated services How do we know that we won't be hijacked in fact window actually require all the images as a curable images to be signed Otherwise cannot be trusted The last one as we go to down the core we go into Linux Where are we going to do a elevated services in Linux operating system? It is just not going to work So here I am going to tell circuit back Sorry your solution did not last too long Well, I guess that engineering detail was a real problem after all so what we need we need a new code generation strategy What we need to do we need to scale back on those optimizations that we do in jet that make a real bad assumptions The good thing that those assumptions are real just used for optimizations like I said and they are not fundamental to ahead of time Code generation we can live without them. So for example instead of a hard-coding offsets Into the table we can just ask a runtime to give us a data and then call on that So cross gen is a new tool that we wrote that Generate the code which is version resilient and it's just one library It doesn't know anything our side of it. So you can replace libraries as much as you want You are the code in other places will not get invalidated As I said it's gonna get generated less performance code because not optimizations are gonna be allowed in this mode The other nice thing about this tool that it can run anywhere It can run on the target machine that where the application will run or it can actually run in the build lab Where the library or the application in pre is being produced in the first place? What it allows to do is that the companies now can actually sign their binaries on their production servers and deploy it GLLs that have been verified and you very can be verified right, so Let's take an example look at the example of the code gen in both cases As we said we're talking about virtual calls and a code is to string is a good example on The my right side You can see code produced by engine. You can see the fragility. I was talking about here the offsets into v-tables This is v-table. This is pointer to v-table chunk that is used by this v-table stored Everything is hard-coded in the image. What this means is If you add or a move a method virtual method from that Class or one of its base classes this code will become invalid and go crash now on this side is Code generated by cross gen There is no more hard-coded offsets What we do instead we just call in turn time to get an interaction cell and then we Invoke a virtual stop dispatch on it which will Run a little slower, but it will do the right thing and it doesn't matter what happened to that object How many methods we add to it? How much we change it the code will will or work now? Let's measure again one thing I want to mention about this slide is that It contains data for two different run times for the net framework and course the alarm The reason is that engine by itself. It works well on that net framework and It's it doesn't exist for the core course your corrosion on other hand Was built for course you are it was for course you are so what we need to do We need to compare highs and lows for each run time separately. Yeah, this is that you saw before for that net framework and This is for course you are What what it means in each row like again first row is every single method has been jitted at runtime This is when we pre-compiled our core lip system product or lip We use kind of fragile compilation for you to get better performance results And then this is actually our shipping configuration is All our effects libraries framework libraries are cross gen today to improve the startup And now what I do here is I run cross gen on every single DLL that was in application And this is the end result as you can see we got two third reductions here after using engine and now with cross gen we have even better data I don't know how my boss says what my boss is about his data, but to me it looks great. I think we are done He's always optimistic So if you go back to let's slide a little bit this slide actually contains many different things When we first ship course you are the down a core 1.0 We are actually very unhappy because we are performance team down a core 1.0 is actually slower than desktop Because there's no engine and we only pre-compile the core lab Cross gen for fun work was actually enabled in 2.0 That is why some of the performance game that you are seeing in the previous Presentation that actually was manifest some part of start up there as well So as I say engineer is always optimistic Didn't he just mentioned the optimization? Didn't he just go to a stop helper and Start up is not impact. Is that the only metrics that we measure? Why don't you go back to the purple app check out all the metrics that we measure Yes, well, right Start up is not the only metrics. There is another thing that call is called throughput So let's look at throughput that data Here we have data from Jason sterilization benchmark This is actually the benchmark that is used by a speed of net team and by taking power So what we can see here is again Request per second for each configuration First line is all jitted all jitted all code is Perfectly generated with every possible optimizations a jit supports this is our kind of shipping configuration and This is if I cross gen everything in the app. Well, yeah We do have a problem here. We are about seven percent slower than in all jitted Configuration, so yeah, I guess we just push the problem somewhere else from start up to throughput. So we need to think about it All right, okay, I'm gonna think about it So let's take a look at the Code generation technologies we actually have today Head of time cross gen we talked a lot about it today a generate fast code A jit code fast for fast start up right, but the quality of the code is not optimal Then we have jit that actually has two co-generation modes. One is the minimum optimizations and full optimizations The first mode minimum optimizations is actually the mode that is used when you hit f5 in DS for debugging scenarios It's for purpose again Jit decode as part as fast as possible minimum optimizations Which makes also your code very debuggable and provides great diagnostics capabilities in DS And full optimizations is what we have in release, right? Everyone knows this interpreter Well, we have three prototypes of that, but none of them actually works well and it requires A lot of work on the diagnostic stack to support it. So I'm not sure none of these technologies kind of work by itself. I guess what we can do is Combine some of them together This is what we've done Working in the couple for the last couple years. This is what Steve mentioned in his talk Teregin that was shipped in 2.1 as a preview feature and it's supposed to be on by default in .NET Core 3.0 So Before we had Teregin enabled or available to us we at jit time Could compile an AL or a method only once and only once we could decide what do we do with it? Do we optimize for? fast jit in for startup For throughput or maybe for portability Now we still didn't we actually can do it multiple times or at least twice so what we can do on startup we use minops or We can reuse corrosion code to run the code fast and then once we reach the steady state we can recompile the recompile the same code and enable all available optimizations The other part about corrosion what we can do is here is actually We can embed in our images like some hints for jit or for tiering system to do better optimizations or Generate code more efficiently so One thing that we can still experimenting with is heuristic for Teregin As a start flight says steady state versus startup is kind of a great area every every single application is very different We don't have yet an API called the start steady state We don't know yet, so we kind of have to guess and there are various ways to do that one is a simple one is Just calculate how how many times the method has been called that hit count and then at some point start Recompiling it then we can use more advanced techniques like a sample profiling or performance counters for now we decided to start with very simple approach and what we do is As soon as the method is called 30 times We invoke the tiering system and it will replace the method with better code. Yes So measure again and what we see yay We are back So the throughput is good the start up is good as well We haven't regressed that because runtime actually picks cross-gen code at first time. So all good So if you think about this, this is a very similar to Java's hotspot mechanism, right when you If you are not sure if your code is going to be executed many many times instead of spending time Optimization all the code. We actually took the upper heuristic Like you know generating code as fast as possible and as method that identify to be executed more frequently then we try to optimize those method and That actually give you a very good blend of startup and also throughput So this is our code engine Journey we started with pure jet and we never should be left in configuration because It's just not fast enough and we built engine for the desktop and that did not get us across To the cloud scenario nor go to Linux So that's why we get a question here, right and then further down trying to repair the cross-gen Degration we built till jetting what you didn't really give you is actually give jet the freedom Jet never had the freedom jet had to do the work, but had to must use the most Need to use as little as possible resource But yet had to generate the best code for you and if that that is actually just not going to gel together so with till jetting why it opens up an issue of more optimization and Optimize only in the code that matters. So it's just a starting of the journey now really the end and this is a Question and till jetting are the things that it's going to ship in down a course real So this go back to circuit right So we talked a lot about what we've done so far. Let's spend a couple minutes talking about What we will do in the near future? So everyone knows that docker containers are super popular right now in general Dotnet applications dotnet core applications run just fine in that in those environments But we did get a few reports from customers including the presentation From Steve a few hours ago that if a container is configured in with low memory Then applications can run into some issues So we are working on addressing that we are going to change some Gc heuristic to consume this memory And we're going to add the more configuration options to configure runtime for those environment if our Automatic heuristics don't don't kick it don't kick in The other part that is important for docker container is the size of the base images when we pre-compile assemblies with cross-gen we store the pre-compile code in Assemblies themselves, which means a bigger size of those libraries, right? So what we want to do for 3.0 we're looking at using partial compilation to reduce the size of the Corefx frameworks that we shape and at the same time not to regress startup times, right? Then as you probably saw in the dotnet core 3 view blog post we're going to ship a new UI stack winforms and AWPF, right and UI apps what they do what do they require faster top quicker spawns times, right? So when you will spend some time actually optimizing those those libraries and pre-compiling them And last but not least we want to ship or make our cross-gen compiler publicly available so every developer can optimize the applications I Want to quickly show you like one minute What what is the right time to use a AOT? I just saw there's always a trade-off, right? So how do you know whether cross-gen is right for you in this case? Yes Perf you is a performance tool is your friend there are a lot of tutorials and Help on the GitHub and this tool is super powerful. It can show you everything your application is doing We're going to use the new package explorer It's actually a real-world application that has been ported to 3.0 and uses the brand new UI stack So what do we do? We start perf you obviously? We use collect options to start data collection. We launch our application once application started we stop collection and Perf you will generate an ETL file As you can see but if you can do lots of stuff it generates of lots of different data sampling profiling and Memory to see information But what we are interested in right now are just it starts if you click on that it will open a new window and Contains lots of information about every single method that has been jitted, but I just showed you the summary So what you see is we spent two seconds or 60% of CPU time Jitting the code Not good. So again, these numbers are pointing time when we ship the 10.3 go 3.0 the final is Bits these numbers are going to be much lower Next slide I pre-compound the app run it again You can see the g-time has disappeared pretty much more all it like it used to be two seconds Now it's 200 300 milliseconds start up of the application is about one second, right? So That's it. I will show take away is Thank you About the jobs. Yes This lies I now talk for guys if you have any questions, please please send emails talk to us. Thank you very much