 WebNP has been around for about 10 years. In 2015, we ran into a big issue. We effectively ran out of minimal performance optimizations. And a lot of that was related to the kind of architecture that we were using, but we also had some hard requirements that we also wanted to sort out, which was really hard to do. The primary one of them was that we wanted to also support Linux and non-general software systems. And we wanted to reduce our support. As you can imagine, a database needs to have 24-7 support. And we ran into issues. A lot of them were off-volts, but a lot of them were stuff that were hard or impossible to fix. So when we looked at our issues, we had two options. One of them was basically write everything in C. This is the natural options for databases in most cases. And this is something that we see as you can see. The disadvantages of writing something in C is obviously that the cost and complexity is much higher. And you lose a lot of features and capabilities. At the time, the CorsiLA was still called DNx. It was quite low. But it really was able to run stuff on Linux. There was already starting to be some focus on performance, which were really light. And one of the things that the reason that we decided to go with DNx was that we had a backup strategy of, OK, if this ends up being another silver light, then we at least are not in a worse off state than we were at the beginning of only being able to run on the full-donted framework. And being able to run on a CorsiLA platform, we have already started doing some spiked tests. And even DNx at the time was still much, much, much better than trying to run on mono. So that was a pretty good thing for one point of view. Something that I can't emphasize enough with the nature of CorsiLA and the ability to accept patches. It used to be the case that you found a bug and you submitted to Microsoft Connect and it would take anywhere between two years and never to get a fix. And even from the get go, we had amazing interactions with the team, both in here is an issue, can we fix it? And they said yes, or here is an issue, here is the fix. And the team worked with us in order to actually, or this is what needs to be done in order to get it accepted. Here are these other considerations that you need to do. And the experience has been much better. Another very important consideration from our perspective about using the CorsiLA is the ability to embed the Rata environment inside our deployment zip file basically. The idea here is that instead of having to rely on the administrator to install a system-wide update, we can package the CorsiLA along with our software so we can do a couple of important things. One, we can just upgrade to the next version or stay with the particular version of the CorsiLA as long as we want. And if necessary, we can add on our own patch version without having a global impact on the system. OK, I mentioned performance. And I want to talk about what is fast. You can see some numbers on a per second, number of requests per second. How many requests it actually means you're heading on a per day. And the most important thing about it is how much a budget this gives you. Now, this is a single server with eight cores. And that's a ridiculous amount of time with you. And most users would be somewhere within the first three or four items. So you make it as much as 18 milliseconds the processor requests in the common scenario. By the way, I'm assuming here that we're doing things sequentially, but obviously we're going to have to do things in parallel. But that's a huge amount of time that you have to process a request. 18 milliseconds is forever for a modern CPU. It gets interesting when you start wanting to move to the 10,000-quest-a-second, 100,000-quest-a-second. Because at that point, the budget that you have is much smaller. And we set a goal. I want to be 10 times faster than the previous version. And when I set this goal, I got lots of funny looks for my team. Because that's a ridiculous goal. How can we be that much faster? Remember at that time, we were three and a half releases out. And every single release that we made so far had been with some focus on performance. And the first thing that we looked at is how can we make it faster? And right now, with .NET Core 3.0 and some of the performance utilizations, you can actually probably just recompile and run it. And you will get some speedups just from the platform improvements. The problem with that is that there is a high bound to that, or a very small bound to that, to the level of improvement you can get from just recompiling the platform. And the really thing that we wanted, the big thing that we wanted to do was to get much higher than that. So I want to talk about one of the most common issues that we had with the previous version of RevDB. And this is the GC. And the GC has been something that we've been fighting for for a very long time. And you can see here some GC pauses and how it works. You operate, you operate, you operate, and suddenly you have a huge spike. And your 1,900 percentile basically goes to the proper. And it is not something that you can usually see or visualize very efficiently. It gets worse if you are under memory pressure. Suddenly, you're performing stunts because you're now at 90% time in GC all the time. So it's easy to know how to fix that. You need to control your location. Maybe to use some management, but it's impossible to actually do that on an existing system. Because it's everywhere, manager location everywhere. And in a manager, it's like a C-sharp. You tend not to care very much about memory ownership, who does what, how long an allocation leaves, and stuff like that, especially not if the code hasn't been built specifically on that purpose. So I want to give you a good example of that. So RevDB is a document database. It store the data in JSON format. The Drift of Inversion, we store the data on this as JSON. That means that whenever we needed to read a document, we would have to read it from this. We have to parse it, JSON, into an in-memory data structure, and only then we would be able to work on that. As you can imagine, IO and puzzle costs are non-preview, especially if you're running on car machines, if you're running on HDDs or any of these sorts, it can be extremely expensive. So the natural thing that you want to do is to use a cache. Instead of going to this each and every time, what we're going to do, we're just going to have a dictionary somewhere, or a memory cache, or whatever, that cache the PALs document options. As it turns out, this is to be one of the key issues that we had. And a performance optimization that we applied was to save the cache. Why save the cache? I want to explain to you what's going on. If the work is that the amount of the room that you typically use is bigger than the amount of the room that you actually have, then the following things would happen. You would read an object from this, parse it, put it in the cache, which is great. The cache would then hold that object, which means that the GC wouldn't be able to get rid of that. The problem is that if the GC notice that your objects have been held for a long time, they are going to be pushed into the next generation. And the way that the generation GC in Corsair works, you have gen 0, gen 1, gen 2. And gen 1 and 2 have been collected a lot less frequently. Because collecting older generation is much more expensive. The cache would explicitly push all of those objects to a later generation. When you actually have a memory pressure, then the cache says, oh, OK, I know what I'm going to do. I'm going to stop holding onto these objects, which wouldn't actually feed them. You would have to wait another GC cycle or two to actually feed that memory. At that time, the cache says, oh, I'm empty. I can get more data and start holding more and more documents. And those documents, because of the memory pressure, we would have a lot more GCs, the new objects would end up in gen 2, and the cycle would continue. We effectively spent greater than 90% of the time in some cases just in GC under this specific scenario. And that was something that was very hard or impossible to solve in a meaningful fashion. So that's something that you have to take into account. And we didn't know really how to solve that, especially because our entire project, which was something like a meaningless of code, worked in this fashion. So the first thing that we need to do before going over or moving to CorsiLar, moving to something else, was to figure out how we can change the most basic assumption that we had. The thing that we do, we created this binary format. And you can see basically the difference between a JSON document and a binary format. And the idea here is that a JSON object, a JSON text, needs to be parsed before it can be used. This binary format, which we called Oblitable, is just meant to be, oh, if I want to know what the force them of this document here, in this case, I need to, OK, read this, read this, or force them, now I got that. In the case of the binary format, I can do, oh, I can see that the force them is here, and this is the force them or something like that. So it's readily usable. There is no parsing required. And most importantly, from our perspective, we just had bytes. There is no mention of that here whatsoever. One of the interesting about the cost of GCs is that the cost of GCs is actually proportional to the number of objects that you have. So if you have a few arrays, it's much cheaper than if you have the same information as many different objects. Because the GCL doesn't have to scan through all of them. But we actually took it further and said that the format is going to be a zero copy structure, which means that I can look at a piece of memory and immediately start utilizing that. This is important because we can now use very map files. Instead of building our own cache in managed code, we're going to do something like this. OK, there is already a cache at the operating system layer. There is already a way for the operating system to avoid going to disk if necessary. So if we store our data, our actual physical data, or normal map files, when we access the data, we don't need to do any sort of allocations. And this is the typical way of working with a request inside of the revenue code list. We have some requests. We have some code to get a position in normal map file. And then we can just write it out to the client without doing any allocations in a way. And that means that obviously we serve a lot of work. It also means that we get a lot of interesting benefits. The operating system itself is going to handle caching. The GC is not involved because there are no managed objects being allocated. And that reduced the overall cost dramatically. We also did a whole bunch of things wrong intentionally. So we use a workstation GC. We set all sort of parameters to make sure that the GC would happen very frequently and be as expensive as possible. Just give you some idea when we switch on benchmarks from workstation to server GC, we get 30,000 requests per second just from that change. But making sure that the GC cost was highly visible for us was very important to making sure that the development process would pay attention to that. Because it's very easy to think, oh, I'm just doing a new. It's cost nothing. And it's your coin. It doesn't cost nothing right now. At some point in the future, we're going to pay for that. And unfortunately, sometimes we're in interest. So in order to control management or usage, there are a bunch of options. A pooling object is the most common one. We found that arrays above are the most obvious ones. But we also have a pooling for just common use objects. I have a class that I used to say where is a particular data or disk. And I have a whole list of them that I keep around and reuse over and over. They end up being in Gen 2. And GC doesn't really touch them all of the time. There is also lots of user strikes, sort of use of the lower level features of C-sharp in order to really control what's going on. The way that we work with moving revenue to be from the .NET framework to .NET Core was to do basically start a new solution and move one item at a time. And effectively, that was one major refactor because every time that we move a feature, we applied all of these considerations to it. Okay, we are writing a document to the network. What allocation are we generating? We did, okay, you made a test run, now under a profile check allocation, set course, all of that. And that allows us to, in a very granular fashion, to optimize the specific things. Memory is still an issue. So what we have done, we have decided that for the most part, I don't want to use managed memory for the most common things that I'm using. I'm going to move that into native memory. I'm going to be managing that myself. And the idea here is that this gives me a lot more several optimizations. For example, given the process of request, I allocate some memory upfront. And I get anything that I want to do, I get, I allocate from that location. And here we're talking about managed objects and byte profiles and stuff like that. In the end of the request, I free this memory by just setting where should I start allocating the next memory back to the origin point of the buffer. The idea here is that now instead of having to do any sort of GC, I just effectively clear the arena and start to scratch. The key from our perspective is that we still get the benefits of a GC language. We had all of those buffers and unmanaged memory are actually being handled by an instance of a managed class, which has a finalizer. So that freezes from having to worry about some edge cases. I can say, okay, I'm allocating this memory, I'm going to work with that, and worst case scenario, I missed some edge case. An exception was wrong. I didn't write the final report or something like that. Then that piece of memory is going to be lost, but because it is being held by a managed object that has a finalizer, it is actually going to be finalized. And that means that they have a really nice cushion. I don't have to worry about every little bit. Now, from the point of view of the architecture, we had an arena allocator and we have a context and every request has a context it use. And once the request is done, we can use that memory for the point of view of the coding, we effectively have a budget and we measure allocations, we are using memory, and we said, okay, every time that I run this code, how much more are we allocating? How much managed memory? What is the retention for those objects, all of that? You can say, it's funny because we've done that seven years ago and you can see some of these things happening now with arena allocators coming to the .NET Core with all of the spans that are now being very commonly used in everywhere in API, which is something that we found amazingly good for us. Now, something that in terms of performance, something that I really want to emphasize. We effectively started everything from scratch. One of the things that we have decided was that we don't want to do everything ourselves. On the other hand, I'm going to look at what the platform open system and how it is doing, and then I'm going to base my system on the best, on the heuristic values. For example, in managed memory, we know that there are two types of good allocations. If you have a very short allocation that is gone very quickly, goes to just zero, relatively low cost, or if you have very long allocation that are held for minutes, hours, days, and stuff like that, minutes, not so long, by the way. Try hours or days for low allocations, and those are good because then the GC can mostly just ignore that piece of memory. But anything else is a bad idea. So if you structure your code explicitly to be modeled around that, that gives you a lot of performance advantages just because you're matching the expectation of the underlying platform. If, for example, by using the page case of the OSH, then we're able to not have any caching code without having to be, and we get decades of experience of how do we manage the case, how do we balance load, of course, the entire system, not just my particular process and stuff like that. Okay, for example, how we can use those on the nine assumptions. Let's talk about memory hierarchies. In your system, we typically have the hardies, the memory, and then we have the L3, L2 and 1 caches in the CPU, and those ends up actually metering quite a lot. So we build a routing system, and that routing system was built specifically to fit into the processor cache, and that ended up being a major performance boost. So let's, I want to show you some of those options. So this is the LVC routing, and this is a fairly old implementation, but you can see that it does thing fairly well. So we are processing 2.5 million requests in under a millisecond per each request. Now, this was not called, and again, 2015 error code here, but this is not a code that has been written for performance sake. Now, let's look at Nancy's routing, and Nancy's routing is an algorithm called a TRE, and you can see that this is much, much better. Held in the same number of calls in, I don't know, 2% of the, of the calls. Now, we wrote the whole system using the same algorithm as Nancy using a TRE, but we did that with zero locations, we did that with an I, to tell the memory structures of the hardware. So in this case, we know that we are jumping into a well-known cache line inside of the, inside of the TRE, and that means that it's probably going to be the L1 over L2 cache. That means that we get a major speed boost from the overall system. And you can see that we spend less than one microsecond per routing call. And just to be clear, this is done under the profiler. So we are actually paying quite a lot to do that, and we are still being very, very efficient. Now, one of our requirements was case-insensitive routing. So it doesn't matter what case you use, I want to route it to the right location. Case-insensitive string matching is really expensive. It's expensive if all you're doing is asking, but if you're also doing, if you want to do that on the whole Unicode range, that's ridiculously costly. It's actually, we had an observation that for the most part the actual case that is being used is always the same. So it's actually cheaper to force to a case-insensitive search if we don't find it, do a case-insensitive match and add that to the case-insensitive list. So we effectively do memorization, and very quickly we know what are the actual case-sensitive routes that are being used, and we get major benefits out of that. I mentioned that things are expensive, so expensive for a bunch of reasons. One of them is the location, but there are certain simple operations that are actually quite costly. Changing the end-width to the explicit child check is just gave us 2,000 requests per second, justice. And we had multiple rounds like that where we basically go and look at the profile results. What's expensive? Let's fix that. What's expensive? Let's fix that. I think that in this case what happens was that the end-width needs to handle with a stream of any length, so there is a whole bunch of code that we run there. This was basically just inline, this was the down check was probably leaded, and we had one byte or one character check, and that was it. And the cost was, as you can see, it's a stormy. Streamer also really, really expensive because of allocations. So we end up doing is effectively greater on string type. The underlying backing store for the string is in unmanaged memory inside of our arena allocator, which means that we can allocate a string of very, very cheaply, there is no GC, and we have whole bunch of other smarts going on, in order to reduce the cost even further. The underlying result is that we would typically, in our old system, if we would open a dump, we would see that the primary reason for memory utilization in our system is our strings. And there have been some studies from a stack-off flow where about 70% of the memory cost is our strings, and we're able to throw all of that around. Another benefit for us is also that our string is UTF-8 instead of UTF-16, so that also reduce some of the memory usage that we had to do in the event. We had a whole bunch of issues like this, where we work with the Core-C-Lar team to optimize how the system is working. Here is an example of working with the, okay, let's recognize this pattern, and this is very common in hashing people, keep doing stuff like that, to the best underlying assembly instruction. It's funny because sometimes for very specific hotspots, you would see us writing a code in C-Shop, and then go and look at the assembly, and then play with the C-Shop code to get it to AIR, to get it to G, and then see how it works. And we had several cases like this where we actually talked to the Core-C-Lar team to get better results and better smells for the whole platform. Something, if you care about performance, go with the Rosin Guide. It's really interesting that you get to see some of the more common issues. And just changing this to a for loop, give you 3,000 requests per second. And again, ARSENAR is very demanding, very high performance. We are absolutely willing to accept a high deal of complexity in order to get better performance. The Rosin Style Guide also wants high performance, not really to go that far in order to get them. So it might be something that you want to apply in a more general sense. I want to give you one example. So there's a whole bunch of things that you can do in order to deal with lower level stuff optimizations. Here is an example. Take a moment to look at this code and what we'll do here, we are basically doing string comparison to be able to find whatever particular string equal or does not equal the particular value. And this code is utterly unreadable, but it allows me to do a comparison on a level by string in three operations, in three sub-instructions. And that means that I'm able to reduce specific hotspots significantly. I don't recommend it to do it in a general way, but in most cases, you see hotspots and it might be voted because the benefit can be quite amazing. RevitB is a database. One of the key issue that we have to deal with is that we have to deal with IO. And IO is very slow. And that's critical for all the forms. So one of the things that we have done is effectively don't pay for every single IO. Pay it in bulk. Take 10 rides to the disc and seven in one shot. It's interesting because the cost of writing 16 kilobyte or 1828 kilobyte is roughly the same, which is great to take advantage of. Hey, Oren, I just wanted to let you know we are right on time. That's great for the presentation. So great. So one thing, I love it that the COSTILAR is doing a lot of interseptimization that are small, but they aggregate a lot. And the funny thing that I want to say is that a lot of the things that we have done wasn't so much with the stuff that we did, but understanding the platform. And actually getting things right for the assumption that the platform is making which gives us much better performance. And that's it. Oh, wow, that was awesome. Thank you so much. Yeah, so we were looking on the chat and again, there were no questions because everybody was just learning. I mean, we were just discussing your ends with optimization and we were like, wow, but just something as simple as that, the perf you gain, it's pretty impressive. Yeah, you sometimes see, it's funny because when you look at the profile and you know that you're not going to see the things you expect. Right. For some reason, are you actually telling me that this single line of code hide that much complexity? The lambda example is even worse than that because you don't notice that. I had a case where you do items.oil, other items.contains. Yep, got it. And you pay for the direct location, we pay for a O and search and sometimes you pay for something like this can hide in square costs. Right. And then you're like, oh, yeah, it's ridiculous. But yeah, something like that, 2,000 requests per second, just like that. Yeah, wow. Yeah. And one thing to notice if you do stuff like that, be aware it's addictive. Oh, sure. Yeah, I mean, it's just like you said, it's a slippery slope. All right, we had one question here, but I was wondering if the person who asked it can ask via Twitter since we have to go to the other speaker, because we're trying to stay in time. So, Oren, thank you so much for taking the time to speak. This was amazing. Like always, we love to hear you talk and share your knowledge. So, thank you so much. Everybody, we're on the process. We're gonna get Halil here up and going, calling again, Oren, thank you so much, and we'll be right back. Thank you very much. Thank you. Good bye. Yep, thanks.