 OK, good evening. My name is Nida. I lead the technical team of a company called Active VIA. I'm based in, I work for this company for the past 10 years and I'm in Singapore for the past seven years. I will walk you through this presentation and tell you the tale of our collaboration, a collaboration of my company and Oracle, the team in charge of the JVM. And the effort we did to make Java 9 better on the memory side in order to manage a very large heap. My company is releasing, we build our own product which is called Active Pivot. Active Pivot is an in-memory aggregation engine. You can see it also like an OLAP in memory. I think it's the fastest OLAP engine on Earth as of today since we hold everything in memory. And since the beginning, we were holding all the data in memory. I will give you some use cases later on. But back to our collaboration with Oracle. Actually, our partnership started two years ago when one of our clients asked us, are you able to handle a project which requires 16 terabytes of data that should be held in memory? So that was a serious challenge and the vendor on the other side also was Oracle. So then we did a joint team to tackle that issue. Before that challenge, we used to play with maximum 1 terabyte of RAM, on servers with 1 terabyte of RAM. And then we moved to 16 terabytes. That was a serious challenge because behind the scene, the machine that has, one may ask, what type of machine is that? So this type of machine is called Oracle M6. You can find the equivalent from Huawei, IBM, Silicon Graphics, et cetera. So now every hardware vendor has such a huge amount of RAM, like a few terabytes. So then we took that challenge in 2015. We succeeded and then we were doing a presentation. We've been selected by the Java 1 conference in 2015. And at that time, we were playing with the JDK8. Then we moved to another challenge, which is enhancing. From that challenge, we learned a lot. And then we said, let's enhance a bit the performance metrics we have. And let's do it on Java 9, since the JVM engineer were working on it. And they gave us an early release we started playing with since 2016 with the JDK9. So we felt blessed because we were in a close collaboration with those few guys who are working, who are writing the code of the JVM. And today, the credits goes to a guy called Thomas Schatzel, who is based in Austria, who is part of the JVM team. So he was the team leader. And what I'm going to show you here is part of his work. In-memory computing. So some people could say, OK, in-memory, everything is in-memory. Since the beginning, we load data in-memory for the past 20 years. However, nowadays, if you see the price of the RAM, between the 80s and today, it dropped by 1 million. So today, you have the opportunity to hold the whole set of data in the memory. Before, we were holding chunks of data because the memory was so expensive. I won't detail all the product you see there. But some of them, like Spark, which is quite popular nowadays, Spark is quite fast because they store their RDDs, the intermediate results. You have the ability to store them in memory to speed up the computation. SAP HANA, for the analytics, we do the same. We hold all the set of data in the memory. So then we try to be sub-second close to the few seconds when you fire query to our engine. Then we give you the response. For our work, we focused on a use case for the banking industry because the first challenge was made by a French bank. And for that, what we did is we built a use case on an investment site called a credit risk. For those who are in the banking industry, credit risk requires a lot, a lot, a lot of data because you consume a lot of simulations, Monte Carlo simulations for those who are in that area. And then you have to, for every time point, you have to do the default, the counterparty default simulated and repress all your positions. And you do this, not on a few days, you do this on a couple of years and on every single day. So you end up with few terabytes of data. What we did is we aggregated that, and then we fired query against that use case. So that was our baseline. That project was our baseline that we gave to the JVM engineers, allowing them to play with and to tweak their code to have the best and the optimal performance. In terms of use cases related to the in-memory computation, today, having an in-memory engine allow you to give you an advantage against the competition. So if you are in the financial industry and you give such reporting to your end users, a trader can do some simulation before booking a trade. Compute on the spot, all the analytics, not wait for the end of day or for tomorrow's batch. On the e-commerce side, any website that is having an online store can listen to the prices of the competition and then compute. Every time the price change, you compute and you change your strategy. For a market maker, you change your strategy and you just try to get rid of the stock or you do a flat tail. So this is quite crucial for your business. And on the supply chain area, for instance, one of the interesting use cases we did was for a company that has to move cars from the producer to the consumer to the resellers. And then you have an SLA. Being a mover, you have an SLA. You're transporting cars here. You're not transporting paper. So there are clients at the end waiting for their car. And if you have a delay, you have to pay for it. And then you can say on the spot, this won't happen here. But what if there is a strike in this country? So I have to bypass this port and then deliver somewhere else, et cetera. So this gives you the ability to decide on the spot. So to aggregate a huge amount of data, you definitely need a language and a technology that gives you the ability to implement complex calculation to be versatile enough. And you're looking for a huge community that will push that solution. This is why when we started our product, we directly choose Java. That was 12 years ago. And I think we did the right choice. At that time, we were hesitating C++ or Java. However, plugging, you're looking for a safe language. When it crashes, it doesn't crash the whole server. You're looking for something quite flexible enough. And this is why Java was the good choice at that time. And still. However, the problem with Java when we started 12 years ago was the garbage collection because the scalability is not limitless. And that gave to the language, somehow, bad reputation sometimes. Especially those people that are not using Java and then kept the ideas of Java 1.3 or 1.1, take ads very slow, applet sucks, et cetera. Some people are still in that mindset. However, since then, a lot of stuff evolved on Java becomes really competitive. In two minutes, I will summarize all the effort we did in the R&D, like 10 years of effort. So some concept may look a bit weird or you never heard about them. So do not hesitate. This is, I will summarize in two slides, a talk that we gave one year and a half ago at the spring meetup at that time. Of course, you have to rely on the evolution of Java. But you have to do some effort in your software as well. And this is what we did. First of all, we took control of the memory allocation. Maybe you heard about the off-heap memory in Java to not keep everything in the heap. For instance, in our data set, we were having 3 terabytes. I mean, I'm talking about the use case. We're having 3 terabytes in the heap and 8, 8, 9 in the off-heap. We decided that all the structures of our in-memory database will be off-heap. For that, we implemented our own memory allocation. And we implemented our own malloc. If you want to look at it, you can check Java, Misk, Unsafe. You can check the NIO buffer. All those packages with our part of the JDK gives you the ability to allocate the memory yourself. But you have to do it wisely. And you have to know what you're doing, because then you can crush the whole server. We also took advantage of all what is the concurrent package, all the concurrent package Java is providing. And we used the fork-joint pools. We used the work stealing techniques. And most of our structures are lock-free. We used a lot of the compare and swap. So then we minimized the contention and we maximized the parallel work. We also tried to partition our data the way we store the data. So then we give enough work to all the cores. And we did the partitioning among all the available CPUs. So this will maximize the parallelism as well. But even more than that, we leveraged the new what we call NUMA architecture, what we call non-uniform memory access. All the huge new servers, they don't have a single memory bus anymore. They have what we call NUMA nodes. So you can see it that way. You have memory chip associated to a CPU. So here, for instance, you can see an example of four NUMA nodes. And for that, what we did, we used several techniques to keep the data within a NUMA node. And all the threads working on that data, we jailed them, we pinned them to that node. For that, we used something called, we did a call to native libraries, like libnuma, libpthread. For that, we used something called Java native access, which is much more flexible than the GNI. You don't have to write C here. You just write Java code. So that was for the software you write if you want to leverage as much as you can from the language. But now the language has also to help you a bit. The garbage collection didn't evolve as the application evolved and as the hardware evolved in the past years. So at the given stage, it became a bottleneck. This is why you will see that in the JDK9, by the way, how many people are using already Java 9? Java 8? Java 7? Still? OK, so Java 8 mostly. Some of the garbage collection algorithm are deprecated for the JDK9. And the default garbage collection is the G1GC. I will describe this later on. So the idea is to have a garbage collection algorithm that won't suffer from that is where the post time is not proportional to the heap. Because if you want to start a JVM on a few terabytes, you will suffer a lot if the post time is proportional. Your users won't be happy. You will have a lot of pauses. You will have stopped the world all the way, all the time. And then you have to restart your application. So this is why the G1GC is now the default garbage collection. Everything good? Is the default garbage collection? And we'll see why and the effort that has been made. The garbage, the G1GC, deals with region. We don't have this block of young, survivor, two blocks of survivor, and old gen. Now you have the memory, and you split that in region. And every region can have a different policy. A region can be young, old, or survivor. What you see in purple is what we call humongous region. I won't detail that, but you may have an allocation that spans over a region. We call this humongous region. Good to know if you have some tuning to do. Of course, you can define the size of a region. You can change that. And this is the garbage collection we played with in 2016. So let's see the cinematic of G1GC. Before, I tell you what was the effort that has been made by the Oracle JVM engineers. Let's see the usual how G1GC is supposed to work. So when you start your application, you start allocating some objects. Those objects, the short-living objects will be allocated, as you know, in the young region. And from time to time, young or edin region gets full, and then you have to do some young collection. Every time you have a young collection that will happen, the G1GC, you have this pose that you see here as TW means stop the world, or pose. You're posing the application. You try to pose. This is what the G1GC tries to do. It poses the application to do some young collection. And when we talk about collection, we are evacuating objects from, we get all the live objects. We evacuate them somewhere. And then we leave garbage behind. And most of the time, let's say a normal application where you don't have a cache or like our application where you don't have long-living objects. Most of the time, what you keep in terms of live objects is a small fraction. Use the object then you throw. So you leave a lot of cache, and this is why the young collection, like the young collection threads come, the garbage collection thread comes, and then keep cleaning and evacuating your objects from one region to another. When you do the evacuation of a young object from Eden, the live objects are moved to what you call a survivor region. Technically, survivor region still belongs to the young region. And so on and so forth. So your application is evolving, and then you keep moving the object from young to survivor. And at a given stage, your object, the long-living object, will end up in the old gen, tenured space. And then they will stay there. So the young collection at this stage doesn't care about those long-living objects who ended up in an old gen region. We leave them there until the IHOPE happens. The IHOPE stands for Initiating Heap Occupancy Percent. This is a parameter you can play with. It's 45% by default, which means if you reach the threshold, the funny part starts. And now we pause. We have a pause where we do what we call the initial mark. We start looking at all the live objects in the old gen. So here there is a pause. And then we start marking all those objects. The interesting thing is this marking happens concurrently, happens with your application threads. So here we do not pause the application. And as you can see, from time to time, while you're marking the thread, the marking threads are working, you will do young collection. Young collection will still happen. You will do small pauses, and young collection will happen. Then once this concurrent marking stops, you will do what we call a remark, where you check all the live objects that you marked, including the new object that happened between the initial mark and the remark. Because you create objects while you're doing the marking threads. As I said, it happens concurrently with your application thread. So potentially you create some new objects there. After the remark, we collect statistic about the old regions. And we start looking at the regions that we're going to clean. And we'll prioritize the region with a lot of garbage. Maybe some of you know it already. G1 means garbage first, the short form of garbage first. So the region with a lot of garbage, I will clean them first. The cleanup phase is where we decided what are the regions that I'm going to collect. Then just to free as much as I can, I do a very last young collection. And then I do what we call the mixed collection, where I will collect all those regions that have been selected from the old gen. And then back to square one. So this is the whole cinematic of the garbage collection, of the G1GC operation. So you do not, we started with a normal young collection. And then we start doing the marking. And then at the end, I do the mixed collection. The very important thing that we do here, we don't stop brutally the application, and we do a full GC. Especially if you do a full GC on a heap with a few terabytes, it's devastating. Believe me, a few terabytes, we had that. And you have to restart the server. You can just have those mixed collection. If you have the full GC, it will scrub everything. OK, before I detail this, the effort that has been made was on the young collection with the Oracle engineers. Because we noticed it with this benchmark that the young GC was not as good as we expected. Actually, when we started the first time playing with the G1GC on our use case with a few terabytes, some part of the G1 algorithm scale pretty well. And this use case, this first benchmark I'm sharing with you, was made on a static data set. When I say static data set, it means like I loaded the active pivot, which is the name of our database, I loaded everything with a static set. And then I start firing some queries, short queries, queries with some calculation, and queries with what we call a full scan, which means like I try to get terabytes of data. What you see here is the number of garbage collections by pose. For the short queries, it was quite OK. Five seconds in average, quite OK. But when you start doing something quite complex, having the pose of 30 seconds, here I'm talking only about young collection. I'm not talking about full GC, as I mentioned. 30 seconds was too much for an interactive application. This is what we claim for our product, like OK, you use it, use everything in memory. Remember the use case I mentioned earlier? We want you to take a decision on the spot. So if you have to wait 30 seconds to take a decision, it's not interactive anymore. So then we started collaborating based on that to make those metrics better. And since this happened on a static set, we load everything and then we let the queries happen. Every time we see a garbage collection, most of the time this was for the short living object, for the transient memory. You fire query, then you collect the results, you generate some, you fill the heap, and then you collect that. So let's see the improvement that happened on that part. The improvement of the young collection was as follows. So first of all, the GC threads, how do they work? They start from the root object, and then they start to resolve all the outgoing references. And every time you find a reference, you cannot move immediately what you find in terms of live objects to another. You cannot evacuate everything. Right now, you have to enqueue. You keep enqueuing the references in what we call, so those are the GC threads. We enqueue in the public buffer, and we enqueue the references in the private buffer. Why do we have public and private buffer? That is mostly to have a smoother work balance. So before, what happened is when, say, we have two threads, two GC threads that are working. There is one thread which has a public buffer full, private buffer full. The thread too finished with his private buffer and finished with his public buffer. So then this existed before we asked Oracle engineers to make some effort. They were using work stealing to help. So the thread that is done with his work don't remain idle and start helping the other thread. But the work stealing happens only on the public buffer area. And you may get flooded in the private buffer. Maybe you've faced this use case in the past. So you have all the threads idle. You see only one GC thread doing the job. And all the rest, all the other threads are looking at him like, OK, do the job now. So that wasn't good. And especially that in the private buffer, sometimes when you get huge arrays, all the references were copied there. And that was part of the cause of the flooding. So what they did is they changed that by processing those huge arrays by chunk. And they added the way of refilling from the private buffer. We can refill the public buffer. So we give work to other threads who want to help. And based on that effort, the same benchmark you saw in red in 2015, now we have something more predictable. And we have less poses, especially on the second and the third case. So this was really interesting for us. Just the effort that has been done on the young collection. However, for our use case and our product, we do not deal only with static data. Our customers, what they want is to have intraday data coming in, some transactions happening, a guy doing a trade, or some people buying on an e-commerce website. And then you process the order. So there is some mixed workload. You have read and write. And this test that we did at that time was with read only. You load everything, and then I allow you to do some queries. At that time, in 2015, when we played with the read and write, nothing worked. So then we decided to improve that by enhancing the parallel marking phase, where we noticed that when you load, when you have some intraday activity, when you have read and write happening in your database, the mixed workload, sorry, the marking phase, the parallel marking phase, if you recall well, this part. This part here, the concurrent marking that was happening. We start doing marking the objects, but thus was taking ages. While there is an activity, a young collection trying to happen, you feel the young area. You feel the young generation. You feel the tenured space. And at a given stage, everything got stuck. And you finish having a full GC, and everything was crashing. So that pushed us to do some effort on the collection of the old gen, and especially on the parallel marking. Same as the garbage collection thread, we have marking threads. The marking thread, they do the same job, but they don't evacuate the objects from one region to another. This is a heap region called the cart. And every, this is here, we are zooming on a marking thread. And every marking thread starts looking at the different live objects. And if it finds live objects, he has to write somewhere and say, oh, in this region, I found a live object. The way we remember those live objects, we use that in what we call data bitmap. So every 512 bits, this is what we call a cart, we have a down sample area there in the bitmap, which is one bit. So if I find an object in one cart, so then I turn the bit on the bitmap. So this was for one marking thread. So if you have few, it's OK. But on a huge deployment, that bitmap becomes a bottleneck. Because every marking thread was having a copy of the bitmap. So this is why G1GC was not well-sized a long time ago now, was not well-sized for a lot of threads and multiple cores and a huge memory on the server you're working with. So to tackle that, what the engineers at Oracle did is they have one single bitmap that is shared across all those marking threads. Because if every thread has a copy of the bitmap, then at the end of the day, you have to merge them. And those bitmaps, if you have several of them at the end of the day, those bitmaps, you're looking to collect the memory. And that was itself a bottleneck. So they changed totally the design. And they were having only one bitmap shared across all those threads. And then they use a lock-free mechanism to access it and to maintain and to remember all the live objects. That was the first improvement on the parallel marking. The second improvement, same as the GC threads, the marking threads, they have a private buffer and a public buffer to enhance the work balance. What we noticed is that that public buffer was guarded by a mutex. It was a very, very simple structure that allows only one thread to access at the time. So by releasing that and having a lock-free type of structure, I'm not saying lock-free is a contention-free. There is still some contention. But by just changing this, we had 50 times the improvement by 50 times in terms of parallel marking. And as you can see here, we were able to do our mixed load reading and writing at the same time. So on the right, you can see for short queries and for full scan, you have read-only with static data and read-and-write, where we were changing one terabyte of data every minute to simulate a huge activity. We were a bit aggressive to see how it behaves. And you can see the read-and-write versus the read-only is less than two times in terms of number of garbage collections. Notice that the resources were also used to change and to reshuffle the data and to refresh the data there. So all the effort that we did with all the, actually, our effort was to build the application and to ask the Oracle engineers for help. And the job they did allow us today to build applications that do not fear the usage of a few terabytes of RAM. Today, if you want to get a machine in five minutes, you can have a machine with at least two terabytes of RAM on AWS. The X1 large is a machine that having two terabytes of RAM. It's very easy to have on Azure as well. On all the cloud providers, you can have a virtual machine with a few terabytes, which means you can, by storing, of course, it has to address a certain use case where you're looking for speed. If you can do something with SQL and your client or the guy who is sponsoring your project doesn't care about taking the decision right now, of course, do not go to the top. But if you're looking for performance, Java can do it. And in end of November, there is, I read something here. Sorry, your stuff is blocking me. Anyway, there is a new garbage collection that will be probably open source made by the Oracle engineers, which is named Z Garbage Collection. This garbage collection claims to have poses of few, like, 10 milliseconds on huge heaps. It's made like their mindset now is to target huge heaps. Red Heart as well, they have their own garbage collection, which will be in competition with this ZGC. That's it for me. Thank you. Thank you for listening. Any question? I know it's a heavy topic. But feel free. If you have any question, don't be shy. I'll try to answer it. Yes, please. Can you go over to Azul Shalmiyev. Excuse me. Yes, we played with Azul VM. Azul VM, OK, actually I played with a few years ago. They will have it. They have this. It's called Xing. And their garbage collection is called C4, like that. OK? In terms of it's poseless. It's poseless. However, most of the time, you use a lot of CPU to do the garbage collection behind the scene. And on our product, when we did a benchmark, Oracle JDK that you can download versus Azul Xing that you have to pay for, we compared both. And we noticed that the post time was really good on Azul. However, the response time and end to end for our queries were better on Oracle JVM. So I guess that, of course, they're famous for that. And I'm pretty sure they are racing as well against those ZGC and other garbage collections to make Java better. At the end is what you have here. You will have it in the open JDK. So it's good for us, if you want to. Yeah, it's good. Competition is always good. Yes, please. What were the DC poses like before you went to DC for your product? Good question. Actually, when we start dealing with a huge set of data before, when I start for a long time, I was working with the parallel old garbage collection. We're having maybe tens of seconds. But the set of data wasn't big. So when our clients start challenging us on the size of the data that we can handle in memory, the JVM was already better. And today, for my clients, by default, I'm using JDK8. I'm working for most of our clients, at least in Singapore, they're in the banking industry. They didn't move to, as you know, in the banks, they take the time to move from one version to another. So today, most of them, they're using JDK8 by default. It's easy for me to convince them to turn the G1 GC on. By the way, G1 GC is good on JDK8, even better on the JDK9, but on the JDK7, it was so-so. It started with the JDK7. I think it was experimental. So if you are on the JDK8 and you have some issues with your garbage collection, give a try with the G1 GC. G1 GC claims to be, by default, if you read, they will tell you, just turn the G1 GC on and do not tune it. It will work. But believe me, you have to do some effort when you start doing some tuning and you have to understand what's happening there. There is some interesting flags, like you can fix a threshold, like the max pose you want to do or the maximum mixed collection you have to do to reach all what you have to collect or you can resize the regions. For example, if you have humongous objects ending up spanning across two regions, so you can make your regions bigger, so then you don't have humongous issues anymore. But the best thing you can do when you start playing with the G1 GC is to turn all the logging on. And it's quite explicit, actually. Of course, you have to understand all that cinematics I was mentioning. But their logging is better than what you probably was used to with the CMS and the Olga. Okay. Thank you. Thank you.