 So, hi everyone and thank you for coming and thank you for staying late. So yes, last year one of our customers came to us with quite a challenge. They wanted to run our software or Java software on more than 10 terabytes of memory. And we met that challenge. So if you're here tonight, I expect that you like this kind of challenge yourself. So, first you're in the right place and secondly, yes, this is possible. And tonight, we'll tell you all the steps we went through to do this. But first, let us introduce ourselves. So I'm Antoine Chambille. I'm the head of R&D at Quartet FS. And with my teams in Paris and New York, we write Active Pivot. Active Pivot is the fusion between, I would say, a new memory database and a calculation engine. And it's all written in Java. My name is Anita Bouzid. I'm leading the AIPAC team for Quartet FS. We have offices in Singapore, Hong Kong and a small presence in Australia. That's all. Yeah, so here is our gene data tonight. We will describe the application that we are testing. And then we will talk about hardware. I mean, how does it look like a server with 16 terabytes of memory? But then we will jump into the half of the matter. How to operate a JVM on 16 terabytes of memory. So we worked on a financial application, a credit risk application. So give me just a few minutes to give you the business context. So credit risk estimates the money that a bank may lose because of counter parties that default. And of course, there is no formula to tell you when a counter party will default. So credit risk is estimated from probabilities and Monte Carlo simulations. So a credit risk engine really will simulate the market conditions in the future thousands of times. And for each time point along each of those simulated tasks, it will calculate the value of all the positions in the bank. So this gives you hundreds of billions of values, so terabytes of data, but when you aggregate them the right way, give you an estimate of the credit risk. And of course, in the past, those kind of systems were run in a nightly batch, nightly batch processes, producing pre-aggregated reports in the morning, maybe. But this approach today is proving its limits. First, the financial regulators say that pre-aggregation and pre-canned report destroys information. They say that the risk analyst today they should be able to freely select and filter from the granular data and aggregate on the fly along any attribute that they want. And in parallel to those regulatory requirements, the banks themselves want to do more than just static reporting with their system. Their goal is to help their traders and their sales and let them use the tool interactively while they are a customer of other funds, maybe. So transforming a batch system into an interactive system. But of course, when you aim for that level on terabytes of data, it becomes a problem of very fast aggregation of huge volumes of data. And that's precisely the kind of problem that active people or platform solve. And so what is active people really? Very quickly, it's an in-memory database with all of the high-performance elements that you could expect from an analytical database. So, for instance, color stores, multi-dimensional aggregations, accelerated by bitmap indexes, and of course, multi-core calculations. But what's more relevant today is that it's all written in Java. We made a choice of Java very early for our system to do high-performance in-memory computing. And the reason for the choice is that Java is fast or at least reasonably fast. But more importantly, it's also safe. I mean that Java will not cordon your application when you do a mistake or an error. And also, Java is very accessible with more than 20 million Java developers in the world. That's what I heard. And it's easy to extend, and that's what makes Java the right choice. That's what made it the right choice for the vision we had for active people, which is to bring the business logic close to the data and being able to do on the fly and interactively those calculations that used to take hours. So this was our vision for active people in the beginning, but I can tell you that we never expected that we would be running Java on that kind of hardware. So here is a selection of large-memory servers that we have certified or that we are testing with active people. And each of those servers has a little backstory. For instance, the big Oracle Spark server, that's the one everything started with. Last year, when Oracle heard about the challenge that our customer gave us, they joined forces with us, and they helped us do it for the first time. So they built a team of system engineers, JVM engineers, and they worked with us to make this possible. So it all started on the Spark server. Another of my favourites is the first one, the Boolean server, because it is made in France by Atos Boolean. And it's an interesting server because you can see from its look that it is modular. So in fact, you can start with one module and switch to a byte of memory and add modules when your application scales or unwind and scales down if you need to. And also, this is an important partner for us. We've been, our R&Ds have been closed for two years now, and Atos now has an appliance, an active Pivo appliance that they sell based on the Boolean of course. And then if you're looking at the maximum size of the memory, I guess FGI and UAV are the current leaders because they can go up to 48 terabytes of memory per server. And the truth is that in a few months, you will be able to multiply all of those memory quantities by two when the new memory chips with 128 gigs per chip will be available. So you know it's not going to slow down many times. And of course, the big players also, HP and IBM, they also have their big memory server. And maybe that's the important information to keep to that. All the vendors now have their big memory server. So there is some competition, and the price is full. And it means that those large-scale in-memory applications are becoming mainstream. That's right. What way is it? It can't go. So what way, right? It can't go. What way is it? It can't go. Yes. They have all the same. IBM and OREC. So IBM is four, eight. OREC and smart. So they're all interconnected, right? Yes. There is a special interconnect to make several motherboards inside the server. But it's really one server each time. One operating system, Linux, and one server. One shared memory address space. No cheating. So when you look at those hardware configurations, there is something disproportionate about, because we're here, we're talking about terabytes of memory. If you ask some Java developers and they say, what do you think is the maximum Java can handle in terms of memory? You can find this on some blogs, or maybe some people will answer you maybe 100 gigabytes of RAM. Here we're talking about Terra. You have also to understand that Active Pivot is today running on deployed on production servers, on servers having one, two, or three terabytes of RAM. So that is really possible thanks to many efforts that has been made by Antoine's team on the Active Pivot code itself. Let me walk you through this journey. First of all, we had to start by minimizing the Java heap and manage the memory ourselves. How to do that? To do that, we relied on what we call the buffer. The buffer exists since the JDK 1.4. The buffer allows you to allocate what we call the memory of heap, no more in the Java heap. And if you've never heard about it, in short, the buffer is no more than a primitive array. So let's have a look at a classic code that is allocated in the Java heap. Here I'm allocating an array of 1 million doubles and then doing some operations on it. So the same code can be written by using the buffers. So here I'm allocating some memory, but this memory is allocated off heap, and I do the same type of operations. Those direct buffers are used in what we call the memory map files in the high-performance network streams. But in our case, in the case of Active Pivot, what we want to do is just use them as a malloc for those who remember the C courses or those who practice at C. So we want just to use them to allocate the memory. So then what we discovered is by using those direct buffers, we had some overhead because of the NDS, boundary checks, method calls. So what we had to do is to reverse engineer those buffers and use only the minimum. So in short, we ended up by using some MISC and SAFE. Who has used that already? Who hasn't assessed the block? They don't be afraid because the name is unsafe. Like a few years ago, if you say, hey, I'm using some MISC and SAFE, so people will say, oh, this is not that classic. Nowadays, this becomes more common. Just if you Google it, you will see that Oracle tried to give up some MISC and SAFE from the JDK9. So then this would have broken many of your favorite frameworks, like NetE, if you know, many others there actually. I found a page listing all those popular frameworks using some MISC and SAFE. It's probably hidden for you, but they're using it behind the scenes. So this explains how popular is some MISC and SAFE among the Java community. However, there is something that is not that magical here. You cannot put everything off here. For Active Pivot, being a database, a memory database, what we need to solve is mostly primitives. So this works really in an efficient way if you use columns, hash tables, indexes, but doesn't work well and doesn't work at all for serialized objects. Active Pivot is also an API. It comes with its own API. So we tried to find the right balance. We wanted to expose this low-level API to our users, to our customers. What we did is we kept all those raw data that is used by the database. We kept this off heap. And then all the calculations and all the queries that are fired to the database will run on heap. So by doing that, we had the right balance for the performance. And when we did the database, anything for the performance, and we kept all those raw data off heap. And on one use case, for the CVA, he just exposed, we had to use three terabytes for the heap and 12 terabytes off heap. That's right. One to four ratio. But let's return to the question. Yeah. Sorry to interrupt. Can you tell us a little bit about why this was done in 1JVM instead of 1JVM instead of scaling out? Yeah. There are some workloads that work well in scale out. So if there are some workloads, you can split in little tasks and tasks run independently. And then you merge the solve result together. Embarrassingly parallel problems work like that. Counting, clicks, doing a sum works like that. But anything a bit more sophisticated, a calculation with non-linear aggregations that requires mixing data from both sides, collapses very fast when you start to scale out. And for those problems, what you want to do is running speed seconds that was used to take hours and including complex calculations, quantize and statistics, you really lose all the advantages in memory when you start scaling out. That's why there are some use cases that require speed on large volumes that work so much better on a scale up system. In short. So what do you get for you to manage a whole memory? Is that a whole concept? Yes. Both memory is a piece of memory that we manage ourselves from within Java. Exactly. Outside the JVM. The memory is allocated outside of the heap. It's not the JVM that manages it. It's us. But from within Java. And you have got pointers on. Yeah. We manage pointers from within Java. But as we have said, it sounds like dangerous and exotic. But in fact, you're all doing that without knowing many, many, many common libraries and frameworks today. Do a heap memory behind the hood to offload the garbage collection. Just to add something that will be disclosed later on. When you are off heap, you won't suffer. You suffer less from the garbage collection. You won't be... The garbage collection won't come and see what's happening off heap. So this is a heap. But we will come to the garbage collection story later on. Certainly. So if we return to our use case, our objective was not just to fill terabytes of memory. Our objective was to deliver interactive calculations that aggregate terabytes in seconds. So it means that parallel computing is the other hand of the equation. We need to use those hundreds, of course, not just memory to solve our problem. But of course, when you run hundreds of cores, it is not really a multi-core issue anymore. It's like a new problem that sometimes is called main core parallelism. And over the years, I can tell you that we've tried all of the usual paradigms to improve parallelism inside the activity. For instance, we were one of the first software to really use the forward join pool. Way before it was made available in Java 7. You may have a lot of the forward join pool. It's a special thread pool that can divide work recursively into smaller and smaller tasks and run them on as many threads as possible. So it maximizes the usage of the cores. And also the forward join pool can do work stealing. So when a worker sprays a spare cycle, it can go and steal the task from another busy spray also to maximize the usage of the cores. So I can tell you that about five years ago, using the forward join pool really helped us push the limits and sense doubly for this great piece of software. And something else where we see a lot of improvement also was to move to lock free programming algorithm because nothing will waste your parallelism as good as well as a new text under convention. So through time, we have written our dictionaries or cues or in Texas, in active people, around lock free data structure. And each time we are unleashing a bit more parallel power if you want. And in the end, we also removed the locks from our transactional engine. It made that sense to multi-version concurrency control, MVCC. There is no synchronization anymore in active people between the threads who write updates to the data and the threads that read the data for the calculation. It's a bit like there are several versions of the data at the same time. And I could keep talking about everything we did for the rest of the night, but let's not stray too far from the core subject because all of those optimizations anyway, they're not enough. Of course, they helped us follow the hardware trend. We started on processors with two cores, four, eight, 16 cores now. But when we were profiling the largest application of our customer, we were already seeing that this design would not carry much further. We knew that it would not be good enough for a server with hundreds of cores. That at this time, we thought we are far away in the future, but we didn't know at that time. That's why three years ago, we totally redesigned our software to move from the multi to the many cores era. If you look today inside the modern servers, what you will see is you will see that you will have a CPU that is associated to a memory chip. And those CPUs and memory chip are somehow organized in a distributed fashion. So what we have to do is to partition our active pivot database. And the idea is to keep every partition within accessible only by one CPU. When only access to that partition will allocate the memory there will be when we do calculation or when we do query, we will just segregate and work within that partition. So we won't go and see what's happening somewhere else. So we literally mimic the hardware topology while rewriting our software. That is quite interesting because what you do here is you move from a multi-threaded code to a mono-threaded code. So you simplify your code really and you avoid managing the synchronization to the shared resources and the contentions. Can move to the next one. So this screenshot, let me just tell you the story for this screenshot. For this, we were firing a query dealing with 12 terabytes of data. And this screenshot has been made of the Spark server we were working on. And so this is just the state of all the threads of that server. I can tell you that the Oracle engineers were quite amazed when we reached this state. One may say, okay, are all the threads doing effective work? Let me give you the proof of that. So we can see here that we did a few benchmarks. So three benchmarks for the first one. What we did is we start using quarter of the machine in terms of CPU and we were dealing with three terabytes of data. Then we moved it to six terabytes of data. Then we moved it to 12 terabytes of data. And every time, from one test to another, we were adding a few thousand threads every time. And then we reached, actually, the screenshot that has been made was for the proper best. So we had the query response time equal to 24 seconds. So it means like, yeah, we had some work that has been achieved by the server and we received the response. When you look at this benchmark, the interesting thing to focus on here is we kept adding thousands and thousands of threads. And we had a certain result. If you are familiar with the multi-threading programming, maybe you had some tests and you did it at the tenth rate and it worked fine. And the day you want to do that on more threads, you will maybe reach a plateau or maybe the performance will collapse. Here it's not the case. So it's quite important. This is the Spark one. So there are 32 processors. Each one has eight cores and each core can run eight threads, eight hardware processors. 32 processors. So in all it's 3,070... 3,000 processors. Eight cores per processor or something like that. So if you can just understand that, it's perfectly scalable. It should be... You're absolutely right. This is not perfectly scalable. In a perfect word, yes. You keep multiplying, but you have always something acceptable. So here... But the good thing is from the first test, we multiplied, if you look at the green one or the purple one, we multiplied by four, but the response time is just multiplied by two. So it's quite impressive for this large-scale type of test. You usually told me every core can access a particular audio of the memory. So it doesn't need me... If I also had a particular audio, one of my processors would be really, really busy, and maybe that's the reason that the core became so slow. I think it will work. Are you talking about the Neuma? No, I'm talking about you said that one core can access just one memory. And let's assume for my own test case, four is just one access to that particular audio. I'm sorry. Can you rephrase it a bit? Sorry, just to... So one core? Yeah, one CPU? Even if you have 24 cores, one of them will work. Because I've always lost that particular audio. All the cores work at the same time, and all the cores can read from memory at the same time. That's because I'm already from that particular audio which is possible for all cores. Two slides before, you said that you need a partition. So here you said that one core can access one allocated memory. We might as well always want to ask that core. Always want to ask that particular memory. That means my test will be possible even if I have 20 or 43 processor which is not running. Yes, but this partitioning is dynamic. The partitioning is done by looking at the topology of the server to maximize concurrent access. Parallel access. Do you want to take it offline after the talk? I would like to talk with you about this. I want to share with you in fact, we don't just split the data in partitions. We actually care about where in the physical memory the partitions will be put. You know that all of the big servers above one terabyte of memory they all follow a non-uniform memory architecture called NUMA where the memory chips are in fact distributed among the processors. And if a processor reads data from its local memory chip the performance is optimal. But if a processor gets data from a remote chip then the performance is degraded. So NUMA in this regard is both an issue and an opportunity. And I have learned the hard way the big thing is that if you don't care about NUMA you can get a very expensive server to run slower than a laptop and it's almost very easy to do that. But if you leverage NUMA properly then you can aggregate the memory bandwidth of each of the sockets and you can scale the throughput of the memory together with a number of processors. And maybe those considerations may seem overstated to you because it's true that operating systems manage to smooth the impact of NUMA on standout workloads. But for a NIMMA-growing database really the memory bandwidth is the major pattern and so NUMA is a key point as important as the power of the processor for instance. And that takes us now to the real problem how to support NUMA in Java. Well you know that I was showing you what a 32 processors would look like with respect to NUMA nodes, right? So you know that when a process allocates some memory calling MANOC or NMAP it's in fact the operating system that decides where physically the memory will be allocated. And on Linux or Solaris the default policy is to put the memory on the same NUMA node than the processor running the thread. That's the default policy. The idea behind that is that probably the thread will re-access this memory quickly and in that case the performance will be optimal. And you will see simply that we have based our entire NUMA design on this key principle. Okay, first of all we have to detect the NUMA topology. So to do so of course we have to do some system calls. What we did is we relied on Java native access which is much easier to use if you compare it to GNI Java native interface. So for Java native access if you want to call if you want to make a library call all you have to do is to implement an interface that will have the same method signature as the library you want to call. And then GNI will do the binding for you. So what we did is we did this for NUMA-lib for Linux. We did the same for the libp thread. This allowed us to for a running thread have access to the NUMA node on which that thread is running. We're also able to access the NUMA topology, how many NUMA nodes et cetera. So we had all the tools we need to get that running thread and pin it to that NUMA node and keep it there and just access the data available to that NUMA node. By doing that we minimize the communication inter-NUMA node and we maximize the bandwidth the socket server bandwidth. If we look at the benchmark we did earlier just if you look at the last one we can reach half, more than half terabyte when we were performing that query without NUMA this is without the NUMA aware implementation we did this is almost impossible. So in other words right we were calling C library to do the first touch memory porting basement. Yes that's right we're using Linux system library from Java which I can see Of course there is no magic, there is no magic recipe somehow you have to interact with the low system libraries to understand on which type of topology you are running. The good surprise was how easy it was to do that in the end. Yes so what you see would be a node that far from processing data one terabyte a second and sure next year we cannot do that but at this stage normally you're probably asking yourself what about garbage collection right? And we're getting there. Sure thanks to our off-eat memory we only need a 3 terabyte heap but of course 3 terabyte is still the comfort zone in Java and I don't think we could have done it just by ourselves. I mean we felt the need to get closer to the guys who made Java to get this work and talk to the guys at Oracle IBA as well who actually write JVM. They make technology partnerships with them and in the beginning they have helped us to find the best set of JVM parameters to run such a large JVM. Time to take some photos So without the help of Oracle engineers we were not able to of course reach the perfect parameters, the perfect arguments for our JVM The idea here is quite simple if you focus on the rules of the game we'll give it for you if you want to The session is getting recorded so you can just see how the genius product is so we'll go and take a photo You can take pictures of us because they won't be able to You can take pictures of us because they won't be able to Okay Also there is only one main rule when we were playing with the garbage collection it's to never have full GC Globally the garbage collection are deleted, they scale where till you hit the full GC When you hit the full GC it was over, then you have to restart the application and then keep testing in your arguments and keep tuning your JVM Yes Full stop the world GC cleaning both the young and the old generation It's like close the totality application and then you keep cleaning all the garbage basically So I will just detail a few of the arguments we found useful So actually we were playing at that time with the JDK8 and we had to find a garbage collector algorithm that is able to continuously clean the garbage For that we focused on the garbage first, G1 GC which will become which will be the default for the JDK9 Also we had to use a big young generation remember when I talked about the offheal all the raw data of our in memory database of heap, those are the long-living objects we don't want them to vanish, we want them to stay there However the transient memory that is related to the queries which is related to the calculation we wanted to go in the young GC and remain there then it will be easy to collect So basically we use the Java heap as a big young generation So that was our approach The fact that we disabled the adaptive size policy is just to allow us to size properly the young generation Then the other problem we faced is if you are collecting the young generation and at the same time while collecting the young generation at the same time if you have transient object our fear was that those objects will be directly put in the old gem So the idea was we were collecting and putting a big survivor so then those transient objects remain there and then we can collect them easily Here by setting it equal to one means half of the young generation And the last interesting point was to continuously doing what we call marking the different regions So what we as far as we reach 10% of heap we start marking the objects So we are continuously doing that marking and as far as we reach 20% of garbage we include those object parts of the garbage collection We hope to know when we use this parameter but with Nolac it didn't this actually maxed GC post mini is kind of threshold it's a target that you want to reach for the JDK when you do your post but this wasn't efficient for our for the memory size we were playing with for the JVM So to be sure that those arguments we set are good for our testing what we did is we let overnight the application running and we simulated long and short queries resting our application and the day after we came and we were looking at it of course we didn't have the full GC so we were happy and then we noticed that the post we had was around 10 seconds which is we were really amazing but such result that we found the magic cocktail in terms of arguments for the JVM so with you want to add something? that's a lot for association I think almost none let's talk about the other JVM so of course when I say all the JVM I was talking about the JVM was the one from Oracle what you call the hotspot we tested another JVM that is made by a company called Azul and the JVM name is Z this JVM claims to be poseless and it's a fork of the hotspot but the garbage collection algorithm is called C4 they have their own garbage collection algorithm and they invest a lot on it so we did the same stress scenario so if you remember where I had exactly almost the same shape but on the X axis the unit before the hotspot were in seconds here they are in milliseconds so it's really poseless however there is no magic what you notice is most of the time like 10% of the time related to the garbage collection activity is spent by the CPU so you spend really like on if you have a 3000 threads so 10% is just working just on the GC activity and here we have side by side on some queries the response time that we have in the hotspot and the response time we have with Zing from Azul so as you can notice even if this JVM is poseless sometimes we can be two times faster than Zing with the hotspot so there is no silver bullet here and all is related to how critical is your application and of course free of charge you have to pay for it it's like you have to show between short post time or max performance but that's right now you should know that it's only the beginning I told you before that we had big partnerships with the big Java players and for instance since one year Orecola allocated a special team of JVM who tried to optimize G1 the G1 garbage collector as a reference benchmark in memory application and so the engineers, right active people work together with the engineers who write the JVM and together we are trying new ideas to improve G1 on such real life big size work and I can tell you that in the current big of Java 9 of course we are doing exactly in the Java 9 branch some parts of G1 run already 10 times faster quite impressive and I cannot disclose too much because in fact that's going to be announced at Java 1 in September I'm going to join Bernard Traversa the head of Java engineering and together we will present our results and the improvements that our work has brought to Java 9 but still I think we can still give a quick preview of the results so we are showing you the results from 3 months ago so we have already improved that so those were the results we had in Java 9 3 months ago so as you can see if you focus on the GC activity and GC poses you can see that we moved for 4 seconds something like maybe 3 seconds in average to something close to 1 which is quite impressive compared to the other size that we had before and if we focus on we took here just one of the biggest square you have to deal with so you can see that we moved from 33 to 18 seconds so there is a lot of hope to really improve thanks to this partnership and being close to those guys who are writing the GC algorithms really a process and maybe the most important aspect for real education is you know those outliers they were from times to times in very long 300 seconds pose those are completely gone in the new Java 9 it may be even more important than the reduction in the process ok so all of that means that we can all go home and do the same on our own 16 terabyte cellars if you had asked me this question last year I would have answered no like everyone else I consider that the border is crossed so here comes the new generation of applications that Java can take over applications that are operational that require continuous interactive calculations of a large amount of data that change in real time and this is going to be quite an adventure of course and if you want to be part of it I think there are many ways we can work together and all over the world but in the future thank you very much for your attention tonight I think it happened three questions so are you making use of any vision of the JVM that you can require a separate license from Oracle? this is the OpenJDK branch the one free for everyone it's the hotspot you can download on JVM and not Oracle JVM but Oracle itself is OpenJDK based on a few of improvements the work we are doing is in OpenJDK it's for everyone so when you get Java 9 next year and you say our Gavis collection is so much better remember that is do you think other languages can do the same? any language running on this JVM I expect will benefit from the better G1, Scala or Clojure for instance will benefit from it at certain I have a question from the other way around did you try to make use of our classes in all objects in your code actually or if you have any space or did you focus on your JVM? oh no yes we played on every cursor yes bigger objects, reuse objects anything that can minimize the pressure on GC we have been considering but really the best design reform in the L is to keep every long-lived object of hip and keep every transient data that is used by query and calculation in the hip where even if you create a lot of them they are clean very easily the young generation is a brutal parallel piece of garbage collection that works well even on many objects so the old generation yes the old generation is very small because every big structure of our database hash tables columns of numbers vectors of simulation all of that can be put off so we really keep a small amount, a small old generation not even one-third of the hip on the garbage first which was maybe not that efficient in JDK7 now it's really really doing an amazing result even on servers like 500 gigofram you can see that moving from a parallel gen to G1 you can see really an improvement really and you don't fall in that stop the world type of type of approach are you still here? less often really less who was the biggest memory around the past 16 terabytes? that's our record so bring it on I'm sure we will do better in the future but we only do what customers actually ask us to do so what is your next target for this afternoon? the target is not the size again we don't try to beat records we want to be able to do what customers actually find useful so today that's about what some of our customers want to do but what we really want to achieve is a smooth operation very smooth operation we try to work with Oracle so that they improve their garbage collection we try to work with Azure so that they improve their performance in the presence of good garbage collection that's our goal thank you