 So we're really happy to have Edward today from QuasarDB. QuasarDB is another financial time series database that's sort of focusing on techs coming from stock markets and trading installations. So we're really happy to have Edward here and go for it. Okay, so I'm in Oligam, the founder and CEO of the company. So it's going to be a very technical presentation, I hope. If you have any questions, you can ask me at any point in time and I will answer to the best of my capabilities. Before I go into the details, the goal of this presentation is to discuss about what you have to do to make your software go fast in these contacts being a database. Because the use case we have is there's really a lot of data coming on, nonstop. And you don't have the liberty to be unreliable because you have to also be reliable with your updates. So there's the contradiction about being precise, being accurate, and being fast. Before we start, let's have a discussion about benchmarks because if there is only one takeaway, it's about having a critical mind about benchmarks and the way you approach how to evaluate the speed of software, any kind of software in general. So in terms of database, let's have a different context in usage. Let's go away from database and let's talk about cars, right? So if I give you two cars just for the evidence like this, and I ask you, which one is the fastest? And let's assume you don't know anything about cars, just look at it. I think most of you maybe would say the Porsche is faster, right? No one has an ID, no one cares? We see the YouTube video, we know the Tesla is faster. Actually it's faster on a stop like this, yes it's faster because it's an electric engine with immediate torque compared to an explosive engine with torque only in the revs up. But the true story is actually the Porsche is significantly faster than the Tesla. If you go to the number ring, actually the Tesla kind of finished the lap with full power because it overheats kind of stuff. So it's always interesting to have a company, I'm the CEO of a company, I will try to show up and give you numbers and say, well, my database is faster, but you should always be critical and see because what you don't see here is the weight of the car, the capability to handling the turns and the car stuff. And for database it's exactly the same thing, and so the numbers where I show you, I can aggregate billions of lines per second. What does it mean for you? If for example you're a trading firm and you want to compute the weighted averages of your stocks, that's what you care about. And if you do, if you want to do monitoring, you care about being able to ingress everything you have. So in the field of database, for one customer, we did a benchmark, pure read write benchmarks using the low level API where we write blobs to the disk and read blobs to the disk. The setup is on the left, and so it's bounded to by 10 gigabytes. And as you can see, the open source SQL, I'm just not naming it because I'm not interested in bashing on anyone, writes more than twice, almost three times faster than us. So the conclusion could be, well, we suck, yeah, maybe. So but there's just this little detail that they write faster than the disk. And so the conclusion is when you see numbers like this, what you actually are reading is that they don't really write to disk immediately, they buff in memory and then write to disk, which is fine in general. And not saying one of the other two databases is better, but in your context that may be a huge problem because your data may be arriving all the time, and at some point what's in memory has to be written to the disk, right? So what I just want to say to start this talk is just don't read too much in benchmark because you can craft the benchmark you need, you can make the modifications we want in the database to pass a benchmark with flying colors. And what you really care if at some point you are going to evaluate a product, whether it's a database or not a product, is to make sure that you go beyond the marketing and you really understand what's going on and be a critical mind. OK, so very quickly now we're talking about what we do, what the database comes from, like in a couple of minutes and then we get really into the details of what we did to have the performance we can deliver. So basically like Andy said, we were born in market finance. To be precise, the idea appeared during the financial crisis in 2008 when I decided to change my career and I said, well I'm going to work in market finance and I picked the best year for it, 2008, so I don't know if you remember it, but it's basically market crashes. Quasar is a time series database, whatever that means we'll see later that what's the definition of time series. Market finance time series is really a great way to represent what's going on in the markets because you're really interested in two things, that's the value of whatever stocks you're looking at or in this case it's the cross, it's a USD euro cross and the value and then maybe other instruments. So time series are great because then you can ask questions such as can you give me the values between two points in time or can you compute some sort of aggregation like I want the average value or maybe I want the median or maybe I want the standard deviation, that kind of stuff. And in a time series database like us, which is column oriented, we get more into the data exactly what software is, it's very easy to do and very fast. Just to be clear, below a certain amount of data, any kind of database is fine for your workload and really if you, the amount of data is reasonable, you should probably first try a generic database, I don't know, like PostgreSQL because it's probably going to be able to handle your workload. What happens is when you either have a lot of data or you have a lot of requests, then you should start considering, well, maybe I need something a little bit more specific. In finance, the amount of data you can get is really insane and there are two insinity, if you will name the word, is first the amount of data is crazy. So that's 10 million messages per second, the whole day from 9 to 5, just for a couple of stocks exchanges and you have probably more data to get and one message is obviously not one point. The lot of activities when people do what we call a bid and ask is, hey, how much would you be willing to pay for my stock? Hey, I have stock, how much I would like to sell this amount of data? So it's really a lot of data. And the other problem is you have to be very accurate that you cannot cut corners in the sense that if you want to get a lot of performance and you can fog a little bit of reliability, then what you can do is, like we saw a little bit earlier, for example, you can buffer rise to disk and it's fine if the computer crashes and you lose a little bit of data, you can allow it to be non-consistent if you have a distributed database. Okay, so this value is not exactly what I expected, but it's fine because in my context I don't care. For example, if you display an ad on a website and you display the wrong ad, it's fine, right? But if you do a deal with the wrong value, it can have financial consequences. So that's also things you're going to find in other areas, such as predictive maintenance, where you really want to have accurate values. So again, I'll go back to the story of when I have all this data and then let's say you want to, you're a jobist, you're working at a hedge fund or an investment bank and you're a jobist to make sense of all this market data and to build an infrastructure. So you have to be naive in the way you approach problems. Like you don't have, you should not have an opinion about what you should do before you really study the question. Like what kind of data do I have? So in this case, it's obviously time series. How much data do I have per day? And you know, like if you get caught in, well, it's millions, yeah, but one million points is just a megabyte. What level of reliability do I need? So if you're a bank, there's new regulation coming up, for example, and you have to keep the data for seven years and you have to be able to answer questions to regulatory entities, we would be very happy to find a reason to find you. So that kind of stuff. But maybe if you're in different businesses, I just need to keep one year of data or it's just for my own statistical need that I have a historical of data. So how much money am I going to spend on my system right? Then just to give you, so I'm not going to give you a crash course on market finance. Maybe some of you don't really care about market finance in general, but just a couple of questions, things, problems we solve for the customer we have. We have three kinds of customers, stock exchanges, investment banks, and hedge funds. So pretty focused, right? So it's really, they usually have a lot of data and they found a solution for the data of the day. Like I can ask any question about my data of today because it's reasonable amount of data. It fits in memory and that's fine. But if I'm an analyst, I'm really interested in what was going on 10 years ago about that stock because maybe I'm going to find patterns. Maybe I'm going to get ideas. Maybe I'm going to find correlations. Sometimes correlations can be very weird. Like you can find that two companies that do totally different things actually have the stocks correlated. And maybe you can willing to bet on this. So what we come up with is a solution that they can store everything they want and ask a couple of questions every time in database. We now, the only one to do that in the market, you had a lecture from a company that doing a similar thing. So the product is a bit different in the sense that our solution is scaled out. And we solve a very difficult problem of distributed transactions and that kind of stuff. But the thing is now we have a scaled out database and we can ingest a lot of data. And we also have what we've done when we designed the stuff is really to go back to the problem and interface with the tools people have. For example, machine learning which is emerging. We have connectors for Apache Spark, that kind of stuff. It's really about delivering a lot of data in real time. And just so, yeah, it's the end of what we do really quickly and that's the kind of performance we deliver with just one thread. That sounds maybe like a lot, but it's a requirement, for example. We had one customer, one of the oldest we had. They say, OK, I need to do 8 million updates per second in your database and do that reliably. Can you do it? And if you don't do it, we don't have a deal. So currently, yeah, we can aggregate a 1 billion line per second when it's in memory. And ingress at millions of points per second. But again, that looks impressive, because I'm doing marketing here, right? But just multiply that by the size of double and you see it's not that much. In terms of megabytes, you machines, they can handle gigabytes of six bytes per second. Anyway. So looking at some of those numbers, what are your bottlenecks? Are your memory bandwidth limited? Are you just limited? Repeat the question. So the question is currently, what is our bottleneck? So today, I would say it's a persistent layer, I think. Because with 100 gigabit, you journey faster than the persistent layer. So we get around that by distribution, by clustering. Like actually, we work with Intel and a company named Levix for persistence. And we are very eager to test the new opt-in technology. But I would say that's really persistence. Then for certain kind of aggregations, well, when we memory bound, I think it's a success. It's really mean we've done our job really well. At some point, it's a software. The data structure, the computation indexes. Yeah, it's interesting because today, the network is no longer really a bottleneck in terms of bandwidth. Maybe latency, but we see that later. For latency, what we've done is we worked with companies like Melanox to bring down latency as much as we can. In this case, there are two shortcuts we have. Actually, now we write directly to the storage. But the first thing is, instead of when I get data from the network, reading, letting the system copy that in the buffer, and then read the buffer, and then read it again and again and again. And maybe my own software is making copies. We go to the network card and just read the data in place. So saves you a lot of copy and gains a lot of latency. And the other thing is modern network cards are also able to outflow a little bit of computation required for TCP IP connections protocol. And for the storage, it's the same thing. If your database, you can either rely on the file system. That is, you're going to create files. And within the files, you're going to have your own way of finding your data. Or you can directly write to the device. And in that case, you gain a little bit more speed at the expense of more engineering for us. But from the point of view of the user, you don't care. Because the heavy lifting has to be done only one time, hopefully. So to go back to the architecture in terms of clustering. So everything within the database is distributed. And distributed transactions are first citizen. In the sense that everything you do can be a transaction. And everything can be verified by this transaction. To do the distribution, we are based on an algorithm named code. You can look it up. It's from the MIT. We did a couple of modifications to it. Basically, it's consistent hashing. Consistent hashing, meaning that you have automatic sharding of the data across your cluster. And every node then is responsible for a range of data. The range of data is computed for a given key. For example, for the name of your time series, it's going to be actually splitted over the cluster. If your time series, for example, is very large, like there is really a lot of data, obviously, you want this data to be spread across the whole cluster, right? Because if you have 1 terabyte of time series and you have four nodes, you ideally want to have 256 gigabyte on the four node. So to do that, we transparently shard the time series across time. And we do our best to have locality of different time series for a given time. Again, so as you can see, we do direct access to the block device and the network card on every node. And every node is working independently of every other node. That is, as soon as you've decided that a node is responsible for your data, it doesn't need to communicate with the other node, unless you're doing distributed transaction, which we're going to see later. So the advantage of this architecture, we've been working a lot with Cisco for the network and performance and benchmarking, is scale out really works. If you have an architecture which is more based on consensus algorithm and outcome stuff, the problem you can have is when you do updates, you need to wait for the consensus to be finished. And that's not our case, because we can just, when you have your shard, you just work on your shard. Is that benchmark just adjusting? Yeah, so it's really just you do use a primitive. So the question was, what's the benchmark? So in this case, you use the native primitive API to write blobs to database and see if these writes are scaling up, same for the reads. So when you do a time series, it's going to be a little bit different, because you're going to have an intermediate processing of a serializing a time series, finding what it is. But the goal there was really to see, do we have any bottleneck? Can we really scale out? If not, there is something wrong in the algorithm. Yeah, you had a question? So you're saying each shard works independently. Is there no dependency data, or is it like a single? OK, so that's a really good question. The question was, every node works independently, but what about replication? What about synchronization? So it's true that every node works independently, except for replication, where we are synchronous. Again, let's go through the requirements, where we need to be very reliable in our updates. So we don't want to accept a data. Say, yeah, yeah, we did replication, and then we lose a node, and the replication did not happen, and then the data is lost. So we wait for the replication, meaning in that case, a node is going to elect a node as a replica slave. And so it's really a hybrid peer-to-peer master slave architecture. So for one node, every node is going to have a slave to which it's going to do replications. And if the node goes down, automatically the slave is going to be promoted master for this area of node, and it's going to then again start replication. And the replication is configured on the cluster. And that's the work we did with Melanox. So it's very interesting. So VMI is really a thing around DPDK. And actually, I think now they have new stuff we can even go faster. So that's the gain you get to bypass the operating system more or less, if you allow me to take short cuts in explanation. So it's almost a few times faster, just because, and it's exactly doing the same thing, right? It's sending packets on the network, exactly the same packets. There is no compromise done in integrity on entities. The only thing that happened is we bypassed the kernel and we write directly to the network card. And we used this specific network card for this. So this seems really slow compared to what? Actually, do it apacrate? Yeah. What's the bottleneck character? I think it's VMA. Is this just one byte request? Yeah. I mean, that number could be $30 million. Yeah, it could be, but that's a good question. I have no idea why it's so different. Remember that the whole protocol of the database is running. It's not just a dump server sending one byte and the other. So this is requesting at the database level? Yeah. But you're right in the sense that it should be much higher. And for the time being, we haven't identified clearly where the bottleneck is. This was a Neumann machine. I think there may be surprises related to our software and the interaction with the Neumann machine. Like, is the request I get in the right core when I get it? That kind of stuff. But be very happy if you can help me increase this value. Well, just come to talk to me. So then, OK, that's the database of what we do roughly, very quickly in the time I have to give you an idea of what the database looks like. I just want to spend some time to give you back some concept in terms of software engineering very, very quickly. Because we're going to see what we've done to, for example, we are pretty good at scaling up, except for Neumann machines, which is a nightmare for databases. And so I spend sometimes with the low-level coding. So one of the first thing the database has to do is memory management. And memory management, you can say, well, I'm going to use a garbage-collected language and my problem is solved, not really. The real problem underlying memory management is the lifetime of your values in memory. Is when can I delete it? And so the garbage collector is going to give you an answer, which is going to be based on different algorithms. Or you can manually, in your code, decide when it's safe to delete a value. Just end, of course, when you have a multi-threaded application. And asynchronous HIO is even worse, because it's very, very hard when you code to know when it's safe to release memory back to the system. Asynchronous HIO, of course, I think every database today is doing asynchronous HIO like we do. So we do asynchronous HIO for a network in particular. So it's really about telling to the kernel, well, please do this for me. And when you're done, tell me if it's a high-level explanation. This is very powerful, because while the kernel is working and probably waiting for the hardware to receive the packet, for example, from this machine, you can actually do stuff, because the hardware is just waiting. Then just to speak about multi-threading and scalability, what's interesting about multi-threading performance is if you want to scale up, that is, if you have a multi-core machine and you really want to use all the cores of your machine, it's actually very hard. It's not just a question of, oh, I'm going to add threads. Yeah, but the values you have, at some point, you're going to want to have some sort of synchronization, right? Because let's say someone is updating your value in memory. A user updates the value in memory. And the value is more than just an integer. It's a whole blob. And you want the update, if someone reads the value, you want at least to either give the whole value or the new value, but not half of each, right? So what's interesting to know about multi-threading is basically reading is easy to scale up. If you just want to share access to the same value, you don't have to do anything special. The hardware will be very clever about it and just have threads read the data. What is difficult is when you want to share writing to the value. Then that's very hard. And actually, the more threads you have, the worse the performance is going to be. And that's why it's very hard when you design a database, because unless you have a read-only database, which I'm not sure what would be the use, you're going to struggle to balance the write loads and the read loads. And so just to go back about memory management, so you can take the decision when you design your database to use a garbage-collected language. That's a decision that has been made by some database makers. Personally, I think it's outsourcing something which is strategic to a database. In your course, very often, I think your teacher told you, you don't know better on the operating system what you're going to do. You don't know better when the garbage collector what you're going to do. And in most of the case, it's true. But when you do a database, which is really a massive piece of software, very often actually you have a better idea of what you should be done than the OS. And you have a better idea of the garbage collector of what should be done. And that's where it could be interesting to see if manual memory management is better. So that's the path we chose. And it was more expensive in terms of engineering resources to do, but it really paid off because we don't have the problem you have with the garbage collector going crazy because it has to collect and it freezes everything. Although there's a lot of progress being made in that area. OK, so just so you know, I'm a C++ guy. I wrote a book about C++, so I may be a little bit biased. OK. So let's scale up. Why do we want to scale up? When I started to do software engineering, the only thing I had to do to have my software go faster is wait for Intel to give me a new processor. And I would take the same processor, the same program, exactly the same, and it would just go faster. It's been called, I think, by Herbstutter at the end of the free lunch. Today, if you really want to leverage the next generation of processor, you have to make sure that your application is able to run on multiple cores that is being multi-threaded. That is today, if you decide the database we have is very recent. I mean, it starts from scratch. We had no technical depth. We could just do whatever we want. We started from scratch. And early on, it was pretty clear that we had to be multi-core native, that we had to be multi-threaded. Because we could take the decision of being a single-threaded database, but in which case, we would probably pay it very dearly in the future. So, as I said again, being multi-threaded it makes everything easier because you don't have to care about contention, locks, and anything. But then you don't benefit from the latest processes and architectures. And yeah, like I said, yeah, it's actually very hard to scale up as a software in general. I give you just one question. Do you know what it is? Yeah, you have an ID. So it's really how hard it can get. If you don't know about it, the first time you learn about it. So you say, I have no mutex in my application. That is, I did not specify any primitive to do locking. And yet you do your benchmark and you see that you do not scale with a number of threads. And you say, well, what did I do wrong? Well, maybe there is a lock somewhere. Maybe I missed something. And that's because you have to remember how a processor works because obviously when you're doing updates in memory, the processor has the architecture as some way to make sure that you don't do stupid things like updates the same value at the same time. It has to be consistent as well because if you read back the value, I think if you would fall in memory, you really want to read back four. What you need to know is, so I have my thread writing to first variable, right? And the other thread writing to another variable. And that doesn't scale. And independent variables, no lock, no nothing. And why it doesn't scale? Because in memory, it's going to be on the same cache line and updates on the same cache line are locked by the processor. That kind of stuff you can have and that forces you to go very deep down in what you're going and really understand the hardware in terms of performance and what you're doing when you want to scale up. This can be a very bad surprise. So another interesting stuff. Do you know about reference counting, maybe? So it's a memory management strategy that exists very used in C++ basically. Every time you make the copy of an object, you increase a reference counter. And every time you no longer use it, you decrease the reference counter. And when it reaches zero, it means no one is using it and you can destroy it safely. It's heavily used. And so the name of the structure is named a shared pointer. You can also hear about smart pointers, the same thing. And so to do that, you have an atomic counter that is, again, an atomic value that is going to be verified for the increase because you really don't want to miss an update and you have multiple threads updating the value. What you need to know is that atomic increment and decrement compared to a classic increment or decrement of an integer is 20 to 100 times lower. So if you have, it's really not a lot of objects like 10,000 objects, what is it? It's nothing. As you can see, the cost of having references is, can become very significant. And again, you can lose your multi-threading capabilities because of it. So the raw pointer is just the regular pointer, nothing fancy, the data in memory. So what is this warplane? You're just reading the pointer or what are you? Oh, it's a good question. So it's a pointer, the question is what do we do? Do we just read the pointer? So it's pointer intensive manipulations like I'm just passing the pointer to a function. Just this, I give you the pointer. So it's in theory the most inexpensive operation you can do because it's really just reading from a register right. But when it becomes a spark pointer, it becomes an expensive operation because when I do that, I have to atomically increment the reference counter and maybe decrease it as well later. And because if it's atomic in terms of multi-threading, it's going to be very, very significant. So you have eight threads in that context. So what is the standard library share pointer doing that the boost share library doesn't do or it's better? Why is the boost one better? To be honest, I have no idea. And especially because I don't remember in which compiler we tested this. So maybe you were going to test with another compiler to get a different value. Maybe it was because of memory location and cache locality, that kind of stuff. The way you create your smart pointer, you may want to have the counter in the same memory area than your object, that kind of stuff. It can be also maybe that the compiler saw an optimization opportunity. It's at this amount of data, I would say it's within the same range. Maybe, yeah. So your message to the children here and the students is don't use smart pointers? No, no, no, it's not my message. My message is sometimes the bottlenecks can be in trivial things you would miss because the smart pointer is hidden in, for example, a type definition. And my message would be benchmark your code, audit it and have no specific idea about where the bottleneck is and you would be surprised about what can come up in IOP in terms of bottlenecks. So don't be surprised at some point if you see, well, it's pointer operation, how can it be? Is this something that you guys, like, came across or using? No, to be honest, we, so we have, it's for session objects. Session objects is what we use when someone connects. We use row pointers coming from a pool, but we knew that before doing the benchmark that it's going to be better. But sometimes it's good to run the experiment to convince yourself that it's worth it. Yeah, question? Have any memory leaks because you used row pointers that you had to debug and eliminate that you wouldn't have had had you use the safer, slower pointers? Okay, so the question was, did we have any memory leaks because of that? And the answer is, we don't have memory leaks. Yeah. No, no, not really because we, the lifetime of the object is the lifetime of the database. It's the session object we pre-allocate at the instantiation and we destroy when the database shuts down. So in that context, there is no way we can leak because it's a fixed amount of objects. It's something like by default, 20,000 concurrent sessions. Otherwise we do use smart pointers, especially for asynchronous operations because yes, there is a cost, but good luck finding when you can destroy an object from an asynchronous operation. The typical trap is when it doesn't work well, that your answer is being delayed by the system and you get a new request, but somewhere there's still this old request with your old buffer going on. And you say, well, I have a new request so it's safe to deallocate the old one, right? And then it crashes. So we really tried. So to answer your question, in some areas we try to get away with smart pointers. I think it's possible in theory, but the cost in terms of engineering would be very high and the benefits in performance, it's not like it's going to be 10-fold, right? So yeah, we do use smart pointers a lot but we're just very careful about how we use them. Yeah, another question? No, a question. Okay, so I showed you the problems. Let's see about what we can do to find solutions with all these problems. Of course it's really a subset of everything you have to do to build a database. I'm really talking about the low level problems of building a database. Then if we're dabbling more into the time series problems like how can you write an aggregation engine that is as fast as possible? So hint, you do a collaborative database and you leverage a single instruction, multiple data instructions. That kind of stuff. And you care about allocations and efficient memory allocation and all this stuff. Do you use TCMalloc or do you guys use it? So the question is which memory allocator do we use? We currently use the memory allocator from TBB and our own memory pools. The benchmark we saw is at some points the most performance you can get is being smart about memory allocation. If you allocate when you can and do that in a pool, we may want to switch to GMalloc but the reason would be because it has a lot of interesting statistics inside and that's really great. And we could show that back to the customer. Say, hey, you've done so many allocations, maybe you should change your request. But the TBB allocator is satisfactory in terms of performance for us. I don't know if I'm cutting what you're talking about. Do you use TBB for other things as well? Oh, so okay. The question is, do we use TBB for other things? The answer is yes. We even wrote a white paper with intel about it. So we do love the lock-free containers they have. They are very, very nice. They have also very interesting locking primitives. For example, we like that they have speculative mutex using transactional memory. That's really cool. And why would we want to write that? Do you use it for the threading stuff as well? No, so the first version of the database we're using intensively, the scheduler from TB, it's actually a great question. Do we use the TBB scheduler? The answer is no and the first version did but the performance was very bad and it's not related to TBB. It's not related to us. It's because it was creating contact switches. So you have this thread, which is the IO thread coming from the user and you do the accept, then you go to the select and then you have this threader, you process the query from the user. You deserialize the packet and then you would push that to TBB which will be running in different thread and you lose your quality. And the significant, it's really huge to different performances and we are not compute intensive enough to justify the cost of the switching. That may change in the future. I don't know. So a single connection handles, like a single thread handles the connection and data ingestion and what processing you wanna do commit and then the response? Yeah, exactly. The design is really, basically the database is shorted by thread. So when you instantiate the database, you decide how many worker thread you're going to have and the threader has the responsibility of everything from beginning to end and that really is very good performance. That approach only works if the processing is short. Short enough. If we had much more expensive analytics going on, I think we would need to switch to different strategy. That's it. So the basic toolbox is, so we don't do coroutines anymore, I don't know, for various reasons. So coroutine is also named as lightweight threads or it can be named fibers. It's basically when you do yourself the scheduling of the tasks. We have in the threading, it's the operating system which is going to decide when you're finished and when you're not finished. Coroutines can be very, very, very powerful in the sense that if you have a good idea of what you're going to do in the next cycles, you can be smarter than the operating system. In our case, that's, we didn't find any benefits but it's a good tool to know that exists. Then data shorting, so like I said again, you can scale reads very well but you can't scale writes but you can cheat by making copies. If you're willing to pay the extra memory, you can just have the data in several threads at the same time and then they can all do their stuff on the data and they will not interfere with each other. So that's the way to do it. We did that a lot in the past. Now we do that less and less because the memory usage can become very high. We see techniques of so to get around the locking and that kind of stuff. I think I like it a lot personally. It's read, write locks when they're efficiently implemented. If you have really heavy workloads, read and write locks and read in time series is often the case. It's really, I write my time series one time and then I'm ready to ask a lot of questions. We have micro locks everywhere that are using a lot of the latest locks from TBB and we use that to manage access to the same entry in a safe way. And again, like I said, the optimistic experiment access using transactional memory and it's available in most processes now. Can you say micro lock means like a latch? Or something like, is it a logical lock or a physical lock on the data structure? Okay, so the question is it's a physical lock or a logical lock. What we, we use mainly spin mutex to access the data structure for read, write and we use the spin mutex that are not too expensive to acquire when for read. Biologically, a mutex can be very expensive to acquire for write, especially if it's a busy mutex like a spin mutex because you have to spin the CPU to wait for the mutex to be available, write. And when you do that, well, you burn up CPU power. So that's can be, and the latency can be very high. So we use mutexes that are inexpensive to acquire when for read and for writes, of course. Well, if you have to wait for the writer to finish, you have to wait for the writer to finish. But that's why also optimistic spin mutex are very interesting because using transaction memory, what an optimistic spin mutex is doing is basically it does memory transaction. And if it's successful, it means that no one was acquiring the mutex because in life conflicts are not so frequent. That is, you're not going to read the same value at the same, exactly the same time very often, right? So you can do as if, well, I'm not going to acquire the mutex, but I will only acquire if I had to. That's the principle of transactional memory. And another technique you can use, it's thread local storage. So thread local storage is much slower than registers, but it's not shared. And there's a couple of things you can do. It's, yeah, TBB-combinable, and it's very specific. It's what we call, I think it's spread counters. Let's say you want to count the number of entries in your database. So the naive approach would be have an atomic counter and every thread is going to increment this counter. But as we saw, it's not going to scale because, well, at some point all the threads are going to try to access the same atomic value, right? And it's not because it's atomic that it's going to be fast. So what you can do is every thread is going to increment a counter in a local thread storage. And only when you want to know the total value you're going to sum all of them. So, yeah. Any other question at this time? No. So another way to manage access to a shared object is, so it's hazard pointers. So basically, you create a list of the values you want to access and you use that to know if you can access the value. So it's actually, there's a patent, I think, on these techniques. It's what my notes say, be careful about what you use. And this one is more interesting. It's actually, it's a good transition because then I can speak about multi-version concurrency control. It's basically, so I can't write the same value, right? I can't because I will get contention. But what I can do is instead of writing the same value I will create a new value and write to it. And at some point in time when I decide that I no longer need the old value, I can erase it. And it's going to be invisible for the other because if, for example, you have a thread that is accessing the old value and doesn't want to see the update yet, it's as if no update appeared. And it's interesting because you will find this concept in what's called multi-version concurrency control. Multi-version concurrency control is, again, in a database you always have this problem of, oh, am I going to handle the fact that I'm going to have concurrent updates for the same values? You can't prevent it. Someone is going to want to write to your value at the same time. And to do that, one of the ways to lock, right? You can just, okay, wait for the other person to finish. And the other way is to do multi-version concurrency control where everyone is going to see a value according to a timestamp in time. And if I am after the value in time, then I will create a new value and I will see the future that is my new values. But thread running in the past with a timestamp in the past, we see the old values. And the challenge is then when can I remove the old values? Cool, trimming. We use multi-version concurrency control and inquisitor for distributed transactions. Okay, so now I just gave you, in one hour, I can't obviously go into data of what we did in several years to make the database fast. But I hope in what I showed you, you have an idea that it can be a huge undertaking just to be, what we saw is just what you have to think about to scale up in multi-threaded environment, right? We didn't even write to disk and read from the network card. We haven't even done serialization. We haven't even, oh, are we going to communicate from Qalient to server? But I hope you can see that there's a lot of thinking to be done if you really want to have a huge level of performance. So one thing I'm very, my favorite, it's really something I'm very careful about is memory copies and especially the ones you don't see. And it's easy to say, well, it's inexpensive because of the memory bandwidth, yada, yada, yada. But actually this can be super expensive. And we know for a fact because so we have this low-level key value API that we give when you use database. You can use it. And we have two APIs just to get value. There is one which allocates memory for you and put that in the buffer because you don't want to do anything about it. And there is one where it's in a copy where we write directly to the buffer that you give us. And when the data becomes large, the defensive performance is really huge. And we're just talking about one copy. So when it's, when I say huge, it's like 20, 30 percent. So and it's really for like if you have 50 megabytes or that kind of stuff. So the thing is when you use libraries and you write in your language and depends on the language, you can have a lot of hidden copies. And it's not immediately obvious what's going on with your data and memory. So the hidden copies you have can be a huge performance bottleneck and for several reasons. First is going to increase the pressure on your allocator. And there's going to be a threshold when you access a certain number of allocation and the allocation are going to create memory fragmentation. And then the more your database runs and the more you have performance problem. Secondly, you are making the life of cache a little bit harder because you're accessing all the time different values. And if you can access the same value all the time, it's really better, not only for the cache but also for the memory paging system. For example, we often forget that the memory we access, we actually go through a system which is called pagination. That is the memory we use, not actually the physical memory but a page that represents physical memory. And if you access memory within the same page, it's much faster than accessing memory from different pages. And avoiding copies, that kind of stuff that we can leverage a lot of performance. Ideally, my dream would be able to take the buffer from the network card and write that directly to the disk when I have an update without the processor in between. That would probably require dedicated hardware but I think it's the future of database. Really write directly to the disk without any processing. And don't do any copy. Is there any tool that you found? How do you make sure you don't have extra copies? Is it just coding discipline or is there a tool you find that like sets up a pretty good job identifying problems? So the question is how do we find that we don't have copies? So there is of course discipline, what you see. There is constant benchmarking of what you do. And in C++, you can just prevent copies or catch copy constructors, that kind of stuff. So you know that there's been a copy going on and why. You have coding techniques in C++ to say I probably beat copies of this object. So I know for a fact that it's not going to be copied. Then the thing is that when you use third-party libraries, we always see if in the documentation, there's a way to not have copies or that kind of stuff. If they provide a memory copy free copy less or zero copy API. So the zero copy actually comes this passion I have for zero copy comes because I was working in file systems and it's really about having a zero copy cache for the file system. So by everything's a kernel. So again, that's a bit insane because why would you want to bypass such a voluted piece of software? Why do you know better than the project system? Right. So for file system, that's actually pretty obvious. The file system is between you and the disk. And in the case of a database, while you don't really need the file system, if you have a pretty good idea of the data structure you have, let's say, so in our case, we use a technology coming from company Lemlevix which is neither LSM nor B3. But let's say if you have a B3 and you want to write it directly to disk, that's more efficient than going through files. And the other advantage of bypassing the file system is you don't have any surprise in terms of caching write cache. If you want to be sure that the data is actually written to disk, which is a guarantee we have to provide. So in our case, we have a small buffer of one page, which... So you're using Lemlevix's stuff on a raw block device? Yeah, so the question is, do we use Lemlevix's stuff on a raw device? And the answer is yes. We mount directly an NVMe card and write directly to it. But you're still going through the kernel for block rates? So yes, the question is, do we still go to the kernel for the block writes and the answer is yes. Do you wish you could... Do you believe that it would be worth saving that kernel interposition ways in something like the SPDK? So the question is, do you think SPDK is interesting? And the answer is maybe yes. I know the Lemlevix, but I wonder if I should suggest to them. I think you probably also... Yeah, yeah, we... So well, that's a discussion we can have. For me, every intermediary we can suppress is interesting, but maybe we would be surprised by the numbers and that maybe it's not worth it because of the device is actually the bottleneck. I don't have the answer, I don't know. For Optane, that's really interesting. For the NVMe, the fastest NVMe you can get is what, two gigabytes per second in write? So yeah. You can do it. Yeah. Yeah. So are you still using the file system for header data or do you, like naming and access control and all that stuff, or do you not use the file system for that at all? So the question is, do we use a file system at all? And the answer is when we go through the Lemlevix layer, the answer is we don't need a file system. You can... There are different ways you can use us. You can use... We have a RocksDB persistence layer that you can use which is going to rely on the file system. The advantage is then you can use RocksDB open source tools if you want and access the data, et cetera. Or you can use the Lemlevix persistence layer in which case you can either mount a device or mount a file as a device. Yeah. The Lemlevix layer is faster than RocksDB in most cases. And it's not just because they mount the device directly, but it helps. So yeah. You only have five people working on the data system. You don't want these things? Yes. The answer is yes. When you supress sleep, you gain a lot of talent. So just to be fair, the low level persistence layer, we didn't write it, right? It's from the Lemlevix guide. And again, we were, I think, not one wanted to both. We've been smart about not writing anything that would not make sense. For example, the trap is if you have a big team, you say, well, we're going to write our own logs. We're going to write our own block-free maps. We're going to do that. But if you're a resource constraint that forced us, because we didn't raise money in the first place, so we had nothing, right? And so when you are really constrained and you have a customer who wants a product, then you can be very smart and go very fast. You said you had one KiloLite pages, right? When you go to the raw store, should I give you one KiloLite pages? So the question was, we have one KiloLite pages. Now, what I was saying, maybe I was incorrect, is we have a buffer between us and the disk. We are not 100% synchronous, because that would be maybe problematic in terms of performance. And the size of the buffer is one page. So it's four KiloLites. So it's a couple of microseconds of, yeah. It's acceptable, they are trade-offs. Right. Yeah, so that's one is interesting. If you send a request to database, whatever request it is, if it's, hey, here's my bunch of points that I want to insert in my time series. What you have to do is to serialize this data and send it to the server, right? And maybe you are running from a Windows XP, it's real-life scenario, right? With 32-bit Windows XP writing to a 64-bit Linux, and you can't just dump the memory on the wire and send it and read it. So you have to do what's called serialization, I think you've heard about it. And we spent a lot of time being very efficient about serialization. Like it's really something that showed up in when you benchmark the software and you want to see how much time you spend doing requests and serializing requests is actually using a lot of time. So what we did is, let me check the next slide. Oh yeah. So we leveraged C++ template metaprogramming techniques to see if you could decide at compile time the amount of memory you would need to serialize a structure. And I will give you the most simple example. Let's say you want to send a 64-bit integer. You know at compile time exactly the amount of memory you need to send up. It's eight bytes, right? Assuming you don't do a viadic in coding. And by composition, you can compute the size of the structure at compilation. If the structure is not just a bunch of integer, then you can compute at compile time how much size the size of the structure. So you can use the stack instead of dynamic memory location. And that can really, really give a lot of performance because doing memory location is very expensive compared to just moving the stack pointer. And you're talking about the client throughout the day, running out of time. You're talking about the client driver. You're compiling it to be hard coded for your protocol. So I'm talking about the client and also the server input. So you want to make it faster? Yeah, I'm sorry. How much more do you have to gather? Well, it's almost finished. Okay, perfect, keep going. Okay. So that's basically what we've done. So I'm sorry to finish you with some C++, but basically we have that code that can detect at compilation if you're going to need a memory location. If not, we're just going to use a buffer in the stack. Then just to finish kind of funny stuff is we use, for our distributed transaction, we need highly accurate timestamps. And yeah, that's kind of interesting stuff to know between operating system. So just to be fair to Microsoft, since Windows 8, it's no longer the case. But yeah, you can be surprised that some system calls this cost 20 nanoseconds. So just nothing. This is more expensive because we have to do some math to get the right time. So just to finish. So like I said, we like to say we have 100x DB. Like when we go to a customer, the gains have to be in this area because otherwise just use Pascal SQL, just use SQL Server, just use Oracle. And what I want to do is 10,000x. And how I, well, I have a couple of ideas, but it's a secret. Okay, that's it. Awesome. Three at a time for one or two questions. Yes? About putting shared access data blocks into something called a hazard point request. Sorry, yes, we. Hazard point request. Yeah, we don't use them. You don't use them? No, just give an example. Yeah, I thought of the opposite. Yeah, no. So I had a question. So about hazard point list, the real reason why we don't use them is because of the patent. Yeah, I don't think they would work. Yeah, I also don't think, but I'm not willing to bet the company on it. One more question. Have you looked at like code generation or queries like using the LVM to turn the query into machine code? So for queries using an LVM, for the moment the queries we have are very simple, very time series related. As we go, I think we will have to be more smart and smarter about it. All right, let's thank Edward again. Thank you.