 Marisa for that introduction. So let's begin. So first a little bit about myself. About 15 years ago, a few years after that, I was the creator and maintainer of the kernel-based virtual machine. And so I participated in many Linux Foundation events. Later, I was the creator of the CSTAR-IO framework, and I'm working on CELADB, where I'm the CTO and co-founder. For those who are wondering, the virtual background behind me is a screenshot from the movie The Black Hole from 1978. So if you recognize it, that's good for you, but also it means that you're old. So let's begin with the presentation. A little bit about the structure. So the presentation is both generic in that it doesn't apply only to CELADB. It's something that you can apply in your applications as well. But I will also explain how the principles within it apply to CELADB. And for those who are not familiar, CELADB is a high throughput distributed node SQL database. So you can scale out by adding more nodes that's necessary if you need more storage or more compute. It is compatible with three different protocols. So the Cassandra protocol, the first protocol we supported, and also Redis and Amazon DynamoDB. Of these, only Cassandra and DynamoDB are really supported for production use. Redis is not a mature implementation. But Cassandra and DynamoDB are mature and are used in production for many years. So it uses thread per core architecture in order to get the high throughput. I will elaborate more on that later. It is self-tuning. So we try to avoid all the fiddling with configuration variables, which can be very hard in distributed environments, because every configuration item affects every other configuration item, and it can be an endless loop of tuning. And the main use case is online transaction processing. So real-time event processing. But we also support analytical processing on the same data. So you can fire up Apache Spark on this data and perform your business intelligence queries while the database is serving real-time requests from users. So let's compare throughput and latency and see why they're at odds with each other. So they're not really compatible. You can usually achieve one or the other, but it's hard to achieve both. So when we're doing throughput computing, we want to maximize utilization of our resources. We want the CPUs to be at 100%. We want the disks to be busy. And we want to offer a lot of requests so that if there is network latency or latency from the disk, then the CPU isn't idle while it's waiting for responses. But you want to generate a lot of concurrency, and you want to make sure that there is always something more to do. And it's a total time for an operation. So if you're running a query that takes several minutes or maybe even several hours, what's important is to minimize the total amount of time that the query takes. And it's okay to reorder or serialize the operation. You want the entire batch to be complete as soon as possible, and you're willing to trade off the latency of individual operation. Let's contrast it with latency computing, which is often denoted by OLTP, online transaction processing. So there you want to leave some free cycles so that if you have a traffic burst, then you have those cycles to absorb them. You cannot predict the data, and so you cannot do read ahead. And also when you write, you often cannot buffer the data. You must write it immediately in order to ensure durability. So you want to be able to promise to your users that if you respond to a write, then that write has reached disk, and they are sure to read it there. So that increases latency. It's the individual operation time that's important, and not the total runtime of the entire job, because if you have live humans on the other end of the application, they have clicked something and they want the webpage to appear promptly. They don't care about all of the other users. They care about their own operation, and as the operator of the website, so do you. And you have lots of operations that execute concurrently. Multiple users clicking at the same time, or it can be an application that aren't humans behind it, but still generating a lot of concurrency. So those are quite different, and perhaps the biggest difference is that for throughput computing, you want to saturate your CPU, while for latency computing, you want to leave free cycles to make sure that when a request comes in, there is a CPU available to process it. So why mix them at all? So there are two answers to that question. First, you might have two different workloads that run on the same data. It's often called hybrid transaction and analytics processing. And here you have your users who are clicking their way through your website, but you also want to run analytics on the same data. And you could spin up a separate cluster for analytics and run those analytics for years on a separate cluster, but that's more resources, and you also need to transfer the data, keep the data always synchronized. It's a lot of hassle, and it's better if you can do it on the same cluster. And the second reason is that you have some internal processes within the database that look like an analytics query. So one example is garbage collection in the Java virtual machine. So it's a throughput-oriented operation. You're running through the entire memory, looking for memories that you can throw away. And at the same time, you have your application running and processing real-time events. And the same thing happens with the database. So with the database, garbage collection is a process where you look for records that you can expire or records that became obsolete because they were overwritten and remove them. So you have this garbage collection operations. And in a database, it's often done via a log-structured merge tree. That's a very common structure these days for maintaining data. And you have cluster maintenance operations. So if you're adding a node, so while the nodes are serving those real-time operations, you also need to stream data to new nodes that are being added or stream data from nodes that are being decommissioned. You're performing backup or your scrubbing nodes, which is looking for errors and correcting them. So all of those operations, from the point of view of the system, they look like an analytics query. So they're throughput-oriented and they're running in parallel with an OLTP workload. So we have a lot of motivation to make sure that we can run those two operations in parallel without the throughput operation interfering with the transactional workload. So how are we going to approach this? The general plan is to isolate the different tasks that make up. So we want to be able to say about every piece of code and every IO that goes to this or to the network, whether it belongs to an OLTP or a real-time operation, or whether it belongs to an analytics or maintenance operation. So the first step is to identify those tasks and tag them in a way that allows us to sort them out later. And the second step is to have a scheduler that picks which tasks to execute and when to execute them. And if we're able to do this, then we're able to prevent the analytics operations from interfering with the real-time queries. So let's see how we do that. But before that, we'll show how to achieve the throughput that we want and later how to adapt this to this mixed latency and throughput workloads. So in modern machines, there are lots of a large number of cores. So with the large R machines today, you can see 128 cores on the same server. And there is a large penalty to try to coordinate work across those many cores. So this means atomic operations or locks and you have cache line bouncing and just selecting which core will be the one to execute the task. All of that itself creates a large penalty. So we would like to avoid it. And the way to avoid it is to recognize that the distributed database already spreads the data across nodes. So we're already able to divide the data amongst the cluster into individual nodes. So we're just going to extend this and do this for one other level and also distributed data amongst the shards among cores. So each core becomes the owner of a particular subset of the data. And we just route the work that is relating to a particular bit of data to the cores that owns it. So compute and storage are co-located. And we've built the CSTAR framework, which is a C++, the open source framework that helps doing that. It's in use by several projects. It can sell a DB, of course, and I can also mention the vectorized IO, which is a streaming framework compatible with Kafka and the Ceph distributed file system by Red Hat and a few others. So this is built exactly for this sort of cluster operations. So the idea about chart per core is that the individual cores don't share data with each other. And so they don't block each other. They don't need the locks when accessing the data because there is just one core that can access them. And with a single threaded operation, there can be no contention. So it's similar to the image on the right of each puppy eating from its own core. By the way, I didn't mention about the questions. So you can put your questions in the Q&A box if you have any. So far, I haven't seen any questions. I'm happy to stop and answer questions as they pop up. So here we also see a comparison between a traditional threaded applications and the C-star sharded stack. So in a traditional application, you have many threads and the threads are multiplexed onto cores by the scheduler. And this immediately presents a problem of how to do this multiplexing. If you have a large number of threads, this is good for throughput because it means that every core will have at least one thread to run and so we can saturate the machine. But it also means that you will not be able to guarantee that there are free cycles left for your latency-based computing. And if you have a small number of threads, then that's great for latency, but your throughput will be limited because you will not be able to utilize all of your cores. So it's kind of hard to find the perfect fit. And of course, it's always the virus with the workload itself. With C-star, we have exactly one thread per core and we run the same database stack on each core. And those stacks communicate with each other with point-to-point queues. So instead of taking a lock in order to access the shared memory, instead the individual cores send messages to each other via queues and they can proceed with some other work until the response comes back via the response queue. So this is a cooperative model where instead of taking locks, the individual tasks wait for each other at predefined points in order to check for completion. So let's see how we make latency and throughput behave nicely with each other for CPU workloads. This is the easier part. The second part will be IO, which will be a little bit more difficult. So remember, we wanted to isolate the task and then schedule them. So the traditional model where you isolate the task in a thread, then every task, say a request from a web request, becomes a thread. And often you borrow it from a thread pool and you let the kernel decide which thread comes first. And you can try to influence the kernel choices by raising or reducing the thread priority. So the advantages of that are that it's well understood and it has a huge ecosystem. And although you might think from the size of the disadvantage list that I'm a little bit biased and I am, those advantages are really very important and they're well worth considering. So unless you have a very specialized application, it's really good to go down the well trodden path and suffer the disadvantages. But if you are writing a high throughput application, then it's also worth to consider the disadvantages. So context switches that arise from packing operations to thread, they're pretty expensive and they're actually getting more expensive because of things like spectra and meltdown mitigations. Telling the priority to the operating system is not that easy. The priority levels are not meaningful. So you cannot predict what a particular priority level will, what it will result in in terms of response time. Locking, when you have a large number of threads, it becomes very complicated and very expensive. So locking eats into your compute budget. It's possible to have a priority inversion where a low priority task is holding a lock that is needed by a high priority task and blocking its progress. And the kernel does its scheduling on its own granularity and it may be inducing latency in this way. So what we chose is to do a task isolation in an application level. And here instead of a thread, every operation is just a normal object with a function pointer that tells us what's the operation that is needed to do. And the operations are multiplex on our thread per core architecture. So if you have a machine with, I don't know, 30 cores, then you will have 30 threads, each executing one task at a time. But you can alternate between throughput and latency tasks. And this is the key. So here you have the choice whether to execute a latency-sensitive task. And after you've run out of a latency-sensitive task, you can use the remaining spare cycles to run throughput-oriented tasks. And you have the concurrency framework, which is CSTAR in our case, that does this assignment of tasks to thread and controlling the order crucially. So what are the advantages and disadvantages here? So one advantage is that you have a full control. And that means that if you identify any problem, you can go ahead and fix it. But you'll notice I also listed as a disadvantage. So full control also means that you are responsible for everything. And there's many more bugs to chase down and strange behaviors. So it takes more time to mature such a system compared to the traditional system. It has very low overhead. So the scheduling is cooperative. So you never need to take any locks over form atomic operations. And the CPU affinity is wonderful. So you never have cases where you have cache lines bouncing around and poor performance on a large machine. And then the kernel is less involved. So the kernel, which is a big and complicated beast, will give you fewer surprises to your application. So this is how it looks like. You have a bunch of tasks represented by the small rectangles. And they're placed into task queues. And remember, we've identified every task. We've tagged it in a way which identifies whether it is a latent incentive task or a throughput or in the task. So we can sort them into individual queues. And each queue has its own priority class. And so the scheduler can tell what kind of tasks are waiting to be scheduled and for each task queue what its priority is. This is the key to the whole thing. And the way the execution goes is that the scheduler will pick a task queue and will pick some task from that task queue and keep executing those tasks until one of two things happen. One is that the scheduler runs out of tasks in that task queue. And the other is that a timer tick happens. And our timer tick is every half a millisecond. So in this way, we ensure good locality. So it's important to execute similar tasks back to back so that the CPU harder mechanisms like the instruction cache and branch prediction work well. But also to not let the one task queue monopolize the CPU by preempting it when needed. So the scheduler will alternate between task queues, pick a bunch of tasks. You can have a bunch of small tasks running within a time slice. Or you can have one large task that's occupying the entire time slice. So when do we switch queue? One option is when the queue is exhausted. And this is common for your real-time operations. You have a bunch of queries that appeared on the network. And you start processing them. And when you've done, you switch over and do something else. And for the throughput-oriented jobs, those are large jobs that take many seconds. And therefore they're being preempted. So you start a task. You see that it took more than half a millisecond. So you stop it and move on to the next task queue. Every time you switch between task queues, you also poll for IO. So you check if you have a new task by looking for network events or for IO completions from the disk. And at that time, you make a scheduling decision, which is what is the next queue from which I will pick my task. And your goal is to keep the runtime equal across the queues, but weighted by the number of shares that we assign to each queue. So if a queue has a large number of shares, it will get proportionately more runtime than a queue that has a small number of shares. So it's not the wrong robin. And this allows us to give higher priority to our real-time tasks and lower priority to our throughput-oriented tasks. So how do we preempt those tasks? One option is to read the clock at the head of every loop and compare it to time slice end. That's incredibly expensive, so we don't do that. Another option is to ask the kernel to send a signal whenever every half millisecond. And this works. It's a little bit problematic on large machines because of the way that signal locking works, but it does work. And the way that we do is we use a kernel timer to write to memory location, and then we just call that memory location to see if it ever changed. And there are nice techniques involving IOU ring that allows you to do that. It's a little bit tricky, but also very efficient. When you do something like that, you're completely in control, and that also gives us the ability to create problems for yourself. And that is when you have a complex computation that doesn't check the preemption mechanism. And so we have a stall detector, which is a signal-based mechanism that logs an entry whenever it detects that a task is not preempted in time. And we use that in order to detect problem errors in the code and fix them. So that was a part about the CPU. Let's look at IOU, which is in some ways similar. So the goals are similar, but the details are quite different. So again, we want to isolate the tasks that we have to see which IOUs are part of latency sensitive operations and which IOUs are part of our throughput-oriented computations. And we want to schedule those IOUs at the time that we want. So IOUs are similar to identify. Every time you issue a disk read or disk write, you know that you're doing it and you can easily tag this IOU with some kind of markers that tells you whether it's part of a throughput-oriented operation or a real-time operation. And there are just four types. You have random reads, random writes, and sequential reads and writes. But in practical terms, it is much more complicated. So first, those disks are physical devices. They are not those cores that are more just machines that execute instructions. Instead, their behavior changes with time. So if you're talking about hard disks, they have heads that move and you need to take into account the time that it takes the head to move. And even solid state drives, which appear to be more like RAM, they also have strange behavior. If you write to them, if you perform random access writes, after a while their write performance starts to deteriorate because they need to perform internal ware leveling and garbage collection. So they're physical beings with more complicated operation. And it keeps changing both with time and also depending on your deployments. So different disks can behave very differently. So let's compare CPU to IOU. So on a CPU, tasks are homogeneous. You have basically a pointer to a function and you execute that function. But with IOU, the tasks are wildly different. So they can be small reads that require disk six, or they can be large sequential writes. And the disks have very different limits to different types of IOUs. With a CPU, you can have each core schedule its own tasks. So there is no coordination penalty. But with a disk, you can do that. You have usually fewer disks than you have core. So you need to share this among cores. And that means you need to do coordination. The capacity of a core, you know it in advance. So you know that a core can give you one CPU second every second. But the disk performance depends on the workload. And of course, you're running multiple workloads at the same time. So it's hard to predict the disk performance. And with a core, it will run one task at a time. Even if you're running threads, at any particular time, a core is running just one thread. But disks, they need multiple requests. Otherwise, you will not be able to utilize them. So disks are internally are parallel machines. And so they require concurrency in order to operate to their full capacity. So one challenge that we can solve about cross core coordination is using leases. So we can have each disk virtually lease out some of its capacity. So let's say a disk can do 100,000 operations per second. And we have 10 cores. So we can lease out a few operations at a time to every core. And this way, we can achieve sharing of the disk capacity among the cores without static partitioning, which is where every core gets exactly one tenth of the capacity. And so we leave some capacity unused if it's not every core claims its capacity. So it's an ability to do dynamic sharing without oversubscribing the disk. So one thing we need to do is to decide on how many requests we can feed the disk. And let's limit the discussion to an SSD, because that's what most people use these days. So usually you have four different parameters that describe the disk. You have the read throughput and the write throughput in megabytes per second, usually two different numbers. And you have the read operations per second and write operations per second. Again, two very different numbers. So overall, four parameters. And of course, the larger the larger those parameters, the better your disk. But the story doesn't end well. So it doesn't isn't so simple. So this chart demonstrates how the disk behaves when you mix and match different workloads. So on the x-axis, we have a write workload that starts from zero all the way up to the maximum capacity of the disk. And on the y-axis, we have a read workload that starts from zero IOPS all the way to the maximum IOPS that the disk can support. And in the metrics, you can see how the disk behaves with each combination of workloads. So on the bottom left, you're pushing low workloads on both the right and the read, and on the top right, we're pushing a high read workload in parallel with the high write workload. And you can see that the disk is not full-duplex. It cannot process both workloads currently. There is a region in which it will simply not be able to sustain the workload. There is a region where it will work well. And there is some kind of gray area where it sort of works, but the latency starts to shoot high. And our goal is to stay off this gray area. And that's what our scheduler should do. It should schedule just enough requests to keep within the good zone. So let's see how this is implemented in CELADB. And in CSTAR, actually, most of the code lives. This is a quick reminder of what CELADB is. So the scheduler has two jobs. The first is whether it has room to admit a new task. So for the case of IO, if the scheduler admits a new task, will that new IO cause a disk to go over the limit and start incurring high latency? Or whether the disk has sufficient capacity and it can accept the new task, the new IO without a problem? And once we've answered the first question, the second one is of all of the IOs that we have pending, which I should admit. And the same goes for CPU. So we have a bunch of CPU tasks that are rating. We first need to decide whether we want to admit a new task to the CPU. Here, it's easy. If the CPU is ideal, then we should admit a new task. And again, the second question is also the same. Which of the tasks that are rating should I admit? And this is the basic structure of the scheduler. For each queue of tasks, we assign shares. So every workload is characterized by the number of shares that it has. A higher number of shares means that we will accept more tasks from that queue. And the scheduler alternates between picking the task, dispatching and up to the resource that is being controlled. And the resource can be the disk or CPU. And we get the results from that and continue and move on to the next task. And this is the key to how to limit the impact of the throughput-oriented task on latency. We simply don't admit a throughput task if there will be an impact on latency. So how do we assign those shares? There are two ways to do that. So the first is to have an internal controller that selects the number of shares dynamically. And this is mostly for internal maintenance tasks where the user does not have the ability to or the desire to tune everything. They want to spend their time elsewhere. So you want to have the system automatically determine the correct number of shares that are needed to accomplish the maintenance task, say streaming to a new node, but without affecting the throughput loads. So the way this works, we have the various queues. In our case, we have the commit log and query queues. Those represent the real-time queries that come from users and commit log represent the rights. And we have the various other queues which are used for maintenance tasks. And the scheduler is balancing between them. So compaction is one of the important maintenance tasks which is trying to balance reads and writes. And in order to decide on the right number of shares for the compaction task, we monitor the backlog that the compaction has. So if compaction is falling behind, we will increase its shares. And so we will increase the amount of resources that it takes from the CPU and disk. And this will allow it to reduce its backlog. And on the other hand, if we see that the backlog is falling, it means that we are making progress and we can reduce the number of shares again and allow the user tasks of serving queries and writing data to have a larger share and receive a lower latency. The same we have with memory. So CDDB accumulates writes in memory. And whenever memory gets full, it starts flushing those writes to disk. This is a very efficient way of performing writes because you only need to access a disk sequentially as opposed to randomly, which is the way that it happened with traditional B3 storage formats. But if you run out of free memory, then you will end up not being able to serve any writes at all. So the way we do that is we monitor the amount of free memory that we have. And if we see that we're starting to run low, we initiate a flush and we assign a number of shares to the flush that allows us to keep accepting new writes. So we ensure that we flush memory to disk at the same rate that we have incoming writes. And so we dynamically balance between the rate of the disk flush with the incoming rate. And so we're not flushing too fast, which would be a waste of resources or too slow, which would cause us to not have the ability to accept new writes if we run out of memory. So we can also do the same thing with different user workloads. And we do that by letting the user assign shares to different query queues. So say you have a real-time workload, let's call that query one. And then analytics workload, let's call that query two. And they're running in parallel, but you would like the analytics workload not to interfere with the real-time workload. So you can assign the priority, different priorities to query one and query two. And by having a higher number of shares for the real-time workload, you ensure that the analytics workload doesn't interfere with it. It can still make progress, but it will not cause the latency to increase when it starts using up resources. It will basically only use spare resources left over from the real-time workload. And we have SQL commands that allow you to adjust the number of shares for different workloads. So it works on the session level. Basically, every session that connects to the database is really using a separate queue at the back end. And that's it from me, and I'm happy to take questions if there are any. I think I see a question. So how is this being used in real-world applications? So one example that we have is, I think I mentioned it, is Spark versus real-time workloads. So many companies like to use Spark for analytics, but Spark can gobble up a large amount of data. And if you aren't careful and you don't use something like workload prioritizations, then you end up having the Spark workload dominate over your real-time workload. And this can cause a disruption. And with this system, you can run the Spark workload just on the spare capacity of the system. And I see another question. So I'll read the question. I think the scheduler operates on supervised machine learning in feedback. So it's not exactly supervised machine learning. We use the really simple proportional controller algorithm. I think it would make a good study to see if machine learning can improve on it. I'm not a machine learning expert, so we didn't really try machine learning. But certainly, with machine learning, you could learn the characteristics of the disk in real-time. And so avoid having to measure the parameters of the disk ahead of time. So certainly, it would make a very interesting project. We didn't do that yet, but I agree that there's room for investigation here. I would love to have more knowledge in machine learning in order to know how to do that. Okay. I see another question. So how many different workloads can you reasonably prioritize? So the number is a handful. We don't really see lots of cases that demand more than that. I guess microservices applications can benefit from having a very large number of workloads that operate concurrently on the same database. We did not reach that level yet. So if you have five or eight workloads, then the system can deal with them. But if you have many more, then it just becomes just selecting between so many queues, you cannot reach a good decision at all times. So five to eight is what you can reasonably be expected to do. Do we have time for more? Let me check the time. So I see it's a question if Paul did say to switch to using IOU ring from AIO. So I mentioned IOU ring for because it's just a hot topic. In reality, we're still using Linux AIO. It performs just as well or maybe it's a little bit less performant. But for our purposes, it's fine. We will eventually switch to IOU ring since it's such a popular API. One more question. Is there a need for a separation of internode via intranode level shard replica configuration? I asked because I think you mentioned the intent to avoid configuration as much as possible. So the intranode configuration is completely automatic. When the system is installed, it detects the number of shards and it divides the data equally among the shards. It is possible to override some of the decisions, but there's really very little that you need to do there. It's mostly automatic. You don't really have intranode configuration. So all of the configuration is at the node level and above. And usually all of the node level configuration is equivalent and you manage it with a configuration management tool like Ansible or Kubernetes. So all of the configuration is handled at the higher level. Okay. And I think we've exhausted our time. So thanks everyone for attending. And you can always ask questions on our mailing list or on Twitter. Follow me on Twitter. And thanks again. Bye-bye. Thank you so much, Avi, for your time today. And thank you everyone for joining us. Just a quick reminder that this recording will be posted to the Linux Foundation's YouTube page later today. All right. Have a good one, everyone. Thank you.