 The Carnegie Mellon Quarantine Database talks are made possible by the Stephen Moy Foundation for Keeping It Real and by contributions from viewers like you. Thank you. So we're happy today to have Avi Kitty, the CTO and co-founder of CeliaDB. Prior to starting Celia, he worked on the KBM at Red Hat for a while and then they started CeliaDB in I think in 2013, that correct? 2012? Yeah around 2013. Okay and so we appreciate him for being up with us. He's currently in Israel so it's late night for him so we appreciate him spending time with us. Again we're sponsored by the Stephen Moy Foundation for Keeping It Real and we thank them for the support for these quarantine database talks and we'll do the same thing we do every week. We want this to be interactive as possible so if you have a question for Avi, just unmute yourself, say who you are and where you're coming from and then ask your question. Okay Avi, the floor is yours. Thanks for doing this. Okay thanks Andy. It's indeed almost midnight here so don't be alarmed if I turn into a pumpkin at the top of the hour. It's completely normal. So I'll talk about CeliaDB, no compromised performance. This will be a rather low level talk compared to the other talks that I've seen. I've sampled a few talks so this will be rather low level. I hope you'll enjoy it. So let's start. What is CeliaDB? There are about 8 million databases around so let's try to map it among all of the other options. So it's a distributed no-SQL database. The main use case is the online transaction processing so it does support analytics. CeliaDB are things like Spark or Presto but the main use cases are online transaction processing. It's compatible with several other databases so first is Apache Cassandra which was the inspiration and this is more or less the model that we followed with the Cassandra query language and protocol and also the Thrift protocol which is the other protocol that is used by Cassandra. It's also compatible with DynamoDB so we implement the same API which is JSON over HTTP and Cassandra and DynamoDB are also related in a way because both are based on an older database called Dynamo also by Amazon and it also has support for some of the verbs supported by Redis so only a small number because this is an infancy but it does show that we support multiple models and we plan to increase our compatibility along the way and add more interesting protocols. We have a very good performance so about up to 10 times on the same hardware with the same characteristics. What most users opt to is to have not exercise the full 10x but instead spend some of the headroom on reducing latency so typically you achieve about 5x performance but with better latency. You get very good latency and most of this talk will focus on the performance and latency and it's implemented in C++ and it's open source so you can find it on GitHub. Quickly, I don't want to focus too much on this but like it's C++ 20 that obviously didn't exist when you started in 2012 so like how do you decide like you know did you start at 11 and go to 14, 17, 20 like are you guys rolling this forward? So the company was founded in around 2013 but the database was actually started a bit later so we started with C++ 14 and then updated to C++ 17 and C++ 20 when they were released. We're really excited about C++ 20 because it brings us support for coroutines which are really helpful for the models that we're implementing. I don't think we'll be able to touch into that but it is very interesting. We are very interested in coroutines as well but keep going. Yeah all right now so you support multiple consistency models. The first is the eventual using concurrent replicated data types that's an acronym that I guess many people just remember the acronym and not the actual words behind it and eventual consistency gives you very good performance especially with multiple data centers but it's not suitable for all kinds for every kind of use case. Some use cases like Internet of Things can use eventual consistency but other use cases need stronger consistency so for that we have lightweight transactions based on Paxos and we're also working on RAF support in order to improve the performance of our lightweight transactions and you can also have a mixed consistency model where within a data center you have strong consistency but the replication for other data centers is asynchronous so you get kind of a mix between the two models. It's aimed at multi-terabyte or even petabyte workloads so it's big data and also high performance so even the smallest deployments have tens of thousands of ops per second and the largest grow to millions of ops per second and we have demos of simple workloads that do a million ops per second per node so more complex workloads usually have lower performance but it does show that the raw throughput is very high. How big is a typical node Alvi? So a very typical node is the Amazon i3 or i3n and those have the i3.metals has half a terabyte of RAM 36 cores or 72 vCPUs and 16 terabytes of disk and that is actually part of this slide we also aim for having a high disk to RAM ratio so a lot of modern databases or most of the open source databases really rely on on high caching so you need most of the data to be in memory cached in memory otherwise the performance starts to tank and this doesn't work well when you have a high amount of data because if you look at the price of RAM you want to keep a high ratio of disk to RAM because RAM prices are very high so if you have something like 100 terabytes that computes to a million dollars just for the RAM in the cluster so keeping a high disk to RAM ratio keeps the cost slow and we're also self-tuning and we'll have an example of that later on and the problem is with those databases is that they're complex and also the workloads are not constant they change with time and even if you manage to tune your system to a workload over time it changes and you have to keep retuning it and of course everything is in production so this is quite an effort. Okay so we have a symmetric architecture and this is more or less the same as the Cassandra so in this example we have three nodes and each node is divided into two layers the coordination layer and the storage layer or the replica layer and there is a full connection mesh the fact that we don't have errors connecting every client to every coordinator and every coordinator to every storage layer is just because I was lazy so the two layers are in the same process so when we have a coordinator talking to its own storage it doesn't use RPC or TCP it just uses a local procedure call and in fact we exploit this symmetry even more by using a thread per core architecture so in fact each node is internally divided into identical shards each of which contains its own coordination and storage layer which act more or less independently but they do share all of the resources in the node which is the memory and the CPU and and the disk so we need to both both we both try to share the resources in order to maximize the utilization but also we have to isolate them so that one of them does not dominate over the other and we have some mechanisms for that so we use a log structured merge tree I assume that you've studied this and you know what it is I'll just give a small overview of it so instead of having a data structure that is always sorted on disk like a b3 we keep the data as it is written in memory and we keep it sorted in memory using a balanced tree structure but when memory fills up we write it to a disk in a sorted way and that's called an ss table a sorted spring table and as time passes we get more of these ss tables being deposited on on disk and these ss tables can contain overrides or duplicate keys and these duplications are a performance problem because you need to read and merge all of those ss tables so you need to and also they occupy duplicate space so you need some process that merges those ss tables using a sequential merge scan and that process is called compaction and while this compaction is happening we're depositing more ss tables from the foreground process which is the writes so this gives us a foreground job which is the regular writes and the background job which tries to compact those ss tables and there are actually multiple levels of compaction because when you have this some ss table written to disk well you have to start the compacting ss table four and five and six as well and you will get another sort of some ss table and then you will need to compact compact those and the number of levels is proportional to the logarithm of the disk to memory ratio more or less it depends on a few more factors but that's sort of the the guide so you usually have maybe four or five levels of or tiers and this process causes a problem because it is a background job and those background jobs compete with the foreground jobs and you need to balance them if the background jobs consume too much too many resources disk bandwidth and compute and CPU compute power then your foreground processes suffer and you get a reduction in throughput but on the other hand if they don't get enough bandwidth then the data on disk becomes fragmented and your reads have to and reads that miss the cache have to read many ss tables and then your read performance suffers so you have to close the loop and make sure that these background jobs get exactly the correct amount of resources the correct amount of bandwidths in order to do their job so what are our goals so first we want to be efficient we're given a number of CPU cycles from from the hardware so a number of cores and we want to make sure that the cycles go towards useful work and not coordination or locking or things like that so this is one reason to use the c++ and not the language like java it it is a little bit faster although that's not the main reason another goal is to have good utilization so it's common to see on large machines that only a small number of the cores are busy and the rest are waiting on locks and the larger the machine the harder it is to utilize all of it it's also true for ior resources so disks are now very fast you can get millions of i operations per second but it's rare to see a program that can actually squeeze all of those iops from the disk so we want to make sure that we are able to use all of the cpu and with sila it's common to see the system running at 100% or if you count each vcpu at 70-200% and even more on larger machine and that's what would be setting off alarms in other databases is actually indication that the system is working as design and the last thing that we want to achieve is to have control to spend the cycles on the things that we want to do because we have multiple competing processes running on the system i mentioned the compaction a slide ago so we want to make sure that we can direct the the system to spend the number of cycles we want on compaction and the number of cycles we want on on serving requests or doing other things like maintaining nodes and performing repairs so a database that does OLTP is characterized by having a large number of very small operations so many queries can be just around a few hundred bytes so we want to make a coordination cheap we don't want to spend a lot of effort taking and releasing locks or exercising the cache coherency of the cpu we want to make everything very cheap and there are also lots of communications and the communication can happen within the machine so local procedure calls or inter thread calls or with the disk you read and write data from disk because we have a high disk to RAM ratio then we have to work with a significant percentage of cache misses we can't assume that everything is cached so we have a large amount of disk traffic and of course being a distributed database there is a large amount of communication with other machines usually if you have an application factor of two then writes require three round trips and reads require two round trips for a quorum so the solution to having lots of communication is to make everything asynchronous and since you're in a class then let's have a quick quiz does anyone recognize the object on the slide this is a SATA-based SSD so you're very brave but this was a trick question this is of course it would be a trick question this is a network device so those gold-plated connectors are actually a network terminal and although we are used to treating a disk as a synchronous device because the APIs are synchronous so you issue a read and the system called blocks until the data is there what happens under the covers is that the kernel prepares a message to the disk the same way that it would send the message to another computer send it over a kind of point-to-point link goes and does something else and switches to a different process and then the disk fetches the data sends it over the link and the kernel wakes up the thread that was asleep and and we go back to doing what you like but this is not an efficient way we wouldn't dream of doing that with a remote node so having like a thread per connection model everyone does networking asynchronously and we also do disk IO asynchronously so we treat the disk exactly as we would a remote computer and when we started this was relatively rare at least among open source databases but now with IOU ring and with the Go programming language I think it's becoming more common so let's have another quiz so what is this this time it's not a trick question is it an Intel Xeon processor e5 b4 product family yes it is it's a very good you utilized the but what does HCC stand for yeah I think core count high core count high core count if you if you were if you had said that it was a network device then I would say no it's a it's an Intel Xeon processor e5 or whatever but it actually does have a lot a processor is really a network of cores and instead of treating it in them in the way that most programs do which is to ignore that fact and just have a shared shared data structures that are protected with locks instead we try to treat each core as a separate node and have explicit communication between those nodes so we send messages between the nodes and this minimizes the amount of traffic on those internal networks and increases the scalability that we have within the machine so this is just as important as scalability across the cluster because it allows us to have a smaller cluster which is easier to manage so how do we do that we have one thread per core and that means that we must never block because if a thread blocks then that core will not have anything to run so we must whenever we do something that does IO any network call or any or any disk operation or any operation between cores we must be prepared to continue doing something else so we have a queue of things that we need to do and we never block on any operation and everything is asynchronous networking of course everyone does asynchronous networking file IO is more rare and we do that so instead of having a read call we have an operation to initiate a read from a file and asynchronous read and then we reap a completion later on when it becomes ready and meanwhile we go and do other things and also asynchronous multicore so if we need to access data that is owned by another core then we send the message to the other core go and do something else and that other core picks up the message performs the operation places the results on a point to point queue and we will pick up their results and continue so this is very different from general s&p programming but it's actually very similar to normal network programming distributed programming except that you can't have a failure so you send always send messages between course in the same way that you send messages between nodes and the fact that the database as a whole is distributed actually makes it easier because we use the same partitioning strategy that we use for nodes we use the same strategy for a course within the node so let's look at how the the programming stack looks and today there are actually intermediate stacks so for go and Erlang you have like intermediate models where you have the runtime doing some of this work for you but let's look at the traditional stack so you have a number of threads usually a large number of threads so that you can utilize all of the cores and some of those threads are running and some of the threads are sleeping and the kernel multiplexes those threads but the problem is that if you have a very lightweight operations which is the premise of the whole thing you have queries that return a few hundred bytes then you end up having a large number of context switches and a large effort to coordinate all of the threads and you also end up with problems where either you have too few threads to run in which case you don't utilize the machine or too many threads to run in which case your latency increases because you have a lot of contention and the kernel might not pick the thread that is most important for for you to run it might pick some other thread so our stack is with a thread per core it has its own scheduler so each its thread gets its own internal scheduler and it does a one-to-completion of small tasks so those tasks typically take a few microseconds to run and we can run around a million tasks per second per core and those those cores communicate via point-to-point queue so every pair of cores have a queue for sending a message requesting that core to do something and the queue for results and there is no sharing so the the network that lies within the processor has a lot less work to do and so everything goes much faster so let's let's look a little bit about let's talk a little bit about concurrency so I realize this is a computer science and not a math class so but I hope you'll forgive me so there is the little's law which says that the concurrency is the product of throughput and latency and that makes sense so if you increase the throughput and keeping latency the same then you need to do more things in parallel in order to compensate for that increased throughput and also if you increase the latency while keeping the throughput the same again you will need to increase the concurrency because each thing takes more time now let's do some math and look at look at how the throughput changes with the concurrency so you can see this is the same equation just transformed a little bit so you can see that in order to get higher throughput if your latency is the same you need higher concurrency and let's transform the equation again and look at how latency is a function of concurrency and here we see that latency is concurrency divided by throughput so if you want low latency you need low concurrency so the dilemma here is that you need both high concurrency in order to achieve high throughput and you also need low concurrency in order to achieve a low latency was there a question no it's when people leave keep going for good another problem with concurrency is that there are some lower bounds so disks want a minimum IO depth for in order to get the full throughput so think about the rate of rotating disks so there is a number of heads and let's say you have a rate of five disks so if the concurrency is less than five you are going to have one head that is idle so you need that at least five requests running in parallel and maybe even more in order to avoid the if you have some collisions so you need some kind of minimum IO depth for disks the same for remote nodes so there is network latency which you want to hide so you want to do some operations concurrently and also those remote nodes have their own minimum concurrency that they want so you have to supply that and for compute you need at least one operation running concurrently for each core that you have or you will have some idle cores so let's summarize we want high concurrency we also want low concurrency and we also want to be able to supply in order to fully utilize the system some minimum amount of concurrency so how do we solve this this conflict and the answer is is scheduling but that's actually not the right slide so I'll wait with that so what are the sources of concurrency so the first source is users they are clicking on their web pages and generating traffic but you're not always in control so if you have if indeed it's like a web application or then the source of concurrency is something that is a given it's not something that we can change we can add nodes as the operators of the database and this reduces the per node concurrency so you divide the same overall concurrency on a larger number of nodes but this is something that you might want to avoid and we also have the case that we have multiple workloads so you have say web users that are clicking their way through an application and also you have an analytics workload that is running concurrently and scanning the database so you have multiple kinds of requests that are running in parallel throughout the system and there are also internal processes to generate the concurrency so I mentioned the compaction before so compaction if you have a compaction process running it generates reads and writes and if you do read the heads you can also have a single compaction process generating multiple requests concurrently so the trick here is to have the internal processes generate a lot of concurrency in order to be able to utilize the system but also schedule those requests so choose which requests get to run when in order to limit the impact of concurrency on latency and this is how we do it so let's say that we decided that our storage can handle eight requests concurrently and later I'll show how we figure that out but you can imagine that we have a radar array that has eight disks each of which has one one movable head so and we have a bunch of inputs we have users that are reading from the database with a concurrency of say 30 and writing with a concurrency of 12 and we have an internal sources of concurrency compaction and screaming screaming refers to the process of starting another node so you need to move data from the existing nodes to the new nodes so it also generates a request and the idea is to push only to the storage only as many requests as it can handle without starting to back up and having congestion and this allows the scheduler to pick the right request to send to the storage so as soon as the storage completes one request we pick another request that is waiting on one of the queues and send it off and this allows us to have on the one hand have low latency because the storage is only handling the number of requests that it can without trouble and on the other hand we avoid having idle time in the storage because as soon as we have a slot free we feed it a request from one of the internal processes how sophisticated is like is like the schedule in terms of like like the what metadata are you exposing about the things in the queue you know i'm reading this file this offset and i'm and i have this priority do you go any deeper than that or or is it just so every request is tagged by the originating like the internal process that originated it so every week we know about every request whether it's a user read or a user write or whether it's a part of a compaction process and when we have multiple workloads we know which workload is doing that so we know this came from spark or this came from the web application and we know that the the spark should get lower priority than than the web application and those queues have an an amount of shares that are assigned to them and by the way i'm talking about io but the same thing holds for a cpu so every task that we run and recall that the task can be something that runs us in in a microsecond is tagged with the originating process so so and and we account for all of those tasks so we know exactly for each queue we know exactly how much we know it's execution history and we know how much shares we assign to the queue and we can then select the next task so that the the restrictions or the constraints on on the shares are satisfied and i hope that answers the question yes um okay so why don't we use the linux ios scheduler linux has a very capable set of ios schedulers but they have limitations so the first is that you can only communicate the priority by by the originating thread so you can assign a priority to different threads but in our case we have one thread doing doing everything so we have the same thread multiplexing compaction and user reads and writes and the spark reads and and everything else so everything gets mixed it will also do a reordering and and merging so it likes to send the request in and the order that you think is optimal for getting throughput but that's not good if it takes your the request that is latency sensitive and it puts it in the back because it thinks that's a that's more optimal so what we do is we disable merging and reordering and this gives us the control that we need in order to have a good latency response um so previously i've said that each disk has a it's kind of favorite concurrency it's actually a little bit more complicated oh and by the way you asked about what we track we also track the the size of the request and whether it's in the direction if it's a read or a write because this have a different response to reads and writes um so this graph shows the the response of a disk to concurrency so on the x-axis we have an increasing concurrency and by the way you shouldn't take high concurrency as good better actually a better disk would be one that achieves a high throughput with the low concurrency so and in blue we have the throughput so you can see that as the concurrency increases the throughput rises more or less linearly and then it begins to plateau and then the throughput stops increasing and what happens in the disk is it it starts to queue the request instead of serving them in parallel and in red you have the latency and if you squint then you will see that the response is opposite so on when the concurrency is low the latency should be more or less constant because when you feed a request a new request then it starts executing in parallel but as soon as you saturate the disk at around the 100 concurrent request the latency starts to increase it's hard to see it because of the error bars and because it's a noisy graph but this is a theory says that concurrency should be more or less constant at the beginning of the graph and more or less linear after we enter the congestion area so what we do is we run a benchmark when we install the system and we figure out how this what this graph looks like and that informs the scheduler the scheduler figures out where this sweet spot is and it only lets the disk process the amount of requests and it depends on on the size and the type of request as well that it can it requires for for good throughput and everything else it will hold in user space and the fact that to we keep those requests in user space allows us to select the next request that we run so this gives us the control and allows us to respond rapidly to user reads which want low latency and delay the batch operations like scans for spark or compactions or streaming or any of the other things that the database needs to do to maintain itself okay I talked about self-tuning and this kind of fell out of all of the schedulers that we have so we have a huge amount of control about what we do and this is a natural way to utilize this is to use feedback in order to change the controls so the way it works is that we have a bunch of queues so internal processes that feed queues of operations and the scheduler decides which which queue to consume so it can say we have a request coming on the query queue so the scheduler will pick a request from the query queue and let the disk process it and then it will might pick the query from a request from the compaction queue and so forth and it will select from those queues on the basis of the amount of shares that we assign to them but here comes the trick so we have a feedback loop how will we decide how many shares to assign to compaction so we look at the amount of work that is remaining for us to do in compaction we have a backlog monitor and it keeps track of all of the work that they still have to do and it adjusts the priority in order to try to keep that backlog stable and what this means is that if your backlog starts to increase which means that you're not doing enough compaction the number of shares that are assigned for compaction will increase and it will consume more resources from the system in order to to become stable and on the other hand if the backlog the backlog compaction backlog becomes lower then the amount of shares that the compaction needs will lower in turn and we will free those resources for other parts of the system we have the same thing for other components i'll skip over that because it's more of the same so this is an example of this in action so in green we have a request serve and we started a workload the right workload and it starts at the high rate and immediately it starts slowing down this is because when you start the workload it only talks to memory but after a while it starts have it has to flush to disk so it becomes more more intensive uh and in yellow you have the compaction shares and you see that in the beginning no share it's actually the CPU time so in the beginning there is very little backlog because there is no data on disk so the scheduler assigns very little CPU time for compaction but after a while it stabilizes on on some level um and it keeps us stable but in the middle we change the workload and now we're using a different size request which require different amount of processing and what happens is that the the schedule the backlog monitor notices that the backlog increases and it starts assigning more cpu for yeah i think that we we uh we're out of time if you if you want to continue then i'm happy i don't know if people have time and if not we can just move to questions although that was the really interesting bits yeah i mean yeah it is it is 12 30 so it's super late for you so i don't want to keep you up all night um so what is if you email if you email me my slides i'll be able to post them on on the website yeah but i guess i think i think i think some people have some basic questions there are some questions we want to go through first okay so go ahead with questions and i'll try to email the slides in uh in the background yes i have a question i i don't know if there's an ordering or no it's chaos go for it all right yeah i'm nick i'm calling in from the netherlands so here it's also uh it's actually 11 30 here in the evening but thank you avie for the talk um i have a question uh actually not what you talked about but i've been following uh cstar um and well it's clear that you also have your own tcp ip stack on top of dpdk and uh i i read some some papers that compare um a kernel bypass techniques and and and those papers also mentioned cstar and they met they mentioned then as a drawback like yeah this is a maintenance burden to to maintain such uh uh from scratch or written from scratch tps tcp ip stack uh if you compare it to just using say the linux stack or the or the host stack and my question is is that actually true is it a real burden like is tcp ip still changing and is it a lot of work to keep up with that or is it actually you write it once and once in a while you still have to maintain it but but it's it's okay okay so i suppose that if we invested effort in maintaining it then it would be a maintenance burden because even even though tcp doesn't change uh you still have to fix bugs but the truth is that we don't really use uh as a native stack uh in the end it turns out to be a deployment problem you have to detach the network card from linux and assign it to uh to to the database and and that is work that uh is not easy to do maybe today in cloud environments it's easier because everything is a lot more homogeneous you have just a few types of of systems but for like enterprise deployments in the end it turned out to be too difficult to use and we really did not invest much in it it's a pity because i really like it and uh you you can get the better performance and better distribution of compute across the course so i'm sad that we are not using it more extensively but the truth is that it's not a burden because we're not really using it as much as we should i see yeah i had a question so um i'm depend i'm an undergrad studying senior um i wanted to ask about uh how you use c plus was 20 code routines because uh we tried to implement that in our system but the problem we came across was that you can't suspend from a nested stack print right so um how do you guys go about making the execution model support stackless code routines uh what what i missed uh what you can do uh you can't suspend from a nested stack print so you can only like suspend from the top level of the function um so we know for stackless code routines you can you can suspend from any level so long as you return futures so our basic primitive is a future and all of our core routines return a future so every time you call a function that returns a future that is a suspension point uh and um i'm not sure exactly what where you saw the problem because for us it just fits naturally the the coroutine model works uh very naturally with their futures and promises so a coroutine is just a function that returns um a future and every time you have a core weight uh it's just uh every time you have you call another coroutine it's you usually call it with core weight uh which can be a suspension point if that function returns a future that is not ready okay i'm not sure that i understood the problem so maybe my answer wasn't very good so so the problem we saw was that you can only like return back to the thing that called it like you can't suspend from one function and go to like a completely like different coroutine right um so we have like multiple threads executing different things we execute one and then we want to switch over to something that's completely unrelated um were you able to achieve that with stackless code yes it's yes and it's completely natural so it's uh we had our system that used the continuations so future uh future dot then uh and we adapted it to use coroutines with very minimal changes so it's the same scheduler and it's the same underlying model um for us it worked very naturally maybe because the system already was based on continuations uh and uh every time you call a coroutine uh you you already have a suspension point so it's not like you're nested multiple levels in in the stack i would say to pine so lea db is open source so it might make sense to go look at their code and understand what they're saying okay yeah and feel free to ask questions on as a c-store mailing list how it works um we'll be happy to answer awesome thank you my name is jordi and i also have a question so this is more regarding stuff you might have done in the past i know you use i o u ring now for a lot of i o related asynchronous stuff but at any point did you experiment with using epol and our more synchronous apis and if so did you notice any impact on latency um so uh in the beginning uh we used the epol and for networking and linux aio for block i o uh and later we changed the linux aio to also support uh networking and this is what we do in preference we don't yet support i o u ring although it's like a an excellent match for the system uh we prototyped but in the end it didn't show any significant improvement uh in a way the linux aio support for epol which we implemented is uh really the forerunner of i o u ring and it's for epol the problem with epol is not that the interface is bad it's just that it requires a lot of system calls to maintain and it's not integrated so you have different apis to do different things and with i o u ring and or with linux aio you have one ring through which you send the request to listen on file descriptors or start the network request and one ring on which you get their responses so um you you save on a large number of system calls and it's all amortized so with one system call you a bunch up a number of file descriptors and a number of disk ios and did you see producing the system calls impacting latency or was that just like a throughput side effect but no the system calls affect throughput they don't affect latencies they're all non-blocking um of course uh if you exhaust the throughput capabilities of the system then you end up with the latency impact but the direct impact is the throughput okay thank you so as long as the system is not overloaded you get very similar latency as soon as it gets overloaded then of course q is built up and you get the high latency but the model of talking to the kernel doesn't directly impact latency thank you so much so my name is ling i'm a phd student here but thanks for the very interesting talk so my question is that i'm wondering how you decide the boundaries of those small tasks that you mentioned about is it just flows naturally or there's some like certain principles or guidelines you are following and also how do you ensure that the task is as small as you expected because i heard you you mentioned it could be just microseconds or some milliseconds right yeah so that's that's a good question and also a difficult one and we're struggling with it every day so some tasks really are very simple you launch an iO operation or you launch across those are the coordination tasks but the compute tasks are the ones that are difficult because they can take very long so for that we have a kind of user space tick and we use that by having another Linux AR ring that is running on a timer so whenever we have a completion for the timer whenever the timer fires the kernel queues the completion and when it queues the completion it modifies the indexes in the ring so and we compare the indexes to the ring to notice to note that we need to do preemption and all of our loops whenever we have a loop we also check for preemption and it's not every loop so if we loop on a small number of iterations then we don't really care but when you have a loop that is unbounded or can have a large number of iterations then we check for preemption and we break the loop and you will see that in a large large number of places in our code of course we don't directly check for for preemption we have primitives that do it for us but this is a source of latency whenever you have a computation that isn't that doesn't have this preemption check so indeed it's a it's a serious problem and we continuously find ways find places where we have this unbounded loops and and fix them we have a latency detector that sets a timer and when it expires it sends a signal and with a signal with a signal we get a backtrace of the loop that is consuming this CPU time without the latency check and our QA team sends us lots of nice bug reports with those backtraces and we figure out the source and then usually then it's an easy task to add the preemption check usually it's some data structure that we thought would would not grow too large but actually does and edge cases like if someone wants a database with thousands of tables whereas the usual number is a few dozen tables so in those edge cases we we get a high latency and because this is the cooperative system whenever you get that everyone feels it so the impact is pretty strong so this is one of the things to you have to worry about with this kind of cooperative architecture so on one hand you win in terms of performance because the coordination is cheap but on the other hand everyone has to be really cooperative and friendly or you get the latency spikes yeah thanks for the answer just quick follow up we'll do not make sure we'll say you do a preemption and then break up the loop do just to package up the remaining task of the loop and then req that into their whatever exactly so we repackage it so usually it means an allocation in order to save the state and package it in a task structure which gets queued in the task queue so this sounds like a huge amount of manual work but actually it's automatic if you're if you're working with core routines then or it's even the simpler and we also have user-level threads in which case the data is just on the stack so all of the automatic variables just stay on the stack and the thread gets resumed and we can have lots of those threads running of course we prefer not to have threads because they're expensive in terms of their memory footprint but we do use them when in places where it's simpler and where we don't have a large amount of concurrency okay that was awesome uh any last question for anybody else yeah i already went so it's a lot i go for nick go for another way um yeah uh so for i have a question about the internode communication for which in cstar you have this rpc class and you you do tcpip protocol tcp protocol uh did you ever consider like using udp with your own uh reliability uh on on top or do you never suffer from say the possible drawbacks of tcp in terms of latency especially if you have some packet loss we ever so i consider it almost once a week uh so i have like this routine where i think okay this is this sucks it really should be using udp and then i start thinking about all the mechanisms that we have to implement in order to have reliability and to package many small messages into one packet and also to fragment a large message into many many smaller packets because we have all all kinds of messages and to do the reliability and in the end it's just too much work for too little gain so tcp does have problems but it it it isn't worth all of the trouble of course the next week i forget all of that and i start thinking about it again but it's always the same the same result i'll alert you if i ever come up with a different uh but it is very tempting so uh and and i did implement in the past um rpc based on on udp and they do have advantages uh but right now it's not worth it for us maybe we should move to rdma because it's now becoming more um available on at least on on well in one of the advantage of the clouds is that they make um at the heart of more homogenous and there is a sort of minimal baseline that you can expect uh so maybe we will jump from udp to rdma okay awesome uh Len we'll give a quick question or yeah very quick one uh sorry a very quick question so we will mention that you are monitoring the cpu resource and performance as well i'm wondering how do you measure the cpu time or resource whatever like basically are you using the perf library or are you directly read out the performance counters from the cpu like how did you do that uh no so if you if you look at something like the rdtsc instruction you see it takes about the 20 odd cycles and for me that's too much uh so what we do is instead of measuring every task every individual task which would have way too much overhead uh we batch all of the tasks from the same q and uh uh we have a preemption timer every half millisecond and this way we we do the accounting of the granularity of half a millisecond so we let the q run and then when when it either completes or it preempts we measure it with uh with nanosecond granularity uh but we don't measure each individual task so we um the overall sharing is is fair between the qs or fair according to the number of shares but not every individual task so uh it's actually the same way that you do normal thread scheduling you you don't schedule every millisecond you schedule every some kind of time slice for us the time slice is half a millisecond and we just batch all of the tasks that the shares that have the same uh key is the same tag and it's simpler but we just keep all of those tasks in in uh in separate queues so we process the queue i see okay that's super jik all right i realized this one was 1 a.m i had to ask you this question because i'm asking everyone this uh how stupid are your are your users like how often are you surprised at them trying to use salia db in a way that like you know you never intended and it shouldn't be used or do you find that the users are coming to you having also maybe already been burned by your kassandra and looking for something better and therefore they're a bit more sophisticated uh so first our users are very smart for picking us so that's already a the after that they really can do no wrong but well you're right that many of our users do come from existing large deployments where they suffer latency problems and not just kassandra also mongo db so they they already they've already been burned they know where the problems are it takes some adjustments so even though we're completely compatible with kassandra there are still some things that are different like you need more connections and because you're talking to more shards we also have uh we can also connect directly to the shards so we have a specialized driver that knows to send the query directly to the shards that to the cpu core that will process it and it saves an internal hop and also improves the load balancing we we see mistakes so we we see people not using prepared statements or using data models that don't have enough partitions so they end up overloading a node and overloading a shard and actually our architecture is more vulnerable to this kind of mistakes because we although we will usually have many fewer nodes we will have many fewer uh a much larger number of cores doing processing and each core owns a subset of the data so you have to be even more careful about the data distribution um so people do mistakes we we also have some complete newcomers and they really have uh problems with basic data modeling which you would expect it's a specialized data model in order to go um be able to provide this kind of throughput you you need to make some trade-offs and without understanding those trade-offs it's hard to design the application so people do make mistakes uh and it's common we have a university course that is designed to ease them into it but of course uh people do sometimes keep on doing mistakes and we try to help them okay awesome i i immensely appreciate you saying up uh so late with us even given through other technical difficulties i will applaud on behalf of everyone else uh again in my opinion this is this is exactly like this this is extremely interesting and you're touching on a lot of things that as you saw a lot of my students had had questions for this so i appreciate you spending your time and letting us pick your brain on this