 Good morning and thank you. Thank you. Good morning, everyone, and welcome to our webinar. Before we begin, we are pushing a quick poll question. Where are you in your Node SQL adoption? Options I currently use SiloDB. I currently use Apache Cassandra. I currently use another Node SQL database. I'm currently evaluating Node SQL. I'm interested in learning more about SiloDB or none of the above. While you're voting, I will introduce myself. So my name is Pavel. I'm from SiloDB. SiloDB is a database. Prior to becoming a database guy, I've been a kernel guy for some years, playing with containers and with what you nowadays know as Docker. In SiloDB, I recently played with disk.io and the way Silo keeps its data on storage. One of the outcomes of this work is the idea of phantom gems I'm presenting today. For those of you who are not familiar with SiloDB yet, it's a database built for game changers. It's created by the founders of the KVM hypervisor. SiloDB was conceived with key design characteristics to power this next cycle and resolve many of the challenges posed when operating distributed systems at scale. In particular, SiloDB is a high throughput and low latency distributed Node SQL database. Increasing database throughput, improving P99 latency and reducing total cost are principal drivers behind teams like yours for selecting SiloDB. In 2020, SiloDB received Infoworld prestigious technology of the year award. And it was truly an honor to be among fellow recipients like Tableau Databricks and Snowflake. A couple of weeks ago, we launched SiloDB 5 with several new innovative features. In fact, yesterday we had a webinar covering what's new, which I highly encourage you to watch. And along with SiloDB 5 launch, the adoption of our database technology has grown to over 400 key players worldwide. Many of you will recognize some of the companies among the selection pictured here, such as Starbucks, who leverage SiloDB for inventory management, Zillow for real-time property listing, and updates and Comcasts, Xfinity, who power all DVR scheduling with SiloDB. If you are interested in knowing how we can help you more, feel free to engage with us. To summarize, if you care about having low latencies while having high throughput for your application, we are certain that SiloDB is a good fit for you. Back to technical part. Making SiloDB become as awesome and performing is possible thanks to several design decisions. First, SiloDB was written in C++ with great attention to all the possible optimizations of hardware usage. Second, and it's probably even more important than the former, is the C-star library that sits between SiloDB and the OS kernel, and that in turn resides on two pillars. ShareNothing approach, in which CPU cores do not explicitly synchronize with each other. And FuturePromise computation model with non-blocking IOS, in which portions of code don't ever block the running thread, thus wasting CPU or networking time in vain. And the last but not least sort of a disclaimer, these were gems. It's no way any technical or well-settled or common term for the effect will submit. I just called it after traffic phantom gems that had been studied and described quite a while ago. And the word phantom, I also took from that work to emphasize that, like on roads, gems in data flow appear from nowhere, and it's not immediately obvious why it happened at all. Okay, so when you look at the system that works at some speed, like data throughput is shown to be some megabytes per second value, or operations per second are observed to fluctuate around some point. Often the question, why is this number not larger appears? And when examining the system, several conclusions can be drawn. It can be a hardware, the system can limit itself, or it can be non-obvious reason, one of which is the phantom gem. So the hardware, it's usually the only and unfortunately unbreakable verdict. Like you are seeing a network flow of one gigabyte per second, but your network adapter is documented to be such, so there is no way to squeeze more from it, said but true. And of course, it's not only the network adapter to blame, it can as well be the disk or the CPU. Sometimes the cool print sounds like it's RAM. The system has too low RAM, but digging it further typically shows that when the system runs out of memory, it always starts compensating this shortage by loading the disk and CPU. So it again ends up being CPU or IO, but still the hardware. If not the hardware, then the problem can be found in the software itself. And it's not that rare that the system is explicitly programmed or configured to self-limit before it hits a natural hardware limit. In Linux, there is a mature and flexible set of C-group controllers that can be used to cap the usage of pretty much any hardware resource out there, CPUs, disks, network. This artificial throttling can be used for many reasons. One of them, for example, is a theorem that limiting a resource is one of the ways to provide guarantee of that resource for other potential consumers. Another reason for artificial throttling can be an attempt to chase good latencies on some hardware. Some time ago, our CTO Avi Kiviti and me had a webinar. Here is the link on this slide. We had it here on Linux Foundation describing how rate limiting disk IO can be used to achieve best IO latencies. The same work, by the way, is also available in writing on our company's blog. So you're welcome to come and read it too. And finally, if it's not a hardware, not explicit throttling, there can be non-obvious reason for a slowdown. And let's proceed and see one of them. Why one of? Likely because there can be more that hadn't been discovered. So let's take a look at the pretty common and generic producer-consumer model, programming model. Here is a subsystem that generates some data at the rate of P messages per second and another subsystem that consumes this data at rate of at most C messages per second. In order to operate, the consumption rate shouldn't be less than the generation rate. So from now on, I'll always assume that that's the case. Next, what a producer can be. It's pretty much anything. For example, a thread or process that does IO or sends packets over the network. And the same for the customer. It's pretty much anything that consumes what the producer generates. Simplicitly, from now on, let's imagine that the producer is a Linux process sending requests to the disk and the consumer is the disk itself. But as I told, the conclusions would apply to any kind of producers and consumers. To make this model more real and actually see the phantom gem, we need to add an interposer to this chain, which I call the dispatcher. It acts as a consumer and producer at the same time. It gets the data from the producer. It can queue it internally. And then it forwards the data to the real consumer. It does so at the maximum rate of d wakeups per second and d small messages per wakeup. This dispatcher component I introduced here is in fact not artificial as it might seem to be. Dispatchers are in fact everywhere. There is an IO scheduler in the Linux kernel that works exactly as described, interposing the IO flow from process to disk. There is the same thing in the networking stack called traffic shaper. And if you think of it, the whole TCP outgoing path is the dispatcher. Because when you send the data into a socket from your program, it doesn't immediately hit the wire. TCP code in the kernel may and it actually does collect the packets internally it merges them or splits them and then sends them out the way it prefers. Generally speaking, adding a dispatcher into the producer-consumer model is always justified to provide something, for example, fair scheduling, resource control, access policy enforcement, buffering, routing, et cetera, et cetera. Of course, the dispatcher can become a bottleneck itself if it works too slow by limiting itself, waking up to rarely or sending too few messages down the pipeline. To get a real phantom gem from now on, I will also assume that the dispatcher doesn't slow down itself and passes at least as many messages per its wakeup down the pipeline as it would have passed if it was fully synchronized just forwarding everything one to one. To conduct the experiment, I started with loading SILA with stress workload. But the effect was like blurred and hard to demonstrate. So I started dropping unrelated pieces of this experimental stand from it. Eventually, I left with just a CSTAR application doing simple network to dist forwarding and later and eventually I patched it more to exclude the real hardware completely. And I just left with a simulator with three bare software components. The producer generating messages at the configured rate, the dispatcher waking up every half a millisecond. I say half a millisecond just to have some scale, but it can be any timing. And the dispatcher not to lock intentionally, it was forwarding 50% more of the messages as it had to the consumer maximum rate. And the consumer consuming the messages at the fixed rate of 200,000 messages per second. With that perfect and clean model, no gems were observed, it just worked. So I made another change to the simulator. The change was the add of artificial jitter to each of the components to simulate some real life behavior. In ideal circumstances when messages are generated, dispatched and consumed at precisely given rates, no gems were seen as I told. Jitter was injected randomly distributed disturbance to the generation, to dispatching and to consumption delays. To be more specific, I tried to simulate real life-ish non-uniformity and used Poisson distributed delays with the average delay being the configured one. And with jitter, that's what I got. First, I added jitter to the producer. So producer started generating messages, not one every fixed fraction of a second, but with more randomized delays. The x-axis on the plot represents the messages generation speed. And the y-axis is the message pass through time throughout the whole pipeline. More exactly, there are three lines for the maximum P95 and P99. Also pay attention that y-axis scale is logarithmic. So first of all, messages take different time to proceed. That's okay. It's not yet gems, just notice this. Since messages are generated with random delays, it might take some small random delays in the dispatcher. And the second thing is that when the messages are generated at the speed close to their possible consumption speed, the right part of the plot, the time to process the message increases about 200 times. It looks like some bad news, but still it's not yet the phantom gem I'm talking about. These 200 times increase can be explained too, but that's not the most interesting part. Now, if adding jitter to the consumer, similar thing happens. Messages get possessed at different times when the generation rate is close to the consumption rate. The time to process goes approximately 200 times larger. That's really sad, but again, it's not the worst thing that could happen. The worst thing happens when the jitter is added to the dispatcher. First of all, note that plot looks completely different. Second note that the scale of the y-axis is 100 times larger. It now goes up to seconds. And the second and the third thing is that some problem happens much earlier than before reaching the maximum generation speed. To get better idea of what's going on, let's put all three plots all together and keep only P99 line. Red and green lines are about producer and consumer, and the blue one is about the dispatcher. So what's new here? First, even when the generation rate is small, relatively small, jitter in the dispatcher makes processing time 20 times worse, which is already some bad news. Prior to this, if the generation rate was 10%, 20% of the maximum consumption rate, the time to process was small. Now it's 20 times worse. Second, and that's exactly where the phantom gem is. Is that starting from here in this experiment from the 160,000 messages per second? The plot goes really up, but it doesn't just goes up. If looking in the experiment in more details, it would become clear that at that point dispatcher effectively stopped maintaining the incoming flow, and its internal queue just grew infinitely up until the end of the experiment. And that's exactly the phantom gem. Remember, the producer didn't produce more messages than the consumer could consume. Neither dispatcher was programmed or configured to limit itself or the pipeline. It was over queuing the messages into consumer. The effect that we've seen in the previous slide makes the idea of what we later called effective dispatch rate. This is the maximum speed, maximum generation rate at which dispatcher may still pass the message through itself. In the previous slide, it was about 20% of the maximum, and it was only dispatcher that was affected to that. In fact, dispatcher may overload the consumer even more than it did in that experiment, but it will still get jammed. This is how I checked it. In this plot, the x-axis is the overload coefficient. The dispatcher, according to this, would send one and a half to four and a half times more messages per it wake up than it should, according to plain consumer expectations. The y-axis is the effective maximum throughput in the sense that I've described above. That's the point at which dispatcher stops maintaining the queue and just grows it infinitely without any chances to handle it if the rate persists. There are three lines. They are for three different wake up ticks, but they do not change anything. The only thing that matters is this x-axis overload factor. You can see that it indeed helps, but it doesn't completely solve the problem. Overloading the consumer too much, it increases the effective dispatch rate, but it has a negative side effect in the sense that it leaves the dispatcher with much less control over the flow. In the extreme case, when dispatcher just doesn't queue anything, it will have the effective rate of 100%. It will just forward everything and nothing will get stuck, but then the whole point of the dispatcher gets lost. It will lose the ability to dispatch anything. Some extra bad news is that I don't have a full correct theory of the effect. Probably I don't yet have it, but eventually the understanding of this thing was reduced down to three items that needed to come together to get a phantom gem. First is the interposer component called dispatcher. Second is the internal dispatcher queue that it uses for any kind of activity it's created for, which means that dispatchers without queues they will not produce this. The most important thing out there is the so-called cooperative preemption. The last part actually is the main reason why it's very hard to observe this on plain Linux, fortunately. The cooperative preemption is what makes the dispatcher star for the CPU time when it needs to dispatch its queue down the pipeline. In simple words, in cooperative preempting system, any code that's executing on a CPU cannot be forcibly preempted. Instead, it decides to yield the CPU to some other component on its own. In Linux, the preemption is voluntary and has been such for a long time already. If there is a dispatcher that needs CPU at some point in time, chances that it will get it are extremely high. Linux will most likely wake it up and give it a CPU. And it would take significant efforts to put the system into a state when components get unreasonable delays. And even if doing so, they will pop up more real gems, not just phantom gems. Having said that, the phantom gem's friendly environments are those with non-enforced CPU yielding. For example, coroutines. One of maybe better known example of coroutines is go-rotines and go-lang. Some implementations of unicernals that do not implement voluntary preemption. And unfortunately, the c-star component, which I mentioned in the beginning of the webinar, on top of which Silla is built. And a few words as a wrap-up. First, the reasons for a system to work slower than you expected to might not necessarily sit in the hardware. Weird, none of those effects may happen during component's interaction. And the way to pin them down, and that's the way how we came across this in Silla, is definitely a well-designed and maintained monitoring system and good understanding of what metrics in it mean and the ability to read them and make good conclusions. This webinar also has its written version available on Silla's company blog. It has a little bit more information than what I've just managed to describe here. So you're welcome to come and read it. Thank you for watching. And now we have another quick audience poll for Sense of Scale. We'd like to understand how much data do you have under management in your own transactional database system? Below one terabyte, one to 50 terabytes, 50 to 100 terabytes, or above 100 terabytes? Please pick the answer that best matches your current dataset. Thank you for watching. Now I'm passing the word back to the Linux foundation. Thank you so much, Pavel, for your time today. And thank you everyone for joining us. As a reminder, this recording will be on the Linux foundation's YouTube page later today. We hope you'll join us for future webinars and have a wonderful day. Okay. Now I guess we have some time for questions. Yeah. If you want to go ahead, I don't see anything in the Q&A. So everyone, if you have questions, please drop them in the Q&A. And we definitely have enough time to answer as many questions as you all may have. Okay. There is a question. Does this effect happen only to disks or can it be seen on any other hardware? That's a very good question. The answer is yes, it can be seen in any other hardware you can observe in network adapters. And in fact, you can observe it not only in hardware. Consumer part can be anything. It can be any other piece of software. It can even be a high level other piece of software, for example, remote service. You still have chances to get it. One more question. Does it help if the components are executed on SMP machines on different CPU cores? Frankly speaking, I didn't conduct such an experiment because C-STAR, it pins everything to individual cores. And my goal was to chase the gem happening on a single core. But from our understanding of it on SMP, you still can get it, though it can be harder to trigger. It's hard to trigger it in Linux with its world-interpreemption thing. Can you give everyone another, yeah, any more questions? We give everyone like another minute or so. And Pavel, maybe it's like people have questions afterwards. Is there a good place for them to follow up with you? Unfortunately, I'm not very active on social media. If you want to contact me, I think I have it on slides. Let me check it, please. Unfortunately, I don't. Can I put my email into the chat? Will it be available? Yeah. Yeah. If you want to put your email, just make sure it's to everyone. And then everyone on all the attendees will see it. Oh, and it looks like we did get one more question in if you want to look in the Q&A. Oh, yes, we have a question. Did you estimate a delay based on data? You should check out WAMP and BART protocols because they used to estimate measuring available bandwidth in the network. If I understood the question correctly, I didn't measure the delay. I actually generated the delay. The goal was to find out or sort of prove that components can limit themselves in speed or throttle themselves just because the way they are programmed. So this thing, it didn't depend on the data size or specific latencies of the hardware. It just happened that the way these three components interacted with each other, it was this sort of self-throttling that we didn't intentionally wanted to have it. All right. If we don't have any other questions at this time, we can go ahead and wrap up. So thank you again so much for all your time today. And thank you, everyone, for joining us. The recording will be on the Linux Foundation's YouTube page later today. And we hope you'll join us for future webinars. Have a wonderful day. Thank you.