 Welcome everyone We're gonna go ahead and get started because we don't have a lot of time So if you need translation things, please put them in now My name is Josh Berkus. I work for Red Hat These days what I work on all the time is Kubernetes I'm in sick release a good job of the experience spend a bunch of time working around with storage stuff, but How I get into Kubernetes in the first place was I was looking for a way to host and automate databases for fully automated HA application stacks and one of the things I Let's come up there is one of the things I got originally was can you run databases in Kubernetes cloud native environment? And there are people who said no It's a fairly famous tweet from a conversation I had with Kelsey Hightower Where he's like no, I don't run databases on Kubernetes for a variety of reasons and So we discussed some of this at a previous conference and talked about the reasons why he would say something like that and There are a number of reasons number one is running a database on Kubernetes With the exception of implementing HA aside from that running a Kubernetes and database a database and Kubernetes is no easier than running a database and bare metal and Therefore you need the same skills just putting it on Kubernetes does not make DBA headaches go away second thing is setting up storage for Kubernetes Can be complicated and hard to figure out And the third reason potentially is performance This is one of the things people are worried about right if I put my database in Kubernetes. What is performance going to be like now? I'm involved with another project that's dealing with the storage setup complexity issue. It's the Rook project In order to be able to set up cloud native storage for Kubernetes and OpenShift and make that easy I don't have answers for you in the database management thing I was literally this morning troubleshooting a thing for a project. I'm involved in where One of the developers noticed an error from the database and they decided to try and fix it Which meant that they needed to then call me up and fix what the developer did Because you still need to actually understand how the database works even if it's operating in cloud native environment So what this talk is about actually is the third component which is performance Which is if I throw my databases on Kubernetes on cloud native storage, how do they perform? Is it good enough for my applications? And that's what I've been spending a few months on a particular cluster working on now The reason why performance is critical is a couple of things one is If you talk to any group of database geeks any group of system infrastructure admin geeks They care about performance right me one of the first questions they ask and therefore you have to have an answer for them The second thing is for any platform. There's this trade-off between ease of use ease of management and speed The I you know just an example for example low-level programming languages things like C Little an assembler Execute faster than higher-level languages that are often easier to use and easier to learn and the same as truly an infrastructure platforms And then a third is if your application was designed to expect a certain level performance from the database Then if you can't get that level performance on a cloud native platform That is going to be a blocker to migrating to Kubernetes and cloud native technology in general You don't want it to be a blocker the other reason is I've been doing database performance stuff since my hair was still blonde So I'm not gonna stop. I actually find it kind of fun because I'm weird that way so And the main way that you actually deal with database performance is through benchmarking Now we've got a bunch of people here from pink cap. They know what database benchmarking is etc But for everybody else to talk a little bit about benchmarking is and benchmarking isn't so When we're talking about benchmarking stuff, we're actually talking about comparing Right, we want to compare two things that are equal in all but one respect so we can find out What effect that difference has? You know so benchmarking between two different types of storage to from one release to another From one configuration change to another or From the requirements that we already have on spec of must return x number of responses per second Right, this is what we're looking on comparing and so anytime you were looking at benchmarking You should be saying what two things or three things or ten things am I going to be comparing now? For this talk what I am actually comparing is types of storage because part of the way I get started in this I work at Red Hat I work collaborate with the Seth and Rook teams that we have at Red Hat and One of the things we're going to look at is if you have pure cloud native storage like a Rook Seth stack Use performance on that acceptable for a database workload Nobody had an answer Nobody good answer that I saw before I started doing it and so Over time I've been comparing for different your four different basic types of storage, right? We have bare metal We have Kubernetes, but using node local storage. That's either local deer or local PV Network storage So no loophers like host path network storage, which is like EBS cloud provider storage or whatever and then cloud native distributed storage So something like a Rook Seth combo or other advanced file systems now I already did a run for this and you can actually find my talk from Kubcon Seattle where I did comparisons and AWS And I actually checked on performance for cloud native storage for Cloud provider storage for network storage. So I'm not going to actually be providing a comparison for network storage today because this set of tests is all about the bare metal Now beyond that we need to talk about types of IO performance That you get in a database workload So database workloads basically do four things and they often do them concurrently with each other, right? You have random reads so reading one small fact at a time Random writes writing one small fact at a time sequential reads when you read big blocks of data and sequential writes when you write big Blocks of data databases will do all of these things concurrently or consecutively and so we care about the performance of all of them Now generally when we're doing a synthetic benchmark We can actually combine the random reads and random writes Which we'll be doing in these benchmarks that I show you because most workloads that do random reads do random writes Etc and you can do them together without messing with the numbers and Then we actually care about two different classes of metrics for those right? We care about latency which is if we make a request how long does it take that request to return? Affects application performance right and we also care about throughput which is how many requests per second or how many megabytes of data per second Can I say read from or write to the database? For throughput so we care about both of these and so what we have here is three different storage platforms three Different types of IO and two different classes of metrics that we care about That's a lot of different tests across the different environments fortunately the database industry for a long time has been obsessed with benchmarking things and So there are tons of tools and benchmarking tools and measurement tools already designed to generate these kinds of traffic and record This kinds of information that you can just use and tap into and I'll be naming a few of those here Particularly what I'm going to be talking about because this was easy for me to do since this is not my full-time job most of my job is managing kubernetes community stuff is I ran a series of what I know is micro benchmarks So you have your benchmark suites things like transaction processing counsel and spec and stuff that have an enormous amount of battery of different things That they do and they're audited and that sort of thing and that's that's a benchmark benchmark with a capital B And then you have micro benchmarks which are smaller easy to run workloads And these are generally if you're running it personally for yourself and you're not doing it for publication You just want to know how your hardware performs micro benchmarks is what you're looking at And actually ran three different micro benchmarks sys bench post crisis PG bench and the cockroach DB workloads Unfortunately, I am going to show you the cockroach DB workloads and the results that I got but these are no longer open source So this is the last time that you will see these from me Because they change the license for them and they're not open source anymore. Unfortunately so Talk about this bench. This bench is a nice toolkit like first thing you should try honestly this bench because It's a sort of omnibus micro benchmark created by the mysql folks years ago It can do tests of a whole bunch of different system performance CPU memories Database tests IO tests that sort of thing in this example. I'm just using it to test directly some of the IO operations Postgres PG bench is a super simple database benchmark. It ships with postgres It does a database micro benchmark and it measures basically two things one is random transactional reads and writes And then the second thing is load in index times which would simulate in a data loading analytics workload and Then the college DB people created this really nice suite of light benchmarks That they use to and then publish them as open source originally Including bank, which is a lot like PG bench In its operation and tbzc, which is a much more complex Right heavy workload with a lot of lock locking inside of it and lock conflicts, which is a common problem in databases And the two of these within that suite is I found bank was really good for measuring throughput and tbzc was really good for measuring latency for complex operations That need to do transactions so now Let me give you some tips because part of my goal in this talk is not necessarily to show you the numbers that I have Because the numbers that I have are not your numbers your hardware is not my hardware Your application is not my application and your stack is not my stack What I want you to get out of this is you can do this yourself And you should do this yourself particularly before you deploy a new platform in production and it's not that hard So let me give you a few tips on that One is if you're running microbenchmarking you need to do a bunch of runs Don't run at once record that number and say that's how it is because there's a certain amount of randomness in all of these benchmarks And if you just depend on a single run the randomness may be what you get Instead of a real thing. You also need to do long runs a lot of people mistake. Oh, I did a 30 second run And I got this well you have memory cache effects and CPU cache effects and a lot of things Where you will get artificially inflated performance on really short runs And you really need to see how the system is going to behave under a more sustained load Ideally you want to do multiple database and file sizes particularly you want to measure both things that fit in memory and things that don't You want to do Multiple you want a concurrent workload because your real production workload is going to be concurrent So multiple users accessing the database or the file storage at the same time And you want to use bare metal Now a lot of people like this. Well, why would I want to use bare metal? Well, here's the problem I've done also a lot of benchmarking on cloud providers and the problem with benchmarking in cloud providers is That a lot of your performance effects have more to do with who else is on the cloud with you Then they do with any changes in platform So like if you look at the stuff that I presented in Seattle if you look at benchmark runs in AWS, I Consider a minimum number of runs for a single AWS workload at a specific size to be like 25 and For each of those I actually recreate the instances Because you never know what's going to be the effect of a bad instance or a noisy neighbor also, frankly a Large cloud Instance is a pretty small bare metal instance and as a result. You're not really going to be testing things for really large workloads so If you have a choice of platforms bare metal is going to actually give you more useful results So that said, let's look at some numbers Because it's one of the things that we care about here, right? So I Also want to add some caution here Which is Please do not compare the numbers between different benchmarks and Different databases these numbers are not meant to be comparable, right? So the TPCC benchmark for cockroach TV does not perform the same activity as the PG bench one for postgres And those numbers are not comparable The databases were minimally tuned on purpose Basically, I did the pro forma performance tuning that you could do in 10 minutes for any of the databases Because that's kind of what most people running on on top of which I didn't want to make this an exercise in Performance tuning the database. My goal was to test the cloud native platforms And then again, like I said my software my hardware is going to be different from yours You need to test yours and not just read my numbers So here was our bare metal platform that I was doing this on I'm continuing to run tests on this because I have this cluster It's a six-blade cluster Each of the blades is 20 cores hundred and 28 gigabytes of Ram Two SSDs with 200 gigabytes of storage each and a shared network This is in the opens or slab at Red Hat Stuff that we have we had it for testing something else In the past and I repurposed it for this now There is one caveat here that limited the kind of tests I can run now for One of the things you want to test is things that fit in memory and things that don't fit in memory The problem is that when you have a drive that's 200 gigabytes and you have 128 gigabytes of Ram It's very hard to test the out-of-memory use case because you'll run out of disk space So you won't be seeing that in my numbers my numbers are all going to be fits in memory use cases Because I don't have the storage to do out-of-memory use cases. I'm waiting on some new SSDs so I can actually test that so And that does mean because of that The primary IO we're actually measuring is file sync time the time to actually write and commit stuff to desk Rather than raw throughput from the storage That's going to dominate our numbers The other thing is I have a shared network and that had an effect on one of the tests later on that I will show you That that eclipsed the effect that I wanted to get in terms of actually comparing things so The first set of tests that I ran is you want to start out with a caveat You want to start out with a benchmark of this is This is what I would get on a plane platform before I do cloud native anything, right? And so we get a host file system where I've got a host install no kubernetes to get reference numbers just using XFS and LVM And hosting databases and file storage on that So first we're going to start with Sysbench that is just going to check the sort of direct IO numbers It does a bunch of IO operations and see and records this back And so the numbers that we got here from the SSDs here, right is that random reads per second 10,000 whatever high number because we're getting this out of memory because again, I can't actually use a file size It's bigger than memory So that should be really fast than it is But also random writes are also really fast, right 7,000 random writes per second nice fast SSDs being able to use two of them there I can read 22 gigabytes per second of data. So here's our throughput numbers And I can write 88 megabytes per second of data So here you see this is going to be a typical pattern of SSDs Where SSDs are much faster than old spinning drives for random writes, but they're not necessarily all that faster in throughput Because that's writing big blocks of data Now a little bit more complicated chart here This is the set of database benchmark So we've got PG bench where we want to get some throughput in terms of database load time simulating Loading large amounts of data into the database Transactions per second for another measure of throughput and then average latency to measure latency and then for bank similarly on cockroach TV you transaction per second 95% latency And TBCC when you're actually doing that benchmark competitively the standard is new orders per Certainly new order transactions per minute. I'm doing per second. And so that's what I actually have there Whether or not I'm meeting an arbitrary target of how many new orders I can process in 90% latency So on the bare thing for PG bench 404 seconds for the bulk load So again lowers better here 404 seconds for the bulk load 11,000 transactions per second. So I'm actually doing slightly better than the file system So there we see Postgres is batch writing kicking in And then an average latency of 2.8 milliseconds now One of the things I actually discovered through this is It's a lot easier to install and configure cockroach DB on Kubernetes than it is to do it on bare metal Is a matter of fact I was getting performance figures that were so bad that I was sure that I had Misconfigured something somehow in the bare metal install and all of the cockroach DB instructions and advice on how to configure Cockroach DB for performance is oriented at a Kubernetes environment And so I'm not going to show you those numbers because I don't think that they're realistic And then after that cockroach DB changed the license, so I'm not going to rerun any of that So Then the next thing here is to use local volumes and this actually if you are concerned if you're hosting a database in Kubernetes and you're concerned about Kubernetes database performance, then this is actually what you're probably going to be doing Provided that your database can manage replication and fail over itself Without being dependent on Kubernetes persistent volumes to do it for it is that you do Local storage either using host path Or using the new local PV's In order to do storage and the thing is this is local storage in a container And so performance should be almost identical to bare metal performance right because all we have this some C Groups overhead And whatever Kubernetes networking overhead is and otherwise we really are running on bare metal So let's look at that here. So suspense here again miniscule difference With the suspense tasks, I mean almost immeasurably small in terms of a difference with just running on straight bare metal The database tests are a little bit more different right so One of the things so bulk loads is a little bit slower 10% slower the Transactional throughput 15% lower and latency is higher and those two go together right higher latency lower throughput on A random right workload makes a lot of sense and here we have our first numbers From cockroach TV, which we're going to be comparing later on with cloud native storage Where we've got these operations and that sort of thing, but let's talk about what's going on here Because that was a much higher penalty than I was expecting for just having a basically container wrapper around it You remember I mentioned that we were on a shared network for the set of blades Well, that's actually what's going on here because So for the bare metal test, I had to run the PG bench client on bare metal Which means that to make things a good comparison I'm still running the PG bench client on bare metal and using node port to route it to postgres running on kubernetes The problem with that is that means that we've actually added a couple of extra network hops for every request And our workload like PG bench that has a lot of requests whose entire You know throughput time is less than a millisecond those extra network hops really count And as a result and particularly on a shared network where we don't have a dedicated network for kubernetes networking You can really see that in the increased latency and the drop throughput so Next storage configuration, right? This is this is what I really cared about comparing right because this is what I'm looking at Evaluating here, right? So we've got rook storage and a five node rook plus sef cluster only two replicas per Per data block Because it was such a small cluster. I didn't want to do the standard 3x replication because we're only talking about five rook nodes The some default tweaks off of the rook documentation for performance And also by the way, here's an important thing. I'm going to be showing you a cockroach TV thing So there's two different ways you can do cockroach TV with Rook One is you the standard thing to do is you run have rook manage cockroach TV for you Which is honestly the easiest way to install cockroach TV if you still want to install it at this point But my goal was to test sef performance So I was actually installing cockroach TV on top of sef rather than installing cockroach TV with Rook This is probably not something you would actually do because cockroach TV and sef Provide similar levels of redundancy and so I have double redundancy here Which is probably not necessary for most workloads, but it does help me measure performance So let's look at this bench here. So just off of raw stuff. I was really pleasantly surprised By what we were looking for raw file. I owe right. I mean that is minority Loss of performance now mind you keep in mind. This is files that fit in memory But Even with the rights Rook is just not imposing much of a penalty on my right speed For those I mean considering that I basically have an automated backup in this right? I'm those files all exist in two places That's actually really nice The Biggest thing that we're actually seeing there is And then the other weird thing is for my sequential reads my large block reads It's actually faster So talk to some of the sef people about this like why it would be faster and sequential reads than bare metal And it turns out that sef if you're reading a bunch of Contiguous blocks it will try to pull them from multiple replicas so you can read them faster And it turns out that that has a real life benefit in the sequential read case Now let's actually talk a little bit about the database benchmarks here, right so Bulk load got slower expected that but only about a third slower and The biggest thing is my latency doubled Which is not a big surprise again for redundant cloud native storage? What is pleasant is that it only about doubled? How many people here have worked with forms of redundant clustered file systems before? Because I've been doing this for a while I've been doing this for a number of years and doubled is actually a very good number For some of the older ones like moose orange FS and stuff. We would be looking at more like quadrupling Or quintupling the latency and so double latency is actually pretty good because we are after all writing everything twice So same thing with the cockroach TV benchmarks is we're getting about double the latency and half the throughput Which are directly related to each other Now TPCC is a target-based benchmark So that 1290 is how many transactions did we complete that will within the target window not what's the maximum number of transactions we can complete And we could still make the threshold because that double latency was still below our threshold and Rook You know Seth didn't add any additional overhead Now some things I actually want to do to tinker with this is obviously I want to actually do it on something without a shared network So I can eliminate the shared network effects from my figures. I want to actually get bigger SSDs So that I can do bigger workload I'm not gonna be doing cockroach TV anymore. I'll be looking at another cloud native database to actually do performance tests on And to maybe do some additional test tuning But let me give you some conclusions some walkways from this is first of all You can benchmark your own hardware in your own cloud native stack with simple database benchmarks to test the performance of that stack And whether or not you need to change it Local volume of performance should be roughly equivalent to bare metal and Rook Seth has good throughput, but about double the latency Compared to running on bare metal Which again if the redundancy if the cloud native storage is valuable to you double the latency is good If it's not valuable to you you should be aware that that exists More important things beware secondary issues that look like performance differences like my shared network problems And an important thing and go back to look away my presentation from Kubek on Seattle And you will see that on public cloud cloud latency effects mask a lot of performance differences That you can't actually see a lot of these performance differences because the effect of the cloud itself is so large So Questions contact If you have questions about Rook a couple of the Rook developers are here at the conference And you can find them there's going to be actually two Rook talks later today one in the next time slot one at 6 p.m So you can find out about that My contact information and we have about Four or five minutes for questions I think so questions Go for it We pass this back Oh, thank you. Thank you for a talk. I have some questions. The first is that What kind of the SSD did you use in their testing? I don't remember the model right off. I'll take it back to you on that So we we built these systems a while ago. I just don't remember what the model was. Is this SATA or NVMe? No, it's PCI bus Yeah, okay, so the second is For the benchmark result of the six bench Yeah, for the bell metal environment I see the sequential right is quite a small number that just Some no more than 100 megabytes. Yeah per second. So Why is so slow? The probably those the SSDs right is that ultimately I'm writing to a single SSD and You know both the bus and the RAM cache and the SSD You know with a large right because it's writing several gigabytes of data The RAM cache and the SSD is going to fill up pretty quickly And then at that point you're going to actually be at its performance, which apparently isn't that great so The I didn't actually care that much about it except to get it for reference But that's not actually that unusual for large rights is all of SSD's benefits are in random rights and small rights etc for large rights, you're operating in the performance of Whatever the slowest component in the whole bus is Which apparently it was not that fast in this case. These are blades after all and not full servers Okay another question is In your SAP environment, so did you use the same configuration of SAP for example to SSD and So the configuration for SAP Yes, but cephlex raw devices So it used each SSD separately So each SSD is a separate device and Seth handles bundling them together So maybe for one SSD you have multiple OSD No, I think well wait, I think each one is one OSD Isn't it? Yeah, each was one each one is one OSD Okay, thank you. Yeah More questions surely our database crew here has some questions No Have you guys tested anything on? Seth broke Minio any of the cloud native storage things? Okay. Well, thank you