 Good morning everybody. Bonjour My name is Jay pipes. I work at Marantis and This is Peter boros. He works at Bracona today. We're going to be talking about MySQL In its relation to OpenStack what works what? doesn't really work very well, and we're going to be showing you some Analysis of the MySQL query load that is generated By the various OpenStack services during Some runs of the rally benchmarking program So We're going to talk a little bit about our setup. We used an Amazon web services virtual private cloud For a whole bunch of nodes to do the testing. So we're going to talk a little bit about Our initial setup and then what we actually ended up using Based on Some problems we ran into with the first setup and then we're going to talk about how we analyze the the MySQL query loads that were generated from the rally tool and we're going to show you output of rally and PT query digest with in Peter's going to talk to you a little bit about how Galera cluster works Compared to a standard MySQL master master server or master slave service setup So What did what did we plan on doing for this talk? So we we really did want to Isolate the interactions between the various OpenStack components Nova Neutron Cinder all that kind of stuff and The back-end database server right we were not interested in you know, whether Zen is faster than KVM or or any of the stuff that goes on at the vert layer We are strictly trying to Examine the database communication between various nodes within the OpenStack services layer and the database and we're trying to identify The types of bottlenecks on the database side of things that that occur under Realistic workloads now there's a little bit of a trade-off there is because we're using the fake vert driver in Nova So, you know, it's not actually creating a VM and plugging in networking and all that kind of stuff We're able to execute a rally boot and and delete scenario or of tens of thousands of VMs Of course, that doesn't represent a real Real-world timing because obviously those VMs are going to be created very quickly in comparison But it does allow us to isolate on the database layer and take a look at the types of queries that are generated from the Nova and Neutron Cinder Glant services for that type of payload so We used multiple Amazon web services instances installed with OpenStack services. We used unmodified Icehouse packages for this Unmodified MySQL Prokona packages for MySQL Galera cluster a lot of what we talked about today By the way, we'll talk about Galera cluster and Prokona extra DB cluster, which is Prokona's Modified enhanced version of The upstream MySQL Galera cluster, but a lot of what we talk about is completely applicable to Anyone who's running in just a standard MySQL master slaves setup as well The queries that are generated are the exact same regardless of whether you're using a cluster database setup like PXC or you're using MySQL master slaves. So of all the people here, how many people use standard MySQL master slave replication? Okay, about 15-20 percent. How many people use MySQL Galera cluster? How many people use a combination of both? Okay, anyone use Postgres or another database DB2 or Okay You have one person Postgres Sorry, cool So yeah, we tried as much as possible to stick with unmodified packages of everything because that's really what people are using and We used a set of Ansible playbooks anyone love Ansible like I do Down with Chef So we we had a lot of fun actually working together on this set of Ansible playbooks and you can go to GitHub and and fork the the code and Investigate sort of how we set up Precona Extra DB cluster nodes with the bootstrap node and multiple other Precona nodes and all the open stack services of the network node and all that kind of stuff If you're interested in that so Yeah The the the things that we were interested in like I said are the database layer Things that we didn't need to worry about whether for instance Tenant VM to VM communication worked right we're not interested in that because what we're what we're doing is as a fake vert driver And there's no there's no actual VM that is created, right? So we didn't have to worry about that we didn't actually have to worry about actually starting a VM, right? we're waiting polling for Livevert KVM to actually start the VM instance So there's a lot of things that we didn't have to worry about so we ran into some some scaling bottlenecks sort of right away one of those was the the controller and and Peter's gonna talk a little bit about The first set of problems that we ran into when trying to actually put load on the database So at the first glance we were we were using seven AWS instances One controller node one compute one network a separated out got error cluster extra db cluster and one rally benchmarking node. These were pretty beefy machines C3.8x large instances and What we saw so this is just you know writing the playbook based on the install guide so installing OpenStack based on the install guide and The database issues we discovered this this is actually not a title slide So this is what we discovered based on this so the database workload was so light It was barely measurable if I would give you a book with the data You would be able to cope with the query workload manually literally so We started Investigating what the bottleneck is and the first that we ran into is then that by default Neutron server is is a single thread and that they top a CPU core So we set API workers and RPC workers the next bottleneck we run into this is actually different in Juneau in Juneau You can have multiple keystone processes, but Keystone in ice house is if you are not running it in Apache. It's a single demon So it was again a single core bottleneck Why running? It in Apache we ran into that bug. So that's the only Modification we meet made to keystone. So so it works We and we also put the queue on a non-controller node the network nodes load was really low So we figured that we could put the queue there First we actually put the queue on a separated node But the network node needs to needs to be able to reach so the DHCP agent of neutral needs QXS. So it's It's good if it's if it's there What we ended up having is one controller node. We want more in the future. We will see what you will see why We created 10 compute worker node one network and Q node Three extra db cluster nodes and one node for rally benchmarking the computes were Not as beefy as as the others, but the the bottleneck bottleneck wasn't the and the reason why? we didn't need or want to to use C3 8x large instances for the Compute nodes is that the Nova compute Damon is a single process single threaded so Putting it on 24 core Virtual machine makes absolutely no difference whatsoever But having ten of them as opposed to putting ten thousand instances on a single compute node does Make an impact in the query generation So obviously remember the the objective here is to stress the database out and we we tried as much as possible to stress the database out You know one one other thing which we had to you know configure Which is not usually in a production environment that despite we were running with fake Nova driver We know was still counted quotas So we weren't able to start that many virtual machines So we had to set the CPU and memory over commit rate to something like 1 million So right there's there's the CPU allocation ratio and RAM allocation ratio settings in Nova comp and they control When you when you schedule an instance onto a Nova compute worker The way that it checks its resources and determines whether it has capacity to run that VM Is it has its its set of physical RAM and and and cores and then there's an Overcommit ratio called allocation ratio and we can we adjusted that by a factor of a million For each Nova compute worker because obviously they're not actually using resources So we were able to do that to get around You know the issue with the scheduler would think that a compute node Was full when in fact it's it's not at all Okay, so yeah, we enabled the first fake vert driver we put 10 compute nodes to have a little bit more realistic Set up so that it wasn't a single record in the compute nodes table that we had more than one record in the compute nodes table more than one record in the services Table which is if you're familiar with Sort of how Nova works internally. There is a service group API that has actually foreign key relationship to the compute nodes table So for each compute worker that's gets spun up There is a record in the compute nodes table and then two records in the services table one for the network topic and one for the compute topic and so There is a set of periodic interval tasks that run that update the the schedulers Or rather the databases idea of the resources that are consumed on Compute instance and so we up the number of compute nodes to try and make that a little more realistic than a single Provider of a million VMs so We we set the quotas to negative one we actually ran a number of benchmarks to see if setting the know-op quota driver Or setting quotas to like a hundred thousand pretendent would make any difference and It does not so we just set quotas to negative one Which means unlimited and has a few short circuits within the quota DB driver For the negative one unlimited thing But we found that it didn't really make much of a difference whether we set it to a hundred thousand where it would actually be doing calculations of quotas or negative one where it's just doing some short circuiting quota calculations and Then we we originally we originally had pki set up kept running into issues with it the first issue we ran into was the The Keystone auth token middleware kept Not closing file descriptors, and so we'd have a bunch of zombied Defunct processes that are trying to do SSL certifications for the tokens Then when we tried to do multiple controllers, we ran into More issues with with pki. So then I just said okay Let's just put it on you you it tokens because that's going to stress the database out even more anyway, so The we we did set the the token driver Or token back end to uuid and in the keystone conf Which does generate a little more traffic to? Well quite a bit more traffic to the keystone endpoint so Okay, so what we tested We used rally so after you install rally we in the playbook We have rally installation as well it installs from the git repository and you point it to a deployment rally can actually deploy deploy an open stack Cluster to you for you as well. It has a pluggable deployment mechanism So what we actually want to do in the future is to plug in or our ansible playbooks Which are creating this AWS thing into rally So we are running a single rally command to create the environment benchmark it then tear it down So it how many people are familiar with rally by the way raise your hand if you used it Raise your hand if you're familiar with it Okay, and it's an amazing tool. It's it's really good. I really like it. I'm I'm doing I'm doing a lot of database benchmarking and You know not not necessarily open stack and I really liked it So you create a deployment configuration We created one for a for an existing cloud you can run rally deployment check to to see if Everything is available and then you can you can create Task descriptions. So for example, this is the boot and delete server test You give the flavor the image name This is the this is the net ID of a shared shared network in in neutron Rally right now actually needs this because it creates the users for for the benchmark So it creates its own users and then at cleanup stage it it deletes them And it doesn't yet create It doesn't yet have a facility although there's a patch in review right now To create a separate non shared network for each tenant during the the pre deployment phase So that's that's the reason that we have a shared network for all the tenants And since we're not really stressing we're not benchmarking networking itself. We didn't care too much I have a feeling that the the database traffic on the neutron side Will be slightly different with non shared networks because the code paths are a little different and how it determines which network to to put a put a VM into but For the for the for these benchmarks we use to share that work So with that configuration we run a thousand times with the concurrency of 24 So it creates a thousand VMs and then deletes it and it creates 50 tenants and 10 users per per tenant Okay, and Rally also because it's great. It is creating it its own users It has the facility to specify different quotas for the rally users You can just set minus one to them and you will be able to create as much VMs as you want This is the common you use to start the rally task and after it it ended you can use rally task list to to see or even if it's not it didn't end yet and You can use rally task report to generate a report in my report You will my reports you will see that the rally task didn't end there were some issues with the cleanup stage, but I Talked to Boris yesterday. By the way, if you are using rally join open stack rally IRC channel Boris helps it help is just amazing and and the tool is and the tool is Really good as well. We'll show you the reports in just a bit. Yeah, they get produced So but before sharing the reports, let's talk about Galera and how how does it work a bit? So, you know Based on based on what I talked to people, you know, I think this kind of gets you the big picture But not too much details. I could go on and talk about the details for two days or so The cluster can be seen as a meeting Right a cluster has a cluster uiD and I like this meeting on a logistics from Alex say your chunk of codership that When somebody leaves the meeting There is still a meeting going on. So that's one cluster node Stopped or failed or something Then when he comes back he joins the meeting again There is the meeting going on and there are less and less people Right less and less nodes in the meeting and when the last node leaves there are no there is no cluster uiD anymore so When a previously failed node or stub node comes back Then it will start a new meeting and Others will join to to this meeting this new meeting is the bootstrapping process of extra extra DB cluster So if you did service my SQL bootstrap PXC, it's actually the start Start the new meeting thing. So it starts up a new cluster and the other nodes will join to that Okay, I would like to talk a bit about how replication works, especially because of the select for update issue We you know at the at the last summit with the with the database panel discussion. We discussed it Also, I have a blog post on perkona.com slash blog which explains This this select for update issue in depth like what is exactly causing it? How can you you know? How can you work it around? Why is it there and stuff like that? But in a nutshell Go ahead and replicates in right sets a right set is practically a row based my SQL binary log event and some metadata The some metadata is good for two things You can compare two right sets and tell if they are conflicting or not and you can tell if a right set is applicable to the database so Whenever a node node writes it goes through a process called certification Certification involves two steps first is that the right set is compared to every other right set in the in the node skew and Because we can write on any node the in its queue it can have rights from you know from From multiple other nodes and if If it doesn't conflict with anything it checks if it's applicable to the database What do I think about when I say applicable to the database? For example, if you have an update or Or a row image which has a before and an after image as well Then you have to have the key which which it updates right if you don't have that That key then that row based row based event is not applicable to my SQL Replication will fail in this case So after the third certification is successful on the on the nodes It sends it in parallel to all the other nodes in the cluster. So this is the replicate part and when it gets back when it gets back from the The acknowledgement from the other nodes that they received the the right set the comet is finalized the remote nodes will certify the right set similarly, but since GT IDs are assigned Based on the group communication that is established the certification process is deterministic So the first node doesn't have to wait for the results of the certification on the remote nodes Similarly with the with the full transaction So this was the autocomit case single statement transaction with the full transaction the case is similar What is really interesting from this is where the mechanics described here kick in They kick in at the at the comet phase and this is really important because that's the reason why select for update Doesn't do what you expect it to do because until you commit the the record level locks are not Are not replicated across the cluster. They are only locking on the local node and this is why Select for update kind of works when When you only have a single writer node because then You know DB's standard pessimistic locking mechanisms will will kick in So so much for Galera if you if you want to talk about it in more detail You can come to our booths and or hit me up anytime if you see me Okay, so the workload we did rally boot server and boot and delete server scanners the boots are so With each test we booted tank a nodes and Then we run a boot and delete with 5k. So we just booted 10,000 left that alone and then run a boot and delete for for 5000 Boot and delete is a more realistic workload for for production database iterations because it's a good mix of read and write query and you know it it It stresses the database with with log heavy queries. So But we wanted to have something that already so Okay, all tenants in the rally scanner. You are used the same shared network This is what Jay earlier talked about and this is the this is the patch which will allow rally to create its own networks Okay, so for analyzing the my sql load we were using pt query digest This is the description of pt query digest from our website but pt query digest is practically an aggregator for my sql slow query logs During the benchmarks We set long query time to zero meaning it will log every single slow query and Returned on per corners enhancements to slow query log So we have a number of additional metrics in the slow query log with which pt query digest can work with This is a slow query log event. So if you issue this update query This is what you will get in the in the slow query log for example The inodb stuff here is per corner specific like logging the transaction ID logging the various inodb metrics and and things like that and This is an example how to kind of abuse pt query digest. It's really just the just a slow log aggregator so it's not all it It can use for more than just generating the list of queries that consume the most time Here we are using a filter with with which we can inject actual per code and dumping a single slow log event into JSON format. So yes, it's it's pearl. Sorry. Yeah, it's pearl All per corner talk it is pearl Okay, so we from the workload we generated the the regular pt query digest, which is just the top queries but Because some of you are not using Galera and you know people were running into issues with Galera I would like to share that how can you analyze your workload if if it's Let's say Galera compliant, right So apart from the top queries, we were generating a Digest from the transactions with the most rows affected Why is this important if that these are the large transactions, right? If many rows are affected the transaction is large if you think about the mechanics We we talked about if you hit a large transaction in replication Then it will get certified and it will get into the queue of the node Right and Galera has parallel replication because we know that the right sets in the queue are not conflicting because of the Certification what happens if I have a long-running transaction in the queue with many other otherwise short transactions Let's say if I have a queue of 10 transactions nine of them taking 50 milliseconds one of them taking two seconds My replication will be stored for two seconds Right, so long-running transactions can cause replication stores in In the regular my SQM master slave word. This is manifest So these long-running transactions long-running rats are manifested as slave like typically So just just to make it clear All of these PT query digest commands are applicable for both Galera and standard my school master slave replication It's just interpreting the The results of these reports is slightly different depending on whether you're using Galera cluster Or rather whether you're using standard master slave replication, but the commands work identically against Galera or standard my school Okay, another way of getting Getting the large transactions is to examine the transactions with the highest statement count I wish for an example of that We will get the largest statement by By rose affected the queries that write the most so we filter out We filter out everything which is not a right in there and For the select for update issue for determining it before going for Galera or before going to multi node writer is that we grew by the time spent in record lock weights because the record lock weight can potentially You know can potentially cause the cause the select for update when the deadlock issue with with Galera because if the So in in a sink in case of a single node if you if I update the same record They will the second update will wait on on the first First transactions record lock while in Galera's case if this happens on two nodes One of them will win and the other will have a certification failure. So With this command you can determine if or or how does select select for update on multiple writer nodes affect you just one second before we go on so what Peter just said about on if you if you have multiple writers going to multiple Galera nodes, right and Two separate writers go get on two separate nodes and they update the same record for instance the compute nodes table Or a usage record Reservations or quotas that kind of thing where you have a highly contentious row it's not like the The the nodes are then going to have different representations of the data, which would be data corruption, right? It's just that one the first one will succeed and the other will issue a certification error and what that manifests is What it manifests as is an in a DB? Deadlock error, which is why we say it's a deadlock. It's not really a deadlock It's a certification timeout, but the error code that gets popped up through the various sequel alchemy layers within the open stack services is actually an inner DB deadlock Which is why a lot of people confuse the two things, right? Okay, so this is the html output of rally and this is booting 10,000 servers so the performance was Was really you know consistent here? This is running it against one database node if we ran against a multiple one the performance is So the average response time is increased a bit more And we have some errors King and those errors are and those errors of timeouts, right the certification timeouts With the boot and delete server setup, I try to go fast because we are almost out of time with the boot and delete server writing to To several nodes actually significantly improves performance Despite the fact that we will hit the the hit the deadlocks So if the application gets a you know an error from my SQL Because it draws back the transaction for whatever reason it should know how to handle that and it should know how to how to retry So and some of the code in some of the services does know how to retry it successfully So there is a decorator within Nova called retry on deadlock Which Does exactly that it retries when it sees an inner DB deadlock Sometimes that works just fine. Sometimes it's it doesn't work so fine but The errors that you see in the success here are all related to that inner DB deadlock Which is manif or the certification timeout, which is manifested as an inner DB deadlock The important thing to point out here though is that we've got 14% or so of the 5,000 Total requests to boot and delete a server That are resulting in this do you do you see this if you run? Tempest for instance against the production cloud. Yes tempest will because it's right It's run sequentially and it does boot and delete and reboot and terminate all that kind of stuff stop pause All in rapid succession all within the same tenant. You will hit the inner DB deadlock issue Do you see it a lot in production? No, I mean you can forcefully produce the deadlock issues, but Here what we're doing is we're just doing database stuff. There's really nothing else going on It's like just hammering the database. So we're able to reproduce this issue successfully and consistently But in real production environments with real workloads where you're doing you know block copies for for images and Snapshots and doing virtual is you know the vert the vert layer itself and spinning up a VM It's really not something that I mean at AT&T We ran into it like I can count on one hand the number of times that it happened in real production workloads So just make a note of that So yeah At the end thanks so at the end We were we use the C3 that 8x large instances with 32 cores the controller nodes cores were completely saturated and The database node was using like one and a half one and a half core Yeah, so that's how far we could load it. So in order to load it We will need tens of controller nodes, right which we plan on doing some more benchmarks and publishing those on our blogs But the key here is that even in a manufactured Benchmark where we're trying to stress the database We're we weren't bottlenecked on the database issues at all We were doing I don't know 2,600 transactions a second on the my school node, which is nothing I mean it like you say on a 32 core box. It was using like a core and a half. That's it but the Python Controller API services conductor all those kind of things were Maxing out their CPU. So we were hitting CPU issues on the controllers way before and issues with the message queue as well and that may have been because of the sort of awful networking in AWS VPC, but The point being the database was not the bottleneck And now some digests so This is first I would like to show you that Let's begin with the with the regular digest output so this is pt query that pt query digest output for the first case when we write only to one database and And Creating 10k virtual machines So what you see saw the fields in the slow log you see an aggregation here For example the total time the total total walk-rock time spent in the database is Almost a half an hour You see an aggregation for minimum maximum average Execution time 95th percent higher, you know standard deviation and median for each each slow log event also So all the other things which you saw in the slow log and aggregated like this including how many you know DB pages did we eat how many bites did we eat how many you know DB operations did we do and things like that? after that you find a profile a profile means that For example, we have the first query with this fingerprint which is comet It takes up most of the most of the time which is spent in the database the second query which has this fingerprint which is Which belongs to neutron takes nine percent of it and so of all the queries run during the rally payload or benchmark all of the query payloads out of that Not the commits, but the next one nine percent of all those queries was to get port bindings in neutral of All the queries nine percent Okay, so let's examine And that's the boot server not the boot and delete So that's 10,000 just boot boot 10,000 servers So this is the query we are talking about for new neutron port binding delicious, isn't it? And what's the what's the issue with this? So this is a this seems to be an easy fix because we see that it's Unindexed and you can check if it's really unindexed That it examines So you should compare the rows examined and rows sent here It examines Ten and a half thousand. Yeah, on average. Let's use average five thousand five thousand to return Zero point seventy five So there yeah, there's there's no index on the queries that are there on the fields that are used in the where or Group by condition in that particular monstrous query And that's the reason why there is a Full scan of the of the table, okay? So full scans tend to you know blow up over time So let's see what happens in the next test where we boot and delete after this with it boot and delete The same query takes almost half of the 50% of the total of the data this time So Yeah, so it's proportional with the with the amount of data we We have there. So this will kind of you know Blow up if your database will get larger and larger But I'm not sure who has a database with 15,000 nodes who has 15,000 VMs running But because they also include deleted instances, right? I mean for instance the you know the upstream CI system any any given minute is running 500 to 900 nodes that are running open-stack tests So yeah, I mean there's a lot of deleted instances out there also notes that the second query in the boot and delete is Select instance metadata That is a table in Nova that contains it's called instance system metadata and Whenever you boot an instance We get we grab a bunch of information from the flavor for the instance type like number CPUs number amount of memory used and we Archive that information in the instance system metadata table. So for every instance that you start up. There's between 10 and Maybe 45 on average Records in that instance system metadata table. So it's an exponentially growing table and unfortunately the Column types in that table are varkar 255 and text or something So not exactly the most efficient storage and it blows up significantly very quickly and as you can see that the 13% or so of all of the queries executed are on that table So for this you see full join as well But it's a kind of misleading because it's a full join because it has this so this is the whole query Again juicy Because it has a table sub query. So I'm highlighting the part from select and this select returns only Only one record. So this is my this queries result will be will generate a TMP table the N on 1 TMP table and You know, it will be a full join against it But it's a table with a with a single row and how can you tell it otherwise? Again, you can examine the rows examined and the rows sent So on every one let's use the maximum here, but the average is pretty close. Just the maximum have, you know, wrong numbers So it examines 15 rows to return Return 16. This is not that bad. This means that you cannot have this this much within next thing So it's it's what you're you talked about that. It's it's the structure. What It means it means that it's the structure which we which we can help it help it with Okay, so let's examine something else Let's do We're getting close to time as well, just like, okay, so I probably just rambling too much The the other thing that we probably want to point out is the rollbacks. Yeah, so let's go on the rollbacks so Let's examine the record lock weights, right? So here we group group by Spending time in Waiting for a no DB record locks So if you check Here is record lock weight. We spent a total of 45 seconds here during the test waiting on the record locks But the maximum of is like 300 milliseconds, right? And it's It's this query. So this report with group by the record lock weights Yeah, no the for update and Let's check the second query. Oh Look what's there? Let's check the third one. Okay. This is an insert probably waiting on a select for update Well, it's an insert into the reservations table Which is highly contentious because there are updates and inserts going on and deletes at the same time So most of the record lock weights here are coming from these these select for updates So this way you can you know with this report generated You can determine how much? How much conflicting rights that will affect you When moving to Galera final one, let's put on the digest transaction most statements Yes, so Here we have we are grouping here by you know DB transaction ID So this is not a query. This is an inno DB transaction And let's check. What is the what is the top thing here, right? What? Transact a transaction with with 1000 queries You can grab in the slow log like this to get all the events belonging to a transaction and If you grab out commits and sets you will see the actual sequel of the given transaction and You can see that I forgot to reset the slow log After deploying the cluster. That's the issue here, right? So The the next the next Transaction with the most throws affected affects only 30 so we don't have particularly large transactions only at the time of initialization I think we we need to wrap this up because yes another session coming up So thank you very much. We appreciate you being here and please please come and talk to both of us We can talk about this all day long