 Hi, I'm Stuart. I work for IBM, have done for about a year. I believe was Abby. My current job title is Opal Architect. Is that it? Or the next kernel developer or something there? I work in Firmware. So if you want to know more about OpenPower, Firmware, go see Jeremy's talk. That's in the future. If you want to know more about PowerKVM, go and invent the time machine and watch Paul's talk. Or you can watch a video, which is also great. I also do a bit of work on MySQL on Power due to this accidentally falling into being a MySQL internals expert for about 10 years. And I've worked on MySQL cluster, MySQL itself, Drizzle, various Bacona server and extra backup stuff. So I've been inside this database world for a while and have somehow escaped into Firmware. So we'll see how that goes. I'm going to talk about towards one million SQL queries per second. Oddly enough, if you just heard me, this is going to be about MySQL related things. I'd love to see a Postgres benchmark that approaches this. So this is a subtle hint towards the Postgres people in the room. For the record, Josh is giggling. I'm not the Postgres expert, though, so I'm going to look at MySQL. MySQL is still the most widely used open source database. Rumors of its death are greatly exaggerated. It is still used in a lot, a lot of places. Even people with forks of MySQL and an agenda behind it have something that is more similar to MySQL than anything else. So it's still fundamentally the same architecture. The fashion of today seems to be more NoSQL stores. You can usually get some really high key lookup things in that because maybe you have a less overhead of parsing things and the like. But SQL is this nice general-purpose query language. And in case you were wondering, NoSQL is not new, what is a file system if not a hierarchical key value store with a non-SQL interface? And MySQL also has some also SQL features. Recently we've had things like a memcache interface to it and people are always doing something to do with the web to have queries come through HTTP. But I'm not going to talk about no SQL ways because SQL is obviously cool. And doing general-purpose queries fast is an interesting thing indeed instead of sort going through the side of the interface. There's this thing called MySQL cluster I used to work on, which is a shared nothing distributed database. And in 2001, what would become MySQL cluster? And this is a different storage engine than standard in ODB. So you require a number of machines to do it. And it's in memory. On a 72 CPU Spark system, in 2001, MySQL cluster could get about a million reads per second. And this was using a C++ API to do batch asynchronous operations to many nodes running on a single 72 CPU system. The 72 CPU system was, of course, a loner because not many people can afford to buy that just for a benchmark. But it did sort of foreshadow a few things coming to the future. Hardware trends over the years have been for wider, more CPUs rather than faster cores. So just like people, hardware seems to be getting wider rather than faster. And this means, of course, we have to parallelize to get higher performance. We can't just do the same things even quicker. And so first, let's go and have a look at some history of MySQL and concurrency. So obviously, if we're going to get towards like a million queries per second, we're going to have to run on some multi-core box. An easy way would be to get a database server that does one query per second, buy a million boxes, add all the sums together, and then we have a million but one computer is way more manageable and it's kind of cool and besides benchmarks are fun. So MySQL started around the 1990s. There is prehistory to that. I gave a talk last year on weird history of MySQL so you can totally geek out on weird MySQL history at that talk. But back in the 1990s when MySQL first came around, remember it was also the days of Windows NT, which if you recall came with a one to two CPU license. And these days, that doesn't actually work on my phone because it has more than two cores. So we've grown a lot since MySQL started. And at some point, you know, Linux got SMP support with the big kernel lock and things sort of got better from there. And for a while, you also remember hardware, anyone had a dual seller on 400 is like the cheapest possible way to get an SMP box. Yeah, so multiple CPUs were hard to get sort of expensive and nothing really worked on it too well. And well, MySQL is a really small company in these days. So you know, actually acquiring the hardware was hard. But there was this odd kind of accident, right? So MySQL on threads has sort of been there since day one. They had, we have a one thread per connection model. So each database connection spawns a thread or grabs one from the cache because caching is brilliant and more config options make it better for this admins, no doubt. But we have one thread per connection. And this is kind of a lucky accident, right? So a bunch of the code from MySQL came from a preexisting project, a preexisting engine. And this was very single-threaded or oriented. And so if you just sort of have multiple threads doing sort of the easy independent things and you could just sort of synchronize down to table access and threads were an easy way of doing this concurrently rather than event loops. Or that's what a textbook said was cool. Because remember, this was the 1990s. Threads were cool then, right? That was like the, it's still cool now, maybe. But you know, threads were really cool in the 90s. But so there was this idea that you'd think maybe threads were like multi-core magic, right? Now you've got one connection per thread, so you just get like 100 CPUs and everything like magically works faster. That's not quite the case. So it's a little bit wrong. Threads are not multi-core magic. But it turns out that this model of MySQL was actually fairly good for the late 90s and early 2000s, right? You didn't have many CPUs. And those people deploying MySQL were doing so on like, you know, cheap commodity X86 boxes, which maybe had one or two. So really you didn't have to scale up beyond one or two CPUs anyway. So it was okay, it worked well enough and after all, a billion dollar business was built on this so you can't really sneeze at it. But we have a problem here. We had a bunch of big locks. So what's the great thing about big locks that lock large sections of code and data structures? They're much easier than many small ones, right? Many small ones is hard. Locking is really hard and it's really hard to do thread programming. So if you just have one big lock or very few big locks, then it's easier to program. So MyISAM, the default story engine of the time, had table level locks so you can only have one concurrent writer to the table. And just through nicely, it was a read-write lock so you could have many concurrent readers. And it turns out that the ratio of reads to writes is generally the right way around so that you have a lot of concurrency when being able to do concurrent reads. If you use transactional tables at the time, NODB had what they called as the NODB kernel lock and it had a bunch more. We also had like key cache lock for MyISAM if you're doing key operations and there's like a hundred other different giant big locks in there. Query cache lock, authentication lock, everything had a lock. In fact, everyone gets a lock. That's a big thing over a big data structure and blocks of code. We once counted and it was at least a dozen locks you'd take just to do a select one query that did nothing. So there's lots of big locks and this worked really well on one to two CPUs because you didn't have to wait too long and things worked well enough. And it worked well, it scaled well and when you couldn't buy more than two CPUs for less than the same money as a house, it worked brilliantly. And other features were a priority, right? You could go and scale something with the 16 CPUs but that would cost more money than MySQL the business had and no one was gonna buy MySQL at the time to run on a 16 CPU box. So other features were a priority, things like subqueries, making the optimizer work better, making replication easier to use and slightly more reliable. But roughly around sort of the mid 2000s, this SMP thing seemed to catch on a bit more in commodity hardware and four and eight core servers became a bit more common. In fact, it became hard to buy things in server form factors with less than multiple cores. And MySQL scaling to these systems didn't look great. So great that when I was trolling through the internet attempting to find a diagram or graph to steal, it turns out they were really quite rare because really the graph looked a bit like this except it's missing the down arrow. So you sort of peak performance at like the number of threads equals number of cores when number of cores is about two and then it plummeted right down to zero. Well, not zero but close enough too. And this was the time where eight cores was not necessarily better than four. In fact, you could add cores and get the same performance as if you reduced the number of cores. So often people disabled the cores on their system to make their database faster, right? Giant locks bouncing around just didn't help. There was also other things going on at the time that didn't help matters. MySQL had a big quality problem. So not only did it not scale to many CPUs, the new versions didn't actually work on one, let alone two. So you had a problem going on at the same time that was arguably the fastest database engine to crash is probably not the best one, right? And so MySQL itself lagged behind what hardware people were buying. And it was this odd kind of thing of MySQL kind of killed its contributor community by hiring everyone. If you submitted a patch or a job, you got a job which sounds great except to me you realize then everyone's in one company and interacting with the developed community then is weird because you no longer have one outside the company. But you had companies like Google and Bacona driving S&P performance with a bunch of patches that got into various different builds of MySQL to have things be less disastrous I S&P. And we actually got to the stage where a few still graphs from Google's Wiki page, you actually get the state that we got to the point where things were not plummeting off when you added numbers of threads or numbers of cores. It was kind of leveling off and then going down which is a bit better than plummeting. So adding cores became something that could improve performance or at least not dramatically hurt it. At this at the time we counted select one which is basically returned the number one in a single column there. There was about 12 mutexes going on which if you have a number of CPUs and for running this very quickly you can pretty much imagine what happens. You spend a lot of time waiting for mutexes and not much executing queries which is disappointing, right? And these were big global mutexes on things. In December of 2010 MySQL 5.5 went GA or stable and it was much better than any previous release. And it also actually improved some scalability along the way. At this point we had what they referred to as an adaptive mutex inside NODB which basically instead of just calling the P thread mutex lock spun for a while attempting to get the mutex and then went to sleep. It also whacked the x86 pause instruction there to attempt to be better with hyperthreading. And this did see some improvements. So doing sort of adaptive mutexes in user space did actually seem some improvement and this was largely patching over the problem of you had too many large heavy locks that were trying to be held by too many threads at once rather than fixing the core problem but it was an improvement. In MySQL 5.6, in early 2013, finally NODB split their big kernel mutex so the equivalent of the big kernel lock for the kernel got split in NODB in 2013. So in the past three years, we went from a GA MySQL version only handling about four to eight cores to about 32, right? So this is in the past four years. You went from more than four or eight cores was probably a bad idea to now 32 is something you should always see improvement on and after that it's still kind of okay. So this is a bit of a comparison on sort of relative performance between MySQL 5.5 and 5.6 between number of cores. So there's a lot of improvements in 5.6 as well so you actually had improvements when you added cores rather than slowly tapering off. So here you have 32 cores or 64 when you turn on hyperthreading, you get the same performance as if you had eight for certain benchmarks. So this really sold big machines well but with MySQL 5.6 you actually got improvements which is brilliant. The same can be said for more sophisticated things, benchmarks as well as when you actually put many concurrent threads. So instead of having simply one thread per core or one connection per core, if you had 1024 connections, you have 1024 threads running accessing the database you actually got some improvement when you added some CPUs. Now it's not ideal, right? But it's better. Same with 5.6, we actually get to the point where you can add CPUs and it at least doesn't hurt, right? So we pretty much always get better which is a big improvement over the earlier days. So Sun bought MySQL in 2008 and so as MySQL engineers we got to care about Spark. If you remember that era on Spark it was all about cool threads and the T1000 and the T2000 chip which what was a great property of them, they had threads, how many? Lots. And this was back in a time where MySQL 5.1 wasn't even released. So all those magical graphs of a whole high Nellie scale that was years off. So when Spark came around and we had the Niagara T1000 chips, eight cores, four threads per core or the T2K which was eight cores by its threads, it turned out I found a really odd mapping of MySQL releases and end of life dates for Spark machines. Turns out by the time a stable MySQL version was released that could actually possibly scale to a number of hardware threads in that Spark CPU those systems were end of life. But luckily Intel chips had fewer cores and therefore it wasn't as disastrous as it could have been. I have no idea of the current state of Spark but this was a humorous thing to watch from the inside of people urgently trying to get high performance numbers on Spark when it was just like, you're never gonna get that, which is disappointing. But PowerPC was also a thing. Turns out Apple at some point stopped selling PowerPC machines. So it was a bit of a glut for a while and MySQL wasn't really maintained on power. But the disappointing thing of this is a bunch of these sort of improvements things like the adaptive mutex and the like were kind of only written to work on x86 and as an afterthought work on Spark. So it kind of worked on Spark and it definitely was tested on x86. So anything else like power or arm or MIPS either worked by accident or was too slow to really hit a bunch of the concurrency problems or people didn't hammer it hard enough to hit them. So this was kind of an odd situation with architectures other than x86 over the years. And the concurrency improvements continued with people testing on x86 to a much lesser extent Spark but other platforms kind of went away. And from a hint of my employer at the start you may think that after I joined IBM and was working with Power8 and OpenPower stuff you can possibly guess what happened next. Someone discovered I knew something about MySQL and they had this fast processor and they wanted to make the two work together. So MySQL on power. It turns out that yes people started asking questions for like possible customers. And it turns out it's an odd idea in business when you sell things is to possibly do things that customers want and give them nice performance. And it turns out that the last real stable version people playing with is MySQL 5.0 which I may have mentioned scaled to a few cores and Power8 has a lot more than a few cores. In fact, we have about SMT-8 so we can have eight threads per core and we can have four, six, eight or 12 cores per socket and we have dual socket machines. So you add that up, that's can you easily get 192 threads per box which obviously does not work on old MySQL really well at all but this modern stuff that we keep hearing about may actually have some numbers. So let's get it working on Power8 again. So step one was it compiles. So obviously ship it, it works perfectly, right? But MySQL, latest 5.6 and 5.7, the development stuff it compiled, so ship it. And it turns out that it wasn't hard to compile at all. It did actually compile out the box. Step two however, have it actually work was a little bit harder. And this was because of a critical step that you need to do. You need to make mutexes work, right? Because it turns out these are kind of critical when you're doing multi-threaded programming and MySQL had their own adaptive mutex things that kind of improved things. So the make it faster thing of spin and check, spin and check, spin and check was missing one crucial thing on Power and that is memory barriers. So Power requires you to do memory barriers properly. So it's not like x86 where everything is just magically consistent across processes which works when you have sort of smaller systems but less so when larger ones. So the custom mutex code in MySQL was missing this and occasionally you'd see some hint of do magic here in a comment. Or you had the wonderful thing of where it would like spin on a stale cache line instead of actually checking a value or you had the bits where it's like could actually race with itself and attempt, thought it'd get a lock when it didn't. Later on hitting a cert and leading to corrupted data on disk. But only if you ran many things at once. So it turns out the MySQL test suite would pass with mutexes that were distinctly non-functional. And it turns out that you could run applications against it as long as you didn't have more than one or two concurrent connections doing work. And you were probably okay. But then you threw a whole bunch of stuff at it and exploded horrifically. So if you looked at the story for MySQL on Power it appeared to be that at some point break mutexes then have the test suite still pass. And then I assumed there was profit at some point. So it was a bit odd if you're just looking at it from the power side. But obviously corrupting everything and falling in a heap is the worst failure mode ever. So I kind of had to work to fix this and get it all upstream. So how to quickly test mutexes? Cause of course unit tests are for I don't know, useful testing or something. So there's none of those. Instead the custom mutex code it's like, okay, how do I quickly hammer this to see if I fixed it? And so I thought, well, obviously I could just run a benchmark because that could hammer it quickly because I know where all the mutex contention is. I'll just construct a benchmark to have highly contended mutexes. And if it falls in a heap then probably it's incorrect. And so Stuart's wisdom on benchmarks is that all benchmarks are probably a result of bovine digestion. Yep, tick. And so there's this thing called Sysbench which is a highly flawed benchmark for OLTP apps or general SQL query databases. But it's really well known. It's very easy to run. It doesn't involve Java because I like sanity. It usually uses prepared statements. So we'll do prepared statements and then execute stuff which is great which isn't considered cheating to get SQL performance because you're actually running SQL queries or at least it's not cheating by my rules and the universe runs according to my rules. It has options to change what it runs so you can run different types of queries. And so standard Sysbench is pretty good but it doesn't always hit all the scalability problems. So this mode called point select, right? So point select is basically select value from table where primary key equals value. So it's just doing key lookups but with SQL. And it turns out this is a really simple and fast query to execute. So if I execute a lot of these in a very concurrent manner, I'd be able to test if my mutex implementation was remotely on the right track. And I could at least bring it up in Perf and see what's going on. And of course read-only benchmarks are brilliant because they don't actually test your disk subsystem. They just can test how fast you are executing in memory. And it turns out that in various stages of development with PowerAid, we had many systems without much disk. So at some point, I believe I was on a system that had about an order of magnitude more memory than disk. So you just compile everything in a RAM disk because that was the only space you could fit the source trees in. So this was sort of great using development during development to actually get sort of valid numbers to check progress. So I was doing read-only, so it didn't hit disk and I was testing the actual speed of the machine as well as my mutex implementation and not just the speed of my disk. And of course, when you do this kind of query really concurrently, and I was saying before, if you know those 12 mutexes to execute a query that's nothing, you end up hitting a lot of these mutexes very, very often and very, very quickly, so you really do test if things work together. And this is this odd benchmark that's used in the MySQL world because this is basically the idea of what is the maximum number of SQL queries that we could possibly execute for a second? Right, so this is actually, what is the theoretical maximum that we could get out assuming all queries are simple? And at some point, I thought, well, this PowerAid thing's meant to be fast. So what was the previous record? Could I actually get a record out of this? Not only get working mutexes and MySQL working on PowerAid, but could I actually get a world record benchmark number at the same time because I know I was feeling greedy or something. So what's the record? So on MySQL 5.6, it was 275,000 queries per second. This was on a, this was in 2013, it was on a 16-core system with hyperthreading. And this was state-of-the-art for MySQL 5.6 which is still the current stable release. So 275,000 key lookups per second from SQL, which sounds like a lot or a little depending on what you're used to, but this is SQL, this is real stuff. So this is pretty fast. So I had a target, right? Could I beat 275,000 queries per second? Having the fastest benchmark number in the world for anything is a pretty good way to show off. And I don't mind doing that occasionally. So I was gonna get people interested with MySQL on power again because then it scratches two itches. Brilliant. So the first thing I had to do was fix the ODB mutex, of course. So I did that. My first experiment was just using, there was a compile time option to switch just to use p-thread mutexes which actually did work. So I knew it was the mutex code that was the problem. And of course had some performance issues due to the excessive locking. And the great thing about this was I had to add memory barrier magic around this lock word thing that was in the ODB mutex implementation. So the first problem was there's no barriers around setting it. So it wasn't really a mutex which led to, of course, the aforementioned corruption. There was a loop as well that would spin check, is it free? Can I acquire it? It turns out that without a barrier there, all you do is you sit on your local CPU core and check a stale cache line of going, can I acquire the mutex? Can I acquire the mutex? Well, all the other CPUs don't hold it anymore. So having a barrier in there actually reduces spinning, doing no work. So patch one, add barriers. Turns out that the crashing magically stopped. My initial patch was sort of incomplete and very hackish, but luckily, luckily it turns out that one of the guys at Oracle also made a patch that was nearly as nearly there. So I added a little few things there and modified mine and we ended up with something that worked fairly well. So with my improvements along there, I managed to get from something that purely worked to getting 25% more on single-threaded performance and 50% more on multi-threaded. So my naive, tiny, just get it to work patch was pretty good, but it was still further to go. And at this point, after just the initial thing of getting a mutex is working, I had 264,000 queries per second, which you may notice is, what is it? A little bit off the world record and close. And the thing that told me I was close is there's lots of idle CPU time. So a lot of the machine was spending, you know, no time at all, or time doing nothing at all. And I also noticed that SMT in this case didn't actually help much at all. It added about 10% performance onto it. So it was like, well, we've got all these hardware threads, you know, really SMT248 should be better than SMT1, one thread, so I had a long way to go, but it was close to the record and so I was excited. There's also this other big lock inside MySQL that's there in 5.6, which is around creating read views. So because you have MVVCC inside the database, you need to create a consistent read view when you start a transaction. And it turns out there was a mutex there. And it turns out that you can cheat a bit if you're creating a read-only transaction. For example, you don't run an explicit beginner commit for the statement and you're just doing a simple query. And in MySQL 5.7 they fixed this, but in 5.6 we actually spend a lot of time just merely creating the read view to be able to do the single key select. So this was a problem that I saw and I was like, okay, that's probably gonna be my bottleneck because it was fairly high on mutex whites. And there's a well-known bottleneck on x86 as well. So at least it was nothing new or power specific. Patch two was I knew where to look in the source code for some data structures that relied to cache line size. Turns out that a really simple patch kind of helped a little bit. Cache line size on power is different. That was easy. And then I went to, okay, this SMT thing should help, not hurt. So the pause instruction on x86 says I'm in a spin loop lead to yield to another hardware thread. On power we have something called thread priorities. So you can just sort of lower or raise the priority of your current executing thread compared to others running on the core. So, well, I put that in. And that did actually kind of help. So spin locks and SMT didn't love each other because it turns out that when you're spinning, it looks an awful lot like work. And so the core executes that. And so basically what I had was when I had SMT on, you had the thread that was spinning, waiting for the mutex, getting all the CPU and the other ones were sort of not. So it didn't actually help. But once I put in the priority stuff, it helped a bunch. I found this using Perf, of course, because what else do you use for analyzing anything? Perf. Does anyone have an alternative? Nope, good, covered the how I found things stage. So patch three, thread priority, makes SMT help rather than hinder, which is great. And I got up to 289,000 queries per second on SMT4. And that was a world record for my skill 5.6. So I could claim that I had the fastest my skill 5.6 thing. I like the clapping. Also because I'm not done yet. So I also noticed that Par8 is a Numer machine. And it turns out that Numer and MySQL has had much written on the topic, all of which is depressing. Of course, it doesn't help for many years. I don't believe any MySQL engineer really got any time on Numer machines. So it was hard to optimize. So Par8 has two sockets and four nodes when you get dual socket machine. It sounds odd. This is due to the structure of the silicon. There's dual chip modules, essentially two chips per socket. And so you have four Numer nodes on a dual socket Par8. And of course, if I'm gonna do benchmarks to get big numbers, what am I gonna do? I'm gonna find the bigger machine, not the smaller one, because the bigger one should have bigger numbers. And they're more impressive, right? So no distances. This means basically that when you're on a CPU zero and you access memory attached to node three, it is way more expensive than accessing your local memory. So you have lower latency for memory attached to your local chip than one that's remote. And so in MySQL, you have a number of fun things. For example, you have a global cache of database pages as well as you have thread local data. So there's some stuff which is gonna be shared from every connection and some stuff which is gonna be connection local. I have a patch that does that, probably does memory allocation per node in a kind of hackish way. And it does help a little bit as well. But we also had this issue of I had machines with nodes without memory. Because it turns out that you don't always populate all of your development machines with memory in every single socket. So there was some machines with only a couple of dims here and there. And it turns out that this hurt a lot in my benchmarking. And in fact, I could see it right in Perf of like there's all this time being spent on load instructions. And I'm like, well, that's really annoying because that shouldn't be spent a lot of time there. And I found that when I actually ran the benchmark on fewer CPUs, I got the same performance. Why? Because this means all the memory accesses was node local rather than going across to another numer node. So nodes without memory hurt in this use case. Nodes without memory may be fine in other benchmarks, but obviously I care about this specific database benchmark because I want a world record and therefore all other use cases are invalid. So solution bind to a single numer node. This was the other idea. We're seeing a lot of mutex contention. What do I do? Well, if we're only using a certain number of the CPUs, why don't I bind to a single numer node? So I bind the MySQL process to a single numer node and I bind the benchmark to another one. This is a brilliant, brilliant hack, at least for some work. So my patch from MySQL Numer Stuff is currently in discussion. Basically it does interleaving between all of the nodes for like, you know, to be buffer pool and then you do node local allocation for each connection, which seems to work fairly well. Even on x86, this actually seems to get a few percentage performance improvement, which is kind of nice. So when I improve performance for x86 as well, that's a win. And at this point, I had 324,988 queries per second, simply using less CPUs. And then I also modified my MySQL server configuration because it turns out that I was silly and missed a few options to help with multiple performance. So this is including not only that, but several other config changes. So this was CPU and memory bound to node zero. So a single P8 machine, six cores, you know, eight threads per core, and node one had suspense running on it. And the mutex contention didn't help, it meant that if I added cores, I didn't get any more performance. So really, you know, six par8 cores gets the world record of MySQL performance, five dot six, four point select kilo cups, which is great. So there was also there some memory, there was some mutex contention inside the memory allocator. So MySQL has all these fancy tricks to try and allocate memory less often than you'd like because historically, malloc and free have taken a lot of CPU time and not been very concurrent. So there's a bunch of those tricks too. There's also Jemalloc, which is an LD preload library that will replace malloc with something that is better for multicore. And so this magic got me slightly higher. So again, had another world record, which was 335, 756 queries per second. This is average over a couple of runs and everything, but this is, you know, not bad. So this is like a single socket par8 system is the fastest MySQL five dot six, four point selects in the world, brilliant. But I still had mutex contention. So the place I found mutex contention here was actually an information about the database table. So there's a bunch of shared state about the database table inside the database server. And again, I found this using perf because you could tell exactly which mutex were all waiting on and spending some time. So it was like, well, that was annoying. I found condition point, but luckily there was a solution for that for benchmarking. And I wasn't the one to come up with it. If you have one table that you're performing a benchmark against and you have mutex contention, what might you do? Create eight tables instead and run it against eight separate tables and you've essentially then partitioned the mutex. So this is a Dmitriy Wurzer-Irkal who does a bunch of benchmarking stuff there and performance analysis had this and published some numbers. And this is how he got his x86 world record. So I finally did, you know, eight suspension processes, eight tables and summed the results together. And this, you know, alleviated some mutex contention. I also decided to look at this configuration option, which is table open cache instances, which is another mutex I was hitting a lot. Turns out there was a config option for that and I just didn't remember. So I modified that and got up to 344,000 queries per second on eight times one million row tables on my SQL 5.6. So every improvement I was doing seemed to add about 10,000 queries per second, which is an odd kind of incremental thing. So, and this is still on a quarter of a dual socket power eight machine. So theoretically, if I did the math and said, well, if we solve some of these mutex contention problems properly, well, 344 times four is 1.376 million. How close could I get? Well, it turns out there's a development version of my SQL called my SQL 5.7 and on 32 cores on Intel system. So 32 core Intel system, they were getting half a million queries per second and there's a lot of concurrency improvements in my SQL 5.7. So this was hopeful for one, there was a shortcut for getting read views. So I would not have that mutex problem. And so on the same hardware as you got 350,000 on my SQL 5.6, they were getting half a million. And so I thought this could be a good way to get bigger numbers because bigger numbers are better. And I set myself a goal. I said, could I get one million queries per second? Why one million? That's a really nice number. If you email someone saying, I got a million queries per second, they're gonna pay attention to you. So I was obviously gonna switch to the dev release and put my patches across to support power properly. So I switched to my SQL 5.7 as quick as dirtily and possible so I didn't really finish some of the patches because, you know, why I get reliability when you're trying to get a benchmark? Luckily, the typical problem that we had, that we have for some software is, you know, coding not Indian safe, that's a paddling. Luckily, we didn't really have that problem because my SQL used to work in places and also we could just run little Indian on power as well. So it wasn't really an issue, which is brilliant because Indian problems are annoying. So the other thing I found in 5.7 is volatile is considered harmful. It turns out using volatile around in variables is not memory barrier magic to make concurrency work. And this was used a bunch of places in my SQL 5.7. So I have a long series of patches that I've sent up to the Oracle guys going, you need to apply these, because this volatile thing doesn't actually do what you think it does. It forces a load in a store instruction, it doesn't force barriers to happen. So that was annoying to see. I stared at Perth a whole lot more. I found this really odd thing. So looking at Perth, I said, why am I spending 2% of total execution time in this benchmark in this instruction? And then I scratched my head for a while until Anton mentioned, perhaps it's the load instruction five ones up and you should look up higher. And I went, oh, yes, let's have a look at that. And it turns out that we had a load instruction that was doing a whole bunch of awful, awful things. And when you rehearse the talk and everything shows up perfectly in the window and then you give it and now the patch is cut off, you just wanna swear at software. So it turns out that C bit fields are considered harmful. And so if you see the minus line above this, which is magically hidden because of the magic of things failing during a talk, you'll think that when you remove the C bit fields on the end and just use an unsigned integer, it gained 10,000 queries per second. So simply switching away from C bit fields saved it. And this was the before and after assembler. And you may notice that we basically moved from a 64 bit load to a 32 bit load. And since I was using GCC48, it turns out that this was a problem. And GCC49, it's not a problem, but C bit fields considered harmful. And at this point on 5.7, I had 411,000 queries per second, another record. And so this sounded good because this was on a single table instead of eight. So how did I get to a million? Well, remember how I said mutex contention on information about a single table was a problem? Well, so I did two of them and I got 494,000 queries per second. Then I got four tables and I got 588,000 queries per second, which is good because that's more than the previous world record. So that's good. At this point, I noticed that reference counting on the database pages in the buffer pool to make sure they're not evicted while you're using them actually started to show up in profiles. So that became an issue. So I had a world record, but there was still more. So I did eight times. I got 590,000 and that was pretty good. And I stopped at eight because that was a big number, but it's still a little off a million. So I looked at CPU usage, right? Did I have any free CPU? Yeah, it turns out I had 42% of the machine sitting there doing nothing. So I had a world record number and 42% idle CPU, which turns out is not good. And again, it came down to mutex contention, but in a new place. So it turns out there's a read write lock in checking your permission for a table as in can this user who's logged into the MySQL server actually access this table? And this turns out to be a read mostly data structure. The number of times you run grant and revoke is pretty minimal, but you do need to check permissions before returning data. In fact, this is kind of interesting problem. And so I did something that was really, really interesting and naturally not showing up. So what I did was I used a wonderful facility given by the compiler to magically make mutex overhead go away. It's called forward slash forward slash. You comment out the mutex. Because using RCU is hard. And that would be like, you know, the correct solution. So instead I commented out the mutex. So doing RCU and user space is the obviously correct answer, right? It's a read mostly data structure. It's basically never touched and that's complete sense. So I did do that because this would be fine for benchmark. And I got 896,000 queries per second on SMT4, which is big and also a world record and really close to a million that I just had to buy, borrow, beg, steal or, you know, change the laws of the universe to get to a million. And it turned out at this point SMT8 was also helping. So I got to 915,000 QPS. So this is leaving about 10 to 15% idle time on a full P8 system, so dual socket. But there's still room to move because that's still idle time. So it turns out that instead of using just eight tables, these nine tables, I commented out the read write lock and then I used a slightly older version of Sysbench that uses less CPU, which was how they got the half a million number. So I felt completely legitimate doing that as well. Because it turns out that a lot of the system is actually spent simply running the benchmark program. So my next mission is to then optimize the benchmark. And so it was over a million queries per second, which not only beats the goal and is a world record, but is the equivalent of today we can run SQL queries of a million per second. That in 2001, we needed a special 72 process of Spark machine and an asynchronous C++ API and essentially a cluster database to achieve. So now we can do over a million queries per second from SQL, as long as you don't run grant or revoke during runtime, it works fine. So it worked and I got my benchmark number and everyone was duly impressed. And basically I could give this wonderful bug report of this mutex essentially hurts performance by 400,000 queries per second. You might want to look into that. So there's more bottlenecks of course. It turns out atomic variables hurt for a number of reasons. In order to be rose red counter is fairly heavy. There's a bunch of the counters hurt. And ACLs are these huge bottleneck. But 1.3 million is possible. Maybe I mean the machine you multiply it out. Maybe I'd love to see that. But 1.3 million is much less exciting than a million. So slightly less motivation. Perhaps full suspension numbers. We're not going to get to a million yet, but maybe in 10 years we will. There are a bunch of interesting stuff on suspension performance on Power8. The MariaDB guys have a really good white paper on more realistic workloads than key lookups. So needless to say Power8 does pretty well in these ones and it does pretty well on my fake benchmarks. Read-write stuff is probably interesting as well. And this in fact found some other bottlenecks that we have for like more general purpose stuff. So I'm going to do fancy checks on many things to do in hardware. It turns out that I've forgotten a lot of the math that CRC32 is about. And so we all got to scratch our heads and go, didn't I know this? I was cleaning out the garage, procrastinating preparing my talk for LCA. And I found a bunch of papers from university about this exact math. And then realized I'd forgotten it all. So I just wait to make myself feel stupid. So probably an optimized CRC32 for Power is some points going to help us for write performance as well. And future work is really, most of my patches are upstream. There's a bunch of work still continue to have it work out of the box. But as a byproduct of this, the MariaDB folks merged them all and made a few other changes. So at least MariaDB works out of the box on Power8. Somewhat Oracle is assigned to work on the mutex contention for ACLs, which is great. It's a known issue for other benchmarks on x86. So we're in a pretty exciting place. I have currently 21 open bugs on the MySQL server with various patches, some of which are just fixing things. One of which is a spelling error, which is taking so long to merge. Spelling error in a comment. But it was irritating me. Further reading, I will put up elsewhere. Historical reading is kind of interesting as well, but go for it. And we possibly have negative a minute for questions. But if we're, let's do it, questions. Thank you.