 My first talk today is optimizing my SQL without SQL touching my .cnf, and there is a reason why I call this talk like that, because there is a bit of information about me. I'm sweet for the last 10 years, and I joined Dropbox as SRE in 2015. But despite I'm working as a member of storage team and I'm working on metadata storage and touching my SQL a lot, I had no DBA experience previously. So I have no idea about how to optimize or speed up my SQL, modifying my SQL configuration. And let's look what is a Dropbox today. So Dropbox today is actually 500 million users, we have 200,000 business users, and the users are uploading 1.2 billion files every day, it's a quite big scale. We have tens of thousands servers, we are located in multiple data centers across multiple continents and multiple countries, and we're storing petabytes of metadata and exabytes of raw data, and that leads to quite obvious problem that we need to optimize our performance because we want to save on the power, we want to save on the number of hardware we are buying, and we want to reduce latency sets there. So this talk is covering optimizing my SQL, mostly I'm covering how we are building my SQL, and just rebuilding, like is it makes sense to build my SQL in your own infrastructure or is it just enough to use vendor packages? And for every performance improvements or for every performance change it's actually reasonable to use benchmarks, and for benchmarks we are using CIS bench, and this is example of configuration we're using for running CIS bench. So we are running several tests with different number of threads and number of threads is variable, in this particular test we are using number of threads from 1 to 64, each test was running for five minutes, for testing compute it's enough to run on five minutes. So we are just using standard low LTP read write test from CIS bench, and for this particular test we are using master branch, and this is quite fresh check out of CIS bench, as far as I understand there would be a CIS bench talk later this day. So just we are creating 16 tables, each table contains one million rows. So these are testing subjects, these are examples of hardware, not exactly hardware but systems we are testing on. We are historically work for quite long time on Ubuntu 12.04, and we still in the process of migrating to Ubuntu 16.04, and both systems have different kernels, we are currently running 3.16 on 12.04 and 4.8 on 16.04. There's a different set of compilers available on each platform. On 12.04 these are GCC 4.6, 4.9 and client 3.0, which are pretty old by this time. On 16.04 these are GCC 4.9, which is just back ported for backers compatibility to 16.04 GCC 5.4 and client 3.8. So both machines are running this specific CPU, and both machines are completely the same in terms of hardware. So speaking a little bit about built infrastructure at Dropbox, we use Bazel as our build system, and Bazel is tightly integrated into our build system, but my SQL uses CMake as a build system, and Bazel is not compatible to open source software. And if you're trying to build something with Bazel and integrate something into your Bazel build, is it auto tools based or CMake based or any other build system, you're literally screwed and you need to do the same, or maybe we will go into open source that. We actually wrote a CMake to Bazel converter tool, which actually helps us to integrate into Bazel build. It's actually not that hard to write that. We did that during one night. So important stuff before running any benchmarks and actually made this mistake on one of these machines, and thanks to Ubuntu developers, starting from Ubuntu 16.04 CPU frack scaling governor set to power safe, and your CPU is just in power safe mode. So don't forget to set CPU frack performance governor. And let's start. So what we are going to optimize is optimize my SQL, just rebuilding it. And one of advanced compiler techniques is profile guided optimizations. And what is a profile guided optimizations? You're literally rebuilding, building special build of your binary, which starts collecting training data. And then when you have enough training data, you're rebuilding your binary again, and it optimizes branches to use better branch prediction, it relocates code pass, it has better inlining because it actually knows statistics on how the code is used. So it usually requires to rebuild, but not in the same, not in every case. In case of Clank, and as far as I understand, thanks to Google, Clank can use PerfData. So you don't need to rebuild, you can just collect PerfData and build using PerfData, which is a bit easier. So a common mistake, collecting profile data from unit tests, don't do that. Because unit tests usually are checking corner cases and you don't want to optimize for corner cases. So literally your code would be fast on some corner cases, like return exception, but not in any other case. And let's see how we're building with GCC. So to build profile guided optimization, build with GCC, we have two flags. One of them is to enable profile data collection, it's fprofile generate. There's another variant of this flag which allows you to set the output here where you want to write profile data. And the second flag is fprofile use, it's actually when you're rebuilding your binary, it uses profile data. And in case of MySQL, it doesn't work. And it doesn't work because MySQL is multi-threaded application and profile data would have incorrect counters. And in order to fix that, we need to use fprofile correction, which would tell GCC just try to fix counters for us. And bet news for Ubuntu 12.04 users, it doesn't work with GCC 4.6 and 4.9. And flags are working, but PGO doesn't work for MySQL. And yeah, that's about GCC. So another compiler we have in distribution is Clang. Clang is set up a bit different, it has a little bit different set of flags. And it's fprofile generate, but before they introduced command line back compatibility of GCC, it was called a bit different, it was called fprofile install generate. And there is an interesting flag, which is fprofile sample use, which is exactly the flag I mentioned to use per data. So bet news again for Ubuntu 12.04 users, no PGO support in Clang 3.0 at all. So we rebuilt MySQL with PGO, and we ran tests with SiegeBench, and we did several passes, so we don't have any fluctuations in our tests. And here results for 12.04, I don't know how the colors are looking here, but I can tell what's there. So we have Clang 3.0 as a winner, it produces binary highest throughput, and throughput is something about 10% more on using building of Clang than building of GCC. But in the same time, both GCC and Clang here are pretty old. But if you're stuck with 12.04, and you shouldn't do that, because 12.04 LTS ends this April, it could be a good choice to just rebuild Clang 3.0 and get 10% performance improvement. And here, how response time looks, and it is the same, so the lower is better, and Clang produces lower latency binaries as well, which is good. So that slide is more interesting. Here we have SiegeBench results for 16.04 builds, and on 16.04 we're actually able to use BGO, and there's a bit interesting, and we haven't found out why, but GCC 5.4 produces really bad results. Maybe something is broken with our compiler, and not the problem of GCC, but of our built infrastructure. But we decided just to include that, to be honest. And GCC 5.4 with BGO produces pretty good results. It's generally 20% faster than just GCC 5.4, and you can achieve that just rebuilding your MySQL binary is in BGO. But every other compiler is a little bit slower, but just using BGO you can achieve 20% performance improvement. And here are latencies, so looking at it, the lowest part is GCC 5.4 BGO build. For that test, we did not try to clank BGO builds, so we don't have clank data here yet, but it will be interesting to compare clank BGO build versus GCC BGO build, but latencies are also something like 20% lower for BGO builds. So this is one of interesting optimization techniques when you are not just, you're not touching MySQL configuration at all and you're speeding up your MySQL. Another interesting advanced compiler optimization technique is link time optimizations, also known as full program optimization. And it is supported by both modern GCC and clank, but MySQL code base is not LTO compatible at all. And a little bit what is LTO? LTO is when compiler, instead of producing object code in compilation units, it produces intermediate representation and linker does actual compilation and optimizations. So you're postponing actual optimization to the link stage and linker sees all object files and knows how to inline, like it can inline code from other object files, which is not available without LTO. And but actually it's easy to fix in code base and we're going to start working on that pretty soon. So we're going to send patches for that. And it will be interesting to see LTO results on other projects. So we're using, we saw something about like 5% performance improvement, which is also interesting. And so that's basically it what we accomplished by this time with optimizations. We have like 20% performance improvement just rebuilt in MySQL. And we have actually more ideas what we want to do next. We're working on, because we were using sysbench as a training data for PGO and we're testing with sysbench again, we actually point to start using a real-world training data. And for that, we're going to start, we actually started contributing to Percona playback. And we're going to just play back the real reduction traffic we have in order to optimize our builds for production. And another interesting thing that it leads to interesting question that we don't have only one single database, we have many databases and workload profiles are different. And we haven't decided what to do here because it would be interesting to build, to have some generic training data we can use on every production database. So another interesting option is building with MRNATIVE, I have mistyped there. So it allows compiler to use latest extensions like AVX or AVX2 and usually it just uses some generic compiler set, compiler CPU instruction set. But with that, you can also achieve like three, four percent of performance boost. And another interesting thing that when we're running MySQL in production, we're usually running one single instance, multi-freaded instance on single machine. But we've actually seen Perf data to see QPI interface and usually when you have multi-core machines, you have usually you have two slots and CPUs. If you want to access the memory from another CPU slot, it uses QPI link and it actually slower than using local memory. Not significantly slower but still we see like three, four percent here. And another option, we probably would run two MySQL instances, two shards on single hardware machine. And each bound of NumSATL to each slot and to each memory slot. So that's it. Questions? So could you repeat the question? Yeah. Could you repeat the question? Do we have some public Debian package with... Oh, so, yeah, like as I said, we are not using Debian packages for all built infrastructure. We're using with Bazel, so we're, unfortunately, we're not, we're not building Debian packages. So, yeah, it, so the question is how PGO works. So PGO built, collects data on branch users and it writes that into GCD files in case of GCC. And using that statistics, it just writes the statistics. It can understand which branch, branches are, occur earlier, which are not. Yes, yes, it's exactly about, it's about writing object world. So we are collecting basic events. I can, yeah, I don't recall correctly what is the size of the data we have. But we're also running a top on the machines. So, yeah, okay. Thank you very much. Thank you. Will you be around? So Maxime will be around. And so if you have questions, just take him, go outside and ask him so many questions, pay him a beer and it will speak for a full day, right?