 Now, we have Mateon Kovit speaking about database performance secrets of the stars. Matt, you have the words. Thank you very much, Svetlana. Hello, everybody. My name is Matt Yankevit. I'm here to talk to you about the issues that could be destroying your application performance that may be caused by the database. So a little bit about me. Again, Matt Yankevit. My official title is Head of Open Source Strategy or The Haas at Percona. I've been around the open source space for over 15 years, specifically in the database space for all that time. And if you don't know who Percona is, we do MySQL, MongoDB, and Postgres software services and tooling. So we also have a podcast that I host as well, specifically on open source software. Now, today's challenge, and this is a challenge for all of us, is I'm going to try and fit as much knowledge as I can into this small bite-sized chunk. And so there's going to be a lot of detail. It is not going to be a ton of detailed specifics, but I'm going to give you enough to help make your application perform better and ensure that the database is up and running and give you pointers on where you can look. Now, you've probably built many an application, and maybe your application is beautiful and speedy and fast and it looks sleek like this Formula One car. I'm sure that you put a lot of work and effort into making it look and operate in a positive way, but sometimes you end up with performance that looks like this is the engine. Of course, nobody wants a hamster wheel engine, and a lot of times that awesome application that you have is slowed down by the database or the infrastructure that you've put in place. And a lot of times it is not just about the database you've decided to use. It's about how you designed your application, built the database internals, as well as the database that you've chosen and how you've set it up. Now, a consistent theme that we're going to talk about when we talk about infrastructure and that foundation is we're going to start with who's managing it and is it managed? Okay, this is a question that a lot of people don't know the answer to. As developers, we tend to think of the database as one of the last pieces that we want to concentrate on. There's a lot of more interesting code. There's a lot of more interesting features. So the database tends to be something that we don't want to spend a ton of time on. So when you have a database, can you answer who is in charge of it? Who's responsible for making sure that it's working properly? Or are you outsourcing it? And as we talk about that infrastructure, the words trust but verify come to mind. And this is an important concept as we talk about that infrastructure because the trend right now is a lot of people are taking their database and they're trying to run it as a service in the cloud. They're trying to have someone else help manage it. And the cloud is a great place in databases as a service such as RDS or Azure SQL, Google SQL, or the myriad of other databases as a services that are out there, they provide some excellent operational management. However, they are not perfect. And this is something that is very important because if you have chosen to run your database in that infrastructure, there are some things you need to be aware of. If you don't manage it, your performance is only going to scale by spending more money. You go to the next instance sides, you get a more memory, more disk, more CPU usage, you increase your spend. And this is where that trust but verify comes in because in the cloud space, it's important to understand that if you have outsourced or given fully managed providers, your database fully managed isn't always the same thing that you think it is. So for instance, most cloud providers employ a shared responsibility model. And from a database perspective, you might be running your MySQL, your Postgres, your Mongo somewhere else, but your app needs to understand that the database is architected for it, it's designed for it, it's configured to an optimized and troubleshoot it. These are all things that are your responsibility in the cloud. So even if you are using a cloud provider's database as a service, you still need to understand that there are things that could severely impact the scalability of your application within the database itself. Now you might ask, what about some of these new databases? And so one of the things that has come up quite frequently in recent times is these new SQL databases, if you will, cloud native databases, right? They are awesome. They help you scale your systems. And I've seen a lot of people look at issues with databases as a service, they've tried MySQL, they've tried Postgres, now they're looking at these cloud native that can scale up and down. These provide a lot of opportunities to solve some of those scalability challenges. However, while they show great promise, you have to understand that they are not all designed for every workload. Many of these require you to change the workload of your application, change how your application runs or what sort of data you're retrieving and what frequency in order to boost performance. If you do a copy and paste from database, let's say Postgres to Postgres compatible database, you're not necessarily going to see a performance boost at all, you might see degradation depending on that workload. So trust but verify if you decide to go down the route of trying some of the new databases that are out there in order to solve some of the scalability challenges you have. Similarly, you might see systems like VTES or Citus that provide sharding for your systems, which is also another way to scale and boost the performance. And again, with a sharded workload, some queries, some system, some application, some workload, they're not necessarily going to perform faster they could actually be worse. So you have to understand that not every technology that's in the database ecosystem, even if they do say it's to boost performance, we'll actually boost it unless you make application changes. If you do make application changes or your workload does match, these can be awesome tools that can really improve performance. Sometimes multiple magnitudes of performance improvement. So that's something to keep in mind. Now, that being said, so whether you've chosen to run on the cloud, you're running in a Kubernetes environment, you're running your databases on-prem and your own hardware, there are some basic things from a development perspective if you're in charge of choosing these and you're setting up a new application, you should consider. Now, the first thing about choosing hardware and instant selection is this. If anybody knows what this is, this is a RAM. Yes, I'm going to give one of those little vignettes of memory is good. Of course it is because from a database perspective, and this is foundational, the amount of data that you are selecting and retrieving is one of the biggest limiters to your application performance. And the more of that data that is readily available in its fastest form is going to improve the performance of your application a hundredfold. So if you have all of your hot data in memory, in RAM, it is better than if it's on a solid state disk. And if you have all of your data in solid state disk, it's better than rotational memory or rotational disk. So it is important that you understand all of your hot data should be able to fit into memory. And all databases are universal, whether you're running Mongo, Postgres, InnoDB or MySQL, MariaDB, any of the new SQL options, the more memory and the more of your hot data that's in memory, the faster performance you're going to get. And the more data you put into your database, that means that there is a risk that it's going to slow it down. And that's why there's a big push on scalability. How do we scale? How do we get more out of our systems? And unfortunately, today, we are data hoarders. I would challenge anyone in this room, anyone watching later on, have you ever had a manager who has said, we don't need to keep more data? We need less. And the answer is no. In most cases, people want to hoard data just in case. And that makes it very difficult. And if you look at these two sides, think of this as your database. It's easier to get around the after picture than the before picture of this hoarder's setup because there's less clutter. And the less clutter you have in your database, the better off it is. So keep that in mind. The more that fits into memory, the better, the less clutter in your system. Now, of course, one of the other things when you're choosing a instance or hardware is going to be what about concurrency? What about those CPUs? In most of the time, concurrency matters more than raw speed in your CPU selection. So you want more cores as opposed to more megahertz per second. And that's gonna be something that is important because as your number of users grow and the number of queries grow, you want them to run concurrently. So some general rules on hardware selection and optimizing your system and making sure you've chosen the right infrastructure. Make sure all of your hot data is regularly in memory. If you do have a lot of data that you need, make sure it's on faster disk. Make sure you use guaranteed IO if it is a heavy IO system. See if you can get rid of some of that data or reduce it and make sure that if you're going to have a lot of concurrent users that you are using more cores. Now, another opportunity to reduce the database side of things is to put caching in front of it. So if you wanna use Redis, Memcache, something like that, that is another option. But really you want to select and use the smallest amount of data possible, hitting the database and being retrieved at any one time. Now that we've kind of talked about the infrastructure, we built this awesome car and we've kind of made sure that the engine component is there, let's get the sports car onto the road. And if you were to take that Formula One car and put it on this road, the performance would not be as good as you might think. This is not a road designed for Formula One sports cars. This is an off road. And this is a design issue. If you design the sports car and this is the road you need to go on, that's a problem. And so I wanna talk a little bit about design because for most applications from a database perspective, the scale and performance of that application is heavily dependent on the design. The design matters most because let's be honest, if you get the hardware wrong, it's generally fairly straightforward to upgrade to the next level of hardware. It's generally easier to upgrade, to add additional nodes to a replica, things like that. If you need to redesign the entire application from the ground up, that's a very expensive prospect. Now, the number one issue we see when we're talking about database performance issues and scalability issues when we run into these is design issues. And this is an over and over again problem and it's really hard to get right. So when you look at this, you see number one, things like bad data type selections, right? Using larger data types than you need. Whether you're going to use an int versus a big int versus a numeric versus something else, poor schema design, over reliance and ORMs. Many of us might use SQL Alchemy, you might use Ruby on Rails, might use Hibernate, all kinds of different ORMs. ORMs are great to simplify your code, but a lot of times they'll choose the wrong type of access path for a query. They might choose more data than they need. And you need to understand that just because something's there to make it easy doesn't necessarily mean it will make it fast. Similarly, when we get back to that hardware, you design your system, you might over or under engineer the hardware side. And so it's important to understand how these flow together. Now, you might be thinking, okay, I don't need to worry about schema because I am schema-less. Some of you might be running MongoDB or might be running other systems that have flexible schemas. I am here to tell you that schema-less does not really exist. It is a marketing tool, okay? Now, yes, there are things that are designed to be flexible and to make you not worry about the schema. But the truth is this is really more about do you structure your data or not and where do you do the validation for your schema? Schema-less designs tend to push most of the logic to the application side where you need to validate in code, which may be okay for you. But that flexibility can come at a price. And so sometimes the databases themselves, even these schema-less databases need to still manage a validation scheme behind the scenes. And those aren't always as efficient. So when you employ one of these unstructured data structures, you go schema-less, you just kind of throw whatever into a database, you're going to trade off performance scalability for that use of flexibility. Even in the most advanced databases, even in systems that are designed to be quote unquote schema-less, this is something that we see happen quite a bit. Because when you start to throw iteration upon iteration of your schema and of your JSON documents, oftentimes how you're selecting the data out slows down, bogs down, because there are things that are there, are there indexes don't quite work the same way. So it's important to understand that if you are going to choose one of these designs that you're still going to have to think through that. And the more design thought you put in upfront, the better off you're going to be. Now, that being said, I've harped a little bit on data types. And I want to point out from a data type perspective why data types matter so much. And I'm going to give you a real-world example here. This is something very easy. It's nothing that's groundbreaking. But let's say you wanted to store the string one, two, three, four, five, six. Now that could be a number or it could be a string. If you take a look here, we've stored it as a VARCHAR32 with a UTF-8. We've stored it as a VARCHAR32 with a Latin. This is, these are all minus QL for this example. And we've stored it as an int in a big int. You can see that each of these fields will take 21 bytes, seven bytes, four bytes, eight bytes. But you can see the difference, right? If you were to store this as a VARCHAR32 in a UTF-8, it's almost two gigs worth of data, whereas the int is 381. Now remember what I said, the more hot data can fit into memory, the more scalable the application and the database is going to be. Well, in this case, if you've got 16 gigs of memory available, one eighth is going to be taken by the first one, whereas a very small percentage is going to be taken by the int. So it's about trying to optimize what's being stored in memory. And that's an important thing. Now at one point, we worked with a large social media company. If you were to look at the top five social media companies in the world, it's one of them without naming their names. And they were running into scalability issues when they were going through massive growth about eight, nine years ago. And they had keyed everything off of their email address. And an email address being something very long, it could be up to 100 characters. This is something that for most people, it just makes sense. That, you know, hey, we want to store this and everything will be keyed off of this. Now in MySQL, which is what they happen to use, the primary key is also stored in every other index that you build. So that means that that 100 bytes that is being used for the email is also in every subsequent index as a lookup, which means that they had 10x the amount of data they needed to store. Now what we ended up doing was converting their email primary key to a auto increment. So a sequence, a number, one, two, three, four, five, six. And we also put a hash, CRC32 hash of the email address and we made those two combined their primary key. That meant it was a maximum of eight bytes as opposed to the hundred. So there was a 10x improvement not only in storage, but there was a massive almost 20x improvement in performance because less data is in memory, less CPU cycles are spent trying to compare the Varchar values and it was a huge win for them. So many of you might use UUIDs and many ORMs you use UUIDs internally. So a 32 byte representative of a unique ID. That can actually kill performance. And so that's something to keep in mind. Now I mentioned that MySQL, the indexes all contain the primary key. Indexes are probably your number one friend when it comes to troubleshooting queries and query performance. And so you've probably heard this numerous times all over the world. So I'm not gonna go too much into this, but I wanna leave you with a couple of things. Too much of a good thing can be a bad thing. Too much of a good thing can be a bad thing. I've seen it many, many a times where people will index almost every column that is not good. Okay, you should limit the number of indexes because every index you add adds additional overhead to maintain that index. So every column doesn't need an index just the most frequently queried or filtered columns do. Every database has their own set of indexes and there's different use cases for individual indexes. Now we've done talks before, happy to point you to some resources. If you're interested in which individual indexes do what, we do have that detail there. Now I wanna also say, use what you need, okay? Don't keep everything just in case, archive what you don't need and be mindful that a lot of ORMs and a lot of database applications do a select star to select all the data from the systems all at once. This is something that can kill performance, okay? Matt, I'm sorry to interrupt you. We need the time limit for talk. So we should slowly move to the question answers and questions. So please finish the idea and slowly move to the questions. I think I only have like two more slides. So one minute. So be mindful of that select star. And I do have some configuration options in case you are interested, MySQL configuration options, which ones you should potentially look at and Postgres configuration options. But keep in mind, you have to tune your system for the workload and don't over engineer things if possible. You can follow us if you wanna get more on this topic. We have a vibrant YouTube channel. We also have a blog with tons of 15 years worth of database content all in the open source space. So I'm here for questions now, but keep in mind there are a few more slides you might wanna dig into. Thank you Matt for a great presentation. Now everyone who have any question, please put it into the question answer step. We will wait for a bit if there is any question. Anyway, Matt will be available at work adventure site. So feel free to visit him there to talk about the talk or anything else. And I'm also on Discord as well. So if you wanna ping me on Discord directly I'm available there as well. Excellent. Well, I don't see any questions. I guess no questions. That's excellent. Okay. Again, thank you very much for great presentation and thank you everyone for joining us for this talk. See you soon and enjoy the rest of the DevCon.