 Should I put you guys all on my Insta story? You guys want to be on my Insta story really quick? OK, so to get started, I need two people that aren't really tired to give me a hand with something. It's really basic, non-computing, but I do need two volunteers. OK, I got one here. One more volunteer just to come up and help me out. Come on up. Appreciate it. OK, so we're going to do a little competition. I have two books here. I just need each of you to pick a book. OK, stand on each side of me, please. Now, can you look up every instance of the word war in each of these books? And the first one to get it, just raise your hand. We have time. Go ahead. Just every instance of the word war. Please don't use your cell phone. I've done this and someone use their cell phone. I was like, come on. You have the right idea there, I think. French ministry work. Oh, you got it. OK, so you found it really quick. What are you doing over here? How many have you found? None. That's not that bad. Let's just, what page are you on? OK, thanks. I think you get the point here. Thank you very much for helping me out with that. Yeah, so the talk is dropping into bee trees. That guy's name is Jaws. I'm not kidding. Hopefully this drop-in isn't as vertical as that is. Hopefully it's like a nice 18-degree descent. My name is David McDonald. I'm a developer at Weed Maps. Although this is the title of the talk and we will dive into bee trees here, we're going to talk more in general about indexes. So bear with me a little bit as I set the stage. Yeah, as developers, we spend a lot of our time learning. That comes with the territory. By my estimate, I spend 25% of my time learning and 12.5% of my time learning about the new JavaScript whatever and a lot of time watching YouTube videos, specifically cat videos, because they're hilarious. And I feel like I want more time. I feel like a lot of people I talk to in the industry just want more of that time, but it's not coming, especially if you work for somebody else, as most of us do. We don't have an endless amount of time to learn about the things that we would like to learn about. And we have to prioritize. We end up not getting as much time as we might like to learn about the tools we even use every day. So that's kind of where this came from. There's been tons of instances in my career where I'd run into that just in time learning. Something's broken. I've got two hours to fix it. Sprint deadline is at 3 PM or something. And now I'm diving in. Well, that just isn't sufficient sometimes. And I think when it comes to database issues, specifically indexes, that's always the case. It's kind of a vast topic. 50 years of computer science going into just one thing, like a B-tree, for example. As Rails developers, we've likely spent more time looking up, like Googling how to use options for a select and a slim template. Or just like, how do I get sprockets to cooperate with my Angular 1.x thing? So yeah. And our databases just work. They're time tested and reliable. So for the most part, it's kind of plug and play. For the average developer, our databases are just casualties of this time crunch. And one of the sides I hope you take away from this today, which is a little bit in contrast with what David spoke about this morning, but not too much, it's that it's worth diving deeper into certain topics that can yield a lot of fruit, right? And I feel like indexes is definitely one of those topics. There are a lot of spaces, though, in our stack like this, middleware, for example, maybe, that they require expertise and they require attention and they're very marketable things, right? They're things that will help you in any team you go into for the rest of your career in perpetuity. Now, I frankly spend most of my career treating databases the way I would treat my smartphone. Like, I know how to use the thing, but I don't really know exactly what it's doing, right? That's mostly how I look at a database. And I think your average person walks around with their cell phone just using it. They don't really think about radio waves. They don't really think about what, they don't really know the difference between Swift and Kotlin, they don't care about that stuff. So I've gotten by, like most of us do, with gems and heuristics and online searches, but I'm a professional, I use these things every day. I should be able to go a little bit deeper when I need to. So that inspired me to do just that. Databases are a vast topic, so focusing on indexes, I feel like, gives us the most bang for our buck, okay? Understanding indexes is key to demystifying the database performance and functionality, and will have a huge impact on your ability to plan, contribute and troubleshoot as you progress in your career. If I'm talking too quickly, someone signal me because I drink a lot of coffee. So this is a bold statement. What are indexes anyways, right? We use them all the time. Do we ever really stop to kind of think of what they are? Yeah, they're those things that speed up selects, right? They speed up SQL queries, especially ones that are starting to run slowly. They can help you look up things faster, and they actually help you look up things much faster depending on the size and structure of your stored data. And all this may be true, but what are they? We know what they do, right? But we don't really know what they are sometimes. How do they speed up selects? What is this data structure? How does it work? We kind of know what they do, but we really don't know what they are. At least, I didn't know what they are. Standard definition you'll find is that a index is a data structure stored on disk that organizes a reference to your data. It is both a copy of the data you wish to index combined with a reference to the actual data location. So pictures are helpful, right? For completeness, this would be like a non-clustered form here. You have this separate place stored on disk where you're actually making copies of your data that you're trying to index and you have pointers pointing to them. So it really is the best image I could come up with there. These data structures provide the basis for random lookups and efficient access to ordered records. So ordered meaning put in tables in our case, right? Let's just dwell on that image for a second and just really let that sink in how it's this separate thing, how it makes this copy. You can start to deduce a lot just from knowing that, right? But let's go a little bit deeper. If you take our book example again, imagine you need to add a paragraph, okay, and it's gonna add the word war in there. Well, what are you doing when you have an index that's already in this book? You can't just add the word war and expect your index to be untouched, right? What you're doing is you're actually inserting two wars in there. You gotta go to the index, update that, point to where that is in your book. And our databases work the exact same way. At this point, you likely have more questions than you came in here with. So let's just add a couple more. When and how do I use this index? Why is an index on an integer faster than on a string? Why do inserts bring a performance penalty? How can you drop an index and not hurt the data? Another question I have for you. How can you open a file, say on Linux? You're messing around with it. You destroy the file, but this file's still open and then you can just go on editing it still. On top of all these questions we all know, just as working developers that as our data grows and as the amount of queries increase, the number of rows being scanned for grows. And as these requirements shift, you will need to go deeper and understand the issues that will arise. So some examples should help us out with these questions. You can, let's keep these questions in the back of our mind as we kinda go forward. I wanna do a brief disclaimer. I'm gonna be talking about these examples in the context of Postgres. There's two reasons for that. The number one is that that's what I work with every day. Number two, that's pretty much it. I work with it every day, so. If you use another flavor of relational database, don't fret, because these principles that we'll talk about generally apply. So okay, let's take some, keep those questions in the back of your head. We're gonna jump into Postgres with some examples and then we'll kinda revisit them in a bit. So I just did a simple sampling of data. I threw up here on a gist, if you wanted to follow along or check it out later. But just in Ruby, created 50,000 rows in a CSV file using the Faker gem. If you're not familiar with that, it's great. Created my table, we have an ID. We'll make that our primary key. Name, email, city, all strings. And then we just copy that in, specifying our delimiter and where we put that file. So now we have 50,000 records, okay? Why 50,000? I found that that's where I started to see a performance difference between certain things. So we're gonna talk about today. It's a tiny amount of data though. That's important to keep in mind. At Weedmaps, we have several tables with millions and millions of rows, right? And that's not even a lot of data. I think that kinda tells you a lot though. We're gonna see the performance differences on small tables, like at one with 50,000 rows. And hopefully you can kind of see the value in diving deeper into this topic. Okay, so that yields a table like this, right? That's just what you see in PSQL. Notice we have an index here, to primary key, B tree on the ID. We all know that looking up rows by that primary key will be fast and painless, but there's a catch. So remember, indexes are stored on disk, and that means this table has stored a copy and a pointer to each of the fields in addition to the actual data in the table. So moreover, for every insert and update, like we talked about with the book, that we make to those columns, it forces the database to make two writes. One write to the actual table for the desired field, the intersection of the column and your row there. And again, to make a copy of the data included with reference and some metadata to the index. So there is a small cost. Obviously, this trade-off is well worth the cost on columns that are used frequently for lookups, and most databases do more reads and writes, but just keep in mind that there is a cost. What does all of this mean for us? Well, it doesn't seem like such an obvious win to just be throwing indexes on everything willy-nilly, and that's exactly right. This is why it's the best practice to only add indexes as you need them. You need them obviously on primary keys, foreign keys, and integer columns that are used for joins, or otherwise just queried often, but you need them on string fields. Well, sometimes you do, right? And it depends on a case-by-case basis, but how do you just go about making these decisions? You can't just Google it. You really can't. Everyone's use case is different. So we kind of need to understand what these data structures are and how they act. Let's take a look a little bit here at what we got in that table. This is just a screen grab for my, so, sorry. We do a little explain here on a full table scan. So if you're not familiar with Postgres Explain, this is like a quick little crash course with that. We see this seek scan. That just means a full table scan, meaning it's gonna look at every single row, right? And then we get these weird metrics here. Here we have a cost. Now this is just an arbitrary metric, like most metrics are, that's kind of meaningless. You need some kind of reference point. So your reference point is what does it look like compared to other cost queries, right? Other explains. This, there's two numbers you gotta look at. This first one here is your startup time, your ramp up time, you call it. So you notice we got a zero for this full table scan, all 50,000 records. We got this 1022, whatever that means. We have 50,000 rows and we have 52 bytes wide, meaning the bytes as a width of what you're getting back. Okay, so not too crazy, right? And if you were to run this locally, it just happened like that. Let's do another one. We do an explain on our primary key, okay? So ID 666, because you know, well we got a whole different set of output here. This time we did an index scan using our primary key on users and look at our cost compared to our full table scan. The first thing you'll notice is a huge decrease, right? Eight compared to 1022, huge win already. Now we're only getting one row, but it's worth noting. We got this ramp up time here, which is completely different. You gotta wonder, okay, why is there a ramp up time when we're looking at one single row? Got one row, same width on that guy, and here's our index condition here. So this explain is very helpful and determining these kinds of things going forward, this is the way to do it. When performance matters and you need to make that decision and it comes down to you, I don't know how else to go about it without having this broader understanding and actually looking at what you're doing and running some basic benchmarks. Let's do a select on a string column, okay? So we set our email with string, right? Yeah, bar car. So we do an explain on users where email is artlemke at low.co, art's a real person, he's a friend of mine. We have a sequential scan, so same thing we did up here earlier, but wait, we're only looking at one row. We were looking at 50,000 rows. Why are we doing a seek scan on this one row? Look at our cost. So there's no ramp up time again, but we actually took more time to look up one row than it took us to look up 50,000 rows. The rest of this is the same. Look at this filter here we use. Okay, so no index. That's obvious. So what do we do here in this case? We need to use the email to look, maybe we're doing some authentication system. We need to use this email for lookups all the time. It is crucial that you know how to examine your DB and what you're doing, okay? But so these questions remain. Why are we starting to see these things slow down? Someone might be thinking, well it's an index, it's a string index, and those are slower, okay? Well you'll add this index here. We do this to explain. Notice the cost compared to our int column. Where's that? Right here. So 0.29 to eight, three one. 0.41, a little bit more ramp up time, a little bit more look up time. Same deal, well it's just one column. So yeah, well it's obvious, right? We're doing a string index, so that's slower. We just know that, that's a heuristic we live by. String comparisons are more expensive, even though we are dealing with the same number of records, right? This is where knowing what the actual data structure is will help you out. So indexes come in many flavors. Postgres has more than this, but these are the main ones we talk about. And you know what, in 98% in cases, your indexes are just B trees. We do have B trees, hashes, gist, spgist, gin and brain indexes in Postgres. And a lot of these, I think hash doesn't exist in a lot of other relational databases, but a lot of these do exist elsewhere as well. And our indexes, both of those were B trees. So a B tree kind of looks like this if you were to conceptualize it. It's just a simple heap-like data structure. Note that the root note has two children, right? B trees speed up your searches because they sort the values known as keys inside of each node. Depending on what it is, what is actually stored in these nodes, it will sort them differently. For example, if you've got characters compared to integers, characters would be sorted alphabetically, integers in numerical order. Although they can house potentially thousands of keys per node, they try to evenly distribute the keys among the nodes. The number of levels, that way the number of levels that you need to traverse is actually quite small, okay? So this is a gross simplification of your typical B tree, but it's more or less this. This minimizes the time that you're ever gonna have to take, just crawling this thing. Each pointer leads to a child node that contains more or less the same number of keys and each other as each child does. Storing and looking at data on the B tree follows really simple rules. So let's kind of take an example. Let's take a character B tree. Notice how all the data's laid out here. And let's say we wanna look up the V, okay? With your human eyes, that's like instant. Like yeah, there's the V down there, okay? But obviously computers are pretty dumb, they just do what you tell them to do and they don't know how to find this V. So let's say we issue a query, we arrive here. We have this V and we arrive at M. So you look at the root node, okay? And the rule to follow throughout is to obey alphabetical order here. So V comes after M. So we follow the right pointer down to the next node, okay? If we had something like B, we would have gone to the left, it's that simple. As you're starting to see, this is all just basic control flow and this is stuff that we do every day, right? Next you'll arrive at an intermediary node. Again, like you can just kind of see where you need to go, but how should the algorithm proceed? We just arrive here and we've got these three values and we want a V and we got a PTX. So this is a really perfect spot for binary search, right? We actually just land in the middle, we compare to whatever's there to our left. If we are greater than, we check the next value over and so forth. And when we arrive at the spot that we should be, we slide down that pointer to the next node, okay? So we could have just done a sequential scan, right? But again, I mentioned, I mean there's only three there. Cardinality of three, that's super quick to do a sequential scan. But these nodes could potentially house thousands of keys. So we just do a binary search. As luck would have it, these are sorted alphabetically every time. So you're not gonna run into any issues. If you had to sort it yourself, it wouldn't be a good candidate for that. So back to our intermediary node. We start right in the middle and we compare V is after T. So we don't need to worry about that. What comes to the left there? So there's only one value to the right remaining and that's X. V is before X. We take the pointer down in between and we land in the middle. We look to our left and compare and voila. There's our V. Okay, so now I know I just walked you through a B tree. Just for completeness, Postgres uses B plus trees. Up here on the upper right, you have an abstraction of what that looks like. But it's more or less the same idea, right? With B plus trees, the root nodes and inner nodes contain keys and pointers to lower level leaf nodes and those leaf nodes contain pointers to the heap, okay? And the heap is where your tables are stored. Okay, so in defining what this index is, we mentioned making a copy of our data and whenever we execute the create index command, Postgres saved all of our existing IDs or emails into a B tree and it became the keys of our tree. So it would be just like this one here, right? Additionally, whenever we perform an insert, the database will add to the structure. So when you add to the structure, what happens? Well, we can't do our search, our binary search if things are out of order, right? So we have this balancing act that we have to do. Every time we add something, every time we have to rebalance, right? Sometimes it's as simple as like, oh, well, this is a 10, it comes after this eight and nine, I can just slip it on the back here and we're good. There is some rebalancing that has to happen. In fact, B tree stands for balance tree, right? As mentioned before, you won't get a B tree that has wildly different differing numbers of nodes on the same level and all of this has a cost and it has to be managed something we have to remember for larger data sets. So if you go back to our example of doing this with strings, you can instantly see why this is a little bit more expensive. Not wildly expensive, you know, but it is more expensive. If our user's table had 10 million rows in it, we'd have a pretty massive B tree. Every right would potentially cause some rebalancing. We've mentioned a lot of the cost of B trees, but do keep in mind that these things are super efficient. The algorithm that Postgres uses, Lehmann and Yalla have been around, I think since the 60s, just works. B tree indexes are really great for equality and range queries and they can operate against all data types and they're also great for retrieving null values. So there's a lot of value there. Okay, so now that we have a better understanding of what these indexes are, specifically B trees, let's rethink some of these questions from earlier. Why is an index on an integer faster than a string? Well, comparing two ints is no problem, right? The string's a whole different mess. I think you can see pretty quickly. Just also, we're talking about the size of the bytes. Now that we know what this thing looks like and how it operates, you don't even need to wonder about these questions anymore. This is kind of the beauty of diving in here. Why is, why do inserts bring a performance penalty? Well, you have to do two writes every time, right? So if I've worked at a company where we did probably 80% writes and 20% reads and this became a nightmare really quick and this is the just in time moment for me where I was like, I need to learn what's going on here. How can you drop an index and not hurt the data? Why can you do that? We have this image here. What happened if we just destroy this guy? Well, we still have the tables, right? Everything's still living on the heap. So we haven't actually done anything, but if this is a new concept to you, it's kind of like, oh yeah, of course. I'll leave a couple of other questions for you to think about. Maybe we can talk about it after the talk, but how does this work with multicolumn indexes? Consider indexing a string field, okay? Every look up on the index now has to compare strings on every node inside the B-tree until it finds the desired result, which is an expensive computation compared with ints. And only then can we follow the pointer to the actual row it needs in the table, okay? So my hope in just doing this whole 200 foot view down to the nuts and bolts a little bit is that you can kind of take away some quick bullet points. You should be able to answer the previous questions. You know, that we just walked through them. Hopefully you can walk away with that. I hope that you're just more aware of what you are and what you're not indexing. That is a huge win if you can walk away with that. If you can double think like, well, let's just throw an index on that string field. Well, you should probably benchmark it. You should probably think about your use case. You should probably think about if there's a way to get around using that string field. Or, you know, if it's not a big deal, maybe you don't do lookups that often. Maybe it's an admin page that's kind of obscure. Well, maybe that's not a big deal. Or maybe you need to build a whole read API, an elastic search or something. I don't know. You got to figure that out. Know how to analyze your queries. So this is how you figure it out. Just use explain. Rails has great benchmarking, Ruby has great benchmarking tools as well. You can just do that in prior or whatever. You get the same results. Understand the basics of how a B3 index works for reason rights. So hopefully that has helped you arrive there. If you walk away with any of these things today, I'll be happy. Hopefully we all have a better understanding of one of the pillars of database performance and functionality. And thanks for attending. I've left some resources for you. One of the main inspirations for this talk was this Pat Shaughnessy blog here on Btrees from 2014, excellent read. Just some general documentation. Heroku's documentation on indexes is excellent. Obviously Postgres as well. I left a little mock data just for you if you want to play with that so you can mess around. Run into plenty of migration woes in a company like Weedmaps that has 80 engineers. So we've all kind of rallied around this gem called Strong Migrations and thrown some checks on our PR so that you don't run indexes on things, locking up the tables so our millions of users can no longer access them. There's some really great just like bullet points and the read me in there just to live by. And the gem is super helpful. And you can find me at David PM IO, just a little GitHub page where I leave a link to my Instagram, I think. I don't know. Thank you.