 Time I think we'll go ahead and get rolling. Hi everyone. I'm Craig. I Work at Cytus data if you're not familiar we turn post-crest into a sharded distributed database Won't talk too much about Cytus in the talk, but we have a booth in the expo hall So if you're curious happy to answer questions after about Cytus, so Generally going to talk about sharding and data modeling there's a lot of mixed opinions on sharding I found and It's pretty well proven that it scales But how you kind of a model your data up front is probably the single biggest thing on the impact that you have for success or failure with it So first a little bit of background on who I am I Curate post-crest weekly so if you're a post-crest fan is anyone here not using post-crests or not a fan a Couple of hands. Okay. I'll convince you afterwards But I curate post-crest weekly it's similar to like Ruby weekly and other weekly newsletters to go out Targeted more towards app dev than DBA. So hopefully you find something get in there I run our database as a service at Cytus and previously I was at Heroku running Heroku post-crest for about five years So happy to kind of answer questions on on that as well having a good bit of experience there So what is sharding there's been a good bit of talk this week I think already like DHH was kind of proud of how they didn't shard and yet they kind of split things into smaller pieces by using different databases Rails 6 we're getting you know some support for additional databases, but it's not really necessarily sharding It's hey, there's multiple databases that rails is aware of now you get to figure out what to do with them Sharding by definition according to Wikipedia and this is a pretty clear definition It's just the practice of breaking your database into smaller parts Usually this is for performance this could be for like write performance read performance you take your pick But it's usually because of performance The nice thing about sharding is breaking your database up from one huge database to a bunch of smaller parts Everything is small again. Everything's faster things like vacuum and indexing you're all faster So by sharding everything is small and easy to work with again. Now. It's very well proven at scale Google Instagram sales force take your pick any large company that scale has gone down this path You don't necessarily have to be at their scale to find the benefits of it I've seen it start as early as 10 gigs of data more commonly. It's a hundred gigs a terabyte that level So first what is a shard? I think this is the most common Misconception I see when talking to people about this we'll get into the data modeling in a little bit But first like what is actually a shard and how should you approach it? So there's like a physical VM, right? There's a instance on Amazon or Postgres cluster A lot of times people associate. Hey, I've got two different VMs are these shards Actually not so the idea of a shard you're gonna have multiple shards existing within one VM so there's a separation between like nodes and Running multiple postgres instances on it So if you've got like a postgres database if you've got multiple tables Schemas or postgres databases themselves. All those are technically shards And this is pretty important early on so that you don't tie yourself to the exact hardware you have The idea with sharding is you want to create a bunch of smaller chunks that you can easily move them around So you can move them between different nodes scale the hardware up scale it down add nodes and move the hardware at the tables across This makes your life way easier. So starting with this defining the right shard count up front is pretty key So to kind of visualize it if I've got Two different physical instances, and I want to create 32 shards under the covers I might create a table like events and I'm going to call it like events underscore one events underscore two Three four, etc. I'm going to create a bunch of shards on one server first So if I'm using tables, I'm going to create events underscore one underscore three underscore five on this first node And then the even ones over on the other node Really simple, but what's nice is then when I go from two nodes to four nodes I'm just going to take half of those shards and move them and So I'm going to take them copy them you can do this using things like logical replication You can create a read replica copy them over and then update metadata that sort of thing the actual process of moving them You need to have some utility and mechanism for But overall it's it's more manageable than hey if you've got you know a shard per node splitting that up inside the same table Is a lot harder so up front you're gonna create a larger number of shards. All right, so on to the five data models We're gonna walk through each of these in a little bit of like is this your data model Here's the things to think about here's the tips and ways to rationalize through it Here's the the ways to rule it out The number one thing I try to do is actually rule out a data model saying this won't work And try to find the reasons why instead of forcing it in when you force it in usually you've had like a bad sharding experience If you've ever tried sharding and it just failed horribly you probably tried to force like a you know a Square block into a round peg. Okay, but first a couple a bit on the approaches so most people when they think of sharding or I Think there's two different approaches and one of them is much more common at scale The other is one that commonly gets talked about and I see implemented more But it isn't necessarily sharding in a sense it is because you're breaking things up into smaller parts But it's not the same scalable approach that a sales force a Google someone like that would use So the first one hash What you're gonna do is soon as something comes in you're gonna hash that ID that way that you're sharding on You're gonna define a range of shards. So in this case like we our example before we've got 32 different shards And we're gonna split the resulting Numbers of that hash value up into 32 different buckets and keep some metadata table saying this resulting values here It goes into this shard and then we're done any questions on that? All right, so it's a little bit more than that But it's actually not too much more at a basic level so you're gonna have space on some ID like if you're using postgres There's an internal hashing function function you can use It exists for all the data types So if you've got text it can hash it if it's got an integer it can hash it you can roll your own too You don't have to get crazy on this the hashing function you use will not determine the success or failure of you know Does it work for you? You're gonna define your shards upfront This is really important. You want to go larger than you expect to go in number of nodes So if you think hey at our heyday when we're at Google scale, we're gonna run a thousand servers That's probably an okay number to go with like a thousand shards or two thousand something like that You also probably don't want to go with two you'll outgrow two nodes really really fast So some medium based on how large you think you're gonna grow is a good start you can go and split up shards later That's a lot more work. You want to like once you go through the sharding process You want to go ahead and buy yourself two years three years five years ten years and not have to kind of go back and redo a bunch of work there So as I mentioned like two is bad two million is also bad Like now you've got tables that are so small that aren't existing that are sitting there that are empty to Run a migration on those is now costly In production. I tend to like things like a hundred and twenty eight two fifty six You also don't have to grow in factors of two But it's usually really nice and easy and works pretty well So a common mr. Miss conception is that people just route the values So you have like user ID one and you say hey I'm gonna take users one through ten and put them in shard number one Then I'm gonna have shard number two and users two through twenty or here That's a huge problem What you find is your oldest customers are the ones with the most data They've been around the longest and now they're all saturated together By hashing things up front you naturally distribute your data pretty evenly. Yes You may have a hotspot But it's much much kind of a better way of naturally distributing the data upfront So if you take like the user ID one take the hash value of that and postgres It's 46,000 hash value of two as you can see like 27,000 and what I'm gonna do is have a table that I say split up the full range of hash values and you know between 26,000 and 28,000 that shard number 13 and I'm gonna have a little lookup table either in my application or my debt metadata in my database If you're using Citus you want to think about this it's in there for you But it's the same idea where this shard all the resulting hash values live in this range So you've kind of naturally evened out the skew of your data It also makes for you know as things grow and get older new ones kind of redistribute so it works out really nice All right, so on to range. I see a lot of more people implement this than actually like true hash based partitioning It's really common in time series. So what you got to do is ensure you have a new bucket So let's say you're doing range partitioning and you want to keep each week of data by itself Maybe it's you know an ad network and you want to count impressions by week, right? I don't need to join across other weeks. I can do aggregations differently that sort of thing So you're going to create a new table for every week of data that comes in here. It is pretty straightforward. You create the new range Route to the right bucket Make sure you keep creating the new ranges. I'll talk a bit about this on the time series data modeling But that's an important part and you're done. So range is a little more straightforward if you've done it before you've probably done it This way It works just like you'd expect All right Questions all right, so there's actually quite a bit more. So that's the basic approach those two approaches apply Pretty broadly to sharding now the more interesting part is the data model if you don't get the data model, right? You're gonna have a lot of problems You're gonna be trying to join data that's across different nodes moving data across the wires really slow We haven't fixed that in databases So it's a matter of how can you co-locate data based on your application? So first geography So the the big question here is there a really clear line for geographical boundary and this is like a 99% of the time You don't want to do this 100% of the time like there may not be 100% of the time One time I was getting picked up at SFO airport in an uber Heading home the guy was talking about recently He picked someone up at SFO airport It was the day before Thanksgiving and they said I can't go see those people. Can you drive me back to LA? Not kidding. He went from San Francisco to LA in an uber because he didn't want to see his family now Normally, I think of uber is having cons confined like region boundaries Like most uber drivers are not gonna drive from San Francisco to LA So like a trip's always in San Francisco. This is usually a pretty good regional boundary There's a lot of services that are good examples for this so like state Income by state Healthcare has some defined geographic boundaries grocery delivery services like instacart Those sort of things if there's a really quiet clear boundary that you don't go out of This can work pretty well There are some bad examples so so like while that's the exception case with you know An uber ride from San Francisco to LA most of their data works just fine And you can deal with those exception cases as they come Bad examples are when there's one side of the data that has Geography in the other side of it has a exact opposite geography So like text messages go from one area code to another like I don't think people I Grew up in an age where once I moved I didn't get a landline I just kept my cell phone So I still have an Alabama phone number out in California and I don't actually pay attention to the area codes So it's not localized anymore When your data always spans geographic boundaries, that's usually a problem Do you join across geographic boundaries? And so the key here is join not aggregating so you can aggregate pretty easily across geographic boundaries Roll-ups work pretty well that way, but when you're joining from data, you know in Florida to California You're gonna have to move that over the wire So what you need to find is like how much are you joining across those geographical boundaries and are they really clear? Does data easily move from one place to another so for something like Instacart where I have my home grocery store It's okay if I move and they change my geography. That's a one-time operation. That's not happening constantly yet at large scale Yeah, it's happening constantly, but it's not like every user is doing that constantly So if you have something that's happening less than you know a few hundred thousand times per day That's fine for it to move from one spot to the other so a few more specifics You want to find the right mix here, especially from granular versus broad state I mentioned it's kind of an okay example for some things But now you've only got 50 states right and to distribute that what happens when you've got California or Texas Which are huge and you've got a bunch of other states that are really small Doing something like zip code might actually be better if you can divide things that way Because your data skew is a little bit better. So again some common use cases here If your Instacart ships uber lift these use cases work really well like you should know if you're in a Go-to-market where you start in one geography then expand to the other than the other than the other This model could work really well for you Now in the real world I See this happen a bit with rain starting saying like oh California gets their own database Texas gets their own database that sort of thing This is a problem Because of some of that data distribution now you've got 50 different databases to manage You haven't done kind of the the proper work of hashing things as they come in so what you still want to do is Hash those IDs so give each state an ID or you could actually hash to like the text value You could hash the text value of California and do that same thing and create if it states probably again a bad example If it's zip codes I can create 32 buckets and distribute all the zip codes across those 32 buckets pretty easily So you still want to go through that hash-based sharding Actually to make this work. Otherwise you're going to run into pains as you start to scale so that's one multi-tenant This is my favorite probably because it's the easiest And I won't say sharding is ever completely easy, but it's the one that's pretty straightforward Is your it like if you have a sass kind of B2B business where each customer's data is their own like sales force calm I can't look at you know my competitors leads. I only get to see my leads my opportunities They're starting by customer and isolating each customer's data. It's a pretty straightforward process What is the data distribution? This is the number one thing that will probably rule out the the multi-tenant if you have like a sass business If one tenant it's half of your data. Well, obviously this is going to help for the other half But it's not going to help for that really large one I don't usually get worried when one customer is five percent or ten percent of the data There's still some good headroom there and there's some options there But looking at your data distribution is really key to know if you're going to be successful or not So again common use cases here sass B2B. I see a bunch of like CRM marketing automation any kind of sass where their data is their own Even some cases like banks make sense here even though like it's a B2C business Usually when I'm looking at my finances, I'm not sharing that with anyone and I hope they're not so like is the data Isolated to that customer in which case it's a pretty clear line So a few guidelines for this and you can start this really really early when you're you know Just starting out with your first like rails model Or your second rails model put your tenant ID or customer ID on every table It's got to make things like joins foreign keys all of that massively easy You're not gonna have to go back and backfill that later So yes, you're denormalizing which yes in theory is bad in reality It's gonna make things much much easier when you do need to distribute things Ensure your primary keys and foreign keys are composite ones Rails support for this is okay. There's a few ways to kind of work around it It's not dropping out of the box, but it works pretty well and I've seen it work at scale so here basically you're saying like my customer ID plus This typical primary key that's a serial on some table is my composite primary key together You get the same referential integrity it also will give you some nice gains in terms of performance Usually by indexing in that way. It's more targeted and postgres could take a different plan in that case And then in force to an ID on all queries so By going with this model if you make sure that hey You've got a where clause where this customer ID equals foo on every single query, you know, there's no data leakage There's actually a couple of gems for this the one I most recommend is active record multi-tenant Which just has a little decorator that you put and then it applies it on every single query every model takes care of it for you So if you're going down the path in Rails, I would recommend it so a quick look at a Little kind of simplified example of Salesforce schema. So you've got like your leads table accounts and opportunity table Here you're going to change it just a little bit and add org ID under every single one of these on every table that you have and Then as you have, you know your keys You're gonna have a primary key there that's consisting of both of those you do this up front early on It'll save you a lot of headache later when you try to actually come back and change and add these columns later when you've got a 10 terabyte table that's gonna take a long time and it's not gonna be very fun Also, you're gonna find you probably have anomalies in your data So a lot of people are like, oh, no, no, this isn't possible with our data I've seen when they kind of started to backfill that stuff. It's like well I'm pretty sure you have two records for this customer. Which one is right? And you never know and you just have to randomly delete one true story. It really happens All right, so a few warnings If you're using schemas, which I see a lot of people do do a one schema per customer or one database per customer Be careful. So this works great when you're at 10 customers It works okay at a hundred customers when you're running a rails migration and you're running it for 10,000 customers and you have to run that across 10,000 different schemas that thing that used to take a few seconds Now it takes a few hours and a few days. I've heard actually of cases where schema migrations didn't take days to run And migrations never fail, right? So you never have to go back and clean anything up so You also didn't have to worry about connection limits Because you're grabbing a new connection or setting the schema on that search path each time Now if you've got 10,000 customers, you've got a database with 10,000 connections Guess what usually Postgres doesn't like that many connections So I would say in general think twice about going like one schema or one database per customer Unless you know you're only going to have like 10 customers if you have 10 customers Absolutely go that approach if you're never going to scale beyond that Which is valid for some businesses if you have a really really high price point for certain high-end customers All right, so into the ID So here Into the ID is interesting and kind of a hard one to Describe from a textbook definition, but it's something more granular than that like default hierarchy And here what you want to do is actually optimize for paralyzing your workload And you don't want one query to run against a single shard. You want a query to run against every shard And here what you're trying to do is optimize for performance of Paralization and this is more common in use cases like ad networks a lot of kind of log monitoring tools that sort of thing where you've got a massive amount of data So you do want to join where you can but you want to say hey Where can my join be pushed down to these tables and spread out? So again optimizing for parallelism Less for like your data in memory less for your standard kind of typical SAS OLTP app So a good example web analytics here you it's really common to like shard by visitor ID or session ID Both of these work really well and because when I'm looking at things I'm either looking at an aggregation of sessions or visitors that happened Or I'm looking at it for one specific user That I'm not kind of joining of hey give me user a that did something against user B Like if you're joining across those that's where it gets complicated Most web analytics tools are just doing aggregations right and those roll up pretty well So here's some things to think about like sequel is going to be more limited You can't have the full set of like window functions and Ctes as easily because to do those sort of things you've got to bring back data to it like a single node and Resort reorder it which is really expensive here if you're bringing back a terabyte of data over the wire That's going to be slow. Nothing's going to change that you really kind of want to think in like Hadoop map-produced style Like can you actually split this job up into bunch of smaller parts run them locally do pre-aggregation and bring that back to a central point? If you can express your application and workload in that sense This will work really really well for you and it actually scales really beautifully. I've seen cases of like Versus a single node Postgres like 100x performance over single node in these cases if it's what you need So a few examples like count star, right? If I've got 32 shards, I can do 32 smaller counts bring those back do the aggregation Average also really easy like I can do some and count bring those back to the final math Median That's a little bit harder Now this gets cool though like medians window functions all these things are very cool and very powerful So like for count distinct there's a algorithm hyper log log Go read the paper. I think it's out of Google It's really really fun. I also love just saying hyper log log It's like probabilistic Approximate distinct such really really close and you can do things like Union them say like give me users that I saw on Monday, but I didn't see on Wednesday based on this data type Really compact and efficient in terms of storage as well So for like ordered lists, there's you know top in or top K It's really common in elastic search. We actually just released a top-end extension at Citus I want to say like a month ago. So if you need top in for your Postgres database, give it a look Median, there's things like T digest and HDR. So there's Probabilistic data algorithms that you get you really really close here. They're not perfect. So if you need exact things, they're not gonna work But if you do go down this path and eat some of these things give any of these a look All right, so a graph database So this one like you'll really know if you need a graph database the most common use cases, right or social networking And then actually fraud detection and you hear you're looking for a high number of connections It's a very very different approach Like Craig posted a photo which might be some value within my database Daniel like that, which is the vertices And then you know will posted a comment on it. So they're composed of essentially objects and associations And you can think of it as two ways There's things that can occur just once and there's things that can happen over and over and over like comments I can only like a thing one time those edges are those things that can only occur once and the other objects can reoccur in different ways, right? So you can think of kind of like liked as true or false Sharding within a graph database the most common approach is you're gonna write the data twice So you could say oh do I start on objects or do I start on the associations and the answer is yes What you're typically gonna be doing is like querying based on different things. So for like a newsfeed you're gonna query on maybe a associations on you know Other pieces you're gonna query the details about the objects or vice versa depending on what you're doing That's that's the high level of it if you're doing a graph database or using a graph database There's a newer one in Postgres Agans graph. I don't know anything about it. I haven't tried it yet I've looked at it and it looks promising the most common one out here is Neo4j But I would actually recommend reading this paper So probably the most sophisticated graph database that is heavily sharded is Facebook's It's not open source. It's not public It's called tau this walks through a lot of it including some of the sharding principles it gets pretty deep pretty fast and The the applications are very very different from a relational database or even like text search databases The short is you're gonna start it both ways based on your associations and your objects And I believe like Neo4j has some built-in tools for that. So probably just leverage those All right time series This is actually real data from hacker news of I Won't say hacker news is like a perfect source for anything, but if you do read it like there's good content I think they're at times and This is the data from their who's hiring on hacker news Screenscrapes from you know years years back What I love is like Postgres is crushing everything. It's clearly the better database Also interesting is like most things are relational there as cool as like no SQL is and everything like the Primary load of most application is still like a relational database And I don't think that's going away anytime soon. All right, so Time series a lot of people think hey, I have time on my data Like I have a created at on all of these columns are on all these tables like let's just shard by created at Or you know, I have Analytics data so time series it is right because it has a time-based dimension Time series you can always do that, but it really depends on how you're accessing the data Are you frequently joining things that span across time dimensions? Are you partition or you're pruning off old data? One of the biggest benefits to like time-based? sharding is Getting rid of the old data. It's not just to make the recent data fast that is one piece But it's also to get rid of the old data Um So are you always querying time? Are you querying a subset of the data and then do you remove old data? These three things if you do all three of these then time series is actually a really good fit for you If you do one of these It's a little more questionable if you don't do any of these like then you don't have a time series problem at all So if you're querying over long ranges always if you're always querying, you know two years of data It's not gonna help you that much if you're not removing the old data If you're not querying on time at all, hopefully you haven't thought about this at all So this is a you know really good And this kind of really varies like you can make it work But I've seen a lot of people go down this path and run into various pains along the way So with time series You're gonna do more range partitioning than you are hash-based sharding as data comes in You're gonna create a new range for that new bucket. It could be daily, weekly, monthly, yearly you get to pick And You're basically gonna split that up So the key steps is to define your ranges Make sure you set up enough in advance the number of times I've seen people like run a script once and say I'm gonna create all of my my buckets right now They create two years worth of buckets. They think they'll never you know They'll set a reminder and they'll totally remember before they run out two years later production is down And they're like well crap. We should have automated this when we started Make sure you're also deleting old data Dropping the old partitions keeps things snappy Otherwise, it's still just growing and growing and growing and you could do all of this Which is some nice indexing without having to do all the extra elaborate work So I don't know if any of you are in the Postgres 10 talk Postgres 10 got native partitioning time partitioning But it's got a few rough edges So walking through it here like you'd come in create a table and you create say partition by range and that The column you want to partition your data by so this is the initial setup Then you're gonna come in and create tables for each of these partitions And you're gonna say it's for this value from this value to this value And you're gonna do this for every single one. You usually want to schedule this as a job to just keep running If you don't have the table, it's going to Are you just gonna lose the data? You're gonna want to create your indexes up front if you run it come in later and Like just want to create index on that initial measurement feel measurement table It's not gonna do anything for you. So you got to do this on each specific table So as you add new indexes, you need to go back and index the old ones or you need to automate this in some way And then you're gonna create a trigger and a trigger is basically It's gonna run automatically when you insert data into that measurement table it's gonna say let me do some analysis and If my range is between this and this go here if it's between here and here go here, so I Am very excited about this being in Postgres 10. I Wouldn't call this native partitioning partitioning personally like it's still got a little ways to go I think in terms of user experience So a few tips on that There is a postgres extension that smooths out all of the rough edges called PG part man PG part man will automatically create new partitions will as you run a create index. It'll create on all of them You can schedule it to automatically drop old ones if you're doing postgres 10 partitioning I would not use it without PG part man Postgres 11 should be much better and start to remove some of that need But I would highly recommend Using PG part man, so that was really fast a little high level but a little bit of detail and hopefully tips and going through each of them my biggest feedback would be Like charting is not always easy But if you need charting like doing it earlier getting the best practices in like if you're in that multi-tenant model adding your Tenant ID to every table it's going to save you a whole lot of time later. So starting to do the groundwork now It is a good bit easier Identifying the right approach is really key and I would say like Try to rule out approaches. Don't try to force it be like, oh, we have time series data So let's go with that path. I would try to see like where where will it break first off? It's not going to be 100% it's going to fit like 98% if you can get to that 98% You're in a good spot. I think charting was really really rough a number of years ago And there's a lot of tools both with rails libraries Extensions to make it easier and native things that have come in Postgres So it's gotten a lot easier than it was years ago So again, it used to be a lot more painful. I wouldn't call it fun. It's not my idea of a Friday night But it will scale once you do it like you don't have to worry for years about scaling essentially All right, cool. I think that's it and I might actually have a couple of minutes for questions