 We're going to talk about, I'm Dan, by the way, I'm the CTO at Heap, today we're going to talk about the problem of Heap and why it's interesting and the process we went about to design the right schema for it and various attempts we've made at getting that right over the years. Just to give you a little bit of background, I joined as Heap's first hire in July 2013, before that I was at Palantir where I did all kinds of back-end distributed systems-y type things, and before that I was at Sanford where I studied math and computer science. I did an honor thesis on deep learning if that kind of thing is interesting to you, but I don't know anything about that anymore. So to start I'll give you an overview on what Heap is and why our particular take on the problem of web and iOS and user analytics in general, it makes it such a difficult problem. And then I'll discuss some of the approaches that we've taken toward solving this over the years and a couple of different particular schemas that we've tried and the benefits and trade-offs between those. In general, one thing I hope you take away from this is that the most important, this is probably not a surprise to most of you, but the most important choice you can make when you are designing any kind of data system is the schema you use to actually represent that data. That's the basic foundation that's going to determine, that's the foundational trade-off that determines the trade-offs you can make later that'll determine what kinds of right throughput is possible for you, what kinds of reads you can make fast, what kinds of dating consistencies are possible. Everything else is downstream of that. So first I'll give you a little bit of background on what we are. Heap is a user analytics tool. We have a tool for web and iOS analytics software and soon also Android. The purpose of this product is to help you understand your users. So you might want to, this is everything from what features are being used to which ad campaigns are actually driving users to deeper behavioral questions like if a user sets a profile picture within the first week of using my products, does that make them more likely to be active three months later or what kinds of things immediately proceed inviting a teammate to this team or how much does it matter, that kind of stuff. So in order to understand why Heap works the way it does, I'll start by walking you through how you might answer these kinds of questions with a more traditional analytics tool. There are about a dozen tools that you can use for this. They all work basically the same way. They give you an API and you can log events against that API. So let's say we have a totally hypothetical website for booking hotel stays and we have some kind of question like what percentage of our users actually book a hotel. It's a totally fundamental question that you really need to understand the answer to in your building this kind of product. So if you ever tried to answer this kind of question, you've probably seen code that looks like this which is adding an event listener to your book hotel button that fires the booked hotel event against whatever analytics tool you're using and you might go through other parts of your app and all the critical user touch points and you'll want to track all those two like viewing a listing or logging in, signing in, anything like that, entering your credit card, all this stuff you want to track, you'll pepper your app with all these logging statements and then when you have some kind of question, you can run a query and you get your answer. So let's say we want to understand the basic booking a hotel funnel, we can figure out what percentage of users who sign in later view a listing and what percent book a hotel once they've viewed a listing. So great, we have our answer. And then as soon as we get our answer, we'll notice something weird, like 60% of people who view a listing end up booking but only 4% of people who sign in view a listing, which seems like an incredibly high drop off. What are people doing on this website if they don't view any listings? So immediately this lends itself to a dozen other follow questions, like are people getting to the search results page and then not clicking through to the listings, which might suggest for example that the search results might be bad or there might not be a clear call to action on that page that you can see more about a listing or something like that, whereas if people aren't even getting to that search results page then you might have some more fundamental problem upstream. So these are basic questions that you need to understand in order to make intelligent product decisions or prioritization decisions. So you immediately have some kind of follow up question when you get an answer and then you have to go through this whole rigmarole of getting someone to write that new login code and you have to wait for them to do it and you have to wait for that code to ship and if you're on any mobile, any iOS or Android thing you have to go through the app store again and wait for your users to upgrade. And even if you could just wave magic wand and have the new code live, you still have to wait for a statistically meaningful amount of data to accumulate. So most tools have some sort of intra-week variation so at the very least you're probably gonna want to wait a full week or a full two weeks just to get a clear picture of your users. So I think at bare minimum you're talking about weeks and I think at any sort of large institution or large company you're talking about months to get this kind of stuff logged. So as soon as you get a result you'll have a million follow ups immediately from that. Maybe the number will come back weird and you want to track a different thing or change the login code because you get it wrong the first time or you get an answer to your question and you want to drill down and add 10 more steps to the funnel. So the basic point is that analytics is a fundamentally iterative process. You don't know what your unknown unknowns are and that's usually where all the interesting stuff is. You don't know what you're gonna be interested in. The usual industry approach to solving this problem is to have you sit down with a consultant who comes up with a tracking plan which you can think of as like a Soviet five year analytics plan where we come up with every possible thing that we're gonna want to have tracked and then we'll get every single one of them logged. And then of course the interesting stuff is in the parts you didn't think to log. People are clicking this thing because they think it's a button. You didn't even intend for it to be a button. People are leaving this page. You tracked your happy flow but people are leaving from something you didn't even imagine they would leave through. All the interesting stuff is the hidden part. So it's a fundamentally iterative process which means that the iteration speed, the speed at which you can iterate on these analyses is the most important thing. So the heap approach to this problem flows pretty naturally from that. The heap idea is to just capture everything that happens on a website and then give you the tools to analyze it all retroactively. So we give you a snippet you can include on web and you'll include this in your site and we will capture every user event that happens there. Every click, every page view, every form submission, every text field change, all the good stuff. And then there's an equivalent on iOS where we capture every swipe and tap and pinch and all the UI events and soon there will also be an Android equivalent as well. And we'll capture everything but now you have this second problem which is that all of that stuff is in a raw form. Like we're not capturing checkouts or signups, we're capturing a click on a div with this CSS on it and this page view path and this session refer and all this other stuff. So to make this data actually useful, especially to non-technical people but even to engineers, HEAP has a concept of an event definition which you can think of as a semantic mapping between the raw stuff that HEAP is capturing and the stuff that is of interest to humans who are doing analysis later. So we give you this tool we call the visualizer which is an overlay of your website. It's sort of like Chrome Debugger or Firebug if you use any of those. It's an overlay of your website where you can do an event and then give it a name and we'll extract from what you did, what the actual definition for that is if that makes sense. So to give you a concrete example, you might wanna understand signups. So we will give you this tool where you mouse over the signup button and then click it and we give you some features you can toggle like whether you wanna limit this to the current page or there's other stuff you can twiddle to make sure you get exactly the thing you care about highlighted on the screen and then you can give this a name like signup and we'll extract from that the relevant CSS properties that specify this button or the target text if you wanna filter for buttons that have the text sign up in them or page view path or any other sort of stuff you wanna filter on it. And then we can evaluate this definition over all of your historical events going back to when you first installed heap. So you go through this flow of actually doing an event that you're interested in and then we can tell you how often it's been done historically going back to when you first installed heap. There's no waiting at all, this is the basic idea. And then once you have these definitions you can use them for all the sorts of analysis that you would wanna use. So for example, you can have a simple graph like the simplest example would be you wanna graph instances of something over time or maybe you wanna add some filtering like like button clicks filtered for users who are in Canada or something like that. So this brings up the first thing that's really difficult about heap which is that instead of capturing just the specific things that you whitelist as interesting in your website or the things that you log we're capturing everything that happens. So every, all the clicks, all the pages. So this is between 10 and 100 times as much data depending on the customer, depending on the kind of website really. And 95% of it you'll never care about. 95% of these events will never match any of those definitions. They'll never be relevant to your analysis. It's just noise. But we can't ever throw any of it away. We have to keep it on reasonably warm storage because you never know when it will become relevant to a new event that someone defines. You never know when someone will be interested in clicks on some obscure button on an FAQ page or something like that. You have to retain all of it. So anyway, once you have these event definitions you can do all kinds of other analyses with them. You might wanna do graphing something grouped by some other thing. So you're just standard grouped analysis like signups grouped by which plan the customer signed up for. Conversion funnels as I mentioned before they all work the same basic way. We give you a retention analysis tool where you can answer questions like if someone came in from these different ad campaigns, how does that affect the likelihood that they'll still be active and weeks later, that kind of stuff, or the people who joined during various time cohorts had different fall-off rates, all sorts of retention type things. We also give you a segmenting, a user behavioral cohorting type features. So we give you, for example, a way to define a segment of users that you might call active users which are people who signed in more than four times in the last week and also either sent a friend request or accepted one and any kind of behavioral sort of definition like this. And then you can graph membership in this over time or filter your other analyses based on this. You might want some retention thing filtered for people in this segment or some conversion funnel filtered for people who are not in this segment, that kind of thing. The point I'm trying to get across here is that the set of analyses we support is really broad and really ad hoc in nature. We can't really pre-aggregate any of your analysis results because the whole point of the project is it's iterative and you can drill into things whenever you want. And it's not, unfortunately, we can't. If we only had to support graphs, for example, this would be a much simpler problem but we have to support this huge range of different analyses. And finally, we want the data to be, within a few minutes of real time, ideally within a few seconds of real time, this changes, this unlocks a whole new set of use cases that make the project a lot more powerful to give you a sort of concrete intuitive one. If the data shows up in your dashboard in five seconds, then it makes it a lot easier to build trust in the tool because you can go to your website and do some stuff and then see it show up and you can see that it matches that the event definitions that you set up correctly. So there's a lot of value in that in making the data real time, which adds a whole another constraint to the problem of how you actually build this. So this is the basic data model that we are presenting to customers. You have a series of users. Each user does a series of sessions. A session is a sequence of activity in which there is no 30 minute pause, if that makes sense. So if you leave for 30 minutes and then come back, you have started a new session. This is just the Google Analytics definition. A session will have many page views, what you think it is, and then a page view will have some number of events. An event is all the lower level user interaction things. So clicks, page views, scrolls, if you're capturing those, or text field changes, form submissions, all that stuff. So for the remainder of this talk, I'm gonna walk through the different approaches we've taken towards solving this problem over the years, where they fared well, where they didn't fared well. I should also mention that our dataset is impractically large to serve for this product to be served by a single Postgres instance. There's hundreds of terabytes of data here. How you actually distribute any of these is a whole separate topic of conversation. For the purposes of this talk, I'm just gonna wave my hands and say, CitusDB, and we'll leave it at that for now. In practice for us, it turns out to be a little more complicated than that because of some of the details of how we've built this product. And also the distribution piece of this adds a whole extra wrinkle to the actual scheme of pieces. There are certainly features of a schema that will make it distribute well or poorly. And that's a whole separate thing to dig into. I'd love to talk about offline if you're interested. So in trying to design a schema for this, we have a couple observations we can start with that we can start with that and build around these and then go from there. One is that the data is mostly right once never update. There are a couple of contexts in which we will, in which we do need to change events. And I'll touch on one of those towards the end of the talk. But for the most part, this is an append-only dataset. The second is that the queries do map really nicely to a relational model. We're doing a lot of filtering and grouping, joining type things here. For example, if you wanna compute some conversion funnel and then filter it for users who came in via a particular ad campaign, that maps really directly to computing this conversion funnel per user and looking up the ad campaign that the user came in from per user joining those filtering. This is, in a fundamental sense, is I think a relational problem. The third is that the queries, or the events have a natural ordering, which is say time, which is mostly monotonic. And all of the queries that are coming in have some kind of time range that they're filtering based on. So there will probably be some amount of time indexing in whatever solution we come up with here. Maybe we'll have some kind of table partitioning by time or something like that. But there's a natural time component in this, so we'll definitely wanna incorporate that if we can. And finally, all the queries we're doing here are gonna be in terms of those defined events. No one's gonna randomly, out of the blue, wanna know clicks on a particular div, it clicks on divs that match this regx out of the blue. They're gonna do this in terms of those defined events. So they might want clicks on a button on the slash checkout page, where the button text says confirm order. But that's because that map to that definition of that event. It's not gonna be out of nowhere. So the first way we might try to build this product based around these sorts of thoughts is the vanilla boyscott schema. This is your textbook third normal form, totally normalized schema. So to be concrete, we'd have a user's table where every user has a customer ID and a user ID and some bag of properties about them. A user will have many sessions, a session will have a customer ID and a user ID as well. And a session ID and a time that it happened and some set of properties there as well. And then a session will have many page views and it's the same story there. And a page will have many events and it's the same story there. And we did it, we're done, we built this product, it's awesome. I'm feeling like a real database solutions architect over here. I think we nailed it. That was really easy, we're all done. Jokes aside, there's actually a lot to like about this kind of approach. The biggest is that it's really simple and easy to understand. And I think that actually counts for a lot. It means that your data model under the hood maps to the data model you're presenting to users, which means it's easier for users to sort of model the behavior of the system and understand it. It also means that, for example, you won't get in situations where a totally plausible feature request is something that inexplicably we can never support because of some way that actually under the hood it's very different and complicated compared to what you're seeing. So having this mapping between what you show to the user and what you store in the database be clean and simple actually has a ton of value. Another thing that's really great about this is that all of SQL tooling and the innards of the database are designed around this kind of schema. So you can write plain old SQL and it will be familiar and simple and you won't have to learn some weird some weird task specific DSL type of thing. This will play nicely with any kind of ORM. Stuff just sort of works out of the hood or it works out of the box. This is particularly valuable if you're trying to get a product working by Y Combinator demo day or you're just trying to get something live before you have massive scale that you need to deal with yet. And another thing that's really nice about this is that there's not a lot of surface area for data inconsistencies. There's not a lot of data that is stored in multiple places that can be inconsistent. You have a lot of ability to set constraints. You can set primary keys for all those tables. You can set foreign keys to map between them and stuff. One interesting note is that for heap in particular we could never have those foreign keys because at the time we couldn't control what order the sessions, pages, and events would hit our servers. So we actually did need to support need to support inconsistency there. But in general this kind of scheme will give you a lot of tools for making your data set solid. It's really important. And if it wasn't clear and I put this in bold just to make it super clear you should basically always start here. You should basically always start with a simple sort of totally normalized traditional SQL schema. You would be this idiomatic. Your database is designed to work this way. The tool chain is designed to work this way. You also be very surprised at how far this will get you from most products. Like you would be shocked at how many years into scaling a lot of companies are still using a very simple thing like this and it still works fine. So in our case this got us to demo day and approve concept and some basic small customers. The main con of this is that any of the analyses that we're doing there's just too many joins to make things fast from a practical point of view. So if you have a customer at even small scale that you're doing an enormous amount of joining to answer questions. So one thing I should clarify if it was ambiguous is that those event definitions I mentioned before can include properties from the event or the page view or the session. So for example that you might filter this event for clicks that were on a certain page path or that came from which is a property of the page view or clicks that came from a certain ad campaign refer which is property of the session. So even evaluating simple event definitions requires joining up all this data. And if you wanna do something more complicated then you're pretty hosed. Like let's say you wanna define a segment called evangelists which is users who have done the refer a friend event three times ever and you wanna have some kind of conversion funnel and filter it for users who are not evangelists. This is not a pathological example. This is like exactly the kind of question you would want this product to excel at answering and it was just inordinately slow. No matter what we indexed or what we tried we couldn't make this fast enough. So this fell over pretty badly for our larger customers. Just to give you some sort of ballpark reference point larger customer at that point in time means maybe a hundred thousand sessions a month order of a hundred thousand sessions a month. So not the biggest websites in the world either. Some of you probably have blogs that maybe that get on that order of scale. So at this point in time we can make another important observation that might help us make this product work a little better on the read side of things. And that's that all of the aggregations that we're talking about partition cleanly at the user level. So what I mean by that is that all the aggregations that we're doing you can compute them independently for disjoint sets of your users and then re-aggregate later and it works cleanly. For example a simple case would be incidents rate of an event over time. You can just do that for different sets of users and then sum them. More interesting one might be a conversion funnel if you want to know what percentage of users who add an item to cart later to check out. You can just compute this independently for each user like maybe emit a one if they did each step and a zero if they didn't and then just re-sum across your users. So all of your analyses a lot of them require assembling a lot of a user's record in one place but none of them require any cross user joining except for aggregation. So this leads to our second attempt at getting the right schema here which is why don't we just denormalize everything onto the user row? Like we had too many joins, why don't we just get rid of all the joins? You can't have slow joins if you don't have any joins. Let's try it. So to be a little more concrete that looks something like this where you have a one table called user events where there's one row for every user that we've ever seen. Each user has a customer ID and a user ID like before and a time that they were first seen which will come in handy later and a bag of properties which is the same sort of key value whatever you want to append to this user that we had before and one big array of JSONB blobs with all the events that this user has ever done. And then to do analysis on this kind of schema we used in conjunction with the Citus team we deployed a bunch of C extensions that make it possible to do analysis on this kind of schema. So for example we had a count events UDF that chugs through one of these arrays of one of these arrays of events with some event definition that's what that pattern is and just returns the number of events in that array that match. So for example your pattern here might be type is click and page view path is slash checkout and CSS hierarchy matches this glob and this will just chug through that events array evaluating that on each one and counting the number of results that hit. The more interesting one would be if you want to evaluate a conversion funnel we had a UDF that was specific for conversion funnels called funnel events which took an array of patterns for one for each of the definitions in that funnel. So for example if you're doing a funnel from added item to cart to checked out this would include two things in that pattern array. The first one is the definition of add to cart and the second is the definition of checkout and it'll chug through that array looking for the first instance of the add to cart thing and then once it finds it it'll go from there looking for the first instance of did checkout and then it'll emit an array of ones and zeros for that user where there are ones for the steps that the user did and zeros for the steps the user didn't get to and then we can aggregate on top of that. I know that's a lot of words really fast I think it might be simple if I just show you what a query might look like. So if we had an array for example of three raw events at the top and we had two of those definitions at the bottom the definition of the first step in the funnel is did something where the foo field is set to ABC and the definition for the second one is has a city property that ends in Ancisco. Then this would chug through that array and it would find that the second event in that top array matches the first event definition and the third event matches the second one. So the user did both steps in the funnel and will emit a one one whereas if we flip those events in the bottom for example we'll chug through the array and we'll find the third event matches the first definition and then there's nothing to be mapped to be matched after the third event. So in this case we'd emit a one and a zero because these are didn't get to the second step and it's worth thinking about what this would look like in vanilla regular SQL like you're gonna need a whole bunch of sub-selects and maybe some lateral joins and stuff and this is a harder query to write manually and certainly a harder query to build up if you're doing it in vanilla SQL but it does have the trade-off that you have to learn this tool specific query language like someone at your company has to learn how funnel events works and how to use it and all that stuff and has to learn how to make changes to it. So this is a trade-off there as well. And then a more realistic query that we might run would look something like this where you for each user row we have that events array with every event they've ever done and if you wanna compute a funnel for that user we'll pass in the two event definitions like viewed the sign-up page is the top one and the second one looks like actually did the form submission there and then it'll go through that array and figure out for each user which steps they got through and then we'll re-sum and we're somewhere we're filtering for the customer as well because all of these queries are specific to a particular customer and we'll emit something like 110 users we did the first one and 20 did the second one. So this was a considerably more viable schema for us than the previous one. This got us to the point where I think we had real customers paying us real money. The fact that you had no joins and just were doing aggregations here meant we could serve considerably larger customers. We're talking about on the order of maybe a million sessions a month maybe a little more than that that's like let's say two million sessions a month that's about 20x, if you're keeping track that's about 20x larger of a customer that we can serve than before. Another thing that's really nice about this is you can run pretty sophisticated analyses via these extensions like funnel events like any kind of analysis you wanna build you can sort of assume an array of every event the user has done and write some code that goes through that and emits the thing you wanna do. As long as your analysis is sharded cleanly by the user which artists do all this work pretty well. Or maybe another way to think about it is that there isn't much of a sophistication penalty. Like if you wanna run a seven step funnel filtered for has never done some unrelated thing that's gonna take about the same amount of time as a two step funnel with no filter because it's all just one scan over an array. Another thing that's nice about this scheme is it's very easy to distribute. For reasons that I kind of hope are clear like there's just one user row just one row per user and you just have to make sure there's a well-defined mapping for what user lives where. Another thing that's really nice about this that might not be obvious is that those event arrays are all gonna be toasted. So Postgres has this system called toast which is the oversized attribute storage technique. So if a column value is larger than a couple kilobytes Postgres will store it off table. It'll store a pointer to that position in another lookup table and it'll compress it for you. So this data compresses very well. It's a bunch of repetitive JSON blobs with a bunch of properties shared between them. So we got about a 2x or 2.5x compression on this dataset which saves you a ton of space and a ton of IO. So that kind of thing works pretty nicely. There are obviously some limitations here. The most painful one is that you can't index for any of those event definitions or even for the event fields. An array in Postgres is not, like an array of JSON blobs in Postgres is not an array of pointers to JSON blobs. It's a series of JSON blobs concatenated. So you can't even access subfields of an array. You have to, for example, if you wanna get the 10th element in an array you have to load up the whole thing and untoce the whole thing and then you have to iterate from the beginning of the array parsing things until you hit that 10th event. It's not even well defined where the 10th blob will be because it's a variable with type. So you can't really index for those event definitions or presence of fields or event times in the meaningful sense. You can't get anything less than the whole event array for the user, which is quite painful. And another thing that's really painful about this is that those arrays just keep growing and growing and you're doing a lot of operations that are linear in the size of the array. So you get this really bad performance curve over time where someone uses your product for two weeks in a trial and everything's speedy and working and then they pay for it and then stuff gets slower over time and then nine months later when they're considering renewing, it doesn't work at all. This is not a happy place to be. People feel like they got baited and switched. This is not scaled gracefully with a customer's use of your product. So we tried a whole bunch of hacks. These are some of the hacks that we tried. One thing we tried is we tried storing a time last seen for each user, so this should be the time of the last event in that event array. Another thing we did was store a sub array that's just the last week of events because a lot of your analyses are gonna be specific to the last week of data. So we don't have any way to access meaningful sub portions of that events array but maybe we can just store a separate array that's only the stuff we want for the large percentage of our queries that are specific to the last week. And I just wanna note that I'm not ashamed of this and if you asked me maybe four years ago when I started working here about this, I would have told you that's gross and non-idiomatic and wrong and it's not how you're supposed to do things, not how you're supposed to do SQL or build products. This bought us months of runway and viability of product and stuff and that's a totally, like I would absolutely do this again, in fact I'd probably try to find more hacks. So to give you an example of how we might use these kinds of things, if we're in an optimistic scenario where your query is specific to the last week of data, then instead of using that events array we can use the events last week portion which is hopefully a smaller number of totals like you're reading a lot fewer events there and we can filter for users who, we can't filter for users who were active in the time range but we can at least filter out users who weren't seen who were never seen, sorry we can at least filter users who haven't been seen since before the time range started or who were first seen after the time range started. So we can at least filter out a lot of users that way. So your optimistic scenario here is like if you have a SaaS product where users are very active and they're coming back every day and using your thing every day, then hopefully that events last week filter will be, that events last week optimization will buy you a lot because there's a lot fewer events in the last week for this user than all time. And at the other end of the spectrum if you have a very bouncy website where people come back once, that people come in once and they never come back again like a content product or something like that or just a bad website, then the time last seen, time first seen filters will filter out most of your users because most of them won't even be seen in multiple weeks or in multiple time ranges. So this certainly bought us a lot and this kind of thing is always worth iterating on as well. So there's certainly some, there's one other limitation here that is very severe which is that your right path is very, very painful. So how many people here know what MVCC is? Okay, if like most of you, that's good. And how many know how that works in Postgres? Can anyone tell me why this is a really painful schema for on the right path? Right, exactly. You're rewriting that array every time. There's no appending in Postgres. When you append an event to that array, you have to rewrite the whole user row. You rewrite the whole row every time you modify it. So this schema is bloating like crazy because every time you append to that array, you're rewriting a whole row. So if a user does hundreds of events in a row, you have hundreds of replicates, totally, they're full user data set replicated, it's horrible. So you have horrible bloat, you have to crank your auto vacuum up to 11 and spend all your IO there. And simple maintenance gets very expensive because you're doing a lot of stuff that, like if you wanna keep that array sorted and do a periodic job where you sort that thing, that requires rewriting huge swaths of stuff. One thing we noticed around this time was that our write-ahead log backups were about 100 times the size of the actual database, which is incredible. Like, we were rewriting hundreds of times the size of the database per week. So I actually took a picture, this is a picture I took of our stack on a weekday morning in the fall of 2014. It's a tire fire, but that's actually the stack. This is the kind of bloat we were talking about. That's like about 500 gigs of bloat per weekday and then it levels off at night when the auto vacuum can keep up because the data volume is lower and then during the day it starts bloating again. And then on Friday night, we'd have a vacuum full party where we just vacuum full huge swaths of database and try to reclaim it all. This is like not a fun part of my life or like not a fun schema to be supporting. So another observation we can make around this time is that not only are, this is more of like a tweak on one of the previous ones, not only are those analyses expressed in terms of defined events, those event definitions themselves are very sparse. Like the percentage of clicks on your website that are on the checkout button is gonna be really small and keeps capturing this huge diversity of stuff. So if you're doing analyses in terms of event definitions that are 1000X or million X selective, you really want some way to index for that. That should be your first order thing. So we want to optimize around this. So this brings us to attempt number three, which is keep the events to normalize but split them out from the users so that we can index something on those events themselves. So to be a little more concrete, that looks like a user's table where there's a customer ID and a user ID and a properties blob just like before and an event table with every event that each user has done. So there'll be a customer ID and user ID and an event ID and a time because every event happens at a time and some data blob which is the full, the full, the denormalized blob for that event. So that means all the session properties and pager properties will be on there and they'll be replicated across multiple rows but it's only one event per row. So this lets us do all kinds of creative indexing. Like for example, if we're gonna be doing a lot of repeated analysis on checkouts which is something we probably know because you defined the checkout event, then we can create a partial index for example. In Postgres, if you've never used a partial index, you can think of as an index with a predicate on it. So your DDL command will look something like this. It's you're creating an index and you have a where clause on it where that where clause maps directly to that event definition. If you have a checkout, if you define a checkout as a click on this path where the CSS hierarchy matches this blob and target text matches that is that confirm order tag. There's a direct mapping there. And then the rule is that that index will contain exactly the rows for which that predicate is true. And then we can use that when we're doing analysis later. For example, if we wanna graph checkouts over time, we'll have some query that looks like the bottom query where you're filtering for checkouts and also you're filtering for some time range and also some custom ID. And the planner is smart enough to realize that the predicate on that bottom query implies the predicate on that top index. So any row that we could possibly want in this result set has to be in that index. So if we know that this definition is, if the planner can figure out that that definition is very sparse, it can just use that index as a whitelist for the specific events that we care about. So we have a massively reduced amount of stuff that we're reading. We're only reading the rows that actually match these definitions. So if that's like a between one in 1000 and one in 1 million degree of selectivity, that's a huge win. So the general strategy here is to, you have to structure the event table such that every event definition, such that those definitions can be expressed as a row level predicate. You can't do cross table joins these kinds of things. You wouldn't want to anyway. And then under the hood, we are maintaining a partial index for each of those predicates. And then because of the variety of the stuff that heap is capturing, because the variety is so massive, any individual user, any individual event definition is very selective. And this fits really cleanly into our retroactive analytics framework. Cause you just, we already have the raw data and you just create this index and it'll, Postgres will handle computing it over your historical data and maintaining it after that and all that stuff. And then your general read path strategy looks something like this. You have analysis that shard cleanly by customer and user as before. And instead of having a big array of all events that users did, and instead we are dynamically building up an array of all the relevant events for user. So for example, for a conversion funnel from add to cart to checkout, we'll just build up the array of only the add to carts and checkouts. And then we'll pass that into the same UDF we had before cause we still have those. And then you can do all your arbitrary SQL after that for joining and filtering and all the drill down stuff that you want to do. So there are a lot of pros here. The biggest one is the read performance. We can get excellent read performance out of this. There are a couple of caveats and a couple of edge cases and I might get into those if there's time. But in general, we can make reads, tens or hundreds of times faster here. Another thing that's really nice about this is that the schema permits a lot of creative indexing and querying strategies. We're not limited to exactly stuff we can express in a UDF that has to load the full array. We can use partial indexes and weird more exotic join techniques. And there's a lot of tools that are disposable to make analyses fast. There's much, much less right time IO cost cause instead of rewriting these whole arrays, we're just writing one row to a table. So we do have to maintain all these indexes but because those definitions are so selective, we're not writing to them very often. So the right time IO is still very small like a typical event that comes in. You have to check against all those predicates but it'll probably only map to match one or zero of them. So the right time IO here is very cheap. And finally, Postgres is managing a lot of the complexity for us. We don't have to, we get, for example, perfect degradation terms of correctness of your analysis. If one of those indexes is missing, Postgres the planner will just fall back to one of the built-in indexes or do a table scan but will never lie to you. We don't have to handle knowing which indexes are valid at any given time at query time. Postgres just manages all that for us. But there's certainly some limitations. The first is that you do have to maintain all these indexes. Our biggest customers will have maybe 2,000 of those event definitions. And evaluating all those predicates for every new event is gonna cost a lot of CPU. Like every event that comes in, you have to evaluate possibly thousands of those definitions and a lot of them contain a like glob so there's a pretty meaningful amount of CPU that has to go there. Actually the details are really cool and I'm happy to talk about it offline if you're interested. Another thing that's really tough about the scheme is that you don't have any meaningful statistics because everything lives in the JSON blob. So Postgres doesn't know, there aren't meaningful stats on JSON. So Postgres doesn't know which keys are common, which values are common, which what the histograms look like or anything like that. It's making wild guesses as to how common these events are. So in extreme examples, we had to just disable nested loop joints across the database, which is usually a huge no-no. But we had no way to keep it from doing explosively bad nested loop joints because we didn't have the stats. And finally, a lot of those assumptions are a bit restrictive and don't degrade gracefully. Like if those indexes are missing, you might get a bad performance degradation because the fallback to a base index might be really bad. So I'm gonna amend one of these definitions again, which is that not only these events in terms of defined events, which are very sparse, those are pretty predictive to a degree, like are predictable. You're not gonna have an event definition, you can define an event on anything, but almost most of them, maybe two thirds, only use a small set of properties, like two thirds of these only use the page view path, the CSS hierarchy, the target text, a couple other things. I think there's a list of seven that we have. So we can take advantage of that, and this brings us to attempt number four, which is keep the events normalized, but pull out the common fields into their own columns. So instead of something like this, where we have an event row with a data blob that contains everything in it, we can pop out all of the common ones into their own columns. And there's a big win that you get here, which is the first is that your data sets 30% smaller, because instead of storing the keys for those, instead of storing type and hierarchy and target text, the strings in every JSON blob and every row on your table, that's just in the schema, you don't have to have that anymore, and also the planner has stats now, so because it has stats on these columns on the distributions of these things, so you can use it in much more ambitious ways. So for example, we can have separate indexes on each of these types that are specific to the type, like a GIN index on CSS hierarchy, components and a B tree on the type of the event and all that stuff, and the planner has stats so we can reliably get good plans now. We can get bitmap joins that combine these indexes. So we can throw away maybe 60% of those partial indexes and replace them with a much smaller set of simpler indexes and trust the planner to do good plans. So the upshot is we can save about 50% of our write time CPU, because we can get rid of tons of these. We have to pay about 50% more write time IO because you have to maintain all these full column indexes and instead of all these really sparse ones. But this eliminates a lot of the edge cases too, it degrades a lot more gracefully. For most of your events, you don't have to create that index anymore so it's immediately fast when you want to use it or if that index is missing, you can fall back to a base indexing strategy that will work generally a lot better than nothing. So there's a lot of wins that you get from this kind of thing too. What I hope is clear from all this is that if you build software on Postgres or really any data system, you'll find that there are often a lot more ways to represent your problem in terms of schemas than they might be obvious initially and there are tons of tradeoffs and considerations that go into picking the right one. It determines everything about what is possible to index and what data inconsistencies are possible and you'll also probably find if you're like us that you learn more about the problem over time and you actually figure out like what the correct tradeoffs for you to make are. So I'll give you one last thing to think through. It's just another interesting thing to think through. We have an API where you can tag a user with an identity and if two users have the same identity, we'll combine that user into one user record so you can do cross-device tracking type things. How you represent user moves when the users get combined. There's tons of different ways you can do this and actually have lots of different tradeoffs. For example, you can have an update thing where you set the user ID to the other user ID but now you have to make sure all the events are there before you do that update. You can do an update and delete the old user record and leave a tombstone behind that points to the user. You can keep a join table and resolve them on read but now your distribution strategy might be a lot more complicated. There's tons of different ways to do it, even something this simple. So something to think through and asking online if you're interested. I think I'm out of time. I might have a minute for questions or should I just take it offline? Yeah, okay, I have like two minutes for questions and also you're welcome to bother me online or just come grab me after if you have anything. Sure. The question was can you give me some kind of timeline on this? So I joined in July 2013 when we were on schema number one. We were on schema number two by early 2014 and then we were on schema number three by early 2016. It took 16 months to build out a lot of the pieces there which is very painful but some of those hacks came along the way and stuff. So that's going from maybe 100,000 sessions a month to two million sessions a month to 75 million sessions a month. So like massive increase in size of customers and also we were serving these large customers a lot better than we were ever serving the large customers before. And then we started doing research on number four in like early 2016 and you're still rolling out pieces of it now. Yeah. Do you migrate data between each? The question was do you migrate data between each version? That's a whole, anytime you're dealing with hundreds of terabytes of data and you can't ever be down any kind of schema change like this, that's another whole axis of optimization that I didn't even touch at all. Different schemas are easier to change later or give you more flexibility, which I mean it's like the whole appeal of the whole NoSQL thing, I think, or of Mongo at least. Or JSON, it gives you all that flexibility. But yeah, we had to go through a, even figuring out how to migrate between these was a ton of, it was a whole thing. Sure, I got time for one more. Cool. Thanks guys.