 Hello, everyone. I'm Stephanie, and I'm a full-stack developer at Hodjar. As was already explained, Hodjar is an all-in-one tool for user analytics and feedback, so we get loads of user data. But is it really big data? So how big is big data? At Hodjar we have above 400 billion recordings, we have 4.5 billion records in a single table, and our recordings come in the thousands per minute. And we have an overall request rate of 750,000 requests per minute as well. But is this really big data? So what defines big data is basically that we start outgrowing the normal relational tools, and we need to go into a new set of tools, which are more suited for bigger data basically. So at Hodjar we're basically neering this limit. Sometimes we play a few games with Postgres as well because we're neering the limit of either storage or performance for certain tables. So this talk today is about the good decisions we made with storing our data in Postgres, and I would like to share these with you. And also the things we had to figure out in production, which were quite surprising and adventurous at times, and basically the tools we used to handle these situations. So we're going on a journey together, and we're going to imagine that we're going to start building our database from scratch, as though we have nothing. And I will first go through the good decisions, which we made, and then I will also explain the tooling, which will help us when we deploy live. Ready? Okay, let's go. So the first thing we need to consider is normalization. So textbook data normalization is a three-step process by which we eliminate fields from a table, which are not directly dependent on the primary key. However, in real life, we need to achieve a balance between the consistency of the data in our data store and also its performance. So we need to make decisions in a way, which makes sense basically. So Postgres offers us the best of both words, because we can have schema less data inside the JSON schema field, and we can also have relational tables. Now, what makes us choose one or the other is basically the data itself. So at Hodjar, we have data relating to our users, our subscriptions, our accounts, and in this case, it's very important that the data remains consistent. So we use the referential constraints of the database, and we have very normalized tables over there. However, when we come and gather user data from surveys or polls, this is a little bit more dynamic, and it's not so well-defined. So in that case, we can afford to put it in a JSON field, and maybe do the validation at a model level using, at the API level using JSON schema, or using Marshmallow schema. This will give us kind of the best of both words, because the storage doesn't need to verify the contents every time. Okay. So, now that we're happy with more or less how it's going to work, the product owner needs to make a few changes. So where are we going to add this new field? Though it's very nice to know beforehand how the product is planned to evolve so we can plan future changes, it's not always possible. So the idea is that as the product evolves, we need to keep our models healthy, to be able to achieve code maintainability and readability, and to understand the journey of our models, because our database might still hold data from previous definitions. So the way we do this at HotJar is by versioning our models. So both when they are in normalized tables, we can add a version column, and when they are nested inside JSON fields, we can also add a version field inside the JSON. So now this gives us basically, we have new fields, but we can also identify the behavior of the different types of versions of our data at the API layer. Okay. So, what's next? No, we are ready to use our tables from the code, because we're pretty, we're pretty well defined with what we want to do. So should we use our RMS? Our RMS give us a few benefits, some of which are that we think at a higher level of abstraction. So we don't need to think about this table being joined with this table, but we think at a higher level, more about the business logic, and how to make that readable in code. So, and ORMs also give us same defaults for table creation. ORMs and frameworks are also a very nice benefit, because some frameworks are very tightly in it to ORMs. For example, if we speak about the Django admin interface, it's very heavily based on the Django ORM, and it gives us nice goodies for free, so that we can basically view the data from an admin interface without having to write all the creation, updating, and deleting code. Okay, when it comes to handling transactions in a web application, there's, for example, Flask SQL Alchemy, which will manage the transactions for us within a web request, and ensure that they're committed at the end. Again, this is heavily based on the SQL Alchemy ORM. When we advertise ORMs, there are some cases where there are parts of ORMs, which are advertised as benefits, but may not be so, so I want to put a little note here to be careful, because one of the benefits being advertised is that you can switch between the frame backends, so it's tempting to use Postgres as your production database, and SQL Lite as your test database, for example, because it's allowed within the ORM. However, this may lead to discrepancies, and you might get surprises when you deploy your code to production, so it's best to keep your test environment as close to your production environment as possible. Another one is automatic migrations. Certain ORMs support automatic migrations, however, if you have a very, if you have a high throughput server, which is receiving lots of requests, the SQL generated by the ORM for the migration might not be optimal SQL, in the sense that it might cause locks on tables, and this will cause the request time to spike. So, you might need to rewrite the SQL to make it more performant to do parts at a time, to run indexes concurrently, so don't just trust the migration code generated by the ORM, but check a bit what your needs are in this regard. OK. So, when we speak about ORMs, what about performance? ORMs do give us a performance hit, so how do we handle the situation? Should I use stored procedures? Well, at hodger we use both ORMs and stored procedures. Whenever we require a performance, a better performance, we drop into stored procedures. However, there are, of course, cons at this. First of all, we spread the business logic from basically a single module to two different places. It's deployed half on the database inside the stored procedure and half in the web server in the API, so it makes things less readable, and also stored procedures run within a single transaction, so if you have lots of things going on in them, they might increase your chances of deadlocks. So, the good practices are, if you're going to use stored procedures, keep them small and modular, and version them as well, so you'll be, if something goes wrong with the new version of the stored procedure, you'll be able to roll back your code easily and rely on the previous version of the stored procedure. OK. So, I think now we're ready to go live, right? We made some performance considerations, and we're happy with our model, so, yep. Now that we're running on live, things are different, because we cannot predict how our requests are going to use the database. So, once we're running live, we identify that we have slow queries. So, how are we going to go about knowing which are these slow queries? I hope for your sake that you have monitoring set up, so, in this case, you will be able to see which web requests are not performing as well as you'd like, but this doesn't necessarily point us to the query, because the web request might be doing so much, so many things. So, in this case, we basically pull out PGSTED statement from our toolbox, and this will give us the call frequency and the timing of the currently running queries to be able to help us identify which one is the slow one. So, once we know the culprit, what do we do? Basically, explain analyze buffers is the next tool we would like to use. So, when we run explain, what happens is we get back a query plan, which will be used by Postgres to actually give us our results back. When we run explain analyze, we will actually run the query plan and get the exact timing that it's going to take. When we use the buffers part as well, it will tell us how many pages Postgres is reading, basically, to give us the data back. So, there are two numbers, which are given to us for buffers, and if there is a high hit in the shared pages, it means that at least the results are coming from cache. However, if there is a high number of pages in the red buffers, this means that it's coming from disk, so it is also slower. As a general rule, the number of buffers hit should never be too high, because in that case, you might be going over too much data, and you might need to optimize it in certain ways. Other things to look out for are nested loops and full table scans. These might require indexes to basically improve them, and also take a look at how your query is using, how you're basically querying your data. If you're always using two specific fields, you might benefit from using a composite index, by which the index takes both fields into consideration, and if you're doing some computation to a field before querying for it, you might also benefit from having a computed index, which stores the computed value within the index itself. If you're doing some transformation to the data before in the query, you might consider putting that transformation in the index you create as well. Now we're a bit smarter with our queries, but what if Postgres is not choosing the optimal plan? If the query plan is still not performing enough, is it Postgres fault? Let's try educating the query planner, but at this stage I have a big disclaimer to make. He's a very smart guy, so he's usually right. We need to be careful here. Let me go through a bit how the query planner actually gets his decisions. When we run analyze on a table, so it's not explain analyze now, it's the analyze you run with vacuum. The query planner statistics get updated, so the query planner will get around 100 rows from your table and base its query plan on these. However, if you have a table in which columns have very irregular data distribution, you might benefit from giving the query planner more statistics to base its decisions on, so that it can come up with a better query plan. In this case, we can use set stable statistics and basically increase the number of statistics so that the query planner can take better decisions. But other than that, he's usually right. So now we've done quite a lot improvement to our query plans. However, some table joins are still very expensive when running web requests, so what can we do about that? Basically, having materialized views and materialized tables can help us in this direction. So materialized views are created by creating a materialized view based on a query. So the result of the query is computed once and stored in this view. And whenever we need to update it, we need to call a refresh to have it updated with fresh data. This makes lots of sense whenever you need to generate reports daily or hourly because you can take a bit of time to refresh the data. However, when the data needs to be updated continually, it can be expensive to call a refresh on the materialized view, especially if it's a big query. In that case, we can get around this by building our own materialized tables, which are basically tables that we populate at the same time that we populate the normalized tables. So let's give an example. So if I'm entering subscription details and I need to populate the account table, the customer table and the subscription table, I would also, in the same transaction, try and input data also into the materialized table, which would combine the data from these three tables, so that when I request it from the materialized table, I don't have to join. And this gives me the performance benefit. Another way of doing this is by having a trigger on one of the normalized tables, which will then populate the materialized table itself. So we've learned quite a few tricks so far. However, aggregations do deserve a special mention. Would you select account in a web request? Well, if you always know that the results returned are going to be of limited size, it's fine to do it, but usually you might get surprises because people tend to use your APIs in ways which you cannot predict. So it might give you a performance hit in certain queries. So the idea is that though it's hard to cash aggregations, because we always want an updated value, we might not require down to the second updated value. So in this case, we can basically cash the result of the aggregate for five minutes and refresh every five minutes. At hotjar, we do this by having web requests selectively update the cash. So the web request would need to take a look and say, listen, I'm in charge of updating this value now because I can see it's going to expire soon. And then so that we avoid the cash stampede problem when everyone tries to update an expired value at the same time. Another way to do it is by having a background worker take care of this. So it's not done in the web request itself. Also, at the moment we're speaking about aggregations when we need to read them, but when we're continuously updating counts by actually incrementing fields inside the database, it also makes sense if our performance allows it, if we can lose five minutes of data and it's not too bad, we might also want to cash that first and sync to the database at regular intervals to avoid creating too many updates in Postgres because if you create updates at a very high rate, Postgres won't be very happy. Okay. So there's something still lurking in our database which might cause us trouble intermittently, which is a bit worse now. So, welcome to deadlocks. So, I explained deadlocks in this way. Imagine you have two kids and they each have a toy. And of course, each wants the other kid's toy without giving up its own. And this is what happens when we have a deadlock. There are two transactions going on and each has a resource which is required by the other transaction, but neither of them want to let go of their current resource. So, in the case of the kids, you would have a parent or a carrier come and take one of the toys and give it to the loudest kid. In the case of deadlocks, Postgres does this mediator job and one of the queries will fail and one of your users will get a failed request. Due to the nature of deadlocks, since they're not very predictable, repeating the request will probably make it work. So, it's not too bad for your end user. However, the symptoms are that quickly slow down when we have deadlocks. And if the throughput really increases, it can really bog things down. So, we need to take care of them. Okay. One of the ways to take care of this is ensuring that we're using a recent Postgres version because sometimes they give out updates which can prevent certain classes of deadlocks, but mostly we need to take care a bit about how we write our code. So, over here, you probably can't see this query, but I'll make the slides available later and this is a very useful query for identifying deadlocks and it uses the PgStatActivityView and the PgLocks. Within the query. So, these are two tools provided to us by Postgres to help us identify which transactions are using which logs and if we see that they're being held for too long, we can possibly identify that there's a problem. Okay. So, at this stage, we need to speak about table and index bloat. So, this I mentioned before that Postgres isn't too happy with updates. So, what happens when we have table and index bloat, specifically table bloat, is that when searching for results, Postgres has to jump over a lot of rows which have been deleted or updated, but have not been, but their space has not been returned to Postgres yet. So, basically, index bloat happens due to how Postgres manages transactions internally and basically can be fixed by running a vacuum full. However, running a vacuum full is, will hold an exclusive lock on the table. So, it might not be advisable to do this in production environments. So, there are a number of extensions which can be used to improve the situation and remove the number of dead rows and basically give the free space back to Postgres. So, one of the symptoms of having table bloat is that the storage space in your database goes down really fast, faster than you would expect at then the rate of data coming into your database. So, just a quick run through. PgSqueez is something you set up once and it takes care of bloat continually because you configure the level at which the bloat needs to be controlled. PgCompact is advice for toast tables which are basically tables which have blobs inside them and will refer their storage to other tables under the hood. And, however, it needs to be run specifically with a parameter which specifies the number of pages. It's going to clean up. And PgRepec can just be run once whenever the need arises, basically. Okay? And we use PgSquez tuples to actually identify whether we have index or table bloat. So, we have put quite a few problems under the carpet by my asking you to shove them into a background worker. So, let's speak a bit at the end on how to keep these background workers healthy. So, ideally, background workers pick their jobs from a queue so that if one of the background workers dies, the job can be rescheduled. Also, we should keep the duration of the job predictable and do very little inside the background worker so that we can distribute also the tasks between multiple servers. So, when we do this, there isn't one machine under load for our long task, basically. But we can basically monitor and profile the background workers easily. Another tip is to use server-side cursors when we're obtaining a large amount of data. This also incurs performance hit on the database, but it will protect memory within the background worker. So, the advice is don't use it in a web request, but basically use it inside the background worker. This is a cheat slide because it's not about Postgres, but I've mentioned caching, and caching can be done in two different ways. We can either cache small tables within the web request, within the web application, but we need to be careful with memory in this case or else we can use caching stores like Redis or Memcached to keep our cache data for us. Last thing I want to mention is how to handle deletes. We haven't spoken about deletes at all, and when one day we realize that our database is getting full and we say, listen, we have loads of data which we haven't used in ages. And can we delete that? First of all, deleting it all at once will not make Postgres too happy, and second of all, your users might not be informed that this data might be removed. So, planning a data retention policy up front will help in this regard that as soon as the data is not covered by the retention policy, if it's one day older, you can start deleting data which is at least one day old, and in that way you balance the deletions by doing them daily instead of by having to do them all at once. Okay, that was all from my end. I'm not sure we have a little time for questions, but if not, feel free to drop by to the boat. There are four devs from Hojar here, and we're mostly there at breaks because there are many interesting sessions we want to attend as well, but feel free to grab us either after this talk or as you see us around here. We do have time for questions. One at the back. Hi, thank you for the interesting talk. Your abstract mentions sharding in Postgres SQL. I was wondering what's your experience with that. Yep, okay. I left it out because of time, but I can speak a bit about that as well. When we came to basically think about how to expand our Postgres installation, one of the decisions we could have done is basically sharding because since our data is basically organized by site, so all our features are organized, so heat maps are specific to one site, recordings are specific to one site. So our site ID is a very good thing to shard by basically. However, when we actually came to do the migration, we ran into performance problems because the transferring of the data and the changing of the format of the data required was basically too much for our current Postgres installation. Because we needed to do some changes. So the tip is if your data is suited to sharding, you need to think about it early, even though, because the migration will bite you, basically. Yeah, sorry. You've mentioned using JSON structures in Postgres. Are you using them a lot and haven't you seen any problems with performance or to put it another way, did you have, I need to use something more oriented or on schema less stuff, like for example, Mongo or can you just disregard Mongo and use Postgres with schema less data with JSON structures? Postgres has been suiting us fine so from a performance perspective, it's not a problem. However, I also need to say that we don't query the nested structures too much. So what the JSON we're interested in is requested by the API as the entire field. We don't search by stuff which is within. So the answer I'm giving you is within this context. However, if you have a look at benchmarks, you see that Postgres and Mongo are both very performant in this regard. So no, we did not have the need to use Mongo for this. Okay, thank you. We have time for two more questions. Hello. I think it works now. Thank you for the talk. So beyond sharding, which is that it's very hard to do after the data grows, what are the tips you may have while your data is growing, while your products get more successful than your database gets ahead? I know that's the always saying you should throw in more memory and more space, but what other tips you may have? First of all, it's a good problem to have. That's what we say internally to console ourselves whenever we need to stay up, to basically take care of the stuff. Monitor and keep tabs as to what's going on, both in the database and in your web request. And when you get stuff going a little slower or when you see things getting incrementally slower, but just little by little, don't ignore that, because it may grow a bit too much at some point, so you should start profiling early to at least keep tabs on the bottleneck, because the more you know about your bottleneck, the more you are in control as to how to fix it, and it doesn't take you by surprise. Last question, none. Then let's thank Stefanie once again.