 You ready? Yeah, sure. I'm ready. I can, of course, use a few extra minutes to talk about a few extra details. If we get started early, that's good. Cool. Yeah. All right. So here's Chris. He's been working on, is it just posts? Post-crest killer, I'll say. Post-crest? Okay. I didn't want to say it the wrong way, but I know it's though. Okay. Since 1999. Wow. That is a long time. So he must know a lot about this. So he's going to be talking about real-time analytics with over 400 terabytes of high-velocity data and how to do it. Yeah. So take it away. All right. So with this talk, I went back and forth between two different titles on this. Post-crest killer at 20 terabytes and beyond, and post-crest killer at 400 terabytes and beyond. Differences. 20 terabytes is about the maximum size of one of our databases. 400 terabytes was the overall size of the analytics network when I first did the slides. We're now at about 500 terabytes. So hold on a second. So let me talk a little bit about what we do at Adjust, why we have all this data. Basically, we do mobile advertisement attributions. So when you click on an advertisement in your cell phone, and you go and you make a purchase or you download a new app, we referee which advertisement activity gets credited with the installation or the revenue activity. So we connect all those endpoints. And this means that we are being hit by traffic from cell phones 24 hours a day, seven days a week. We focus on fairness. We fight a lot of fraud. You would not believe how much fraud there is in the advertisement world. You probably would believe actually. And we focus on trying to have all this work in a way that's fair for everybody. So that's what we do. And as you can imagine, this generates a very large amount of traffic that we both have to process real time. And then people want analytics on how they're advertising campaigns are going. So what we'll be covering today are just general stuff about our environment. I was going to talk about auto vacuum and data modeling and backup and operations. But at a 20 minute talk, we'll probably have to just take that out to the hallway track. And then we'll talk basically about, again, the strategy on how all the pieces fit together. One thing that sometimes happens when people see this talk is they go, oh, this looks like a cool way I could solve this problem. Chances are, unless you have multiple terabytes of data, you're better off optimizing a single instance. When you're starting to deal with heavy velocity and heavy volume problems, then maybe this approach works better. But there are many cases where you actually do have big data, where you could work with Postgres as a big data platform, and that's what this talks about. So this is more about using Postgres as an alternative to Spark than it is talking about how to solve geolocation for IP addresses on a scale of 100,000 a day or something. So some basic facts about us. We are a Postgres SQL Kafka and Go and C shop. We have about 200 employees worldwide. We link advertisements to installs, and we do very close to real-time analytics with user traffic data. So we have right now, on average, we're most of the time a five-minute window between when an event comes into our servers and when it shows up on their dashboard. This is despite the fact that we have 1,000 to 200,000 requests a second coming in, average 100,000 during downtime, three or four 100,000 during high volume peak times. So our data is big data. Just how big is big? I mentioned the numbers are very steady out of date. Well, it is over 100,000 requests a second. It's quite a bit over that. 100,000 requests a second is about the minimum we see in a day in almost any given second. Higher, we tend to see closer to peak times we see 300,000 sometimes. We have seen 300,000 to 400,000 when a customer makes a mistake and accidentally turns the cell network into a botnet. This happens. This happened when I was giving a talk at PG Conf Russia, actually. So here's our general architecture. Request come in from the internet. They get routed to the back ends via HA proxy in a round Robin way. So basically the HA proxy just directs the request to the next available back end. Then in a tight loop we have a program that runs inside PostgreSQL on the back end. It's actually written in PLPGSQL, which takes that data and it sends it out to all of the analytics charts. So here we have no locatability of any data. We don't know where a given request is going to end up. We can't go and locate it afterwards except because it gets tagged. In this process we know that it was processed by back end 7 for example. Currently we have 25 of these back ends. We go through this materialization. It's again an internal set of stored procedures written in PLPGSQL. It goes to these charts where we do have that sort of locatability. This is basically a MapReduce job, but it's a many to many MapReduce because it goes from many servers to many other servers. Common elements of our infrastructure. It's all bare metal stripped down gentle. Our ops team does a huge amount of AV testing. What do we get more performance out of when we choose kernel schedulers or things like that? And we estimate around roughly 50% more throughput than we would get with a stock Linux account. And stock compilations of Postgres and stuff like that. And then this is standard PostgresQL plus extensions, some of which are just on PGXN and some of which we've written ourselves. We don't use any storage extensions by the way. Mosidus, nothing else. So now let's talk about the roles of our back-end servers in all of this. Request comes in from HAProxy. It gets routed to a back-end server. The back-end server processes it. I'm not going to talk right now about our analytics. No, not about our operational data environment, primarily because that's running in Aerospike, which is not open source. But then once this gets done, it gets saved into Redis. We have a process that reads from Redis, stores it in Postgres. This means if we restart Postgres, we don't interrupt the back-end to do that. And we have right now 24 back-end servers. Typically, you get around 20 terabytes over time in each back-end. Back-ends 20 to 24 are smaller because they're new. But then when we get close to that, we ship the data to basically archive servers that allow us to still backfill in some other things if we need to. So let's talk quickly about this materializer process that runs this written in PLPGSQL. It aggregates new events. It copies them to the charts. It runs every few minutes and it runs on new data only. So this is an incremental MapReduce job, basically. It runs in Postgres, grabs new records only, digests them down, and sends out basically the diffs to these other servers, which then store the little pieces. Does that make sense? And right now it does... Well, it still does this right now only via the foreign data wrapper, though we are changing that because the foreign data wrapper has some serious performance limitations. How many of you here just show of hands are familiar generally with the idea of MapReduce? Okay, this is good. I don't probably need to explain this. Basically, what we're doing here is we're doing a MapReduce from our back-end servers to our analytics charts. And then we have another MapReduce job, effectively, that's done by a service written in Go when it pulls the data out of the analytics charts and hands them back to the client or to the dashboard. And breaking it up in this way means that we can keep this process very close to real-time. As I say, under normal circumstances, it's within five minutes. When we're doing maintenance, it may be up to an hour. And over an hour, ops starts getting involved in calling us and saying, why isn't the data getting there? Because they actually monitored that. So as far as the analytics charts go, these numbers are a little out of date. We have right now three-and-a-half terabytes on each one. There are 16 charts. So this is approaching 60, 64 terabytes. We have our own custom Xamalit software that basically goes through and it does another MapReduce again to generate the data that then ascends to the client over our KPI service or it sends to the dashboard. Once in a while, clients abuse this. They ask for all kinds of data and the KPI service grows. And it restarts automatically. And then they ask, why didn't we get it happen? We'd say you asked for five years of data. You can't have it that much at once. We can't process it that fast. So a lot of challenges and stuff. Obviously, one of the reasons why people come and do these conferences is, of course, when you're working heavily in open source, everybody's always hiring. I'm sure that's the case. I'm putting these slides in here, though, because I'm willing to bet that there are at least some people in the audience that are also hiring and struggling with these problems. So when you're working at this scale, we can't hire junior people. It's an extremely demanding environment. It takes a lot of time to get up to speed on. And you need people who are deeply grounded in both theory and experience. So this is a really big problem. And it's a problem particularly in Postgres because the database system has become popular much faster than the DBA community has grown to fill the needs there. And so this is actually a really serious problem. This is bigger than any of our operational problems. And so the way we've decided to handle it is we look for people with enough experience that can actually start conversing right away. They don't have to be able to contribute right away. But they have to be able to converse. They have to be able to show they can contribute to the foundation. And then they have to be self-learners enough to be able to get into it. And I suspect this is probably applicable to people hiring in other projects also because if you're in an environment where the project has grown so fast, you're going to have these sorts of challenges. And this is how we've decided to handle them. It seems to work well, but of course we still struggle. So I'm going to skip the throughput challenges due to time. I actually opened the slides that I've done for a longer presentation, which is fine. And I'm going to skip auto-vacuum and we can talk about those after if people want to do that in the hall. The data modeling, however, I want to talk about. This is really painful to change your data model at this scale. I'll give you an example, a real-world example. Two months ago, we were told we're running out of 32-bit integers for advertisement trackers. Problem is, these are used everywhere in this 400 terabyte environment. We have two months to solve it. Two months sounds like a lot of time, doesn't it? I see some people laughing. They know where this is going. It took us two months to solve it. This isn't something you just do because when you try to change the data value, another data type, it will try to rewrite the table so you can't just do an offer table. You have to come up with other strategies for updating the data and a lot of this. Executing them takes some time, making sure they don't knock the system offline. That's another challenge and this sort of thing. This is a very under heavy load. This is a real pain to change. I'm happy to say that we did change it. We managed to change every piece without any downtime with only a few seconds of table walks in the whole process. But this was a really heavy task. So getting this right at first, it's a really important piece. We do a lot of our own custom data types. We're sometimes known, we show up at conferences and the people in the office go, hey, wait, you're the guys who write all those funny C data types. I'm like, yeah, we are. We have custom one byte enum. So you just have a map of value to say a country name and then you can represent all countries in a single byte. And this helps you with alignment issues. And it may only seem like it saves like 8 or 10 bytes a row. But with relatively short rows, that adds up over billions and billions and billions of rows. So this may seem like a micro-optimization and in many cases it is. But byte counting actually is something that we pay a lot of attention to and it pays off in this environment. We also have this custom data type we call iStore, which is not related to the Apple iStore. Those of you who work with Postgres may know hStore. This is like hStore, but it's integer to integer. Basically, you can think of this as a way of modeling sparse integer arrays. And this gives us many of the benefits that we would get with columnar storage, but in still a row-oriented format. So we can model time series and stuff like that by effectively putting this variable length blob of data after that we can work on in that way. This allows us to have arithmetic operations. So if you take one iStore and you add it to another iStore, it will assume that missing keys are zeros. And it will just add the two keys together until you get the sum of the two. And it also supports gen indexing for those of you who know what it is. How many here do know the term gen indexing? Just two, three. I'll just mention it briefly. Postgres has a number of index access approaches. Gen is one of them. It stands for general inverted index. This is a way which you can index effectively membership in an object. So basically this will allow you to use an index to say what rows have iStore columns which have a key of five. So basically what we do here is we do an incremental map reduce from one set of servers to another set of servers. We do a second stage aggregation in our KPI service, which by the way is written in Go. We do basically the shards themselves do this second stage aggregation. So when we do the map on the main back ends, we push that data out to shards one at a time. And then the shards do this sort of reduction. Then again you have the second stage aggregation, which happens afterwards. Given the amount of time, I think I'm actually going to go back and mention auto vacuum. How many of you know what auto vacuum is in Postgres? How many of you are generally familiar with the problems of garbage collection? Almost everybody. So we have this issue of garbage collection. And the basic problem is that in relational databases, we like to have every query go from one consistent state to another consistent state. This means that all running queries have to be able to access all the data which has been altered since they started running, right? Different databases have different approaches to dealing with this. SQL Server has one approach. Postgres has a different approach. In Postgres, we just write new rows, and we mark them as deleted. Actually, we mark them as when they were deleted, right? And then we have this other process that comes through and it basically makes the space available from those deleted rows for future rows. So our tables are unordered. This makes certain things very nice. It causes some problems somewhere else. But for the most part, it works fairly well. The problem is, by default, auto vacuum doesn't kick in until you have 50 rows that are dead plus 20% of your table. That works great when you have 100 rows. When you have 100 million rows, this does not keep up. Auto vacuum will sleep for most of the time and then it will try to do too much work and you'll have some problems. So our first attempt was to change that 20% to 8% that didn't work so well. So I introduced another approach which was basically going to 150,000 rows plus 0%. Which actually works very well. However, there's a very major problem with this and that is you switch this and guess what? Auto vacuum says, wait, now I have 2,000 tables to vacuum and I can do it 5 at a time. So 2,000 divided by 5 is 400. So let's create 5 queues of 400 tables and hand them off to these workers and when they're done, we'll look at what else needs to be vacuumed. Four hours later, system falls over, Ops calls me and goes, what are you doing? This is why when you make a change in this environment, four hours later Ops calls you. So we ended up eventually finding ways of rolling this out in very gradual ways. We had to find solutions to this problem and what I actually did was I created a batch script that did more or less what auto vacuum would have done under this configuration. I let it run for two days and then I tuned auto vacuum and everything worked. So that worked well. Onto conclusions and then we will actually have a couple of minutes for questions. So in this environment with the scale, you have to have a huge amount of very careful attention to detail things like data alignment because if you're wasting 10% of your space padding that C is putting in there for you, that's going to come back and bite you pretty bad. 10% of 20 terabytes is 2 terabytes. So keep that in mind. It might only be 10% but you're still talking about terabytes. 1% out of 400 terabytes is still 4 terabytes. So small gains here add up in big ways over large amounts of data. Similarly, small errors add up in big ways also. So it's something that takes a tremendous attention to detail and at the end of the day, we are in fact using Postgres in a way in which many people would use, for example, Hadoop. We're effectively using it as a distributed data storage engine where we can do MAC reduce for data and put it somewhere else and maybe even do a MAC reduce of that and put it somewhere else. And one of the major takeaways I would say on this, at least in our experience, is that there are three or four major pieces of infrastructure our team is really comfortable with. One of them is Postgres. One is Kafka. One is HA Proxy. A lot of the rest of it we can live without. But those are the three things that we kind of have figured out how to scale and scale well. Of course, a large part of it is that, you know, we have the in-house experience necessary to do it. So that's the main section of the talk. It looks like I have about two minutes for questions and answers if there are any immediate questions. Yeah. Thank you, Christopher. I took a five-hour request to listen to you through this talk so far. I have multiple questions, but maybe I can ask them. Sure. I would start with if you are not using SIPUS data, what sharding solutions do you employ? Is it like manual process, manual on-ground process, or what? We have a sharding algorithm we've built that we end up having to deploy in the materializer. We also, we originally used to use the same thing in that KPI service that would retrieve the data from the shards. But what we found is that since people were typically asking for stuff that was more, that was less granular than what we were using to locate it, we just found it was easier to ask all shards on that side. So we shard the data on sending it over the KPI shards to our own algorithm. And basically it's aimed at incremental resharding. So basically the idea is when you reshard, you add a series of, you have to add at least two shards at a time because they have replication between them. And then all other shards give a key to those. So it's a home-grown solution, and the only thing you need to know to be able to locate data based on the key is how many shards you're dealing with. What are the characteristics or feature of Postgres that may be suitable for what you use and what you do? The single biggest one is actually the programmability of Postgres. The fact that we can do our custom data types is actually a really big win. Now in some other databases you can do that too. I know you can in Oracle or SQL Server, some of the other ones. But in Postgres it's pretty easy actually to write your own data types. And that's a big one. The second thing is you can... Okay, and of course we also can do things like connect it to Kafka and actually create a foreign data wrapper that we can pull stuff in directly from Kafka and things like that. The second thing is the buffer management system of Postgres has been really well optimized over a long period of time. And the third thing is the quality of the Postgres source code is actually really, really good. And so when we're writing these extensions, we have a question. We go and we look at the source and we can find what we're looking for quickly and easily. And it's not just about having the source. It's about the fact that the source code is written for humans to read rather than written for tools to read. And it's a beautiful work of art and I would strongly recommend. So I think I'm at the end of the 20 minutes. So if there are any more questions, I'd be happy to take them on the hallway track outside. You have a couple. Okay. All right, so go ahead. Oh, okay, perfect. You mentioned that you had a... you managed to roll out a lot of data so there's a lot of context. Can you kind of give a overview of the strategy that you used? Okay, so... Yeah, so we had four... We had four 32-bit ints. We wanted to change to 64-bit ints. So we created extra columns and we put a trigger on the table to set those new columns to the value of the old columns. And then basically we ran some jobs that incrementally updated everything. And then we ran some jobs that flip the columns around. Now, this actually became problematic because we have cases where we have multiple inheritance and multiple inheritance doesn't work very well with this. So then in these cases, you actually have to create an inheritance diamond in order to make that work. But that's a corner case that we ended up having to figure out. But it's... And then you get rid of the inheritance diamond when you're done with it because we don't like inheritance diamonds, do we? Any more? Go ahead. This is obviously a solution that you've settled on after a long time. Yeah. Was there a point in time where you're like, this is the entire set of tools. So it just started in 2012. And it wasn't until, I think, 2013 or 2014 when they started having to worry about this sort of sharding and scaling and stuff. At that time, a lot of the stuff that went into this was heavily discussed at that point. You know, the funny thing about this is that as you make these decisions, you end up with certain kinds of vendor lock-in. So with our level of custom data types, we can't just shove stuff in PrestoDB or use PrestoDB even to process it because it won't understand the formats we're talking about. And so each decision ends up being made in part based upon all the decisions which have come before. So I wouldn't say that it was entirely set down and engineered from the start all in one go, but you would have had a lot of focus on alignment and things like that, and then you would have had the sharding system built separately and things like that. I know that we had looked at Citus in a couple of points, but by that time we'd already been pretty well built on vanilla Postgres. You had one question? Yeah, one more if I may. I'm sure you all were aware that for the application developers, for example, the single buyers that want to rise are way differently on production databases than in the rules than in the testing environment. So do you have some sort of a staging server deployment? How do you test those? How do you work against your production? Okay, so there are two sides to this. There's the materializer side, which we have a QA environment for. Then there's the KPI service side, which is largely maintained by the same team that maintains the databases. And in those cases, it's actually maintained by the same people who are watching the logs and that sort of thing. And we can actually go and we can, I hate to say run the queries on production, but we can say, well how is this actually going to perform? We can sit down and think about this beforehand. It's very hard to come up with a full staging scenario for this. And for read-only queries, we wouldn't bother. Okay, thank you. Thanks.