 So thank you all for coming. This is a talk entitled PostgreSQL as a big data platform. Where I work at Adjust, we put a lot of data in Postgres. It's one of the most scalable pieces of our infrastructure. And it's something we actually put a lot of trust into. So we'll be spending some time talking about what is big data and what is not big data, because there are a lot of sort of misconceptions about what big data is and the sorts of problems that big data platforms are supposed to solve. So first, I want to kind of discuss the context of the data because data without context is just ones and zeros, right? So the company I work for at Adjust, we do mobile advertising analytics and attribution. So basically, a lot of times people want advertisement on mobile devices that's paid for on a contract which says you get X amount of money for every user you bring this, or you get X percentage of their purchases when you bring users to this. Now for that to work, you have to know that a user came in on a particular advertisement campaign. And you have to be able to do other analytics like revenue and so forth on a campaign by campaign. So we gather all kinds of data that developers send us relating to their applications. And we track those by the origin of the users on their applications. So that's what we do. And we focus primarily on trying to make sure that when an advertiser actually gets credit for something that this in fact was a user that they brought in not just maybe spoofed SDK stuff or things like that which happens way more often than I think any of us would like. So I had the Postgres team at Adjust. We call it the database team. We primarily are a research and development team. We don't typically act as first level on call at night. So by the time Ops calls us, they run out of ideas. We develop and provide escalation support for database environments supporting diverse products. These include analytics products and some other ones we'll talk about. And our deployments are typically in the petabyte scale. So it's a lot of data that the team I run really does some incredible work on in these areas. And so let's talk about why we use Postgres. In order to understand why we use Postgres, we kind of have to understand what big data is and what big data is not. And where people run into a lot of problems thinking that they're going to just throw big data solutions at something when in fact it often doesn't work. So big data is about V3 problems. Anybody here not familiar with the term V3? So V3 stands for volume, velocity, and variety. So the idea of a big data shift is we're moving from sort of the kind of data that an accounting system runs where everything is defined structure you have. You don't have a massive amount of it. Maybe a few terabytes if you're a decently large company. But moving from that to a situation where you're trying to drive intelligence out of a wide variety of things, coming in at a high rate, possibly taking terabytes or petabytes of space. Big data is not a set of products. There's no such thing as a big data product. Okay? People will say, what about Hadoop? Hadoop is a tool which you can use when you're addressing some big data problems. It's not necessarily always the best solution for all problems. Big data is a set of techniques that we use to solve V3 problems. It's a set of ideas and ways in which we can now manage data that may be spread across many different servers in a way in which we can actually make sense out of it. It's not a set of technologies. You can take the same set of techniques and you can use a different set of technologies in a little work. And most importantly, to make big data systems work, it requires a tremendous attention to dealing with it. And the reason is that typically, once you're dealing with velocity problems, you're putting stress on IO systems, you're putting stress on memory bandwidth, you're putting stress on the processors, and if you don't get those things right, the systems will fall over. So you can't just follow recipes. You have to actually pay attention to detail and you have to use techniques where they work. So having said all this, now you have certain trade-offs that come from using something like Hadoop, which is built around trying to solve certain big data problems, versus applying some of those same techniques to Postgres. The big one is that Hadoop thinks in terms of sort of big blocks and large files. Well, Postgres thinks in terms of what we call a tuple or a database row. So if your data fits in the database rows very nicely, applying those same techniques in Postgres can often be a lot more performant and a lot less of a headache than doing the same thing in Hadoop. So I want to kind of go through a tour of the three main environments that we cover in this. First is our KPI service pipeline. The deployment is probably at this point around a petabyte. We deliver near real-time analytics on user behavior split up by advertising campaign. So you can find out, you know, a particular campaign, where users are interacting with the application compared to those who came in organically. And typically we have, you know, just a few minutes between when the requests are back in service and when it's available on our KPI service or dashboard. We have roughly 100 to 300,000 requests a second coming in through this pipeline. These are all independent HTTP requests, short payloads, rapid processing, and it's delivered both by a dashboard and for external AKI access. So the way we approach this, because as you can see this is a significant volume and velocity of data, but for the most part the stuff we're interested in here is not just going to have a high variety. So we approach this as a data pipeline. There are points of access where the actual requests come into our highly redundant. Our goal here is to be able to survive a data center outage. So not just one server dying, but say power gets cut through a data center or the air conditioning goes off, which actually happened once. We expect to be able to survive and at least continue core processing so that people's ads don't redirect to an empty space or, you know, and we can still ingest the data that they expect us to ingest. The customer facing analytic charts are only modestly redundant. They're paired with replication. The idea is we can lose a single node and still serve requests. But if the whole data center is down the dashboard is not going to work, but we can send an email out to customers and say the dashboard is down, but we're still processing advertisements. But if we send them out an email which says we're not processing advertisements right now, please pause your advertising campaigns. They don't tend to be as happy with that. So our architecture is on this one. So basically we have database systems where we have a program running that basically does all the digestion. It throws it into Redis, which we use as a queue. Redis is put into Postgres. Redis, by the way, is a nice rate limiter. It runs as a single threaded event loop, and so it only goes one speed. Then basically we map-reduce this stuff into the customer facing systems, the KPI service-facing systems. And then we map-reduce that again on query. So this allows us to take right now 45, 50-pointed entry servers and adjust that data into a set of 20 databases, each of which is much smaller. And for those of you who were here last year or want to look at the top, Postgres queue 20 terabytes and beyond goes into that in Signific and Detail. So onto our second point. This is what we call a bagger. This one is a bagger, I think, 261. For something, it's the largest bucket excavator built. And that's where we named the tool. This was our Elastic Search replacement because Elastic Search for us fell over at a petaport. We have up to a million, sometimes more, data points a second that this ingests is just a debugger. By volume, we're talking about 10 petabytes before duplication and compression. And basically because this is a debug log, the information is extremely free-former. So we have massive variety, we have massive velocity, and we have massive volume. But we only retain for a limited time, so you can basically think that the conveyor belt is dumping it off out of the system. Use cases, customer asks, why are my numbers like this? And we can actually then pull a limited range of time of what actually hit our system. Our engineering approach was to optimize for bulk storage in linear writes. So everything's optimized for writing data as fast as we can. Everything is stored as JSON being. It's basically a stricter subset of JSON with a nice binary serialization. We don't support it? Yes, do look at keys. Oh, well, those are evil. Yes, but we don't support them in the part of this program. Jason, by the way, does support that. Data is partitioned by our service and tag, and then we drop these partitions when the disk is getting pulled. Client side is sharding, so we actually query master servers as far as who has data we might be interested in, and we ask them. It looks a lot like MongoDB sharding, except we did it with Postgres and we did it right. For architecture, data writes in Kafka and it's partitioned. Basically, we use the Kafka partitions as partitions for the database. So right now, stuff comes in. Basically, you can think of it as one partition or it goes to one database. So you write to a topic, it comes out in partitions, you read from partitions. Data is partitioned in the database by query pattern an hour. So as I said, service, tag, service, tag, and power. We track those on a sort of a metadata database. This is why it kind of looks like MongoDB because MongoDB has this concept that you do sharding. And we have a client written in Perl that basically queries the master database, sends the queries out to everything else, and then can make the data. One thing that Perl client does is it runs an explain first that we'll refuse to run this if it does a sequential scan. So onto the third, audience builder. Now, the purpose of audience builder is you want to run a retargeting campaign. You want to know which users of my application have spent 20 euros or more on this in the past but have not logged in in the last two weeks because maybe they've dropped off and they're no longer playing this game anymore. You do not purchase this, right? So we want to hand a device list out to retargeting providers who can then display ads and try to bring these people back into the game. Now, right now, this is only 12 terabytes, okay? But this is effectively sort of a service that we kind of only did a 1.0 version of and now we're trying to support a lot of new features and we expect those features to push things into the petabyte range. Also, this is not like a normal data warehousing environment because in a data warehousing environment, typically, you want to ask, like, for all users in Asia, what do my sales figures look like per month, right? This would be more like, for all users in Asia, which months have I had more than a certain amount of purchase volume? So a lot of your data warehousing solutions don't handle this great pattern for this. One thing about Audience Builder is, compared to the others, I actually know something about the actual evaluation of other technology. And we didn't just evaluate Postgres on our new version of this. We looked at Apache Drill. We looked at a bunch of other sort of big data frameworks along the way, and we ended up coming back to Postgres. So with the new engineering effort we're pushing forward, we're actually continuing to use Postgres as the main query. So our engineering approach of the new version is a separation of storage and query. The reason for this is we want to be able to scale up storage independently from scaling up query power. This is not a typical use case of Postgres, but it is something that in this particular case we kind of need. We settled on Parquet as a storage format. Anybody here familiar with Parquet? So Parquet is a columnar format. It's usually used in Spark environments and Hadoop environments and things like that. Typically people who run this on Hadoop, they use this on Hadoop. They run MapReduce jobs that pull data back, and then on those they do for the processing. So the Parquet format has a sort of the row group information on the last block of the file. So if we tried to access this over HDFS, that would be a problem. So I'll talk a little bit in a few minutes about what we did specifically to address that. But as we evaluated these, we came to the conclusion that there were two things that we really needed. And that was extensibility, so that we could do development in the database and support new things if we needed to, and predictability of query performance. So when we looked at Apache Drill, those were the two things we couldn't really make work. When we looked at Clickhouse, the fact that it wasn't extensible in our purposes basically killed the idea for us. So we put the Parquet files on Ceph. Recent for Ceph, Ceph has seek support, so we can just look at the last block and then figure out which pieces of the file we have to ask for network files. Listgresql basically acts as a query engine. And we write the files through a pipeline that gets data from Kafka and writes Parquet files and then registers them effectively in the Postgres system using the foreign server, foreign table framework. We wrote a Parquet foreign data wrapper for Postgres. If you search for it on GitHub, you will find it. And with tuning for our use case, we got this as fast as Postgres on a native file system local storage. What do we give up? We give up writes. So all the data is now non-writeable from inside Postgres. But from a query engine, this works very well. And you can see the URL there for the Parquet foreign data wrapper. So those are the three environments. You'll see that they're all very different, right? We have one that's sort of more like a standard data warehousing environment but very high velocity. We have one that's very unstructured data becoming in high volumes but limited retention. And then we have another one that has relatively well-defined data if there's not a variety problem. But as we go into this, we're going to have volume and velocity problems. So here the key takeaways here is that all these systems are different. You get into a big data scale and your problems are very unique to the systems you're working in, right? So big data is entirely about technique. It's not about technology. And you can take technologies that were designed for smaller data environments and you can apply the same techniques and get very good results in the big data. Careful attention to requirements and detail is extremely important. And every system is different. So just as a note, we use a lot of open-source software. We prefer open-source software because when it breaks, we can fix it. And in our scale, that matters. You know, we have... our VP of engineering has contributed fixes to Postgres but we can't fix proprietary software abuse. So we definitely prefer open-source. So thank you for your time, Paul. Thank you for your time. Paul, it's possible. Can you say a couple of words about other big data solutions that you've evaluated? Okay, so we looked at a few of them on the audience builder side. Okay, so for our initial analytics side, we looked at Mongo a long time ago and concluded it wouldn't scale to our needs. On the audience builder side, we looked at Apache Drill and RK files. We looked at Green Plum. We looked at a few others. The big problem we had with Green Plum is that it's really... I don't think there's an easy way of adding new shards and dynamically resharding. So, and we couldn't get Drill to behave predictably for us. You're looking for this. Can you use any other storage solution? So on Backer, we do local storage, hard disk and ZFS. The other systems are all pretty basic file systems on local storage. We do use NVMe quite a bit.