 Okay, so title of my talk is Event Driven Systems with PGQ. The focus is mostly on PGQ and kind of how we use it at Braintree to get things done. So first, I guess a quick show of hands. Has anybody here used PGQ with or without Laundist before? Okay, so a few people. Fortunately, my coworkers say that they have used it. That's good. So yeah, I hope you learned a little bit about PGQ and how we use it and then I'm just gonna share some general thoughts kind of for the future about things we could potentially do. So briefly about me, I've been a Postgres user since 2007. We installed 8.2 at a media monitoring nonprofit in Washington, D.C. The original website was built with a interesting Perl-based XML database called XML comma. And if you know what that is, you should please come talk to me because I bet we have some crazy stories to share. So I originally wanted to use MySQL because it happened to be what I knew at the time but a really great engineer and friend who I was working with who now works with Keith at OmniTI, he advocated for Postgres and so here I am today. So I work at Braintree in Chicago. We are a payments company, we make it easy to accept payments, mobile, online, et cetera. If any of you have GitHub subscriptions, I imagine many do. If you've ever stayed in an Airbnb, you've bought any sort of deal on living social or you've taken a ride in an Uber, we have processed your payments. So we work with a lot of really cool, interesting companies. We were acquired by PayPal in December of 2013. That's been actually a really amazing thing. It's really enabled us to grow and do some amazing stuff. We, to give you a rough idea of scale, we processed over $22 billion in payment volume in 2014. So I think when I started, we were processed, like our annualized rate was something on the order of three billion. So it's quite a bit larger. So Postgres at Braintree, we switched from MySQL to Postgres in very early 2011. The impetus for the switch were things like being able to do DDL without having to rewrite the entire table, concurrent index creation, that kind of stuff that we rely on extremely heavily today. So at the time, we had two production database servers running 8.4. I think we might have had two 146 gig 15K spindles per server and we replicated with DRBD. If anybody is familiar with that setup, it's works fairly well up to a point, then sometimes it's not so great. Now we have 20 production database servers, some of which have up to two terabytes of memory. They're just unbelievably beastly machines. We run 9.1. If you're interested in how we got from 8.4 to 9.1, we did use Laundist and we did it without taking any downtime. We'd be happy to talk about that later. But we do have several terabytes of database of data throughout all of our Postgres databases. So we're not on the petabyte scale, but we have enough data where we have to be pretty thoughtful about how we do things. So the specific problem that we had, that we are using the tool chain that I'm talking about here that we wanted to solve was really the reliable movement of data between systems. Though also a popular topic, just kind of in the hallway track at this conference has been auditing. Magnus talked about auditing in his talk yesterday. Some of the stuff that we're doing here could also be used for that purpose. So our solution to this reliable movement of data problem was to use PGQ at the base, and then we have an extension we wrote called PGQ audit, and then we also use Kafka. That last piece is there's some good reasons why we chose to do that, which I will go through, but you certainly do not have to do that in order to take the advantages of PGQ and PGQ audit. So because the title of the talk is event-driven systems, so quickly outline what people far smarter than I am say about event-driven systems. The best example I could come up with is Martin Fowler from the, I think this is like the extended patterns of enterprise architecture. He talks about event sourcing, and he says event sourcing ensures that all changes to application state are stored as a sequence of events. This is reasonable, and why this is cool and useful is that not just can we query these events, we can also use the event log to reconstruct past states and as a foundation to automatically adjust the state to cope with retroactive changes. Again, if you saw Magnus talk, he implemented something very similar to this using triggers and his history schema. So, events at Braintree. So we have two different kinds of events that we think about. The first is what we call our data stream. So our internal terminology is data stream, but really what that means to us is an event that is triggered from the database. So insert happened, this is the data that was insert. Update happened, here's the old data, here's the new data, delete happened, truncate happened, et cetera. And then there's event stream. So the event stream is events that are triggered by applications. So we use the same tooling, you know, PGQ to handle both of them, but the event stream is not trigger based really. And so both of these are great and they're subtly different or very different depending on how philosophical you are. So the database might say, this credit card charge, its status has changed from authorized to captured. This is a statement of some state changing. So there's a few obvious things that we can do with this. First, we could take this change and apply it to another system. We could log the change in some logging framework, whatever. Or we could react to the change in another system. But anything that we do there, we're observing a state change and then we're deciding to take some action off of that. Now compare that to an event stream message which is much more direct where your application is going to tell you this credit card charge has captured. This is like a declaration of some specific domain event that has meaning to us. Whereas with the database event before, we infer something because we've observed the state change whereas this is a statement of an event as opposed to a statement of a change that you are reacting to. So event sourcing kind of implies that you're doing everything, but everything is a lot and we all have lots to do. So we obviously don't have a log for every single event for every single change in all of our systems. And this talk is gonna be scoped mostly to the data events because it's a Postgres conference and those come from Postgres. Okay, enter PGQ. So PGQ is part of SkyTools. SkyTools came out of Skype. It has a few nice things in it, one of them and most sort of famously is Laundist, which is a trigger-based replication solution to get your data from some sort of primary to some sort of standby. It does predate streaming replication and I think I don't know when it was released in relation to Sloney, but I think you would be accurate to call them competitors or just alternative solutions. So PGQ is the foundation for this Laundist trigger-based replication. It has durable transactional queuing semantics. It waits until messages have been consumed before it purges them and it is all very nicely stored within your Postgres database. This is great. So just a quick little example of what this may look like. So, does this look pretty good enough? It's big enough to read. Excellent. So after you've installed the extension, I've eliminated some of the output here because it's not terribly meaningful and it takes up space on a very small slide. So the first thing you might do is you might create your queue. So in this case, we'll say the queue is called events. Please ignore the smart quotes. Thank you keynote. Then you're gonna register some sort of consumer. In this case, we have a consumer that's simply called consumer. And now we are positioned to insert our event. So we can simply say insert event, something happened, amazing. The signature for this method is the queue name, the event type, and then the event data. So now we can pick up our event. I'm hand waving over one thing here that is important to note. If you are running PGQ, you run an independent daemon called the ticker. And there's a responsibility of the ticker can essentially be thought of as a maintenance daemon. This is the thing that looks over all the messages and my Mac is, oh God, you know, come on, there we go. It looks over kind of the state of your messages. It looks over the state of what transactions are in flight, what transactions have been committed, and then it creates a batch. And that batch is something that when you call PGQ next batch, it's that PGQ goes in and it looks and says what batches have been created for me by the ticker. And then you select the batch. Once you call finished batch, your consumer is considered to have seen all the messages in this batch and you can move on. So this is great in the sense that if you crash after you've called get batch events, but before you've called finished batch, you can replay all the messages without issue. Again, this is great at least once ordered delivery. This is a very good thing for you. So looking at this message real quick, at the top you have an ID, that's a sequence that they use. EV time is the current time stamp, so that's the transaction time stamp. You get the TX ID, then EV type, EV data, as I mentioned before. Then you get these four extra fields that you can really do whatever you want with. There's a extended PGQ insert event signature that you can use to insert all those. So just looking at the PGQ schema real quick, there's one particularly interesting thing here. Just to go over a couple of the other tables, they have a Q table which stores the Q that you created. You can go into this table and you can do some configuration. Configuration may be something like how many child tables, I'll talk about that in a second. Do you want, it might be how long do you want to wait before rotating the child tables. So if you see these like event underscore one, looking in the PGQ Q table, you will see that the Q that we, excuse me, the Q that we just created, events, corresponds to the event underscore one table and it has those three children. So it does this and this is a little bit clever and one of the reasons why I really like PGQ is that they've thought about dealing with what would happen if you had a forever growing table that you were treating as your Q, then you delete messages from the front. Okay, now you might fill those in after you vacuum, but your indexes are gonna bloat forever and it's just gonna be a maintenance nightmare. So what they do is they have these three tables that they rotate between. So when you, the consumer have consumed every message out of event one, zero, this is again, totally transparent to you, you just call get next batch. That signals to PGQ to the ticker that after some configured timeout, I think the default is two hours, it's safe to change the active table so that new events will get written into event one one or event one two as the case may be. I encourage you to read the source and figure out how all this stuff works. It's fairly understandable and it's pretty cool. So that's PGQ. So PGQ audit is something that we wrote on top of PGQ. It's a soon to be open sourced extension. The bottleneck is unfortunately time as with most things. We've had it running in production for almost a year now and it writes data changes to PGQ via triggers. There are some additional possibilities. We add some customized extra data. You could add the current database user. You could do all kinds of interesting things and we actually based it off of second quadrants audit trigger. So if you look at, you know, whenever we get it up to GitHub, you'll see that some of the sources like is fairly similar in terms of the trigger generation. But of course it pushes into PGQ instead of into your own table. So again, there's really not much to the trigger. We want to open source it just because we have it nicely packaged but you can pretty easily write your own trigger to do this and then you may have extra data that you care about. Also ours was developed against nine one. There's no good reason that it can't run against nine two or later but as I recall the regression, the PG Regress test suite fails for some like, you know, silly reason. So I need to fix that but it's not fixed. So okay, so what does this look like? So this is almost exactly kind of what we do at Braintree. So we'll say let's create a queue. We'll call this queue the data stream as I referenced earlier. So now we register our consumer to the data stream and now we're gonna call PGQ audit table. So we give it the table that we wish to audit and the queue that we wish to audit to. That creates something kind of sweet like this. You don't have to create the triggers by yourself. We have this nice little function that does it all for you. So as you can see, we have an after insert or delete or update for each row execute procedure. So we have this trigger that runs whenever data is added or changed or anything on this table. So now when we insert into this table and our ticker runs and we get our next batch and we select from our batch, you get something that looks kind of like this. So you will notice, yes, it is hstore. It is 911. We do not have the back port adjacent 911 extension even though we would like to. We just haven't done that. So we use conventionally the ev type field for the table name and we use the ev data field for the data. Then ev extra two happens to be where we put the statement or the statement type. Now we have a bunch of extra fields here. So we could do a lot of other stuff. That's pretty cool. Like I said, we could store the current user. Perhaps we care about the statement time stamp or the clock time stamp, something like that. You could even get super crazy and I hadn't really thought about this until the talk yesterday. You could set up some sort of custom guck stuff and set it on a per session or statement basis. So let's say that you were a Rails shop, you could actually potentially set the application line calling the code with the get shot that was currently deployed and see in your audit event exactly where your database name was coming from. That would be really cool. Or at least the application name. There's any number of different things that anything that you have access to from within your trigger, you can dump in here. You're really only limited by how much data you care to write. One key difference between this and Laundice because sort of at a superficial layer, they are fairly similar. Laundice has a notion of what your primary key is. So because Laundice is a replication solution, it needs to know how to copy data from point A to point B and if it's an update statement, it says, well, I need to know the primary key that I'm gonna apply this change to. If you look at the trigger definitions created by Laundice, you will see the argument to the trigger is this big, long, crazy. I think it's K, let's say that you have a primary key in the first, as your first attribute and then a bunch of values, it's like K, V, V, V, V, V, V, V, V and the K is the primary key and the V is the value. This makes some other things annoying. If you need to run DDL, you kinda need to deal with Laundice. But we don't care about that. We just want the data. So just for example's sake, if you were to then run, say, an update and no, it's not important anymore, it's very important on this first row here. The data stream then gets a second event and we use, as I said, again, like EV data for the kind of old data and then we use EV extra one for the new data. We thought a little bit about whether it is strange to do this and whether insert should put its data in EV extra one because really that's like the new data and EV data would in that case be null for an insert but this is the way that we did it and it's worked well for us. So why PGQ and PGQ audit again? Like we need really fast response times. We like being entirely within our database. We thought about PGAMQP and there was a fantastic talk last year about using PGAMQP to do kind of this but we don't want to trust that RabbitMQ isn't gonna have some sort of crazy problem and we don't wanna try and manage like sort of two phase commit between RabbitMQ and Postgres. We like everything just being natural inside of our database and like I said it's really fast. We push thousands of messages per second through this thing per database and it doesn't add meaningful latency to our request. It's like we're an API right? So any time that we spend we say oh well we want these PGQ audit messages if we add even say five milliseconds making some sort of external call from Postgres to something like PGAMQP that happens 10 times during an API request. You know this can start to add up and it really sucks because if we do that then we are making our customers' customers wait to process their payments. So not acceptable. We need to like be very, very conscious of how long things are taking. Yeah, right so because it's trigger based like it is in the primary path. Like whenever you take some action any statement issued by the application or any insert statement or an update statement is going to generate an audit record and what we've found is that that is not meaningfully performance impacting. Now for, so I'll get to like where Kafka comes in soon like that's where we say well dealing with the messages we get this shit out of the OLTP system as really as quickly as we can. Is that answer your question? Yeah and I'm sorry. That's cool. Okay, cool. So we also, we can't lose messages. Payments, a message may, many messages may not be terribly important. It could be some like just little notification that we are interested in but it could be very important. It could be a system saying oh I need to make sure that I mark that this transaction has been successfully funded and that way a merchant can then see that this transaction has been successfully funded. So losing messages for us is pretty bad. It's not just oh sorry we dumped a password reset email. And like I said we generate a lot of messages. If we have big batch jobs that are running we let them generate the messages as they go and it holds up just fine. So another reason why we really like PGQ and this is not a mistake. In database queues are hard to get right. I don't know about people here. I have definitely thought before that I am smart enough to solve this problem really well. I am not. And it's not that it's not solvable cause of course PGQ has solved it. It's that there's a lot of stuff to think about and like I mentioned before deletions. Now you have tons of dead tuples. How do you know which messages you've seen. Now you have to keep state somewhere else. Like oh well here's my offset or something. But then maybe if you're depending on how you use some counter you could get into a weird position where some old transaction gets committed and then maybe you're beyond that. Weird stuff can happen and PGQ handles that all really nicely for you. So again if you have the concerns that we do and you want to have the sort of nice transactional nature of PGQ or of Postgres with your message queue then use PGQ and don't worry about it. So like I said pure Postgres means transactions. Guaranteed at least once delivery. Guaranteed delivery and commit order. This is not difficult to think about. You can reason about this in meaningful ways and build systems that react to it well. So perhaps again unsurprisingly PGQ is not a silver bullet. There are many people who for whom this is probably not a good solution. One if your application needs to query your events in some sort of more real time way. PGQ is probably not great unless you are kind of shuffling your events to another system then querying it there. So if your database can't handle the extra rights like I said we can but it may be that you can't. Perhaps you're generating so much wall that the overhead of adding extra anything extra is not acceptable. Like I said that's not, we have not found that to be a problem and we pushed quite a few messages through this but I know that there are much larger Postgres installations out there than what we run. It's also not good if you have very large columns. So we do not currently support the notion of excluding columns from the audit message. So let's say that you're doing something like oh I don't know constructing and downloading some very large PDF file or storing it inside of Postgres in binary or something. I don't know. Insert your use case for like 500 megabyte columns here. If you do that, yes you're writing a ton of data but now you're also auditing all of that data. So you would either need to come up with some sort of mechanism for skipping that column or probably more appropriately not put that in Postgres in the first place. It's also not great and now this is, I would say this is probably the biggest reason why this might not be a good fit for you. If you need ordered events across databases, this is not great. Because we, so the least once delivery in order is only within your specific database. So that's the level at which the triggers operate. There may be other things that you could do here like perhaps you use a foreign data wrapper around some other system and your trigger writes it over there but now we're talking to face commit between systems and we get into a painful place again. So for us, this is fine. For our, like we have a sharded architecture and we really don't generally care about what happens whether something on shard A happens before or after something on shard B really so long as everything within the shard is ordered and consistent, we are happy. But if you do care about this then it might not be a fantastic solution for you. Okay, so we have events. Now like what are some of the possibilities? So like I said, we do use Kafka. We really, really, really don't like to do more on our production databases than we absolutely have to because they're very, very busy. So we get out of the OLTP database as quickly as we can. All messages we have a little piece of Clojure actually yay, which reads everything from all of our PGQ queues and it dumps them all into Kafka. Kafka if you're, anybody familiar with Kafka here? A few people? Okay, so Kafka is a LinkedIn project or I'm sorry it's an Apache project now. Really fantastic PubSub system and they model it as a distributed commit log. So you can sort of almost conceptualize it as wall of sorts. That's kind of how we think about it is like we write all these events and then if something goes wrong, well we rewind and we can roll forward and our downstream systems can just sort of achieve consistency. So it gotcha if you do look at Kafka is that you need one Kafka partition per database because Kafka has the same guarantees that PGQ does. This is good. At least wants delivery in order but that is only guaranteed within a partition. So we might say we have a topic called data stream and all of our databases write to this topic and they each have their own dedicated partition. This does kind of defeat some of the scalability that Kafka affords you but like I, we don't challenge Kafka at all. I think like load on that box whatever goes above 0.1 or something ridiculous. It's really, really good at dealing with linear writes. So all of our applications consume out of Kafka. So now that we're here kind of like what can we do? Perhaps search indexes, this is something that we do. You could update some solar, Lucene, elastic search, whatever you use, you could read off of your data stream and then you could update your kind of more appropriate search thing to the latest version of a domain object. You could even, you know, let's say that you have one sort of core root model or table with then lots of other things hanging off of it that have foreign keys pointing to it. You know, you now, you can say, oh well anytime I get an update on any of these other tables, I will simply update the thing that had like the searchable object whose ID is my foreign key. So you don't really necessarily need to do any joins. There are of course data models or schemas that do not permit this but for us it generally works. Data warehousing would be another cool use case. You could take, you know, whatever the messages are, you could perform some sort of denormalization, add the latest version to your warehouse. Or maybe you are more sensitive to, your data warehouse likely doesn't support individual inserts very well. So you grab a batch of 100,000, million, 10,000, whatever you want and then you transform all that into one big batch, apply the change set, and then your data warehouse is now up to date without having to do some sort of crazy, okay, well let's select from all these tables where my updated timestamp is greater than whatever, you naturally get it out of the event stream, or the day stream, I'm sorry. Some gotchas there, of course, like you need to be sensitive to inserts. You know, you need to know, ah well this unique identifier, you know, if I see, if I insert something and I see like ID one, but I've already inserted ID one, you have to handle that, but you know, that's in my opinion not really that much of a blocker. So auditing, auditing like I said has been, like I feel like in the hallway track I've heard a lot about at this conference, you could just stream all the messages into files and compress them, encrypt them, archive them, do whatever you want, ship them to HDFS, ship them to S3, you know, and then you say oh wait, who made this change back here? You could go literally grab that file and you would see exactly what happened in a potentially significantly more meaningful way than Wall could ever show you. You could also have notifications where some downstream system says oh, you successfully signed up, I'm going to send you a password reset email to come back to that example that I said. So a couple of thoughts for the future. So first this sounds a lot like a logical replication. It was not available when we started building this, but there is stuff that you can do here that you cannot really do with logical replication, namely all the extra gunk that you wanna throw into those EV extra fields. So one, like I said, current user, any sort of guck, any custom session variables that you wanna set up, I'm really excited about that now. But you do pay for that, you pay for it with lots of extra rights. So you are sort of responsible for deciding hey, is this gonna be sufficient? Like can my application afford to have the database writing all this extra data? So I've also been really interested in sort of taking the concept of a permanent historical record of all of our data and applying that to some sort of data warehousing possibility. So literally every version of every row could be written somewhere. Be that yet another Postgres, some CitusDB maybe. I don't really know, I haven't played with it very much. Cassandra, you could key off of like your primary key and then I don't know, statement timestamp or something like that. So some pretty interesting possibilities there, but it could mean that data warehousing becomes more straightforward and a lot of these painful loads that some of us have to deal with are less painful. Another thing that I think about a fair amount is that Kafka is really, really great. But if you don't run ZooKeeper, now you have to run ZooKeeper. So this isn't that bad, but it does mean okay, it's not let's stand up one server, three Kafka servers, it's you gotta stand up three ZooKeeper servers, preferably five, and now you have your three Kafka servers. The complexity here like is not zero. If you're on AWS, maybe this is pretty easy for you to just spin this stuff up, but if you're on physical hardware, you've gotta find room to put it. I don't know, we have this stuff, so it's not that big of a deal for us, but maybe Amazon Kinesis would be an interesting alternative. I think they only keep data for 24 hours, whereas with Kafka it's fairly common to keep it for up to a week, but yeah. I mean, it sucks that you have all these other systems that you now have to learn to administer, but when you think about what we're doing, maybe that's worth it. Again, that is something that you can only really evaluate based on your application. So anyways, in conclusion, there's lots of ways to approach this problem, but for us PGQ, PGQ audit triggers and Kafka give us a very reliable and actionable change log for our data, okay. That's all I've got. Thanks. If you're interested in hearing about like when we push anything to GitHub, please ping me on email, Twitter, whatever. Any questions? Yes. Yeah. Yeah, I think that's a very, very good point and it doesn't necessarily even have to be ZooKeeper, right, there are sort of, they're all alternatives, LCD console, et cetera, like there are other things that provide some of the same sort of features, but yeah, having ZooKeeper, so take Elasticsearch for example, at least what used to be the case was that, if you use the Elasticsearch ZooKeeper plugin, you'd get much better sort of coordination than if you relied on Elasticsearch's like built-in thing. So having ZooKeeper or maybe like a ZooKeeper option could provide very interesting possibilities for managing like multi-master like who's, or kind of a floating master. Yeah, yeah. Any other questions? Yeah. Oh yeah, nice. Awesome, that's super cool. So is that something where like, well we, yeah, cool, well, I have interest so I'll talk to you about it afterwards. Any other questions? Maybe else, okay, cool, thanks very much. Yeah.