 So my name's Alex Henderson, I'm here from Pushpay and I'm about to talk about NoPaymentLeftBehind which is sort of just a bit of a short journey through improving our payment resiliency over time as we've grown very fast as a company. So I work for a company called Pushpay, I'm a Principal Engineer there and we're a technology company that is working in the church sector. We're really sort of building easy to use software that helps churches grow their ministry. So that's through payments and engagement solutions. We have around 350 staff and we serve over 7,000 churches and we're based out of Auckland, New Zealand originally but we actually have more staff now in the US based out of Seattle. So in the beginning when I joined Pushpay that was 2014. I was engineer, hire number five, our VP of Engineering and in fact another engineer in the audience, Carl, joined about two months before then and things were pretty simple, we sort of inherited a very basic monolith. So we were hosted on a single machine, we had sort of a mobile API and a website on there and there was like a small database sitting to the side. We were dealing with multiple gateways, some of them we talked to via Spreedly but for the most part we were talking directly to the gateways and we only did card payments, in fact we were really simplistic, we would allow a user to have a single payment method at a time, there wasn't a wallet where they could have a different payment method, they would have to delete that one to add another one. And we're processing about $1 million a year so payment volume was very infrequent. We really were sort of kind of in a hobby stage at that point. Where we are today, we're now processing $3 billion a year so a lot more volume. We use Spreedly for all our gateway communications with credit cards. We also support a number of other different payment methods so we support ACH credit card, New Zealand bank payments. We do sort of process checks, check scanning that kind of thing. And we scale to about 30 to 40 nodes and we're hosted in AWS now whereas originally we're sort of in a colo. We're also a little bit different. So unlike many people where they're sort of able to sort of process funds and aggregate them and then distribute them as necessary for a payments company, we have thousands of merchant facilities so every single one of our customers ends up with their own merchant facility. This presents quite a lot of interesting challenges from an engineering perspective but it does mean that we're quite well suited to what Spreedly does by having sort of gateway configurations become to be registered as we go. Initially when we started growing in 2014 this actually presented a bunch of interesting challenges to us because by having a merchant facility per customer we were wearing a cost of trying to grow really fast but it took a really long time to onboard our customers. It would take four to six weeks to actually get through underwriting and in some cases some of our sort of prospects would lose interest by the time they completed that process if they'd even get through it and we needed to fix it. So one of the things we did at the time was actually become a registered ISO so that we could market merchant facilities direct to our customers and that allowed us to go get that onboarding time and underwriting time done to sort of one to two days. And then we're in the generosity sector and generosity is also just a little bit different. So we're primarily processing gifts. We're not exchanging money for physical goods or tickets for the most part or any kind of other thing that might be like traded. We're dealing with tithes and donations. If you're not really familiar with tithes this is where you give a percentage of your income to the church traditionally 10% but it kind of varies depending on faith and kind of how wealthy the church in the area which operates. And payments are a combination for us of one time and recurring so we have people that are processing one time payments they're sort of giving in the moment so if they're in church or when they're just feeling generous or they hear about a cause they're interested in they can just go and give straight away and we also have people that are doing this more sort of tithes or giving to the church on a regular basis where they'll set up a recurring payment often aligned with when they're getting their salary so that maybe the first and 15th they'll set up a payment for a fixed amount that'll come out all the time. And because we're dealing with churches our uptick is a bit different. Sorry, windows. Our uptick is a little bit different to everyone else so on a US Sunday is when we see a strong uptick in volume as people are sitting in church and getting out their phones to give. The other thing about churches is that they're a community and for a community, trust is the currency of a community. So the trust between the community members which is the church goes and the church is really important. We see that the trust between those community members and the church is also really easy to erode and we have to be very careful. So when a church picks a technology provider such as Pushpay, you might think say if Pushpay is having a bad day, a gateway is down we're unable to process payments that's going to put people off using Pushpay and the church goes will be like that Pushpay we need to get rid of that, that's no good. That's not where it really happens. They actually start losing trust and their church's ability to select an appropriate technology and they may just stop giving to the church completely. They're sort of put off by the experience. And that's not a great outcome and it can be sort of irreparable in some ways like if you have a really poor choice of technology provider you will see those people either sort of stop giving to the church or the amount they give goes down and doesn't come back up again. So we need to be sort of very careful with that trust. So it's 2014. It's a US Sunday. We're like a new fresh team, we're all excited. The gateway is down. Or a gateway we're using goes down. You can imagine we're sort of leaping on call we're calling everybody, we're getting our gateway on the line like get this stuff back up, this is terrible we're all going to die, it's down. Well that sounds great but in reality we just didn't even know we were down. So in 2014 we were really immature but unfortunately the normally when we'd find out we were down in the first case was when a customer would call us or even worse when an investor would call us to let us know we were down. That is not a place you want to be that is something you need to fix, we could do a lot better. And in our case that tied in really nicely with being a young engineering team and wanting to introduce a bit of extensional rigor and we did that through post-mortems. Well in fact plain list post-mortems which we sort of adopted from Etsy's process. So the process of a post-mortem is really sort of after an incident or during an incident you sort of collect a bit of a timeline you're trying to see what went well and what did not go well about your incident response and then sort of try and take some learnings from that and possibly even identify some mitigation so you can stop it happening again or mitigate some of the impact so that it has less impact the next time it happens. And for us we did something which was both intentional and not necessarily intentional which was we decided we're a payments company we need to do post-mortems when we are failing to serve our customers. So if we failed to process a payment for our customers for something that maybe we did from an engineering perspective that was a reason to initiate a post-mortem. This also applied to gateway outages. So even though it was something that was outside of our control in many ways at least that's how we failed at the time. We still sort of would instigate a post-mortem for that and it had some really good outcomes for us. It meant our incident response process we kind of looked at how things were going we would go oh maybe we need to deal with our gateways a bit better and we also would identify opportunities to actually sort of mitigate some of those potential losses. So early on in our journey we weren't processing a lot of payments so it indicated we were doing sort of about a million dollars a year that meant there could be hours without payments overnight it was kind of tricky. To deal with that we sort of initially started looking at connectivity as our first sort of low-hanging fruit for really kind of getting to be more aware of what was going on with our gateways so before we approached this was really simple we just had sort of like a pingdom so just some kind of ping tool pointing out our gateways externally so that we could kind of see if they were up or not and at the same time we introduced a health check-in point internally with our network that would when hit run a simple set of tests to kind of see from within our network can we reach back out to that gateway so that we could kind of test things like firewall configuration proxy configuration that kind of thing. The combination of those two things allowed us to kind of really narrow in quickly when we had a connectivity issue as to where it was happening was it something we'd done internally that might mean that we need to roll back some kind of infrastructural change or was it something external like the gateways down and then we need to start escalating. We don't really rely on connectivity tracking or testing anymore we have more than enough volume 24-7 that the actual payments themselves become the indicator of failure now but there was some really nice side effects of establishing this process really early on so for instance like sort of middle of last year I think it was July we had an issue with First Data where they actually messed up some DNS records for us because we had really good connectivity sort of tracking and monitoring at that point that happened we were able to jump in identify that it was a connectivity issue that it was a DNS issue look back over our records of what the actual IP address used to resolve to for that service endpoint and then reach out to Spreadly and go hey would you mind just sort of temporarily creating a post entry just to work around this issue because First Data has kind of not got it all together right now and thankfully that kind of relationship allowed us to save quite a lot of volume over that incident so beyond sort of just connectivity tracking the things that we need to do is sort of start dealing with the more nuanced side of classifying responses there's a lot of opportunity to identify failures that aren't classic just connectivity issues but are more sort of something going wrong with the gateway or the gateway's behavior and we classify our responses into sort of three buckets so we have those user recoverable errors those are responses or response codes that we we know we can build some user experience around we can have people sort of go through a flow of say I've got a bank declined or insufficient funds we can provide an appropriate message they can try the payment again and then the bucket we're really most interested in around incident response we have three response codes we have gateway incorrectly configured so if you recall I mentioned we have lots and lots of merchant facilities it's possible that one of those merchant facilities may say have an expired password or an incorrect login and in those instances where we we can identify that in the response and classify it appropriately that's really only going to affect one customer at a time so it's not necessarily the kind of thing we want to wake up our whole on-call team for depending on the size of the customer and then we have two other responses which is communication error and provider error so communication error is kind of is what it sounds like that would be an issue with us being able to either talk to spreadly, talk through spreadly to the gateway or even potentially some kind of upstream issue with say a bank or a card brand that then looks like a communication error presented by the gateway we also have provider failure so provider failure is really sort of the counter example to user recoverable there are non-user recoverable error where we need to escalate with the gateway to either go like hey what's going on there's something weird happening with the behavior of the gateway we're seeing persistent failures that are not our user's fault and we sort of have discovered those over time as we go and then last of all we have unknown error so those are where we actually can't classify the response because we've never seen anything like that before and when we get those we've just got a task of investigating them and figuring out how to classify them so when we think about classification of responses gateways return structured responses you're getting JSON, you're getting XML we're talking to quite a lot of different gateways unfortunately like no gateway is well none of these gateways have really agreed upon a good way to represent these things in a common way so you'll get an error code, you'll get some error messaging but every gateway decides to kind of group the error codes independently or different ways and unfortunately it's a really leaky abstraction we see in some cases gateways will group things that we would consider a provider error it's not user recoverable alongside something that is recoverable so say like CVV incorrectly entered and it can be quite challenging to kind of tease those apart if you're relying on the codes and we tried to look at this in different ways over time and we just kind of eventually came to inclusion no you can't do it, just treat it as text pass it as text this sounds like it shouldn't work it works really well and the other thing is it actually gives you some nice benefits in that you can use the same approach for classifying weird responses you get back from gateways when a gateway starts giving you HTML back for instance I won't name any names for us we deal with we sort of like doing declarative rules for these kinds of things the kind of code that we're going to present back to the business so they can also kind of check out our working check our thinking so we have sort of declarative rules for doing this it's going to be pretty simple we've got something we were looking for the upstream error code text we're going to map that to being a decline for comms error and this is a rule that's only going to be applied to the pin gateway through Spreadly when we first started doing this we kind of fell into this pattern of adding these declarative rules and we didn't really think too much about how we were evaluating them we stored them in a list and we'd start adding them to the bottom of the list it worked okay for a while until we started realising how much this code because no one felt safe adding new rules they had to decide very carefully do I add it at the bottom do I put it somewhere in the middle at the top because it was a risk of occluding an existing rule and causing it its own post-mortem just by adding a new rule as we kind of sat and thought about it it actually kind of occurred to us it's really quite simple what we need to do is just sort those rules in a way that goes from most specific to least specific and so we just sort them by the length of text we're searching for and then how specific the rule is say all gateways because it might be say we're looking just for the evidence of they're giving us back an HTML response we'd make that less specific than a rule where we're making it specific for a single gateway or a single car brand and then if we don't match any rules then it's classified as an unknown area and we'll sort of trigger off a process of investigating how to map that later so once we had these provider areas and communication areas it was a good opportunity to then start actually putting in place some alerting of like when they had elevated rates which is great you're kind of getting this alerting now we've got a lot more like idea of what's down we might know what's down before our customers do but we still have been stuck with the problem it's like what do we do we're going to sort of log in we're going to sort of triage maybe what issue it is is it a gateway down we'll start escalating the gateway but meanwhile there's just things are on fire we're just seeing payments failing this is not kind of a great outcome and one of the first things we came to was like it's really simple to kind of just turn off our schedule payment engine and stop processing schedule payments at least during a gateway outage and so we started with a manual command for doing this but as we kind of went we realized sort of with the manual command there was some operational concerns we needed to address and we really started like chat ops and so we wanted to know sort of who executed the mic command and then we really want to know like how do I know it's off like if I'm on call it how is the system going to wake me up to let me know and the way we did this was pretty simple so we have a manual command for turning off the schedule payments engine when we turn it off we end up with a slack message where we do our incident response it's going to indicate what the state was before and is now so in this case yep the engine is off and who did it so in this case I turn the schedule payment engine off and then the other thing we do is we have a health check endpoint again which we'll check if the schedule payment engine is on or off so in this case if the schedule payment is engine is off we count that as a loss and this health check endpoint starts returning like a status code 400 we're able to ping a point ping to matter and then wire that up to pager duty for our alerts so we'll then get a pager duty alert to say the health check is down this has sort of covers off both those concerns we had this is going to wake people up when we know the schedule payment engine is being turned off and then even better it's going to keep bugging us every 15 minutes until we turn the schedule payment engine back on because we forget to turn it back on that can have sort of real world consequences for our customers and especially their payers if we allow schedule payments to sort of run into the next day potentially people could be expecting that the money has already come out and then they go and spend money that then we later take out and they miss a rent payment or something like that that's not cool so we had this manual command and we were pretty happy you know we could turn the schedule payment engine on and off but there was a little bit of a problem with this and that it's 2 in the morning you get paid, you kind of get up, you get your laptop in you then have to go to a payments list sort through all the payments try and figure out which payments failed try and figure out which gateway it is or what the problem is and it's only at that point that you probably go oh I should turn the schedule payment engine off at which point we might have lost 10 to 15 minutes worth of schedule payments that we've kind of been plowing through just sort of throwing them into the ground we're losing a bunch of payments, we're really not happy about this so we needed to start automatically triggering but on what and in our case it was pretty simple we had these provider areas and communication areas that definitely needed an instant response so it made a real good case for turning our schedule payment engine off we also considered unknown given that those could potentially be a provider or communications area we get to classify but inevitably week on week we get new unknown areas so 99% of them are not provider areas or communication areas we're actually dealing with sort of things like unusual AVS areas we've not seen before things like that and so they're sort of really more user-recovered areas we have so if you love your own call team don't wake them up for unknown areas but we still needed a process in place to at least identify overall elevated error rates and kind of wake people up for that and so much like we do for classifying the rules for a response we do the same thing for triggers our rule for triggering looks like so we're looking for three comms or provider areas within a sort of five minute period in that case we're going to raise what we call a payment failure reaction type so we tried to separate what we do for identifying when to trigger from then what different parts of our system will take actions so the turning the schedule payment engine off would just be one thing that's kind of looking at their payment failure reaction type of comms provider area and then doing something appropriate once we introduced this we needed to kind of revisit what our on-call story looked like again because now we're waking somebody up and they don't know who turned the engine off or why and so we did a couple of things here we needed to put in why it was turned off so in this case we turned it off automatically say because of a communication failure and then more importantly we keep track of all of the payments that have contributed to that trigger and include links to them in our messages and this means when people get woken up they very quickly jump in there click on that link have a look at say the transcript and spreadly and within a two or three seconds they're kind of have got a keyed in as to what the actual problem is and can start deciding to escalate with the gateway or not so that kind of did it for schedule payments but for the first couple of years we're really sort of resigned to the fact that when we're processing these sort of one-off payments is people just getting their phone out and giving and church in the moment that they were just kind of going to fall on the ground and we're just waiting there watching the clock beating sort of our head against the wall kind of waiting for our gateway to kind of sort the issue out so that we can then get back to processing payments. It wasn't a great feeling we felt like we're sort of out of control and I've sort of mentioned initially that we were processing both through spreadly and non-spreadly sort of direct with gateways but over time we've moved all our communication over to using spreadly and as through post-mortems we kind of looking at mitigations or possible mitigations like actually if we're already tokenizing all of our payment methods through spreadly and you know we're getting to the point where we're about to process that payment we could just not process it and come back to it later. We're not exchanging anything for goods there's nothing we really need to kind of hold there we don't need to process these payments immediately and that's where sort of delayed payments as we call it came into our minds and what that process looks like is this so first of all we'll pick a gateway or really a gateway type we have a lot of gateway configurations but they're for a gateway type so like First Data, NMI, different kinds of gateways we'll pick one of those gateways and say let's set that to delayed. The next thing that happens is you're asked at the point that we're about to do some communications to sort of enact a payment with the gateway we'll check is that gateway delayed if it is we just choose not to do the gateway communications to kind of put that payment aside into a queue for later. The next thing we do is once the gateway issues are over and we're kind of happy like ah yep they've sorted that issue out we'll turn the gateway delay off that means any new payments coming in against those that gateway type will start getting processed so just we can kind of observe how things are going and then last of all we'll come back and release those delayed payments for processing once we're really happy everything's stable. To do this we needed to change a few things about our payment experience so this is our success messages for payments and you can see like the top messages what we normally tell users when things are going well if we've got delayed payments turned on then we're going to actually change the messaging up a little bit so we use the language authorized instead of success and we'll use sort of you know we'll tell you when it's been processed via email rather than we're going to send you a receipt and we were a little bit worried when we first introduced it there may be like this language would be confusing but it's turned out to be incredibly not confusing we've had basically no support tickets related to this kind of change in experience when we're delaying a gateway which has been really pleasing the other thing is we had to sort of think about the on-call story we now have a new queue so instead of turning the schedule payments engine off we've got these delayed payments sitting there that we need to process within a certain amount of time otherwise we start running into issues and we have a bot that takes care of this for us so it's just observing the delayed payments queue and it's going to then run out a message on a periodic basis kind of like how many payments we got in the queue for each of gateway because we could be unfortunately having multiple gateways having issues at the same time sort of overlapping where we may be trying to process the queue for one while sort of accumulating payments on the other because the issue hasn't been resolved and this is kind of working really well for us now processing delayed payments is interesting one of the things because we sort of we put this feature in manually and started using it and automated it is we were not quite ready for the somewhat obvious realisation that once you're queuing all these payments up you don't really know when it's safe to turn it back on I mean most of us are used to kind of a gateway outage but you're relying on seeing actual payments succeed to trust that the gateway has fixed the issue you're not going to rely on the gateway going yeah it's all good just because it never is it's always kind of like yeah in about half an hour it'll kind of even out so we need to come up with a process for this this is kind of pretty manual for us but it's not really an issue we'll go and find one of the churches is impacted by this outage we'll make a $1 gift to them and then we're able to just go in and find that payment within the queue and just release that one payment we'll monitor to see if that succeeds if that's looking good then the next thing we'll do is just underlay that gateway so that new payments coming in we can kind of monitor those see how they're going once we've seen a period of stability we used to be very cautious about there was like an hour we've got a lot more confidence in the way things work now so it's normally about five to ten minutes of stability then we'll start releasing the payments that we've got queued up but again we've really really risk averse so we'll always go and sort of sort those payments by dollar value first and pick a small batch of low dollar value payments so really protecting our customers from any additional risks if that gateway is actually kind of still flapping release those first and then if we're happy there we'll just release the rest and the other thing is we're fairly conservative so we're not trying to rush these out the door we kind of trickle them out one at a time so we sort of we just want to eventually catch up we're not going to try and put ourselves at risk by processing them really quickly have another failure. So it brings us to the sort of elephant of the room, like what if Spreedly's down? Well for starters if it was I imagine we'd see some people exiting the door, then very quietly. Yeah yeah that's fine, you don't know it's all good. But no it's all serious like we've again very risk-averse we kind of thought this room was like you know if Spreedly's down unfortunately that means we're not going to be processing with any of our gateways. It's the one sort of risk of picking one of those providers to kind of fend all your payments through. At the same time we won't be able to tokenize new payment methods. So it kind of initially sort of sounds like a game over. If Spreedly's down we've put all our eggs in the Spreedly basket we're not going to be able to proceed. But that's not really true because in the audience we've got a people who are churchgoers they're coming to church every week. Most of them are going to end up with an account they've got an payment method they've captured from the first time they did a payment and those people can just proceed through our experience without actually needing to do tokenization. So we're still able to sort of save those payments. And then the other thing we do is that we don't actually sort of mark Spreedly as an overall gateway of being down and then just not try to tokenize. We'll keep trying to tokenize with Spreedly on the hopes that it may succeed. If it fails we say like I get a communications error with Spreedly what we're going to do at that point is actually just change our user experience. We go hey you could use one of your existing payment methods or why don't you try this kind of groovy thing called ACH for this payment and you can kind of then fend them off towards a different payment method type entirely so that we can still complete that transaction. And this is working really well. To be honest Spreedly doesn't really go down so it hasn't been a problem. So I've been calling this delayed payments all the way through. That's definitely not what we call it. Do not quote me on this. We actually call it assured payments. So this originally came out as a feature that we identified as sort of on the back of a mitigation. So we feel a lot of pain as an on-call team. We sort of pride ourselves in how quickly we respond to these payment issues and it hurts us deep in our heart when payments don't succeed for our customers. It's not good and more importantly the impact analysis is horrific. So you want to just not do impact analysis on these incidents at two in the morning. You just want to go back to sleep. But as we sort of work through this feature we call it delayed payments because we're developers it's like well we're delaying the payment so it's delayed payments. That's just what you call it. When we're getting there to releasing it a sort of VP of Engines and some other people were kind of going actually you know what like there's an appetite for this. Our customers really really want this. It really plays into this theme of trust. They don't want us kind of failing their givers and then kind of that becoming an erosion of trust with the church. Why don't we start marking your customers and at the same time why don't we think about renaming this a little bit because we don't really want this negative term. And so we pivoted and decided to call it assured payments was really sort of identifying it based on the duty it fulfilled the customer. And it taught us sort of a really valuable lesson I think which is like we hadn't really think thought about marketing these internal engineering features around payment resilience to customers. It turns out it's what we do is what we're good at why shouldn't we be sort of marketing it. And today in the last quarter we've saved about four million dollars worth of payments and somewhat sort of ironically or awesomely like in the last talk we actually had assured payments go off for the first data and we sort of saved a few thousand dollars of payments as well just then. So how many 25k. There we go. So the system works. So just in wrapping up like I guess I've got a few takeaways. I think when you're starting out especially it seems like payment processing issues are really sort of out of your control. They're just not and it ties in really closely with I think establishing a process to learn from your payment outages. Once you have that process and once you broaden the reach so when you're conducting a post mortem it's an opportunity to share with a wider engineering team. People who were not involved at all in the incident response are going to come by and kind of go hey you know I've got some ideas about how we could actually make this better how we could save some of those payments. And then I think on top of that when you start thinking about these mitigations definitely employ them manually and learn about how to safely operate them manually before you then automate them. And this has two sort of good really good outcomes. I think you end up with a much safer command to operate manually and you get like a better outcome from a chat ops perspective. But also it means you can manually employ them because what we find is like triggering covers the basis or the scenarios you know. But inevitably gateways are going to throw up situations you aren't anticipating where you're triggering is not going to be capturing that and being able to just manually turn those mitigations on or off for a period of time is actually really cool. And it kind of helps build your roadmap for what you're going to actually build for your triggering functionality in the future. That was all I had to say. Thanks for listening. And if you have any questions find me during the break. Thank you.