 Ready to start? Can everyone hear me? Yeah, cool. All right, so thanks very much for coming to this. Just to give you an introduction to what we're going to talk about, we've called it deaf and taxes. Because we're going to talk a bit about why you should always plan for the worst in your infrastructure, but also because actually, we only just adopted using an open stacked IS provider. And I think if we didn't, we would have been totally screwed about three months ago. So that's kind of a bit of the story. So a bit about why we've gone to open stack and a bit of a story about how it saved us from a lot of trouble. So my name's Tim Britton. I'm the product owner for the digital platform at HMRC. And this is Phil. He's one of the DevOps consultants which helped us deliver this project in our web operations team. So for those of you who don't know who HMRC are, I don't normally have to introduce my organization because people kind of boo at this point. But we're basically the English version of the IRS. For a private sector kind of analogy, I always think of ourselves as a big insurance company, because we take money off people and don't give them anything back, and then occasionally have to pay them something. Now, one thing to bear in mind ahead of this talk is that we have incredibly seasonal traffic, so driven by the tax legislation in the UK. So on the 31st of January, we have every single person in the UK who is self-employed having to file and then pay their self-sessment tax return. And I think around 90% of those people do it online. So I'm going to go way back to kind of give you an introduction to this story. HMRC's previously had a pretty bad reputation for delivering digital services and generally IT. So back in 2010, across government, there was a kind of move to try and find out why we weren't producing the digital services which were seen in the private sector. And the British government commissioned a lady called Martha Lane Fox to go away and write an open letter back to governments to what we should do to change this. She went away for six months, had a look around, and she came back with this idea that we needed revolution, not evolution. So our services, we were pretty much, in HMRC, especially, obsessed with kind of waterfall deliveries. We had a huge amount of governance, and we had a massive amount of outsourcing. So everything in HMRC was outsourced. Now, we were probably the worst, in some respects, for some of these departments, because we had previously had a bit of a data breach, which meant that our governance made it impossible to make any changes. We had six month free cycles, and we would have huge and huge amounts of change on one weekend or two weekends of the year. And if you did something wrong, or you had a bit of content wrong on the page, you wouldn't be able to get that in until six months later. So that's the kind of picture that we were looking at in 2010. So the one thing that she came back with was an idea, was to create a new department, the government digital service, which had a single mandate to revolutionize digital services. This is a new department, which was connected to cabinet office, which is probably the most powerful, politically most powerful office in government under the treasury. And they came in, they were basically hired, not from within the civil service, but externally from private sector, and they came up with a really, really good set of principles about how we should deliver digital services in government. So there was an emphasis on multidisciplinary teams, agile principles, a massive focus on user needs. And also at the end of one of the key things they came up with, and something you can read about online, is whenever you want to put something online in government, you have to go to GDS and be assessed against their design principles before you're allowed to go into production. So they started off this program called 25 Exemplars. This was basically 25 pilot projects, which would be delivered with these sort of principles. And three of those were in HMRC. Now the reason there was such a high proportion in HMRC is because we actually are responsible for 50% of all transactions with the British government. So faced with these deliveries at the time, we had this very, very archaic sort of setup for delivering services. So we realized that actually the only way we're going to do this is if we kind of built a new department within HMRC, which was not within any of the confines of the current organization. So we didn't use any of the corporate networks. We went out, we bought MacBooks, we brought in people from the outside, and we started quite a small skunkworks to basically deliver these three services. And at the time, this was absolutely awesome, because you had unlocked laptops like we were running around, we got to talk to users, stuff that we'd never done before. And we were delivering services in-house. And we got really, really excited. We got really excited about working in Agile Way. And I think at the time, we kind of got a little bit too excited about all the functionality of the building. We kind of forgot about infrastructure. There were some other things involved as well, but infrastructure certainly took a backseat, because the business doesn't really generally tend to get excited about infrastructure when they are seeing the demos of like the new screens coming through. So we did the typical thing. This is kind of about 2013, just to give you an idea of the timeline. So we did the typical thing is we kind of forgot about it for a while, and then we suddenly had to scramble for an IS supplier. At the time, government has a very sort of particular view about what suppliers you can use. We can't work out a credit card and go on to AWS. The British government are pretty stringent about what you can do with data and so the British public. So we were kind of restricted to trying to find a cloud supplier who could provide us with what we needed and the availability we needed, but also wasn't US-based. So typically, actually, one company kind of filled a niche in that respect, a VMware IS supplier, and we kind of basically put all of our chips on them. 2013, the only place we could build infrastructure. So we started off and we kind of were in a position where we didn't really have very much time. So we had three services. These are the three services on the outside. This is kind of a look at how we were shaped organizationally. We started to, we were so rushed to kind of build this infrastructure. We had all this work that was kind of shared components that all of these three services would use, and we needed kind of a name to call that work. And we just thought we'll call it the tax platform. We didn't really know what that meant at the time. And we didn't think of it as a pass or anything. We just had to build stuff. We had to be efficient with our time, because we couldn't afford to diversify in terms of services. So we built a shared infrastructure. We'd already chosen to have microservice architecture running in Docker containers, but we didn't really, at this point, think, this is going to get that much bigger. So anyway, we managed to scramble some infrastructure together, get these services live. And they're really, really successful, because they're the first digital services that we put out, which actually went in front of a user before we put them into production. So the business is like, this is awesome. This is the first time we've actually had people saying good things about our services. That's a bit harsh, but you know. And so they typically put loads and loads of money behind it. So then we have people phoning us up going, yeah, by the way, we're going to set up a delivery center in Newcastle that's going to have 20 agile teams. So you're going from like three, went to five, and then suddenly, bam, we went to 20. Then someone phones off again and goes, by the way, we're setting up another delivery center in Telford, which is in the Midlands of the UK. They're going to have another 20 teams. So at this point, we're like totally sort of desperately trying to scale this massive amount of teams, this massive amount of functionality. And we realize that we're not going to be able to, we're kind of forced in the position where we have to build a patch, like a platform which allows teams to very quickly stand up and very quickly deliver good digital services. So we managed to do that, but remember we're still again on one IS supplier, and we start to, we struggle, but it's kind of a different presentation to talk about how we got through those struggles. But we managed to build a platform which provided us with the things that we needed. We did a lot of, he's pretty much predominantly open source technologies, and Phil will talk about that in a second, and also we tried to open source as much as our code as possible. So if you go to github.com forward slash hmrc, you'll see that you can find pretty much all of our code. And we got to this position where instead of being this kind of like skunkworks in a small building, we were actually slowly becoming the main, like the main people in hmrc had to deliver or run and operationally run services for the UK government's tax authority. On sort of this side, we could go down and people wouldn't really matter because those things aren't totally essential to the UK tax authority. Now when we get to this point in 2015, we start to realize that actually it's all getting a bit serious now, and if we fall over, the country starts to lose money, which is really worrying. And it's only really in October 15 that they really started to kind of, the business really said, OK, you guys have been successful. We're going to put all payments. So every single online method by which you could pay hmrc, so the UK government for your tax, was moved on to the tax platform. Also, the only way in which you could file your self-assessment was moved on to the tax platform, previously it being hosted by an incumbent supplier. So we slowly and gradually, incrementally, bring these services on, and then we got to kind of a critical mass in October 2015, and we were like, this is kind of scary, but it's good. But we're still running on one supplier, and we're not getting the availability we need. And kind of to give you an idea about what that means, we're getting some outages, which previously weren't that worrying, because they weren't critical services. But we knew that in January, we'd take something like 150 million in a day on the self-assessment peak. And if you fail, like if you get downtime at that point, things start to happen, which are pretty scary. So if you have to delay the tax deadline, the Prime Minister and the Chancellor have to meet and sign that off. The Treasury have to start borrowing money at a rate to cover the loss that they've had in the interest. So we're in October, and we start to go, this is kind of worrying. We've got to go looking for another supply here. And basically, a lot of people, I think, in this conference, you hear about there's a lot of people saying we made a really strategic decision to go with OpenStack. And I wish I could say the same, like I had the forethought to be like, yeah, I took this really strategic decision. But really, we were like, oh, my god, we need to find another supplier really, really quickly. They need to be UK-based. So we went out to the market, and we were like, who is around? And we found data-centered. And we kind of, you know, we were like, right? We know that the OpenStack API is really, really versatile. These guys look good. This is our best bet. So we started to try and build out a multi-active or an active-active architecture at that point. So I'm just going to hand over to Phil so he can talk through what the architecture looks like and what that build was like. OK, thanks, Tim. So October 2015, HMRC is pretty nervous, coincidentally, exactly when I joined HMRC. So that's pretty much all I knew was this level of fear. But, you know, that's good. If you're building any kind of infrastructure, if you're doing any kind of engineering project, doesn't matter if it's big or small, it's actually really good to be afraid. It's not good for positive people who say, well, you know, maybe everything's going to be fine. Maybe none of the disks are going to fail. Maybe everything's going to be great. It's not like that. The disks are going to fail. The glass is half empty, and you've got to be pessimistic. So what does that actually mean? It means you've got to plan for failure. You've got to be resilient against any kind of failure in your system. So it's quite well known that Netflix have got their Chaos Monkey. If anyone doesn't know about that, it's essentially a bit of software that runs in their infrastructure that just randomly kills stuff. And it happens only during their working hours, which is probably quite sensible, so you don't have to have a call out if it goes wrong. But it essentially means that their engineers get to plan for failure, because they absolutely know it's going to happen. It's going to happen all the time. We don't do that, but we build resilience into every part of our stack. Every tier is resilient. We have good stateless microservices. They're running the Heroku-type 12-factor apps, resilient Mongo clusters. But most of all, we run virtualized in the cloud with no hardware that can possibly fail, because the cloud is 100% resilient. OK, so some people spotted that that is not true. So the cloud, that's interesting. So in one of the keynotes, I forget who it was, but someone said it's difficult to explain it to non-technical people. I actually don't think that's true. It's difficult to explain it to technical people, because there is no such thing as the cloud. It's just a data center. I think it's a marketing term. So the thing with, well, it is a marketing term. The thing is, what you're talking about is API-driven data centers. That's what they are with a good advertising team. They're also virtualized. So API-driven virtualized data centers. But they're still running software, still running hardware, and they're still got humans running them. And it's not to blame anyone. All of these things fail. So you've got a plan for it. September last year, 2015, completely unrelated to HMRC, but AWS had a six to eight hour outage. Took out parts of Amazon's own infrastructure, took out Tinder, took out IMDB. And despite the Chaos Monkey, it took out Netflix. AWS's SLA is 99.95%. That's 20 minutes of outage a month that they can do without having breached their SLA. And the slide that Tim showed us earlier, that wouldn't fly. We had a 20 minute outage during our peak. That would be really bad. If we had a six to eight hour outage, well, people would lose their jobs. It's as simple as that. So we don't just need resilience within the DC. We need resilience between suppliers. So our plan to go multi-vendor protects us against the failure of an entire data center. Protects us against the failure of the whole supplier. Could be infrastructure bugs. Could be human error. Could be zero day vulnerabilities. But we should be protected. But there's also business value to going multi-vendor center. And Tim's going to tell you a little bit about that. Yeah, I just think from my perspective, right now we run across VMware, we run across OpenStack. And I know that we're not, well, we're tied in, but we know that they can die. So I know that's probably not what you're supposed to say at an OpenStack conference. But if OpenStack is a technology dies, we have some insurance against that. If VMware is a technology dies, we have some insurance against that. And as we go into the future, we plan to have three different technologies underlying our infrastructure. So if AWS comes to the UK and we build out and then, we spread those bets evenly. Okay, so I'm going to talk a little bit about what the infrastructure looks like. So essentially what we're doing is we're running a web gateway for people to interface with government for their tax affairs. Requests, first of all, go to CDN, Akamai, before being farmed out to the edge of each provider. We have a public facing zone. This is full of networks, full of microservices and Mongo database clusters. Then we have a layer of proxies between that and a protected zone, which is also full of microservices and Mongo clusters. Finally, on the skyscape side only, that's the VMware provider, we have a private layer. It's more secure processes and it actually has nothing to do with the customer layer. So we can actually lose that without any interruption to the customer journey. And finally, behind that, we don't actually store any data permanently in the infrastructure. So there's a couple of secure data centers, which are physical data centers, which sit behind there. We link them all up by VPNs. So the Mongo clusters are distributed across and the majority of... Oh, sorry, I've gone ahead of myself. And you can see the traffic flowing through that. So there's... I'm sorry to interrupt, but there's an interesting bit about this. And I think there's quite a lot of stuff in the keynote talks about this kind of bimodal way of delivering stuff. And really, we built... We weren't gonna revolutionize the main... Like the main tax systems, because they're very varied, they're very old. And the idea is that we kind of build a new infrastructure on top of them, present it as a digital layer and build an API, a bunch of APIs from those tax systems, and you gradually strangle them. So you don't try and challenge everything all at once, but you build an API layer, which allows you to pull the information you need out bit by bit. And soon you get to the point where actually you have APIs into all of that data and it's much easier to use. But all the time, you're incrementally adding customer value. So I forgot to mention at the beginning of this. We've got a bit of a demo, but it runs all the way through the talk. So what he's done is he kicked it off at the beginning. And it's in production and it's gonna change the routing of our traffic between these two and what proportion we're sending to each. And then we're gonna show you kind of the impact of the end. Yeah. Okay. So the majority of work on this was done by a small team, four engineers, a little bit of weekend work. We built it. We used vCloud tools, which the government digital service, GDS that Tim mentioned earlier, they developed that and Terraform. And because of open stacks, open APIs, we were able to integrate very quickly, build this out in just four months. And Tim's gonna tell you whether it worked or not. Bit of a rubbish presentation if it didn't work. But let me just get, bear with me two seconds. Right, so I'm just gonna give you an idea of the timeline. So we're in, we talked about how we started this in October. We realized normally we'd push infrastructure changes out by going through Dev and then seeing if they were QA staging and then into production. But we didn't have time, so we started building out a staging environment. So a new staging environment, which we'd use for, but actually have to use for functional testing as well as performance testing. And before we'd even finished it, but the timelines that we had, we had to start building out the production version already for, so we had two teams working in tandem, one working on the kind of still finishing of the previous concept. Then we started building production in both skyscrapers and data sensors. And luckily, well, you know, luckily kind of we functionally testing it in November and staging. So we didn't abandon the production build. And then on Christmas Eve, which is actually an awesome day to have a full outage if you're the tax authority, because no one does their tax on Christmas Eve. But we had a start of a 48 hour outage of our current production, which was running on one supply. And so at that point on sort of Christmas day, in Boxing Day, we had our CIO phoning us up being like, when can we turn this thing on? Because we're clearly not gonna get through this massive business event on January the 30th. And then I think the 12 or 13th of January, we did something I wouldn't recommend to anyone, which was like a massive big bang switch over. So we cut, we turned off all of the tax systems, we replicated all of the data, which was currently in the Mongos. Populated the new Mongo clusters. And then we switched everything over and then we tried to test 40 services. And we had like every QA in HMRC screaming down the phone and Slack being like, I don't know if it works sort of thing. And it was awful. But thankfully, eventually everything kind of woke up and sort of got it working over about an hour. And luckily it worked. And so we're in a position where 13th of January, business peaks on the 31st, we are running across two different technologies, two different data centers, and our CIO phoned us up and goes, right, let's turn off Skyscape for the first time in two and a half years. We managed to turn off one of the suppliers and rely solely on our new supplier data center. And then he's like, right, switch back. And it's easy as that actually now, you can just switch between the suppliers in terms of how much traffic you're paying through. And we generally run traffic kind of through and all the time just to keep things lively. So just to talk about our 31st of January because like big business peaks are kind of in live ops. I feel like the biggest success stories are kind of always anti-climaxes because the 31st of January is really, really boring for us. We didn't have to do anything. We knew that we were resilient across data centers. We spent most of the time meeting pizza and having LAN tournaments. And that's the kind of attitude, like the kind of thing you want to see on the 31st of January, which is something that's never really happened before because normally we would be like absolutely bricking it. We have a lot of sort of self-healing containerization. So it does kind of look after itself at the moment. And this kind of gives you an idea. Like Twitter, I always look at Twitter when we do the January peak. I've done two of them now. And like Twitter is great because if you look at previous years, this year, if you look at previous years, people are like, I can't log in, I can't log in. Now they're just having to go with the tax authority, which is awesome. I mean, that's what you want to see. People not actually not being able to pay their tax, just really annoyed that they had to. And in fact, there's some that we couldn't put up there. Yeah, that's 90% of them you can't put up, but yeah. Yeah. So thanks very much. And if there's any questions, far away. Yeah. Yes, so we have dedicated infrastructure waiting to be utilized. So the moment we basically run double the infrastructure that the traffic actually needs. Now plan is to, sorry, this just, I completely forgot the demo. Basically, this is the Kibana graph showing the traffic between the two data sensors. This is the live all HTTP requests by data sensor for the digital tax platform, which now hosts the vast majority of HMRC's digital services. We took, flipped it over. When did you do it Phil? Probably, yeah. So the problem, one of the problems is that Akamai takes 10 minutes to actually push out the changes. So there's a slight delay. So as we talk, though. We'll keep watching. Yeah, we'll keep watching. Yeah, it's gonna look better. But as I say, yeah. So we have double the infrastructure footprint that we need. One of the things we want to do is introduce a third supplier and then we'll reduce the footprint because we're assuming only one supplier or three will fail. Yeah, we'll be running on 50% of the required amount each time. Yep. So we run a single cluster of one master. Generally the master's in the data sensor which is receiving the most traffic. It doesn't automatically fail over. So the assumption is that if we have a data sensor failure, then you'll also get an election. If you need one. Yeah, I think the missing piece of that puzzle is actually there is an AWS rollout as well, which has the arbiter node. So in a, the MongoElections triggered and it should actually be seamless. If there was a complete data center outage, it would be a seamless transition. The issue is that generally that doesn't, data centers don't tend to just disappear. Yeah. Yeah. Yeah, yeah. Yeah, we should say that one of the suppliers is basically the front row here. So yeah. So yes, we've actually moved. Yeah, we do. We have a lot of queries set up to look at stuff like whether or not MongoE queries are slowing down. So, you know, looking at databases and also if we hear stuff on the government grapevine sometimes because we have a, you know, you hear like, oh, this is something's happening over here. We'll just as for safety move across. So we've done that before. But yeah, we have a number of, basically I have a load of stuff set up in SenSu and we're using PagerDuty. So as soon as we see any trouble. I'm sorry. Yeah. Any other questions? Yeah. So I think to, I'll answer your second question first. So I think we have to go by the principle that we should open source everything that we can because our attitude is that the British people have paid for it. So they should get to use it. I think, you know, in terms of us adopting other people's code that's been open source, it's the same argument as everyone else. It's the same advantages. Your first question in terms of adopting OpenStack, we have requirements at different levels of our architecture, right? So right at the bottom of our architecture where we have all of the information about every single person in the UK, that's stuff that we want to really, really protect. But at the front of our architecture, we have stuff, you know, really transient data like someone's name, someone's email address, doesn't exist for very long and maybe they're session taken. We don't need to protect that very much and there's different requirements for different levels of security protection. Some we won't even have in a data center which isn't wholly UK based, UK owned and every single CIS admin is, you know, security cleared up to one of our highest clearances. And that's the huge mainframes because we don't want people to interrogate, you know, public data. But then realistically, if you're at the other end of that spectrum, you can look at companies which you know could be interrogated by foreign nations because it doesn't really matter because what are they gonna do with that information anyway? I think, so someone like AWS, like AWS comes to the UK, we'll probably use them because there's no reason not to. I think the problem comes in that you need an open-source great in that reason, open-source is a great thing because we're now using like a really good technology which we know is UK based and UK owned and that wouldn't have happened really. Like the experience we've had and kind of this story is about that not happening until this came along. And we could jump into things like public clouds like AWS and Microsoft, but you know, we're at the whim of the British public in that and they very, very quickly adjust their understanding of data sovereignty. So, you know, people didn't really understand what the US Freedom Act was about. They didn't understand what Safe Harbor was about and you know, five years ago, but now people have a much better understanding and people, you know, is putting your data in an AWS database and AWS cloud gonna be safe even if it's UK based. You know, that's an interesting question and it's interesting when you ask American companies who have UK owned data sensors and I think the public's understanding of what that means for their data gets better and better all the time. So we need to hedge our bets in terms of which providers and which cloud technologies we can go with as we go through the coming years. Well actually, there's three other UK departments currently hosted on the tax platform. So we currently host Seal Service Resourcing, the valuations office agency and we've got a pilot with DWP. So yes, I do, I think different. I think we're probably the furthest along in terms of providing a platform. GDS are now building government as a platform. So that'll be a healthy competitor to us because I don't think it's a good idea to put every single government service on the same platform. But I think they'll either follow our leads, they'll either go to GDS to help them provide their powers or they'll build one of their own. It depends on the requirements I guess. I can also add a little bit to that. So the home office had one of the avatar projects as well, I used to work on that. GDS have driven this across all government services and there's a lot of interesting stuff happening. I mean, I'm not gonna lie, we still have some big monolithic horror shows, but yeah, this is, like I said, this is the best bit of HMRC I guess to work in. Well, I think a lot of it stems from the kind of GDS and they have a real, the culture that they've built is one where instead of, it's one where you have a lot more autonomy. So I think when we started this, we were given a load more autonomy to pick which suppliers we wanted and which tools we wanted. And also we were, with that autonomy, it's a great place to work. HMRC I think previously on some projects where you had to go through like several patterns of governance to get anything delivered, it's not particularly exciting. And when it's not particularly exciting, you don't wanna really take ownership and you don't wanna be accountable for the things. But actually if you give people autonomy in an organization and say, look guys, you can do what you want, but it's on you if it fails, they're like, yeah, but I get to do what I want. And actually I'm confident in my abilities to deliver something. So maybe I'll pick a smaller supplier who's better because actually I believe it will work rather than trying to buy myself a safety net by outsourcing all the risk to a massive supply. Yeah, and that is actually a large, you know, we are trying as a government to use our purchasing power to grow the economy through SMEs by investing in decent SMEs. Do you know what, I feel like I'm quite blinkered in terms of what goes on in the rest of the government because I've got this thing to look after. But I think it happens a lot in the digital area. I think that still the attitudes that have been sort of nurtured there perhaps could be encouraged in the sort of lower down the stack when we're doing things with slightly more larger scale infrastructure problems. But yeah, I think it's a good start. Any other questions? Cool, thanks very much. Thank you.