 With great pleasure, I welcome everyone to the session by Ben and Gerald today to talk about Charmed Secoing Digital. We are just glad that both of us could make the session today and share their experiences. Hello and welcome to how HMRC digital secures services at scale. Gerald and myself will introduce ourselves and then we're going to split this talk into two sections. So I'm going to go first. I'm Ben Conrad, head of product at MDTP, which is the tax platform at HMRC. I'm listed that Agile India is working for Her Majesty's Revenue and Customs, but like our national anthem, we've already changed the words to His Majesty's Revenue and Customs. HMRC are the UK tax authority, broadly equivalent to the Inland Revenue Service in the USA, or I believe the Department of Revenue in India. And I'm an APSIC snooper. So I'm going to start by giving some background to the platform, why we care about the security of the applications we host and what we wanted to do about it and improve it. And then Gerald is going to talk about what we're able to do and where we go from here. I'll give a really quick overview of this. HMRC are responsible for ensuring that people and organisations in the United Kingdom pay the right amount of tax and duties. The custom side of the organisation has needed to expand in recent years as a result of the decision for the UK to leave the European Union. Long before that, in an effort to reduce our postage costs, HMRC have been building digital services in line with the approach defined by the UK Government Digital Service, GDS, and that is broadly digital by default. To make all of this easier, HMRC have a multi-channel digital tax platform, MDTP, or just the tax platform. I'll probably switch between the names throughout this talk. The platform exists to make building and hosting digital services as easy as possible. MDTP is a platform as a service where the infrastructure, logging, metrics, alerting, CICD, testing, prototyping, templates, everything you need to build and develop a digital service is provided out of the box and nearly all of it is self-service. MDTP itself has been in existence for the last eight or nine years, so we're really quite mature and we host nearly all of HMRC's customer-facing digital services. As a platform, we provide a set of infrastructure and tools to allow people to build, test, and deploy services written in Scala with the play framework, and we think that we're pretty good at it. The platform is hosted in AWS, but the platform abstracts AWS services so that developers writing services to run on MDTP do not need any AWS credentials. We could in the future move MDTP to a different cloud provider, and although that would be a lot of work, the services running on MDTP would hopefully not need to make any changes. We often talk about MDTP being an opinionated platform. The opinions we hold define the paved road, the golden path, the bowling alley of success that we provide to our users, and the intention with that is that with people following that paved road, it will allow the teams to build services quickly and efficiently. The headlines of these opinions relate to technology, so use Scala for writing services, use Mongo for persistence, but they're also just as much about practices. We have opinions about using continuous integration, deploying frequent small changes through automated pipelines, and having really good test coverage. The payoff of this, why we do it, is that if you follow our opinions, stay within those guard rails, keep to the paved road, then you can focus on solving business problems and deliver value really quickly. The other result of this is that the services on MDTP are built using common technologies, where possible common components are reused, and there are common patterns. And as we'll talk about later, there's a huge advantage that stems from that, from that degree of consistency in the applications that we host. So what is AXEC? For someone like myself with a mild speech impediment, there's a risk that it will sound much more rude than it normally is. Application security, which we differentiate from our platform security team. We've always taken responsibility for the security of the platform itself, the infrastructure, the features that we build. However, it should be clear that just concentrating on the infrastructure is only one side of the coin. The platform itself can be hardened and held to be relatively secure, but that doesn't really count for a lot if the applications we're hosting are riddled with vulnerabilities that are easy to exploit. We may be able to build really strong walls that withstand all sorts of attacks, but if the windows and doors are wide open, it's not going to provide a high level of protection. The services hosted on MDTP have always, and I certainly hope always will, have responsibility for their own security. But because of the consistency that I've mentioned, we're able to effectively look for vulnerabilities, not just in a single service, but in hundreds of services at a time. We can also provide tooling that enables the teams themselves to proactively check the known vulnerabilities in their code, and all of that is part of an automated CI CD pipeline. Yeah, this is what we're trying to avoid. Whether it's our platform security team or the application security team, that the goal is essentially the same. We want to prevent security incidents. A security incident for HMRC can be categorized in a large number of different ways. There's reams of text on what defines a security incident for us. The point I'd like to make is that we're a large target. We process payments of hundreds of billions of pounds every year, and we legitimately pay out billions, even in years without a global viral pandemic during the COVID, we built services that paid out an awful lot more. The applications on the platform process the data for around 45 million individual UK taxpayers and around 5 million different companies. And that data in itself is really valuable. And the UK government has legal responsibilities to protect it. And the APSEC team are focused on looking at the security of those applications in the whole. There are three aspects of APSEC to find here. Finding vulnerabilities, fixing those vulnerabilities. And if you can find and fix them before anyone else is able to exploit them, then you'll likely prevent them from being a problem. Indeed, there are other steps you can take to prevent even widely known vulnerabilities from being exploitable if you've got the right protections in place. I guess as an example of this, there have been recent attacks where an application might have been susceptible to the initial vulnerability. However, something like a proxy blocking egress back to the internet to everything other than a list of allowed URLs would likely be an effective mitigation in blocking that exploitation, which is defence in depth. But rather than focus on a single service, we have a slightly bigger challenge. We have a microservice architecture. And another of the rules that we have is that we are services to reuse functionality that is already offered by the platform. So there are some microservices that are used by nearly all the others. But the numbers on this slide fluctuate a bit. There have been around 170 new microservices created on MDTP so far this year, but not all of those we're running in production. Sometimes we get to decommission old services if they get replaced or aren't needed anymore. And how you count teams is quite difficult because whereas sometimes there's a one-to-one relationship between a team looking after a single service, we also have live service teams who they may look after 50 or more services and there are plenty that fall between those two extremes. And the teams can vary in size as well. The number that isn't on here is that there are over 250 digital services hosted on the platform, different journeys. And that's because the UK is really inventive at coming up with new taxes. So the point here that I've been labouring, I realise, is that we're operating at significant scale. We have our own in-house catalogue for these services and it's a really important tool for operating them at this scale. As I mentioned, one of our opinions is that the services on MDTP are built by agile teams. We ask and expect all teams to use continuous integration. There should be no waterfall development happening here. To enable this, we provide CI-CD tooling and at its most easy screenshot, that currently means Jenkins. Each commit on GitHub can trigger a pipeline of actions to build your software, generate a slug, that's an artefact that is stored in Artifactory. And then it will be scanned in Artifactory and it will be deployed to the first pre-production environment, tested, deployed to the next environment and different tests are run, depending on what the team have configured. And the testing will vary depending on the purpose of the microservice. So a front-end will likely include things like browser compatibility testing that obviously is no use for a back-end or an API. As a platform, we provide tools to allow automated security testing as part of those pipelines as well, which at the moment means ZAP. And again, I think this just highlights the scale again. There's quite a lot of changes to code happening here. On this graph, you can see Christmas and I think Easter to a lesser extent. Each of those lines of codes could be built into a new artefact and then be tested through the pipelines I've just mentioned. If a test fails, then the pipeline fails and the artefact won't progress any further. Deploying to production, by the way, remains a manual step. And all of this does create a challenge for HMRC because things are constantly changing on the platform and we want to know that we're not introducing security holes with those changes. Again, there are lots and lots of changes, lots and lots of deployments. I don't think it ever drops below a thousand deployments to production in any given month and in pre-production environments, it's generally well over 4,000. There are a lot of changes being made across the platform and a lot of these changes will be being proactively scanned with tooling research such as ZAP, but that in itself is not going to be effective in preventing vulnerabilities from being introduced. I just want to be clear here because it's easy to get the wrong impression. The number of changes in itself is not a security problem. Indeed, it's very much the opposite. If we implemented a change freeze and set the whole platform and aspect for the next year, we would become more vulnerable to security incidents, not less. A sizable portion of these deployments will be to upgrade code to remove older versions with known security risks. All of these changes are improvements for services that HMRC make available and higher numbers are better. I've already briefly alluded to the catalogue. This is an internal tool that we've developed in-house, but it is now possible to use more generic alternatives if you wish. As a tool, it's something of a Swiss Army knife. It holds a vast trove of information about the applications and nearly all of it, pretty much all of it, is automatically generated. So there are no manual updates required to anything here. What's shown on this slide is one of the most basic tools. It's what's running where. It's a really, really simple thing, but it's really useful when you're running at this sort of scale to have an at a glance site of which version of any given application is running in which environment. There's also information about every repository, every deployment, and we've got other features that allow you to view the microservices in many different ways. I think that's quite enough context and I'll get onto the security stuff. I hope I haven't bored you all too much. As I've mentioned, HMRC is a target for all sorts of people. The motivation of such people might just be disruption. We're a department of the UK government and not everyone around the world loves us and what we do. Not everyone in this country loves us and what we do. Some of them might be motivated to attempt something like a denial of service attack. There are also 60,000 civil servants of HMRC, so this is around the unauthorized access. All of those people will have been security cleared to varying degrees, but it's generally not very high. There are also thousands of people working at HMRC as contractors and just ensuring things work on the leavers process is a huge challenge. We are big supporters of coding in the open, but that does come with a bit of a risk that sensitive information will leak into GitHub repositories. We also have a lot of personal, if not sensitive information, running through the platform all the time and some of it will be logged. We try and ensure that we don't log anything which contains personal identifiable information, but it does happen from time to time. And then, as you can see in the slide, there are also those people who are always portrayed as hoodie-wearing types with a fear of direct sunlight. Some of those will be looking to gain access to the platform and the services hosted on it, whether they are motivated by a dislike of the soft-drinks industry levy or financial gain, or perhaps because they are being paid by a foreign state. And then that last point, vulnerable dependencies. It's something that's been a growing concern for a while now. There have been a large number of high-profile instances. It's important to be able to scan those dependencies, which is something that Gerald's going to talk about more in a moment. It's also really important, before you're even scanning them, to know what dependencies each service has. And that's a hugely valuable feature of the catalog. We recognise for some time that our success, our scale, has created a problem. Our platform security team was our first attempt to make security a first-class citizen on MDTP. We wanted to start looking at the security of those things that we had direct ownership of. And in a way, that was the easy part. With application security, there are other challenges. Services on the platform are, they're regularly reviewed from a security perspective, but not as often as we're making changes to them. As the platform teams, we decided that we could do more. And that's where the idea of a platform-based application security team came from. And the first remit was really just to go and lift some rocks, pull on some threads, and see what problems there are. And then secondly, investigating what we can do to fix those problems, and preferably do that at a platform level so that we can protect all the services running on MDTP and not require individual changes across thousands of repositories. And at this point, I'm going to hand over to Gerald. He's going to delve into some of the details about this stuff. Thanks, Ben. So I'd like to start by talking about the relationship between those service teams and the APSEC teams. So I've tried to borrow from the idea of team topologies. And we've got service teams that are sort of stream-aligned teams that build the actual thing. And we've got APSEC as an enabling team. Now, the first version of this slide looked a lot different. I sort of had a hub and spoke design with APSEC at the center. And then I thought, no, that's sending completely the wrong message. The service teams, they shouldn't be marginalized when we're talking about security. They shouldn't be pushed out to the edge. Every security issue depends on context. And the only people who are going to know the context are the people who are working on the service themselves. So this is my sort of idea of using this double robberos of the two snakes eating each other, that the APSEC team can't function without the service team, and the service team, they can't do all the security themselves. Right at the beginning, I said that my title is APSEC Snooper. And what that means is that because we've got this opinionated nature of NDTP, we've got that paved road, it allows me to look at services, to kind of go, is there anything wrong? That wouldn't be possible if we had 15 types of languages and four different platforms. So the important thing is that as an APSEC team, we're looking at things in general, sort of try to find issues. And when we find them, it's not about blame, it's about collaborating and making it better. And the catalog makes this easy because if we find an issue in one of the thousand services, we can look up, well, where's the Slack channel that allows us to talk to the team. And this collaboration works both ways. When a service team finds an issue themselves, they can feed it back to APSEC. They can ask about the best practices, they can ask for advice on how to fix it. And now for this collaboration to work, we need tools. And Ben has already mentioned a catalog. I'd just like to take a moment to look through some of the APSEC related tooling. The first that I'd like to mention is we've got a leak detection tool, which essentially keeps tabs on all the GitHub commits. And if it finds something which looks sensitive, like if something looks like a password, if something looks like an AWS secret, it will send out Slack alerts. But importantly, the security teams can look at the bigger picture rather than just individually having a huge list of alerts. The next tool is probably one of my favorites. It's the dependency explorer. This allows us to search through all of the dependencies by which I mean libraries of all of the services. Now, this screen is specifically why the log for shell vulnerability, which happened around Christmas 2021, was scary for us for only about 10 minutes. For anyone not aware, the log for J vulnerability was that you could send us specifically crafted text, which is then logged. And that logging could trigger information leakage or remote code execution, which is very nasty in itself. But we have a dependency explorer be typed in the specific dependency, and it showed that, oh, we're not actually using it in any of our services. Great. No problem. And an additional tool that we have is something that we've called Bobby. It's used as part of a build. And essentially it fails a build if it's got a dependency in it that we think we don't like. Now, there could be multiple reasons why we don't like something. It might be that we've upgraded some functionality and we want to make sure that everyone uses the same library or we found a security issue and we don't want people to use it. Now, importantly, we also have this idea of warnings of deprecation warnings so that failures don't come as a shock to teams. And the problem is with lots of these warnings, we don't want to overwhelm teams because otherwise with the scale that we're operating at, we don't want to be in a situation where lots of teams get lots of warnings and they start ignoring it. And the important thing about this tool as well is that it can be bypassed. So if there is a team that has got a critical update that they need to do and they need to do it today and they don't have the time to test all the dependency updates, we can get past it. Now, next, I'd like to sort of do a bit of a deeper dive in one of our latest tools. And one of the issues is when looking at vulnerabilities is that there's just so much of them. Now, each of the services, as I've mentioned before, has got dependencies. Typically, there are open source libraries built by somebody else. Sometimes they're supported by large companies or large communities. And sometimes it's just Jeff from Nebraska who's doing it as a hobby. Now, in MDTP, all of the dependencies get pulled into Artifactory. And Artifactory also hosts all the build artifacts and they're scanned by X-ray. The first issue with that scanning is the scale. When I first looked at the outputs of X-ray, I had a report with 800,000 rows. Now, the scale, unfortunately, is not the main issue that we've had with X-ray. I mean, look at this user interface. Maybe if you have a bank of 42-inch monitors, you could stretch the columns to be able to read it, but not with a laptop. And do I really want to look through 140,000 pages of the report? I don't think so. I think that's impossible. So just to sort of go over the numbers of what we found, and again, there's sort of very rough numbers, we were looking at for a thousand micro-services. We're looking at 100,000 reports covering 180 different dependencies up to about 600 different CVEs. Now, CVE stands for common vulnerability exposure. And it's what's used in the security world to refer to any issue that was uncovered. Again, I cannot overemphasize that the paved road here is very useful because we only use one particular language. It means that the amount of different vulnerabilities is restricted. I'd like to have a quick note on those vulnerabilities. The vulnerabilities use a vulnerability score, the CVSS score. And very often, I find that in tooling that is out, the severity of the score is used as a risk score. So you've got a severity between zero and 10, and people tend to use 10 as saying, oh, we need to fix this. But it isn't a risk score. We actually went back to all the CVEs that we found, all those 600 of them, and investigated them. And some of the worst issues that didn't have the worst scores, and some of the worst scores that just weren't an issue, it really depends on the context. You can have an issue that is really bad if it were to happen, but because of the way that you use a library, it will never happen. And this information isn't reflected in the scores. So if there's one thing that you take away from this talk, please don't define any policies like anything less than eight can go through because it's okay, and anything above eight needs to be fixed. So we've already established that x-ray is no good at the user interface, but what it does have is a really good API. So what we did in an agile fashion, we looked at how can we take those vulnerabilities and summarize them. And we came up with a couple of scripts that eventually essentially created a simple data pipeline. And that pipeline, we actually just used Google spreadsheets. Spreadsheets are one of my favorite tools, because I think they're very underestimated in the kind of things that they can do. So we now went from 100,000 reports and we aggregated them and used pivot tables, and we were able to draw first conclusions. However, that still wasn't really a great way of operating at scale. I can't expect 100 teams to all look at the same spreadsheet and all understand it. So what we did in true agile fashion, we sort of said, okay, well, let's plan out some way that we can visualize these things using the catalog. So I used this very simple three-by-three board to look at the different roles of what we want to get out of the vulnerabilities and look at, okay, well, what is a minimal viable product? What could we do in the next iteration? What could we do in the iteration after that? Really, really high-level planning. This board was essentially as much as it took us to sort of plan it, because we didn't know where it was going to go. So in less than a month, we had an MVP. And that's what it looked like. Instead of a fairly awful spreadsheet, columns that you can't read, we created something that can be used by the service teams to check how secure their services are, and importantly, contains the assessment by the APSEC team. This means that service teams don't need to look at every single vulnerability themselves. They can see what is something that will be investigated by them based on the experience of the APSEC team. So the whole point is not to overload the security teams. Now, I want to talk about that application security is hard. It's very easy to leak sensitive data, even if it's just into your own logs. It's very easy to miss something, and that could have a serious impact. I'd like to make an example by sort of look at an own goal that we sort of scored. We were looking at the development of the catalogue. We wanted to increase the sort of observability and started to record the HTTP request payloads. Separately, we also forced everyone to log in because we wanted to see who was using which part of the catalogue. And oops, we've just logged everyone's passwords. Now, the important thing here was something bad happened, but it's not the bad thing happening that's important. It's how we reacted. We found it. We calmly fixed it. We identified the users that were affected. We rotated all the passwords, and then we had a post-mortem event. We learned from it. Another thing that I'd like to sort of point out, and after the sort of vulnerabilities, it's not the whole picture. It's to help keep services safe. We need to be proactive. Scanning the vulnerabilities is good, but we need to spot the problems in our own code as well. And what we did here is we created a risk ledger, which is essentially just a set of spreadsheets that identify areas where risky code could be, for example, XML parsing. So we looked at all the code and developed a way of finding every place where the XML is being processed and then put it on the list. Now, this, again, is an issue of scale. It's possible for some developers to remember this for 20 microservices, but not when you've got a thousand. And I'd like to, again, go back to the really important point what made this possible and feasible in MDTP is having that paved road, having that opinionated platform that allowed us to have a certain amount of centralization. But it's also important to stress this exactly the same benefits for developing services quickly. Ben mentioned the COVID support services. Now, some of those were developed in four weeks. They went from the design on the back of an envelope to being deployed to be used for millions of users in just four weeks. And it's important to state that no corners were cut. And because we have that sort of application security, we can provide a platform that provides speed and security. So I'm slowly coming to a close here. The conclusions for MDTP, again, the paved road is really important. The tooling is really important because it allows the centralized team to reach out to service teams, but it also allows the service teams to self-service looking at what vulnerabilities have been found. And it means that we spend our days doing better things than chasing down individuals and having lots of really good engineers helps as well. There's just a couple more things that I'd like to mention. I was listening to Jeff's keynote this morning, and there was struck by the fact that he was talking about being radically transparent to include the buy-in to agile methodologies. And I think security is agile. It must be agile because you can't predict all the threats that are going to happen. And the way to sort of deal with security is to work together, is to communicate, to collaborate. And by being open about it, that gets easier. Now, when I say open, it doesn't mean that you put every vulnerability that you have on Twitter. It doesn't mean that you give everyone your passwords. But to effectively secure a system at a scale, it's really important to communicate effectively. And when you can be open, when you can be transparent, that makes it possible. Now, here's an example of what we try to do to encourage that openness. We've created the APSEC amnesty. It's a lighthearted attempt sort of to encourage people to share the skeletons in their closets. Now, in closing, securing a complex system is hard. But you don't have to do everything all at once. You can start small. I would recommend you start collecting this threat intelligence, find out which service writes to files or talks to a particularly sensitive backend system, make a list, use a spreadsheet and then take it from there. Aggregate the data, script it, automate it, scale it in one word. It won't be agile about it. And thank you very much. Okay. Thanks, Kiral and Ben. I think it was a very informative session.