 All right, thank you, Amir. So yes, hi, I'm Matt Bichkovsky and I'm a security engineer at Square. And today I would like to talk about the process of implementing Spiffy at Square. So I'll touch on service identity at Square, why we decided to use Spiffy, why Spire. I would like to briefly get into the migration process, how it looked, how did it look for us. And finally, any learnings that we had from the whole project. So Square might not be well known outside of US but a lot of people recognize us as the little white reader for taking payments. But Square has seen a tremendous growth over the years. So as many other companies, we started from a monolith and these days we're kind of getting close to a thousand services running in production. We're like in different regions around the world and so on. So once we got more than one service, we needed to make the service call secure. And long time ago, someone decided to use mutual TLS. Since there were many apps, because as you can imagine, you have a monolith, you start building new services, maybe break off a little bit of that functionality or build something totally on the side. So you just have a few apps, right? So you can just manually issue certificates, client certificates, and then you can distribute them manually to the host that you have, probably your infrastructure is not massive. So you can mostly do it by hand or using some scripts. But eventually as we were growing more services and more hosts, it was kind of becoming brittle. So we needed a better way to manage sensitive material. So we built QWIS. It's a project that we later open sourced. And QWIS basically what it does, it stores and distributes secrets to all the hosts. And then once we had that, we also had to look into authorization. So again, at some point, someone wrote authorization code and edit that to a common framework. And this common framework or a library would be used by other applications, other services that square. And the way it was done actually, was that there was some code that would, basically we encoded the application name in part of the certificate. And for that, we specifically picked the organizational unit, part of the subject field. Everyone just calls it OU at square. And then there's this common framework that would get the client certificate, would parse it, extract the identity and based on that would perform some checks. And then we kept iterating on that. And then eventually some of the scripts that were still left to orchestrate the certificate issuance and choosing QWIS and so on, we turned that into more cohesive service with an API, that's sort of like a orchestrator, if you like. So, if you kind of think about this, we have the CA, we have QWIS, we have some other services, it started looking a little bit like, maybe what you would get with Spire in terms of issuing identity system. So now the question is why would we want Spiffy would have this something that seems to be working well for us? And the big part of our service identity system where it kind of comes short is it's ties to QWIS and it's tied to our on-prem architecture. So, we don't use Kubernetes, we have our own data centers, we have bare metal and so on. And if we wanted to take this, say to the cloud or any other places, then we would have to come with QWIS so we would have to build all those different things that we have in order to bring the service identity with us. And one of the big things that we had in past couple of years was that we've had a growing adoption of cloud services at Square. And it wasn't just that we wanted to use cloud, but we had to because we had to keep up with our growth. And finally, when you have this kind of situation comes up, it's kind of an opportunity for you to really start questioning status quo and being like, hey, do we have like, is our system that we've built, is it best? Does it serve us the best these days or is there something else that we can do and look for the future? And then once we settled on Spiffy, the next thing is like, how do we implement it? Of course, we could roll our own, but we already done it once and we didn't want to just build our service identity again. And we had to look at Spire and we started looking at it like probably around two years ago, so Spire was still like very early on as a project but it already had a lot of features that we cared about. And I think one of the biggest thing for us was it had pluggable architecture, right? So like it was great because it came with lots of plugins but for anything that was custom for us that we'd have to build, we could extend it. And of course, Envoy support, it measures really well with Envoy. So again, we're using Envoy service at Square at least we're rolling it out. So again, another great thing for us would work out of the box. Okay, so let's have a look at the migration process. So first we had to deploy Spire servers to our on-prem and then we had, once we had that, we had of course, Spire agent, one Spire agent running on each host kind of like a daemon set situation. And this really was a place where we can iron out any sort of like, small issues here and there, we did run into some problems or some missing feature that you don't even know you needed until you get stuff up and running. And a lot of this was like early on, I would say Spire 0.6, 0.7 era. So it's probably like where at least a year ago or year and a half. So I think Spire has come a long way and all those features that, you know, we might have not had then we have now. So yeah, so once we have it running up and it was kind of dormant state, the next thing was to, you know, start registering entries in Spire such that we can start issuing identities. So we had to build some workload registrar because that's really a custom thing, especially if you have a custom stack. And another thing we did, we started populating that with like actual applications, actual identities that we had in our system. And we slowly were trickling that until we get to 100%. So all of our production application, all of the services are registered in Inspire. And then we got to the point where we had like over a thousand, a hundred thousand entries and we could rotate certificates and we can, you know, play with configs and seeing like how Spire performs, does the rotations and everything. And none of those identities were used by production workloads yet, but we could already see a great deal of information from that. So then once we had that, we wanted to make sure that Spire could be hooked into Envoy or Envoy could, you know, call up to Spire agent. Since we have a custom control plane, we had to run some modifications a little bit there to the config and maybe made a change to initial kind of config that counts with the site curve, but nothing too drastic, especially that the SDS API was implemented by Spire agent. So once we had that, then we're getting close and well, there's another catch. It's like, okay, how do you make sure that all the sort of services, all the applications are in production, how can they understand those new identities? And this is where we had, as I mentioned before, we had the authorization logic and for historical reason, it's lived in the common frameworks of in the library and it was in each supported language. So we had to go Ruby and Java and, you know, we had to figure out how to make sure that those applications understand SPIFI IDs, because as I mentioned before, we had this internal way of looking at identities and we had to kind of wrangle the code a little bit to make sure we had some backwards compatibility there. And I think it's especially important because we had ACLs, so, you know, access control list and it was in a code base of each application and it was calling out the application name, but now you can get SPIFI ID. So we had to figure out how to do it without making lots of changes or make it hard on app owners to be, you know, so that they could adopt SPIFI. And then final step, once we had all of that done and updated, was to really, you know, start using SPIFI IDs. And for that, we built feature flags into the Envoy control plane and then we had the real, really fast way of switching to SPIFI and back. So like any new connection, connecting to a particular Envoy sidecar could be using SPIFI certificates and then you could just, you know, switch the feature flag back and then go back to the homegrown certificates. So now that was quite a lot to digest probably, but here are some of the things that we learned from this project. We were planning for like a solid year of work, but it ended up being even more than that for us for various reasons, a lot of it being, we just have lots of custom implement, a lot of custom implementation, so we don't use like Kubernetes and that sort of thing so that definitely delayed a lot of the, you know, staff for us. And of course, we also started early on with Spire, so again, coming back a little bit like year and a half or so. So, you know, you're still trying to figure out how do you map this new system onto something that you know really well, but it's very specific to you. So if anything I would say, yeah, you know, manage your expectations and, you know, make sure you staff your product accordingly and you have people who are dedicated to this project and not necessarily get distracted by your other day-to-day things. So, you know, the days might be long, but they're shorter when you get help. So, I just cannot say enough of good things about Spiffy Slack, super helpful, lots of smart people, great place to bounce off ideas, brainstorm, people who've done this before in other companies, so you're not alone. You know, you don't, again, with Spire and open source and this is a great point of open source, you don't have to build everything yourself. You know, you can chip in features, but others can make it better. You know, you can share the burden of maintenance because obviously maintenance is a big part of products in order to, you know, keep them stable. It takes a lot of work after you actually implement features. So I think this is a big, big plus for Spire and having this robust community. So, you know, another thing that we learned is you want to, you do want to learn early and often, but you want to keep your risk acceptable and that's sort of what I mean when I say take principled risks. So you don't want to end up in a situation where you're going to production with a small subset of your platform and then you realize that you can't scale to all the nodes maybe, which by the way, it wasn't the case with Spire, it's been working really well for us in that sense. So another thing you don't want to just like switch low throughput services to use, say Spiffy, and then you find out later like, oh, it doesn't work with some high, you know, throughput and highly critical applications. So because of that, we kind of uncover a couple of things. For example, we run into some database performance issue once we, you know, enable it for all our workloads and we were running it in shadow mode, but because we had, you know, telemetry and internal DB metrics, we could share that with, you know, the Spire core team and we actually got help on that. And, you know, they managed to, I guess, shave the performance by at least a half if not more, it was amazing. And yeah, it was working well for us. But again, you don't want to yell or things, you want to, you know, take risks, but, you know, pace yourself with that. So if you want to have this adoption, and I think some people already mentioned this, you want to have backwards compatibility, you can just, you know, green field this and just hope for the best. Product teams, you know, there are migrations happening all the time, there's infrastructure changes, like, you know, someone brings new logging, there's new like service match, there are other things and product teams just don't have time for that. You need to focus on like, you know, developing features and you won't be bogged down by another yet another infrastructure migration. So you want to make this process smooth. What it meant for us is that we added, for example, DNS names to the SPIFI certificates. It's because our homegrown certificates had DNS name as a way to validate. So then if SPIFI IDs had the same set of, you know, DNS names, then you can start doing things where you could sort of mix those certificates on a connection where you have a client that's still on the homegrown certificate and then the server being on a SPIFI certificate and you can have a nice TLS handshake and you can gradually, you know, you can gradually opt into SPIFI rather than being in a lockstep where it's getting really complicated as you have lots of upstream and downstream dependencies. So the actual mesh of your services can get really complicated once you have a lot of them in production. And of course we had to add the, we had to share the trust bundles, so we had to do some work to make sure that we did kind of like, you know, fake federation between Spire and our own homegrown system as a way to sort of get into SPIFI IDs, spread it everywhere and hopefully start removing the homegrown identities as we go. Right, and when you build this backwards compatibility, I think you want to define limits on that because at some point you have, you know, the law of diminishing returns, you have migrations having a long tail, some apps will get decommissioned before you get to them. So maybe you shouldn't even bother building support for that. I think cutting scope is, you know, true of any project. And what it meant for us, for example, was we had the Envoy service mesh that was taking off maybe we had at some point, like, you know, 50 or 100 apps using it. And I'm talking apps, I mean services, but we keep calling them apps at Square, so we'll keep, you know, probably change the words. And then we said, like, okay, why don't we just start with Envoy service mesh first and ignore everything else? So any sort of direct TLS connections between services, we can just keep them as is and don't break them. Hopefully don't break them. And then what it meant is like, okay, we're working on our Spire deployment, we're going to production and we're going through different, you know, motions. And in the meantime, the service Envoy mesh had some, you know, attractive features for engineers. So they were adopting this Envoy service mesh quite a lot. So then we still have more and more apps that we can opt into Spiffy without even looking at the long tail. And, you know, once the time comes, we will look into it, but until then we can ignore the problem and just, you know, look at the top of the top priority. So, you know, now again, talking about how you expand Spiffy is like how do you convince folks that they should use this new and better system. And sadly, you know, security is not necessarily the best selling point. We can tell people like, oh, do this because it's secure. You know, people may not upgrade, but you have some new emojis and all of a sudden you have a great adoption rate. So, you know, if you can bundle those features somehow, that's, I think that's really helpful. And unfortunately, we missed the Envoy service mesh. If we were about to go with like, for example, this, you know, this new service mesh and we could from day one say like, oh, the service mesh is coming with Spiffy, then actually it would be great for us. But we kind of missed this, you know, bundle, this train, this release trains. And then, you know, we have to kind of do it ourselves. And this kind of brings me to the next point, which is like you either, you know, can get others to do the migration for you and or you have to do it yourself. And this kind of goes back a little bit to the bundling is if the engineers have good reason to upgrade, they will do it and then the work is done. But in order for that to happen, you need to have documentation in place. You have to make it really easy for someone to, you know, do it themselves. And we had this one example where we were launching this Kubernetes cloud cluster. And, you know, there were some features that people wanted to do or people wanted to deploy to the cloud for various reasons. And they also had to call back to on-prem. So the new Kubernetes clusters were surrounding, it was using Spiffy from day one because again, new shiny thing, you know, use the latest tech stack, it's great. But then you still had to somehow call in back to the data center applications. And then for that you had to be Spiffy compatible. But because people had the need for using the cloud, they had no problem opting into Spiffy. And the way we designed this whole system was to be built some, you know, some CLI tool so you can see like, hey, am I Spiffy compatible? Yes, you are. Okay, here's a feature flag, add your app name and off you go. And we've seen actually a pretty good adoption people doing that, even though it's still early on for the cloud cluster we've been running. So infrastructure is in a constant, slow state of degradation and it takes work to keep it flexible. And what was true yesterday might not be true today. Some of the things we've seen and this just keeps probably popping up in systems where, you know, our certificates, our homegrown identities were exposed as files. So, you know, they're available to your app, then maybe you run a new process in your pod or something like that. And you're like, oh, I can use those certificates, great. So now I can use them outside of framework for some stuff that we don't even know we're using it. Like, you know, maybe you like keep the identity of this app, but you have a some site process that we'll call another app and we don't even may not know about that, that, you know, there's such use case, right? Then frameworks parse certificates. This happened early on where, you know, as I mentioned before the certificate, the client certificate exposed to the framework, to the library, so you do some parsing, there's some code, someone wrote it in Java, but then maybe someone brings a new language at some point again in some corner, there's a need for, I don't know, doing stuff in Python, right? And then people will be like, okay, how do I make it work with other services? Now I need to get the call and they look at the code and it's like, oh, I would just copy this code and roughly replace that, you know, in a different language. But then there's small differences and then you get into problems with that. So then, you know, we introduced on-vice service match, it started terminating TLS and you have this, you know, plain text connection over Unix socket to the application, but because so much code still relies on the client certificate, now you're forwarding the client certificate to the app rather than just maybe passing some spiffy ID in the header or something else. So again, it kind of takes time to clean this up and you have to consciously kind of take the effort to make those changes, otherwise you end up, you know, in the situation in this slow degradation. So, you know, people get used to, people get used to the way that things are and you need to challenge those assumptions if you want to make, if you want to change things. So I already mentioned how, you know, folks parsing certificates later, they are wanting to parse the FIDs because some of the trust domains are not intuitive anymore. And then we have those ACLs and ACLs only has the application name. And as you hopefully can see on the slide, it's either, you know, application name, my app, or you have this spiffy construct, this URI, where it's like, I don't know what those trust domains are. So, you know, that's kind of confusing. Another thing where, you know, people might have mental models that's like, oh, we have staging and we have production. If I allow my app to app A to call app B, then that means it's good for staging and production, but it might not be really true in, you know, once you have more trust domains and with, you know, spiffy ID, we have more trust domains. We have staging production on-prem. We have a depth staging production cloud. And then we have like separate ones for Lambda and then it just keeps growing. So, you know, it's not the same. So you definitely need to educate and promote within the company. I mean, obviously true of the project in general and in the broader tech community, but definitely true internally as well. Then short-lived certificates are great until they're not. And this is probably a nightmare scenario you want to avoid. You don't want to get in a place where like all your certificates expire and you may not have a lot of time. And some of this is totally how Square did it. Our certificates were over a one-year lifetime before, the homegrown identity. So, you know, we had different monitoring systems that, you know, if you tripped all those different tracks, even in the worst case, you should have, get to the few days where you can respond by Friday and be like, okay, I can get to it. But if your SVITs are a 24-hour TTL, then, you know, and you refresh them at half of the lifetime, then you may end up with like, you know, up to 12 hours to fix some issues. Of course, there are different failure scenarios and we were actually running some exercises. And I would also say the 24-hour TTL, it's purely what Square chose. And it's definitely been official for us in many ways. And Spire's been actually really, really good about the rotations. And even, you know, a year ago, we have not had really issues with that logic. So kudos to the design and implementation because it's been working really well. And I know I said I had 10 lessons, but, you know, I couldn't resist it to turn it up to 11. So it definitely can be challenging at times to solve production identity problems, but it's much easier when you're facing them as a team. So I just wanted to say many thanks to my team, Security Infrastructure, who made this project debris. And of course, you know, the broader, stiff-inspired community. If any of that sounds interesting, you should come talk to me. So thank you. And I don't know if there are any questions, but happy to answer. Matt, you glossed over an important, yet often overlooked point around staffing appropriately. What's your take on what's the right size crew for a surgical team to be effective around Spire? I imagine a good pizza-sized crew would be great. I guess it depends on your needs. One of the things that we've done at Square that was kind of helpful was that we actually had three people at a time working on this, right? And that was really good, but the problem we've done is we kind of ended up with swim lanes, that one person was looking at some AWS cloud, I was looking at on-prem, and someone else was looking at AWS native. We were looking to building MTLS into Lambda. So we ended up, you know, very much separated. And I think that in hindsight, it would have been great if we just build more focus on just tackling one problem at a time and not paralyzing that. And having that sort of like a shared knowledge because what happened later is my team is more like, you know, six, seven people. So we actually own other things, not just like our homegrown service identity. So later you had to go and also like educate the rest of your team because they've been just busy doing other stuff. So, you know, you kind of end up in the situation where you should definitely involve more people. I would say three people sounds about right to me, depending on the size of your project. That's great guidance. Thank you. Quick question from the chat. I think this is a straightforward answer. Your topology around multi-cluster and federation, correct me if wrongs? Where deals with a single trust domain for all the infrastructure? Is that correct? You did not need federation. Federation. Right, so federation, we're not using federation yet. And this is where it gets a little bit tricky. We want to build federation next year. But the way we do it is we have different data centers. So staging is one trust domain. Different say data centers in staging are one trust domain. We have like say three production on-prem data centers. They're one trust domain. So we have separate clusters, but they're under one trust, consider one trust domain. Actually they're backed by, production is backed by one database. That's why you have the same trust bundle. And then we also have cloud. And cloud is like one trust domain configured, but like separate registration for each cluster. So they're kind of disconnected. So the way we've done it to solve this problem is we essentially ended up sharing roots manually and putting them in a trust bundle, which eventually we want to move into the federation. But this was sort of like a short-term fix until we got to the point where we can build a whole federation. I don't know if that answers the question. I believe it does. Thank you for a great presentation. That was a wealth of knowledge packed all together in one. You also want the best plush in your room setup. I have to change my virtual background to reciprocate.