 The Cube presents KubeCon and CloudNativeCon Europe 2022, brought to you by Red Hat, the CloudNative Computing Foundation and its ecosystem partners. Welcome to Valencia, Spain, in KubeCon, CloudNativeCon Europe 2022. I'm Keith Townsend with my co-host, Enrico Signoretti, Senior IT Analyst at Gigon. Exactly. 7,500 people, I'm told, Enrico. What's the flavor of the show so far? It's a fantastic mood. I mean, I find a lot of people wanting to track, talk about what they're doing with Kubernetes, sharing their stories, some world stories that are a bit tough, and this is where you learn, actually, because we had a lot of Zoom calls in the webinar and stuff, but it is when you talk to people, oh, I did it this way, and it didn't work out very well. So, and you start a conversation like this, that is really different from learning from Zoom when everybody talks about things that work well, they did it right. No, it's here that you learn from other experiences. So we're talking to amazing people the whole week, talking about those experiences here on the queue, fresh on the queue for the first time, Chris Boss, Senior Software Engineer at Microsoft Xbox. Chris, welcome to the queue. Thank you so much for having me. So, first off, give us a high-level picture of the environment that you're running at Microsoft. Yeah, so we've got 20, well, probably close to 30 clusters at this point around the globe, 700 to 1,000 pods per cluster, roughly. So, about 22,000 pods total. So, yeah, it's pretty sizable footprint, and yeah, so we've been running on Kubernetes since 2018, and well, actually, it might be 2017, but anyways. So, yeah, that's kind of our footprint, yeah. So, all of that, let's talk about the basics, which is security across multiple, I'm assuming, containers, work microservices, et cetera. Why did you and the team settle on Linkerdue? Yeah, so previously, we had our own kind of solution for managing TLS certs and things like that, and we found it to be pretty painful pretty quickly, and so we knew we wanted something that was a little bit more abstracted away from the developers and things like that, that allowed us to move quickly, and so we began investigating solutions to that, and a few of our colleagues went to KubeCon in San Diego in 2019, CloudNativeCon as well, and basically, they just, you know, sponged it all up, and actually, funny enough, my old manager was one of the people who was there, and he went to the Linkerdue booth, and they had a thing going that was like, hey, get set up with MTLS in five minutes, and he was like, this is something we want to do, why not check this out? And he was able to do it, and so that put it on our radar, and so, yeah, we investigated several others, and Linkerdue just perfectly fit exactly what we needed, so. So in the end, we were talking about security at scale, so how you manage security scale, and also flexibility, right? So, but you know, what is the, you told us about the five minutes to start using there, but you know, again, we're talking about world stories, we're talking about, you know, all these, so what kind of challenges you found at the beginning when you started adopting this technology? So the biggest ones were around getting up and running with like a new service, especially in the beginning, right? We were, you know, adding a new service almost every day, it felt like, and so, you know, basically it took someone going through a whole bunch of different repos, getting approvals from everyone to get the certs minted, all that fun stuff, getting them put into the right environments and in the right clusters to make sure that everybody is talking appropriately, and just the amount of work that that took alone was just a huge headache and a huge barrier to entry for us to, you know, quickly move up the number of services we have, so. So I'm trying to wrap my head around the scale of the challenge. When I think about certification or certificate management, I have to do it on a small scale, and every now and again when a certificate expires, it is just a troubleshooting pain. Yes. So as I think about that, it costs, it's not just certificates across 22,000 pods, or it's certificates across 22,000 pods in multiple applications. How are you doing that before LinkedIn? Like what was the, and what were the pain points? Like what happens when a certificate either fails or expired, not updated? So I mean, to be completely honest, the biggest thing is we're just unable to make the calls out or in based on what is failing basically, but we saw essentially an uptick in failures around a certain service and pretty quickly we got used to the fact that it was like, oh, it's probably a cert expiration issue. And so we tried a few things in order to make that a little bit more automated and things like that, but we never came to a solution that didn't require every engineer on the team to know essentially quite a bit about this just to get into it, which was a huge issue. So talk about day two after you've deployed LinkedIn, how did this alleviate software engineers and what was like the benefits of now having this automated way of managing cert certs? So the biggest thing is like, there is no touch from developers. Everyone on our team, well, I mean, there are a lot of people who are familiar with security and certs and all of that stuff, but no one has to know it. Like it's not a requirement. Like for instance, I knew nothing about it when I joined the team. And even when I was setting up our newer clusters, I knew very little about it. And I was still able to really quickly set up LinkedIn, which was really nice. And it's been, essentially, we've been able to just kind of set it and not think about it too much. Obviously, there are parts of it that you have to think about and we monitor it and all that fun stuff, but yeah, it's been pretty painless almost day one. It took a long time to trust it for developers. You know, anytime there was a failure, it's like, oh, could this be link or D, you know? But after a while, like now we don't have that immediate assumption because people have built up that trust. But also you have this massive infrastructure, I mean, 30 clusters. So I guess that it's quite different to manage a single cluster and 30. So what are the consideration that you have to do to install this software on 30 different clusters, manage different versions, probably, et cetera, et cetera. So I mean, you know, as far as like, I guess just to clarify, are you asking specifically with LinkedIn or are you just asking more in general? I mean, you can take the question in two ways. So yes, LinkedIn in particular, but the 30 clusters are also quite interesting. Yeah, so I mean, you know, more generally, you know, how we manage our clusters and things like that. We have, you know, a CLI tool that we use in order to like change context very quickly and switch and communicate with whatever cluster we're trying to connect to and, you know, are we debugging or getting logs, whatever. And then, you know, with link or D, it's nice because again, you know, we aren't having to worry about like, oh, how is this cert being inserted in the right node or not the right node, but in the right cluster or things like that. Whereas with link or D, we don't really have that concern. When we spin up our clusters, essentially we get the root certificate and everything like that packaged up, passed along to link or D on an installation and then essentially there's not much we have to do after that. So talk to me about your upcoming section here at KubeCon, what's the high-level talking points? Like, what will attendees learn? Yeah, so it's a journey. Those are the sorts of talks that I find useful having not been, you know, I'm not a deep Kubernetes expert from, you know, decades or whatever of experience, but. I think nobody is. That's true, that's another story. That's a job posting decades of requirements for Kubernetes. Of course, yeah. But so, you know, it's a journey. It's really just like, hey, what made us decide on a service mesh in the first place? What made us choose link or D and then what are the ways in which we use link or D? So what are those, you know, we use some of the extra plugins and things like that and then finally a little bit about more what we're going to do in the future. Let's talk about not just necessarily the future as in two or three days from now or two or three years from now, the future after you immediately solve the low level problems with link or D. What were some of the surprises? Because link or D in service mesh in general have side benefits. Do you experience any of those side benefits as well? Yeah, it's funny, you know, writing the blog post, you know, I hadn't really looked at a lot of the data in years when we did our investigations and things like that and we had seen that we had very low latency and low CPU utilization and things like that and looking at some of that, I found that we were actually saving time off of requests and I couldn't really think of why that was and I was talking with someone else and the biggest, unfortunately, all that data's gone now like the source data so I can't go back and verify this but it makes sense. There's the availability zone routing that link or D supports and so I think that's actually doing it where essentially if a node is closer to another node, it's essentially routing to those ones so when one service is talking to another service and maybe they're on the same node, it short circuits that and allows us to gain some time there. It's not huge but it adds up after, 10, 20 calls down the line. In general, so you are saying that it smooths operations and it's very, you know, simplifying your life. And again, we didn't have to really do anything for that. It handled that for us. It was there, yeah. Yeah, exactly. So we know one thing, when I do it on my laptop, it works fine. When I do it across 22,000 pods, that's a different experience. What were some of the lessons learned coming out of KUKON 2018 in San Diego? I was there. I wish I would have ran to the microphone for folks. But what were some of the heart lessons learned scaling link or D across the 22,000 nodes? So, you know, the first one, and this seems pretty obvious, but was just not something I knew about was the high availability mode of link or D. So obviously makes sense. You would want that in a large scale environment. So that's one of the big lessons that we didn't right away know. Like one of the mistakes we made in one of our pre-production clusters was not turning that on. And we were kind of surprised. We were like, whoa, like all of these pods are spinning up, but they're having issues like actually getting injected and things like that. And we found, oh, okay, yeah. You need to actually give it some more resources. But it's still very lightweight, considering they have high availability mode, but it's just a few instances still. So even from a binary perspective and running link or D, how much overhead is it? That is a great question. So I don't remember off the top of my head the numbers, but it's very lightweight. We evaluated a few different service missions and it was the lightest weight that we encountered at that point. And then from a resource perspective, is it a team of link or D people? Is it a couple of people like? To be completely honest, for a long time, it was one person, Abraham, who actually is the person who proposed this talk. He couldn't make it to Valencia, but he essentially did probably 95% of the work to get into production. And then this was before we even had a team dedicated to our infrastructure. And so now we have a team dedicated. We're all kind of link or D folks, if not link or D experts, we at least can troubleshoot basically and things like that. So it's I think a group of six people on our team and then various people who've had experience with it. But I have not dedicated just to that. No, it's dedicated just to it. No, it's pretty like, pretty light touch once it's up and running. It took a very long time for us to really understand it and to get like not getting started, but like getting to where we really felt comfortable letting it go in production. But once it was there, like it is very, very light touch. Well, I really appreciate you stopping by, Chris. It's been an amazing conversation to hear how Microsoft is using an open source project. Exactly. At scale, it's just a few years ago when you would have heard the concept of Microsoft and open source together, like, oh, that's just, you know. But my match has changed a lot in the last few years. Now they are huge contributors and, you know, if you go to Azure, it's full of open source stuff. Every person, yeah. Wow, the KubeCon 2022, how the world has changed in so many ways. From Valencia, Spain, I'm Keith Townsend along with our Rico Signorette. You're watching the Kube, the leader in high tech coverage.