 So okay let's go ahead and get started. So this is a slightly interesting title for this particular presentation. We didn't really use Linkerd to schedule the tests, but Linkerd greatly facilitated our ability to schedule the tests, 68,000 COVID tests in a very short period of time, and we're going to kind of talk you through how it helped us troubleshoot some problems and get us over some humps. So introductions first. Dom, you want to say hi? Sure. I'm Dom De Pascuali. I'm the DevOps Architect at Penn State University in the Department of Software Engineering. And I am Sean Smith. Oh sorry. Sorry. I do all things Kubernetes and pipelines and all that fun stuff. And I am Sean Smith on the Director of Software Engineering and we we built software for Penn State University. Next please. So a little bit of background. So last March like everybody else we were all kind of affected by what happened with the the COVID outbreak and since we work in higher education what that meant for us is that we had to find a way to send all of our students home very quickly while trying to keep them engaged and coming up with a plan to bring them back safely in the fall. What happened was we reached out we had a bunch of vendors that we were dealing with for doing testing. We had on-site testing and we had no way to tie all these things together. So we quickly took a built a system to pull all the pieces together to go all the way from testing test resulting to contact tracing. Unfortunately we're built on top of a microservice infrastructure and DOM has done a really great job of terraforming a lot of the our actual back-end infrastructure out. So we were able to turn things very very quickly. We changed directions in the spring semester of 2021. It was decided by the university that all the students would have to be tested 72 hours before they could come back and then again within 10 days of return. And for those of you who aren't familiar with Penn State we're a pretty large institution so for those that were returning to campus that really equated to about 68,000 scheduled tests in a kind of a very short period of time. Next please. So what we were doing is we have a bunch of Commonwealth campuses and we have the main campus and we were sending out test requests, invitations for tests, a thousand at a time and somebody who DOM was kind enough to not want to mention so he called him someone but he knows who he is Chris, wanted to see just how quickly we could kind of push the system and see what it could take. And again this is something that we didn't really plan for large scaling. We had to build it very very quickly so we weren't entirely sure how it was going to work. We had half of the infrastructure on-premise, half the infrastructure was in the cloud so we had to kind of figure out what the heck was going to happen when we did this. Next slide. So why Lincardee? So Lincardee we had tried other service meshes previously and we found some challenges in them in the configuration. We struggled with some of the tools that we wanted and then I was at KubeCon a couple years back and I went to a presentation on Lincardee and I saw how easily it installed, how smoothly things went and I immediately texted Don. The person that was aforementioned in the previous slide was also in the presentation, also texted Don. And what we recognize is that with Lincardee we got a lot of capability with far less complexity. Mutual TLS is great, I mean who doesn't like security. Free retries is great because we get tired of building that into our code. But the real bang that we got from Lincardee is the observability. So our observability really gave us the opportunity to go in and kind of visualize and see things at a new level. Okay let's get started with the demo. On my laptop I have two clusters of mini cube running an east and west of cluster and I will show what that means here in the next slide. Lincardee 2.8 running since at the time of the event we had Lincardee 2.8 and I didn't want to change anything. The load test will be run via K6 and I'll be quickly stepping up to 200 virtual users to really drive load on my laptop and we're just doing simple gets for this endpoint. I will mention quickly that since we are doing both running two clusters on my laptop and the load test tool on my laptop there will be some resource contention and there's good chance that some of the performance issues we see aren't necessarily induced by latent services. It's just a system resource contention problem. But the way the east and west clusters are laid out on my laptop are similar to what our environment was during the real production outage well partial outage. East is representing what we had running in AWS and west is representing what is running on-prem at Penn State and the you know unhappy stick figure here had a browser launched from the invite taking you to the scheduling system. The scheduling app in the browser was calling the top level back and service demo service X. Demo service X depends on the on-prem on these three demo service A, B and C. So X depends on A, B and C. You also notice that A and B depend on C. And then C depends on two simple HTTP bins and demo service D. Now demo service C is actually our RBAC service and which is why it's so dependent. Everything depends on it. From demo service C it depends on what we're simulating here is just you know three random little services but in reality it would have been authentication and authorization databases or services that are out of our control, out of our software engineering's control. So we're going to jump over now to some terminal so we could go I could show you how this is all set up. Let's see first I have nothing magic just a quick script to start the mini cube clusters. I have a east and west cluster start and I have port ranges specified so we don't overlap port ranges on my laptop and then we just do a basic install of linkerd on both of those east and west clusters. So we can see on my current context which is the west cluster we have linkerd installed in west linkerd installed running in east and we have the applications that were in the diagram we have a simple service definition and deployment definition and the application is dependent via configuration and environment variables on three services running in the west cluster. 0.4 is the IP for the west clusters ingress on my laptop. 0.3 is for the east cluster. I won't spend too much time looking at all of the definitions in the west cluster since there are a lot but demo service C which is back of the diagram demo service C which is what the R the R box simulator right depends on D and the two simple HTTP bins. Here's that set up here the demo service C is down here the definition for the deployment for demo service D has one replica and it is a injected delay of 200 milliseconds. So to I just wrote another quick little script nothing fancy again just to make sure that I apply the right configuration to the right cluster. I just run that and it will deploy those pods as needed. So in the west cluster we have all the components in east we have the components and a quick I'm going to switch tabs one more time and quick test here just to make sure you're still running yes so when I call 0.3 I'm calling demo service X and that's then calling making calls to these three services and demo service C is calling these three services so that's the way that the traffic is flowing. The load generate script like I mentioned earlier is just going to ramp up quickly to 200 virtual users calling demo service X do that right now while that's launching and another terminal over here I'm going to start my monitoring so I'm just doing quick simple port forwards to each cluster for the linker dweb portal the dashboard in this browser here I will launch those localhost 8080 not increase the font size because I'm not I don't want it to be too small localhost 8081 so we'll see here in in our deployments we have in the west cluster demo service abcd and in our east cluster demo service X already we're seeing p95 latencies of 28 seconds so it's going bad already and this is what it was like once we sent out a thousand invites and then all of a sudden those thousand people or so decided to click on the link to start signing up for their for test scheduling our test all right so what I would like to do right now is show you the what we really used to see what was going on we we were big in the graffata dashboards that day watching performance of everything so let me just launch these two dashboards so here we have demo service X and we could see now we don't have a very high request per second but our latency is just terrible over here the success rate panel we seem to be okay in reality when we had our problems the success rate was not 100 the whole way across and our latency was terrible so we had a mix of you know the best of both worlds as far as failure was going an interesting thing is the reason we really wanted to share what we went through here is we could see in demo service X the outbound traffic had high latency so that was something that was would help us troubleshoot like oh okay so we have this terrible outbound latency but there's nothing down here in this dashboard telling us that we're connected to anything that's because we had dependent we had dependencies in another cluster this is the other cluster we're just connected to via simple ingress it's not a link or d multi cluster and even then a few of us in the link or d community were chatting currently or last last time I checked there wasn't a way to aggregate metrics between multiple clusters from ethios to then render these other outbound deployment dependencies so in other words to have back to my diagram here to have the metrics linked to the metrics of service X and service a and service b and service c in this separate cluster all in one dashboard that that's something we would love to see in the future and maybe we'll try to figure it out another day so we this is what gave us our first indication that it must be happening on prem so whatever we're talking to and our other cluster it must be the problem so we jumped into the link or d just like here we had a separate link or d dashboard and separate griffon dashboards to dig into you know we started poking around looking at all the different services that our service X was dependent on and we could see like oh these are all failing miserably you know latency is really high and then of course we check service c or demo service c because it's our hardbox system we check it you know usually first and we saw that it had terrible latency down here we would have seen our dependent services for service c and that well this service call was okay this service call was okay but it was really this one to demo service d that was the problem so if we go to demo service d we see that oh this thing is just you know it has no outbound traffic it's super latent um so of course the first thing we did is we just scale that guy up i'll have to restart my load test but wrong terminal so we're going to go to demo service d and we're going to scale him up because well maybe he's a single threaded app and he just needs some more replicas so we're going to apply that to the west cluster starting up right now and we're going to restart that load test so it's going to ramp up pretty hard here we'll watch we'll watch the demo service x to see what kind of picture we get here i'm going to change refresh rate to 30 seconds on these apologize for the pop-ups it seems like you can't uh actually stop everything so we're at 54 simulated users and it's going and going so we can see here our our our latencies were you know tagged at 50 seconds but in reality if we go back to the load test screen well it's not letting me scroll up right now let me scroll up we had a failed request coming from the load test tool where we you know had just having timeouts and they were probably exactly the type of thing that the students were feeling when they were trying to schedule their class uh schedule their classes their tests let's go back here let's see refresh one more time well we have a 20 second p90 was that p95 yeah p99 p95 are both 20 seconds right now so so far it is better we can see over here that demo service d its latency is better but again still 10 seconds is not ideal whatsoever since everything depends on service c and service c depends on service d which has the latency problem so in while this was happening while we were trying to scale out components and we were staring at these graphs trying to understand what was happening uh all that kind of good stuff we had one of our or maybe a few of our other teammates looking at the code of service c to determine if there was any inefficient logic in the application and it turned out there was so we we were constantly checking demo service d with every request that came from the user here all the way through well it turns out we didn't need to do that check it was i won't go into the details about why it was the check was there and why it's no longer important but the moral of the story is it helped us understand that we had this extra code in demo service c checking d for no good reason and demo service d ended up being a service that was single threaded and wasn't meant to handle this type of load and it was also out of our control so our load test is currently scaling down i believe yes it is and let's see what our pictures look like from that last run and i'll see it's still crept up to 40 second response time however we didn't have any failures this time so you know from a user point of view you waited for 40 seconds and that's totally unacceptable however we didn't have any timeouts so i'm going to make one more change we modified the code and i'll just turn this back down to one replica because we don't need it we modified the code to not depend on demo service d anymore apply that check pods and i apologize i've been doing aliases this whole time so all k g p o is is coup control get pods once this guy's ready go tiz all right we're going to run this load test one last time well i tell one little one last little story about how this went so the way this happened in real life it was there weren't breaks like are there happening with my load testing right this was just constant load and constant users and a team of us frantically trying to figure out what was happening and you know lots of lots of stress and worry and all that kind of good stuff um we if we didn't have these pictures these that dashboards that linkard d has pre pre made for us right this is all out of the can i showed a little bit ago the uh the installation of linkard d was default all custom no custom configuration at all i mean without these metrics that the linkard d system is giving us we would we would have been in trouble for a much longer period of time i'm sure we would have figured it out eventually by you know looking at metrics coming out of ingress logs or something like that but because we had linkard d and what it gives us out of the box we were able to troubleshoot fairly quickly where the bottleneck was because of the latency graph and the outbound traffic latency to kind of tell us hey this is upstream or downstream depending on how you tell your stories um we'll see here demo service d now has no traffic because i turned off the call for it so here's demo service c again demo service c our rbox service it's now its latency is super low now because it's no longer dependent on that problem service and this is the same type of experience we had that day we got rid of that extra check and all of a sudden everything just started zooming right along and we were able to get through the rest of our testing and a reasonable amount of time or testing our invites to schedule the testing this load test is almost done why don't we just wait to see it complete here we go so this is demo service x our top level service that the user's browser is directly connecting to we can see our latency now is much much lower and a much more reasonable range in fact our p99 is two seconds which is i would say well compared to 60 seconds it's super good and i'm looking at my load test app here in the background seeing that we're still scaling up to 200 users so why don't we let it go the whole way before we end the demo part representation plus it's always fun to see what happens if you let it run long enough maybe my operating or my laptop will run the resources so we're scaling down so just to zoom in on this time frame we can see that our p99 was the two seconds or 95 was even lower so this is much more acceptable as far as you know the real-time feel for the human and trying to schedule and trying to schedule schedule their testing all right that will move on in summary without the visibility that linkardee was giving us we would we would still be now not literally but we would have been troubleshooting that problem for hours trying to dig down to where the the real performance bottleneck was and everything and as i mentioned in demo if there was the ability to do multi cluster performance metrics and visualization that would have been even better and faster because what i did in you know 15 20 minutes we spent a long time just figuring out where where to dig and i just kind of zoomed through the solution all in all in the demo with that i'd like to thank everybody for watching thanks folks