 Hi everyone, welcome to the CNCF end user launch where we explore how the cloud native technologies are adopted by end user organizations across different industries and sectors. The CNCF end user community is formed of more than 160 vendor neutral companies that use open source software to deliver their product. I am Abu Bakr City, I'm a CNCF ambassador and today with me I have Dom De Pascolini as a guest speaker. In these live streams we bring end user members to showcase how the organization that gets the cloud native ecosystem to build and distribute their services and products. Join us every fourth Thursday at 9 a.m. PT. This is an official live stream of the CNCF and as such is subjected to the CNCF code of conduct. Please do not add anything to the chat or questions that will be in violation of that code of conduct. Basically, please be respectful to all of your fellow participants and presenters. If you have any questions for us we will be monitoring them in the chat throughout this stream. Make sure to ask your questions in the live stream chat. This week we have Dom De Pascolini here with us to talk about how LinkedIn enabled the software engineering team at Penn State University to quickly troubleshoot performance issues when sending 68,000 conflict tests in right to students, faculty, and staff member earlier this year. Before we dive into the questions Dom De Pascolini could you please please introduce yourself and your organization you work for please. Sure, so like you mentioned I work for Penn State University. I'm the DevOps architect for the software engineering department within the university. So part of central IT. Yeah, awesome. So can you tell us more about the infrastructure set up at Penn State? How about if I just describe the one that I work with because it's very large. Sure, so we utilize Kubernetes on-prem on VMware and Kubernetes in AWS via EKS. And other than that, we're big Postgres, ActiveMQ, RDS from AWS, S3, SQS all that kind of good stuff. Oh, awesome. So when did you start your cloud native journey and why? It's a couple of years ago, fewer, less than five but I don't remember exactly when and cloud native journey we wanted to get into containerization and at the time I'm not even sure we were thinking about Kubernetes yet but we were getting close. The idea of breaking out our monolith into microservices is what we wanted to do. Eventually we fell into Kubernetes and this was early on Kubernetes when we didn't have a lot of great packages for deploying it. And Kubernetes provided systems like EKS, AKS, GKE they weren't available at the time. So we were young in the process when it first came out and yeah, we wanted to get there because we wanted to build that microservice architecture out. Oh, awesome. Now, can you explain more about some of the technologies and companies that make up your stack? Sure. So like I mentioned, Kubernetes, we dockerize everything now where I believe 100% our apps are 100% containerized at this point. And I mentioned ActiveMQ and Postgres but we also use a lot of the other native like the tools that are kind of native to that Kubernetes landscape which include Prometheus and AlertManager and Cloud Autoscaler for EKS it's not Cloud Autoscaler. Anyway, the Autoscaler for EKS and other like Kubernetes native things. I use Prometheus, I use operators for Yeager. Okay. I use operator for Elasticsearch and a whole bunch of good things there. Yeah, awesome. Do you get a chance to attend the KubeCon event? We had like two, Eliad, DCA, US and Europe version which ones were you able to attend in person or virtual? So technically I attended the service mesh con Europe in the spring because I presented this topic here or this COVID topic there. But I didn't actually attend synchronously any of these cons. I ended up, I just, I cherry pick videos after they get posted on YouTube and I usually focus on security related things. Like the best practices of not just Arbok but pod security. So security contacts and getting into admission controllers with Opa and Gatekeeper, things like that are currently my interest. The other big interest that I'll focus on in videos for that is CVE scanning of live pods or at least the pod definitions. So you could get a good idea, especially when it relates to things like log4j maybe. Yeah, yeah, that's a very popular discussion now. Almost everyone is busy trying to figure it out. Yeah, that's all we've been doing. So going back to your work during, at Penn States during COVID, when COVID hit in 2020, your team was asked to figure out how you bring back students and staff safely. How did cloud native technologies help you to achieve that? Well, nicely, like luckily enough, we had all the infrastructure as code in place to, using Terraform, we Terraform this entire environment, utilizing EKS to build that platform within a matter of hours or days. So like we had Kubernetes and everything we needed up and running with no time at all. Because we have that cloud native platform, that Kubernetes platform and everything that goes around with it, we were able to deploy apps very quickly. I mean, the only thing that held us up were changing requirements from the health staff at the university as we decided what needed to be done to be able to test and schedule students and faculty and staff. Yeah, also it's awesome to be able to achieve that. And in your recent Savage Mesh talk, you said LinkerD played a special role in that household. Well, we chose LinkerD a little while ago and it's just our default now. There are only exceptions to using LinkerD, otherwise it's there. So when we built this new COVID environment, one of the first things that gets installed is LinkerD. So automatically all the new apps were getting automatic retries, MTLS, all that kind of good stuff. But the observability was probably the big win. And we could talk about that some more too. Yeah, awesome. So I guess you were spoken about why you chose LinkerD. What options did you consider while you were exploring LinkerD? Originally we were on Istio and this, I can't remember how many years ago, two or three, three or four years ago. And my team went to KubeCon in the fall, I forget what year it was and they got to see the presentation by William from Boyant and they fell in love with the idea that it was everything you needed out of a service mesh without the complexity. And that meant a lot to us. So then when we came back, when they came back we, I think, so that was the fall, like that January we started ripping out Istio and replacing it with LinkerD and we've been happy with it since. Yeah. And recently LinkerD graduated, how has that been anything for you? It helps sell its importance to the bureaucrats of the university. So let them know that this is not just an open source project that we're pulling in like a library, this isn't like a big deal to many, many organizations. So it was like a feather and a cap for them but it was important for us, them being LinkerD but it was important for us too as users. Okay. So is there any other CNCF projects that you explored and what was your experience with them? Sure. Like I mentioned, we use a lot of the, we use Prometheus and Alert Manager and we're starting to look at CubeMQ, I think that's possibly part of the CNCF landscape. But the way we pick tools and projects as we go to the landscape first to decide is there a project or product out there that solves our problem before we go either building it or just scouring the internet for something to help us out. So we start with the CNCF landscape every time. Okay. So I can imagine the internal usage growth that using some of these technologies brings, comes with some challenges. How did you handle cluster growth and adoption of these technologies? Internal usage. I'm trying to think of how to tell a story around that. I don't think I have a good answer now. I was thinking about it. But cluster growth, then adoption of these top, like the challenge is just training of staff. That's when it's internal. So like we bring in new products, even if it's Prometheus and it's kind of already, I don't wanna say it's built into Kubernetes, but it's there. Like everybody knows it's gonna be there. Like having folks know that they can get metrics about their deployments and things like that. It's a learning curve. It's just yet another tool to learn when they were used to coming from older systems or maybe no metrics available at all with some of the newer systems. So it's really staff training with all the new tools and adoption of like as our team picks new projects or tools, we have to make sure the entire staff is on board so they know how to use them too. Yeah, awesome. Now, do you rely on multi-tenant cluster deployment to manage to distribute your workloads? What challenges does tapping for if you do? Well, when we say multi-tenant, do we make multiple customers in the same cluster type of thing? Yeah. All right, so we are currently alone in our clusters. Okay. And we split our clusters up by whether it's production or not production. And then we might have some namespace splits after that based off of project or whatever. But in the future, we are going to a multi-tenant deployment and we are going to be worried about things like sharing resources and network policies and RBOX security, all this kind of good stuff to share the resources as well as we can and also keep projects secure from each other, even though that sounds mean. Yeah. So how do you manage cluster automation then when it comes to things like upgrades, fashioning, testing, rollout of new features, and so on? Yeah, so everything in EKS is done with Terraform. So that's the quick answer for that. Even the control plane version upgrades, I could just Terraform the change there and wait another 45 minutes for whatever it is for the upgrade. And it's all versioned and taken care of there. As far as testing that change is, like I mentioned, we have a non-production cluster that gets all the changes first. On prem, we use CubeSpray currently to manage cluster upgrades. So that's more human involvement there other than just to Terraform apply and kick your feet up. But eventually, we'd like to do more EKS and less CubeSpray. OK. Yeah, that's the gist. And that's the same cluster setup on prem where there's non-prod and prod. OK. Now, where is Penn State at today and what's next in terms of cloud native technologies? So as I mentioned, we have a big presence on prem on top of VMware for our clusters. And our plan for 2022 is to migrate to EKS for all of our Kubernetes clusters. At least we haven't found a situation where we have to stay on prem yet. There's always that possibility, but we haven't found it. So more managed services, still Kubernetes. And we're bringing more groups within the university on board. For example, right now, the libraries, the university libraries is using Kubernetes. So we're going to try to work with them to come up with a common platform that we all share. Awesome. You mentioned about training developers as one of the main challenge that you encountered. Can you tell us more what role developer experience has played in the evolution of your clusters? I think in the evolution of the cluster, is it begin to realize when you make things so complicated that people can't get their job done that maybe things are too locked down or too stiff to use. And what we did to begin with, we really trained, we tried to train everybody to use cube control for all their work. So if they want to see the status of their pods, want to see their logs, we expected them to do everything via command line. And as we matured, we created a lot of micro fauna dashboards so you can see the status of your deployments that way. We shared Splunk searches so you can just use Splunk for your logs, searches there. I think as far as the evolution of the clusters, I think we learned that a lot of people are still dependent, prefer not to use the command line for everything. I've been using Linux for dozens of years now. So I want to use nothing but the command line. And I just got to remember that when we design the interaction with Kubernetes clusters, it's not always going to be that way for everybody. Yeah, that's good. Like you mentioned, you've been creating instrumentation around to make it easier for your team to interact with the clusters. How do you currently, OK, you've already said you don't use the CLI because a lot of them share a way with it. What other ways have you provisioned to enable them easily interact with the clusters? And what's like the typical life cycle of application, deployment, maintenance, and troubleshooting for your team? All right, so the way that our folks interact, to actually make change to the cluster, we follow strict GitOps methodologies in our department and we use Flux for that. So we'll have our, I'll just tell the story of how you get an application out to production. I think that'd be easier. So I make changes to a program on my laptop and it's good to go. And I've tested it in Minicube because we wrote a tool that will automatically compile, dockerize, and deploy into Minicube so you could test it locally. From there, the Git push into our GitLab repositories will trigger pipelines to do that same work again. And it also does a bunch of CVE scans of the docker container, runs SonarCube against the code, all that kind of good stuff. Eventually, it creates a docker image in a docker registry. From there, we have Flux watching those registries for updates. So we do automatic deployments of non-production services. So hopefully at that point, with everything being good, the non-production deployments are automatically updated and running. And either developer or an external user can start testing that within a reasonable amount of minutes. For production, we do merge request only changes. So if somebody wants to release version 1.1.2 of a service, they would go in and update that YAML, create a merge request for a team leader or a manager to approve. And that would get it out to production. And we do that all day long. Yeah. You mentioned about checking the YAML files for CVs and other things. What tools do you use? Use something like kicks or which other tools? We're currently using Trivi to scan our docker containers. So we use it both for library scanning and OS scanning. I believe we use it for a little bit of file system scanning when we don't docker. So some of our libraries don't get dockerized. So it's going to use there as well. OK, awesome. And we just started using Starboard by Aqua Security this week. I bet you'll never guess why. Awesome. So what do you experience in others as an end user in the CNCF community? There are a lot of vendors and other things out there. And sometimes it can be drowning to be an end user as you experience like. I think it's been great, actually. And at the time, I was at the Seattle QCon. And I really enjoyed my time there with meeting vendors, and having the opportunity to go to maintainer sessions is always wonderful. And of course, just going to regular sessions where you can learn about something new. Like when our folks went to the LinkerD session, I learned about that. It's all but very positive. I also really enjoy. It could be overwhelming, but the CNCF landscape map is super helpful. I don't know if that's everybody's experience, but to have that one-stop shop about trying to find a tool to help solve a problem, that's really convenient. Yeah, it is. Now, in your organization, I think you've probably touched on this already, the future of cloud-native challenges at your organization. Are there any technologies or projects that you are interested in using in the future? Yeah, I think everybody's answer for that is different. Even within my department. So this is my two cents. It's again around security and vulnerability scanning. So we're going to be looking more. I know gatekeepers is on our list of things to implement. Like I said, I just started using Starboard this week from Aqua Security. I tend to keep going down those paths of security-related tooling for cloud-native. Now, I know there's also parts of my team that are looking at switching from ActiveMQ to more of a cloud-native messaging platform as well. And I think I mentioned CubeMQ, but there are others out there as well. Yeah, awesome. And the way almost every part of the internet or the world now needs a patch on updates to fix some vulnerability, security becomes much, much more of a concern. Yeah, yeah. So in our pipelines, we rewrote our pipelines to make sure we're scanning specifically for these log4j vulnerabilities and blocking the pipeline until it gets fixed, no important things like that. Plus, the Starboard project really helped with the reporting of what pods are vulnerable currently. So it was important to approach it from both ends. Yeah, OK. Awesome. Yeah, we don't have any questions yet. Viewers, if you have any questions, you can drop in the chat. Before that, is there anything else that you probably want to depth more into or expand on that we've not covered yet? No, I mean, I could quickly touch on why Linkerd was so important for us in the COVID project, not just because of the free benefits that comes out of the box. Yeah, sure, definitely. The short version of the story is we had a performance problem. And we were trying to schedule 68,000 students to hurry up and get logged into the system and schedule their testing. And we were doing it in batches, maybe 1,000 per so many minutes or per hour or whatever. We needed to get it all done in one day, I think, was the goal, though. And what was happening is every time we unloaded a new mass of notifications, students would log in and bring us down. So long story short, we had an on-prem resource that was very much single-threaded and couldn't keep up with the requests. But the way we found it was because of all the built-in dashboards that Linkerd provides. So we were able to go into the Grafana dashboards for Linkerd and find, all right, well, Service A depends on B, C, and D. And well, according to the dashboards, D is really slow, extremely. So then we looked at Service D and found out what it was calling. Well, that took us down to, this was all on AWS. But all the services in AWS depended on our authorization and authentication systems on-prem. So then we jumped down to on-prem and realized, oh, look. And then we also have Linkerd down there. And we started digging into the same types of dashboards for the on-prem services and found out that this one, in particular, service was causing us lots of problems to the point that everything was backing up everywhere. So we had to make a code change to the on-prem service to remove that single-threaded dependency. And the floodgates opened at that point. But without that, the ability to dig into the microservices and find out who's calling what and which are the downstream, or I guess it's upstream, which of the upstream calls is being slow, we would have lost. There's no way, I think they fixed it in a matter of an hour or I might be exaggerating. Maybe it was a couple of hours. Without that, we would have been stuck big time. That was awesome, it's nice to share. I think most times a lot of companies share success stories and it's hard to hear some of the struggles and challenges that goes into getting some of these things running. Thank you very much for sharing that. Yeah, we still don't have any questions yet. If you're still on the chat, you can ask any questions. Now, we have the CFP to KubeCon EU, which probably is closing tomorrow Friday. Are you looking towards maybe speaking at the next EU event or maybe next KubeCon, any? I didn't have any plans to do so now, so I don't have any papers, any submissions available. But I would be open to doing it again in the future if I could think of a topic, yeah. OK, awesome, thank you very much. I think that brings us, since we have no questions, that brings us to the end of this session. Thank you very much everyone for joining the latest episode of the Cloud Native End User Lounge. It was great to have them talking about Penn State's usage of Cloud Native Technologies. We also really love the wealth of experience he has also shared. We bring to you the latest Cloud Native End User Stories on the 4th Thursday of the month at 9 AM PT. Don't forget to join us for KubeCon, Cloud Native Con EU on May from May 17 to 20th of 2022 to hear the latest from the Cloud Native community. Also, if you would like to showcase your usage of Cloud Native Tools as an end user, join the end user community with more details on cncf.io slash end user. Also, remember the CFP for KubeCon EU is ending tomorrow, so get your talks submitted to be able to share your stories. Thank you very much for joining us today and see you next time. Thanks.