 All right, folks, I think we can go ahead and get it started. So hey, everybody. My name is Jed Johnson. I'm a software engineer with Defense Unicorns. And today, I want to talk with you all about this new term in our field called Platform Engineering. So quick overview of the presentation, we're going to talk about what platform engineering is and how it's an outcome of the idea of DevOps. We're going to build a Kubernetes-based platform based entirely on free and open-source software. And finally, we're going to talk about some of the issues with platform engineering. And I'll speak from some of my own experience in the form of a case study. And we'll talk about why your org may or may not want to adopt this practice. All right, so before we can talk about platform engineering, we have to take a brief look at this idea of DevOps. So the term DevOps is ubiquitous in our industry, just like the word agile. Unlike agile, though, you'll hear DevOps used in a litany of different contexts. So you hear of DevOps engineers. You hear of DevOps as a culture. And some even use DevOps as a synonym for agile. But for our purposes today, I want to talk about DevOps as it relates to culture and team structure. So for DevOps as a culture, what we're getting at is this idea of shared ownership across the entire stack. So instead of having, say, an app team and an ops team off working in silos and the app team throws their code over the wall to the ops team, you would have these true full-stack engineers who are not only writing application code, but they're also responsible for deploying and operating that code in a production environment. So in many ways, this is a very good system. Adopting a DevOps culture means that the team has full ownership over the entire software lifecycle. So if there's ever an issue with the app, there's none of this like, hey, let me have my manager call your manager and we'll figure out who's responsible for fixing this outage. Because the team itself is proficient across the entire stack, they can own and they can solve whatever problem pops up. This also enables teams to be opinionated about their tech, meaning that teams are free to choose whatever tech and whatever deployment methodology makes the most sense to them and their particular context. Here's the rub, though. Although this gives app teams a ton of freedom and flexibility, depending on the size of your org, this model can become inefficient. If you have disparate app teams that have adopted a DevOps culture and they're owning their own processes from end to end, those teams have a tendency to inadvertently silo themselves. And if this happens, you can bet that multiple app teams are going to be solving the same problems over and over again. And maybe not necessarily at the application code level, but for example, if you're an org that uses AWS, a very popular method for deploying on AWS is with Elastic Container Service or ECS. And if you have multiple app teams using ECS and those app teams are siloed away, you can bet that they've all solved the problem of deploying their app on ECS in their own way. So some app teams might take an infrastructure as code type of approach and write some terraform. Others might just click ops their way through the AWS console. And also those teams are probably using load balancers in front of their apps. So what type of load balancer did they choose and why? So even in just one AWS service, there's a multitude of different ways to do things. And I'm just talking about ECS in AWS on the whole, there are hundreds of ways to deploy an app. And without any org level guidance and opinionation, you end up with what I'll call ops sprawl. And this is where you have a cloud environment full of different resources. And it's difficult to know why those resources were spun up in the first place and who is using them. And this is also likely very expensive for your org because you likely have orphaned resources lying all over your cloud accounts. And lastly, security and compliance can become especially difficult if each app team has their own snowflake of a deployment process. That means your security and compliance engineers have to reaccredit each of these individual snowflake applications. All right, so at this point, I think we've identified that sometimes in organizations, implementation of DevOps can have unintended consequences and can lead to inefficiencies. And so now we're at a great spot to introduce platform engineering. So the goal of platform engineering is to standardize the process for deploying and operating applications as well as their underlying infrastructure. So in other words, we wanna take all of these disparate tools and in processes that app teams are using and form an opinion on the best, most robust and most relevant to our context methods for deploying an application and operating its infrastructure. Then we want to make it easy for app teams to norm around that prescribed methodology. And it's important to note that I'm not talking about like written policy and documentation. I'm talking about creating, enforcing and automating a golden path to production for your app teams, very specifically. I mean things like creating reusable CI pipelines that enforce things like static code analysis and dependency scanning. I mean, locking down your AWS accounts and being very intentional about the purpose of your IM users and roles and scoping applications to use the minimum set of permissions necessary to operate. So now we're getting to the heart of platform engineering and when I start saying things like standardization and opinionation, some folks rightfully get nervous. For many orgs in a DevOps world, individual app teams had complete control over their own deployment processes and it can sound like platform engineering is here to take that freedom away and go back to a world where other teams who don't have an appropriate level of context are dictating how an app team does their job. In reality, the crux of platform engineering is balancing this level of freedom you give developers with the underlying opinionation of the platform. So in other words, yes, we want developers to have the freedom to own their own processes, but fundamentally, we want those processes to be intentional and responsible in the context of the entire org. Okay, so we've talked quite a bit about the motivation for platform engineering and the problems it's trying to solve, but let's take a couple steps back here and talk about what exactly a platform is and what it takes to run an application in a production environment. So in its simplest form, a platform is just the underlying set of services and capabilities that an application requires to run effectively in a production environment. So as an example, we'll take a look at a simple application made up of two microservices, a front end and a back end. When this app gets deployed, engineers are going to want to know if it actually got deployed, is it healthy, is it reachable, and also how is this app performing? So we need some mechanism to check health endpoints and monitor connectivity and have some quantitative insight into how the app is performing from like a CPU and networking perspective. So to do that, we need a monitoring stack, which includes a way to scrape and calculate those metrics and then persist them to a database. Also, this application is going to emit logs and engineers are going to want to view those logs in real time as well as query historical logs. So to do that, we need a logging stack. The app is going to emit logs to a particular directory and we'll need a log scraper to grab logs from that directory and then forward them on to a log database. So now we have these two data sources, logging and monitoring, and we need some sort of dashboard that engineers can log into to query them. So now we have network requests flowing through the system. We probably want to encrypt them with TLS. So there's a couple of ways to do this. Either we have each application implement its own TLS. We point them all to some centralized certificate authority or CA server and then we manually pass around certificates or we can use a service mesh, which will give us that exact same capability in the platform without needing to touch any application code. Next, we need to make the app accessible to its users. So we need some sort of load balancer along with DNS records. We'll also need a mechanism for internal app devs to deploy new versions of this app and then access those dashboards for logs and metrics. Ideally, we have some sort of developer portal that devs can log into and easily discover those dashboards and get a top level assessment of the health of the platform and the application running. All right, almost there. As more users interact with the application and with the platform, we'll probably want some mechanism to scale up and scale down these various resources so we're making efficient use of the underlying compute that this is all running on. And last, but certainly not least at some point, either the app or one of these platform components will have a zero day critical security vulnerability and this system will get hacked. When that happens, we need a runtime security component to identify us and alert us that we have been hacked. All right, so if this thing looks complicated, it's because it is. Building platforms is non-trivial, right? There is basically an endless number of moving parts and tech decisions that you have to make to build something that your app devs actually want to use and is easy to operate. And that's why for a long time, it's been a popular decision to outsource these platform components to a cloud provider like AWS. And outsourcing to a cloud provider like AWS, that may be absolutely fine for your org. But there are situations, especially if you deal with highly regulated or egress limited environments where that's simply not feasible. So a very popular method for rolling your own platform and implementing what we've talked about here is with Kubernetes. And this presentation isn't necessarily about Kubernetes, but we have to talk about it because it is one of those cornerstone pieces of tech in platform engineering. And at its core, all Kubernetes does is provide a really robust and extensible API for managing containers. And this API is so popular that Kubernetes has become the de facto standard for deploying containers, both on-prem and in the cloud. I want to be very clear, though, that when you choose to adopt Kubernetes, many of those platform components that your cloud provider was providing you, you now own and you now have to operate and maintain and update. And so because of this, if you're going to go this route, I recommend you have some dedicated ops engineers. And lastly, because it's open source, you can hedge against vendor lock-in if that sort of thing is a concern for your org. And using Kubernetes as a base, we can build this entire system using exclusively free and open source software. So let's see what that looks like. Let's build a platform. Okay, so the core tech that our platform is going to use is called Big Bang. And Big Bang is an open source declarative baseline of applications and configurations used to create a secure Kubernetes-based platform. It was created by a United States Air Force organization called Platform One. And the idea was to take all of these like widely used and widely adopted open source platform components and bundle them all together into a single Helm chart so that you can deploy them all in one go. And then you have this like platform in a box type of user experience. So on this slide here, you can see the open source platform components that we're going to use to build this platform. And by using this single dependency Big Bang, we can install all of them at once. So we'll quickly go through this open source tech. Our logging stack is going to use Promtail as the log scraper, which will forward logs onto Loki, our log database. We'll use Prometheus for monitoring and metrics. We'll view those logs and metrics using Grafana as our dashboarding tool. Istio is going to implement our service mesh and secure traffic between all of these various platform components. And NuVector is our runtime security tool that's going to automatically be monitoring the cluster for anomalies and intrusions. All right, so let's see what this looks like in code. Wonderful. Okay, so here we have an AWS EKS cluster. It is mostly empty except for a couple of things. I've installed the EBS CSI driver add-on. And this just means that whenever a pod requests storage, it'll receive it in the form of an AWS elastic block store volume. I've also installed a tool called Flux. And Flux is the continuous delivery tool that we're going to use to deploy this entire platform. So Flux has this concept of a Helm release. And a Helm release takes a reference to a Helm chart along with the values that you would pass into that Helm chart. And essentially, Flux is going to do the Helm upgrade or the Helm install for us inside of the cluster. So what this means is that we can define all of our configuration for this platform as YAML and use Flux to deploy it in a pull-based GitOps fashion. So Big Bang itself is a Helm chart made up of Flux Helm releases for all of these various platform components. So this means the next step is to fill out the values for this top-level Big Bang Helm chart. And looking at the values here on the right, you can see it's actually pretty sparse. It's like 55 lines, right? And that's because under the hood, Big Bang is providing us a bunch of default configuration that's automatically wiring together many of these platform components. So for example, our logging stack starts with Promtail as log scraper. Promtail is already configured to scrape logs that like the var log containers directory on the underlying Kubernetes nodes where the pods are emitting their logs to. It's also already configured to forward its logs onto Loki our log database. So the only real config of note here is strategy monolith for Loki. And this just means that, hey, instead of deploying a bunch of read and write pods, let's keep it real simple for demo purposes and just deploy it as a monolith. Our monitoring stack, I know it says monitoring here under the hood, it's Prometheus. Prometheus ships with a demon set of pods called exporters. And these exporters, they run on each node and they're calculating the metrics for that node, CPU and memory utilization, as well as for the pods running on that node. We'll use Grafana as our dashboard and this is automatically configured to use Prometheus and Loki as its data sources. We'll use Istio as the service mesh and Istio is going to inject sidecars into our pods and that's going to facilitate TLS in the cluster. So under the hood, Big Bang is labeling the various namespaces of the platform components and Istio sees that label and says, okay, you want me to inject my sidecars into all the pods running in this particular namespace. We're also going to use Istio for ingress to the cluster. So here we're creating a Istio ingress gateway and under the hood, this is a Kubernetes service of type load balancer. And so what this means is that whenever this service spawns in the cluster, AWS will create a load balancer in front of the cluster and that's what we're going to use to access all of these underlying platform components. And in this case, that load balancer is going to be public facing. New Vector is our runtime security tool. So we let it know which container runtime our cluster is going to use. In this case, container D. And then these configs are just saying, hey, new vector, as soon as you start up, go ahead and start scanning the cluster and generating reports and detecting anomalies for us. All right, so we are going to install this thing, have a little script under the hood. It's just doing a helm upgrade dash I put it in this script because this is a lot to type live. I didn't feel like doing it. So we're just going to run install big bang and we're reaching all the way to US West two right now. So give us a moment here. I'm sure our internet here in Bilbao is glorious. So a couple more seconds. And under the hood, know that this is grabbing all of those different helm charts for the various platform components. So it is kind of a large package. I should have given this logging. It's fine, perfect. I knew it would work. All right, so at this point in the cluster, we have these flux helm releases, right? And they're going to start rolling out their deployments and their pods. You can see the Istio operator has already come up. And the thing is this thing takes like 15, 20 minutes to like install all the pods. So instead of sitting here and watching pods spin up, I went ahead and pre-provisioned another cluster that already has all of this installed for us. Perfect, okay. So this is what a full deployment of the platform looks like. So just under 50 pods and you can see they all look good, they're all running. So the next thing we want to do is verify connectivity. So let's take a look at these virtual services. And this is an Istio construct. And say for Grafana, we have a host name and a gateway. So let's take a look at that underlying gateway. So it's a service of type load balancer. Perfect. So here we have an IP. And I know this is not an IP, but it's what K9 tells us. So we'll go with it. And if we want to access those underlying services, what we can do is grab the actual IP of this load balancer. And then we can edit our Etsy host file and point those host names at the load balancers IP. Perfect. And so now whenever we go to grafana.bigbang.dev, we end up in Grafana inside of the cluster as soon as it loads. Perfect. Sometimes you just gotta hit refresh. All right, so let's make sure that our monitoring and log data looks correct. So we'll check out these dashboards. We'll look at Istio performance. That's a good one. Wonderful. Okay, so you can see these dashboards and these metrics showing like CPU utilization and memory utilization. And you have all these fancy charts to show your boss. And all I'm checking here is that like, okay, metrics data is definitely being calculated and collected and we can do it. So now for logs, what we can do is go to this explore tab in Grafana. We can go to our label browser and let's check out that EBS CSI controller. Why not? Perfect. Okay, so here we have a log volume showing all the logs for that EBS controller. You can see the most recent logs and then we can use these mechanisms in Grafana to query historical logs. All right, let's take a look at NuVector. Okay, very good. So more pretty dashboards and NuVector is already scanning the cluster for anomalies and it even gives us ways to improve our security posture. It gives us some recommendations here. So NuVector has this interesting thing called discover mode and in discover mode, NuVector is basically analyzing all the pods in this cluster and then watching them for the various SIS calls that are made to the operating system and looking at their ingress and egress network traffic. And so once you've had NuVector in discover mode for a bit, you can switch it over to protect mode and it will use what it learned in discover mode to, it will use what it learned in discover mode as heuristics for detecting various anomalies and intrusions and you can even set it up to automatically neutralize certain behaviors. So at this point, this is all I really need to see to know that the platform was deployed successfully. Honestly, like I spend most of my time here in K9 just watching the pods spin up but it's always good to actually verify that your services are reachable. Like ingress is a whole other animal that you have to figure out, right? Okay, so hopefully this gives you a decent idea of what a platform looks like. Next, we're gonna move on to the final part of the presentation and I'll talk through a case study and lessons learned for rolling out platform engineering in a large organization. So this time last year, I was building a Greenfield platform based on Kubernetes and Big Bang, which you just saw. And I was building this in a large healthcare organization and this org had literally hundreds of app teams, some on-prem, some in the cloud, in AWS and this platform was meant to address two main problems. So number one, this org had one of the largest AWS bills in the country in the United States and it was the exact situation I talked about before where there's loads of ops sprawl across hundreds of accounts and app teams are owning their own processes but they're also doing whatever they want, right? The second problem is that these applications were operating in regulated environments and handling like sensitive healthcare data. And because of this amount of ops sprawl, it took security teams an egregious amount of time to accredit these apps to run in the various regulated environments. So there was this idea that we could take Big Bang, which documents in the form of OSCAL which NIST 853 controls its satisfies and then use that as the secure baseline infrastructure for these hundreds of app teams. And this is one of those things where it's a great idea on paper but when you actually execute, it's tougher than it looks, right? We actually created and deployed this entire platform like from Dev to Prod in less than six months but it took many more months after that for us to actually get some adoption which brings me to the first lesson learned. It's the first rule of public speaking and it should be the first rule of platform engineering. It's know your audience. Even though your platform may not have a UI at first, this is still very much a design problem and the same rigorous user centered design that you would apply to any web app we should be applying to the platform as well. Platform engineering, it should make developers' lives easier. It should enhance workflows and if the platform doesn't do that, then fundamentally like we have missed the mark. So at this org, as we rolled out this platform, what we quickly realized is that although many app teams are already in the cloud, running on AWS, very few are actually containerized, right? Which is a prerequisite to running in Kubernetes and in fact, many of these apps were either running in a VM or they're completely serverless using Landis. So now it's not just like, hey, come join our platform. It's, hey, come join our platform but first I need you to dedicate some engineering cycles to containerization and for a team running a serverless app, like for team running in Landis, like why would you do that? Like why would you make that swap? So it was a tough sell and in the beginning we were marketing this platform more towards like security and compliance engineers and really highlighting that value prop of automatically satisfying all of these security controls. So there was less emphasis on the actual app devs. In order to get buy-in from those app devs, we open sourced all of our work internally to the org and then invited them to come build with us and encouraging these community adoptions and being transparent with what we're building. This was a vital step towards getting some adoption. Another issue that I personally had was with the operational overhead of running Kubernetes and Big Bang. Like during the demo, you saw those 50 pods, you saw all the different platform components and all of these things, they have to be managed. And when I say managed, I mean things like checking for image updates and refactoring whenever the upstream helm chart has breaking changes in its values and that happens way more often than you'd think, right? And it's a ton to manage. Oftentimes I found myself like refactoring the logging stack for the 10th time and I would think, man, AWS has 20 different ways to give us logging. Like why in the world am I like rolling my own logging stack right now? So all that to say, depending on how fast the team moves and the amount of changes being made, the operational overhead of running a homegrown Kubernetes-based platform, it can get pretty absurd. So the lesson learned there is when you're architecting a platform for your org, don't forget to weigh that operational overhead. And I take that a step further and my very practical advice is if you can, don't start with Kubernetes. Like in our case, starting with Kubernetes and with Big Bang, this did address the core business problems we were trying to solve, but it's certainly not appropriate for every single platform and every single context out there. So if you're going to go with Kubernetes, just like any architectural decision, have a data-driven identified need for it. So looking back, we spent probably a year building this platform and refining it and attempting to onboard app teams in a year. Like that's a super long runway, right? Like that's a long time dedicated to building this thing. And I think because we started with Kubernetes, it became difficult to back out of that tech decision. And we did a great job creating an automated way to satisfy security controls. And this is very valuable for like greenfield app teams. But for app teams who are already in prod, it was difficult for them to justify migrating from their tried and true deployment processes to Kubernetes. So as a platform engineering team, we kind of hit a wall and it's nothing that can't be worked around, but we had to expend a lot of effort going back and doing more user research. So the final lesson learned here is to the greatest extent possible, optimize for being wrong. And to quote Brian Fenster, we are probably wrong. We need to optimize our processes for being minimally wrong so that we can quickly adjust to becoming less wrong. And what I think this looks like is, first of all, taking a good hard look at your app teams and gathering as much data as possible about their existing deployment methodologies. And second, building smaller in an effort to fail faster. In our case, I think like, you know, what if instead of starting with Kubernetes and Big Bang and everything that comes with that, what if we just started with vanilla Kubernetes and maybe solve for ingress and then tested our hypothesis that this was something that app teams in this org could actually use. We could have built that vanilla solution way faster and found out sooner that containerized Kubernetes was not a viable option for many of the app teams in this org. All right, so in conclusion, I think platform engineering, despite the recent popularity of the term, it's been with us for a long time and it'll continue to be a mainstay of our field, just like DevOps and an agile and these sorts of things. The core problems that all of these buzzwords are trying to get at are like, how do we make software development suck less and how do we build better products? And the final piece of advice I'd like to offer you all is focus on the problem, not the tech. Despite what a bunch of blog posts are telling us, just because a big company uses a piece of tech doesn't mean the rest of the industry should follow. And also focus on your users, focus on your business problems and ensure there's a clear value stream between what your engineers are working on and what the strategic objectives are for your org. All right, folks, I can't thank you enough for listening to me talk about platform engineering. I think we have a few minutes left in our time slot, so now I'd like to open it up for any questions.