 Hello and welcome, welcome the first principles of cloud native technology. My name is Ron Petty, and I work at RXM. RXM provides instructional consulting and advisory services in the cloud native space. Some of the things we've worked on included creation of the Kubernetes certifications, the CKAD, CKA, and the CKS. Prior to cloud native technologies, I worked in energy and finance working on high performance computing systems. So why are we here? We're here to talk about using first principles to understand cloud native systems. So to do that, we want to set up a scenario and the scenario is as you've inherited a system that you did not create. More to the point, this system has some aspects of cloud native computing. So, you know, using Kubernetes in the cloud and we'll kind of leave it at that for now. And so unless somebody provides you a large, large dearth of information and experience, it can be hard and challenging to understand what the system is and what the system does. So here we have the dreaded SSH into a machine and that's all I know. Right. Is there something that we can do better to more quickly come to an understanding what this system does. Now this talk isn't about judging. We shouldn't hate the code that came before. It's more of understanding what we have, how to find it, how to confirm it, and try to fill in some holes in our knowledge. Now this talk comes out of experience of teaching courses and consulting around this technology. And so at this point we've now taught thousands of people around the world of various skill levels. And there's been some pretty common trends and kind of the holes that need to be filled. It will, it will vary per person. But the idea with this talk hopefully in 20 or so minutes will actually give you some ideas to think about how you can construct your own plan using first principles to help understand the system that you have inherited. So how might we, you know, more tactically start this process. One way to do this of course is to set some goals. So, as an engineer, goals typically look like fixing bugs. Right, we want to do something here and now could be something a little more involved, creation of a feature, assuming it's a little more complicated, not always, but can be very often. We may go beyond that more of a time commitment, right so maintaining so that's a different level of activity right that's many features and many bugs. So again, we're talking about in generality right an entire, you know, caught a sass type system that you've deployed for your for your business. Then of course, once we know what we're doing and we're maintaining it, how do we improve upon that. So again, we've got this system, kind of day one. We're being a little naive here, we're assuming that there's not much to go on. So what are some things we can kind of do to have a better understanding. So that's basically the goal of this talk. So one approach is through first principles. Here we have a bunch of pity quotes, explain kind of what first principles are or at least start to talk about it. But we'll just start at the bottom general was reassemble from the ground up in question assumptions. So for us we've inherited the system. We want to know what the components are. How far back to you go right in a perfect scenario we go to some elemental level, but we don't have unlimited time. So we have to balance those kind of two competing worldviews one of full understanding versus giving to a point where we can start to build back our understanding of our overall system. So the idea here is to kind of lay out some of those areas to look at how they relate and hopefully you can find the path that works for you. So couple terms, just so we're on the same page. A system is a collection of things working together for us it's groups of processes, whether we're building them or using some kind of hosted service. One that's lesser discussed is the the user so prevailing political or social order. So that's a system to. So there's a kind of a system on top of a system right we've we've got this SAS type deployment and we have our internal users right customer care, finance, the engineers, data analytics right all these kinds of efforts, then you have your end users. It's not all engineers all the time. So while we're here filling in our mechanical knowledge. When we combine the components we're going to need that those other world views. So we're going to see that we're going to want to pick up, you know, out of our chairs and go talk to these people and see if what we think is the world is what they are actually participating in. So after this system this cloud native system of ill definition that we've inherited, there's cloud native development, you know why are we using these tools so just as a as a reminder, you know, speed the market, you know kind of that marketing level stuff that's, you know, good high level description, the red hat one here get a little more meat on it so responses scalable fault tolerant apps anywhere in the public private or hybrid clouds. And the highlight here is very often we inherit these systems, you know, make it a little more concrete, we're assuming we're in some kind of, you know, public cloud. Right so now, for any kind of substantial system pieces of the system are things we wrote. Things are created by others but we use them like a Kubernetes. And then there are services that we may use that are hosted like a hosted database. And so the overall system has all these components so how might we, you know, start to understand why they're there, or even more simply, are, you know, what is there, how do we even find that. So the reality check happens. Right so day one, you don't, you know, again, the scenarios is we don't have much to go by so what can we do. So if there is some existing documentation that can be good so it's kind of this upper left quadrant here. And, you know, doesn't matter what the diagrams are the point here is is that there's two diagrams. Right if you look hard enough you probably will find more than one piece of documentation, and they can have some kind of confliction maybe it's a version one version two and it's nice and easy like that. Other times it's two different interpretations of the same thing. So it can be confusing. But we look for them anyway right, it saves us some initial work. Bottom left, there's issue tracking, you know, assuming we have a SAS with many disparate processes, we may have many projects and many repositories. That may be true, but even with all them finding trends, which repos are busy, which ones have the most issues, what are the types of issues you know what are the issues now versus six months ago. In an hour or two just kind of surfing through these things really can give you some context on what's going on. On the top right is a lesser known technique, which I think, you know, in the current reality of not knowing what's going on is a very good one is look at a bill. Even just last month's bill again assuming we're in the kind of a public cloud scenario, it'll have an itemized list of all the services we're using. It may not give us all the data flows. It may not tell us every process that's running on a VM, but it does let us know it even exists. Now that doesn't mean we're actually using it right, you know, worst case this bill encompasses things that have nothing to do with our SAS. In the best case, it's our walled off production environment in it. These are things that are in it and nothing more. Right, but that doesn't mean we're actually using them all. Right now this isn't critiquing, you know, is it good or bad, which it can highlight you know maybe a real expensive database or unnecessary scaling. But that's not the point, you know, we're kind of in the scenario of we don't know what all this is about. So how can we kind of get a good quick view, a bill is a very good way to do that. Now, of course, you may have services in another cloud provider and you don't even know to ask for the bill. Right. So don't forget that kind of next step. You may have known the main part was in Amazon, but maybe you have parts in Google. Right. Talk to finance. Right. Talk to a manager who had been involved. Right. Just see if there's other parts. Right. We can get there's other ways to look at it. We can see network connections and things like that. But at a very high level, very easy, powerful way to do it. And then the bottom right is that shared context. So we, you know, again, we're assuming we're relatively new to cloud native technology. Even the projects in the cloud native space are continually developing so we even though we're using particular technologies. There may be better, you know, white papers and scenarios with a related project than the project we're actually using. So once we know we're using something, we can go look for its document documentation. But that may not be enough. So maybe look at something in the same space to see how people are using it. Again, not the critique, not to replace the ideas to get an understanding of what the power of these things are again assuming we're newer to this technology. So with this idea in mind, I talked to some engineering managers of some very large scale systems you may know including Twitter, some of the original managers and other projects. And, you know, after interviewing people, I wanted to get their kind of ad hoc because I wanted that, you know, experiential like what would you want to do on day one, you know, without a ton of like preparation to deal with this scenario, you know, you inherited a system, a cloud native type systems, more specifically but you know they start off with just general systems. So the three questions that were asked were how do you go about understanding the system you inherited. So how do you know if it's doing what it's supposed to be doing, and how do you share that knowledge with others. So we'll go through a couple of of these managers feedback. So here's the first one. So, first goal, find out, you know, what it's, you know, what is doing what you just inherited is find out what it's supposed to be doing, talk to users. That's the very real world right it's what they're doing right now. You'll learn about web portals that may not be documented right some kind of web interface, or maybe some kind of reporting engine mechanism behind right we'll start to see the entry points pretty quick. There is the dream of what they want and what frustrates them but you know don't do that on day one. Right, that's, that's further down the road once you know what we got. Then of course there's the verification of these things. So, at a lower level if we have the related documentation for these components at an API level communication that's great. Schemas on the database side tend to be a little easier. There are tools to show the schema to generate the schema right apis have them as well but they're not as often deployed. So, combination of those things can give us a pretty decent view on how the conversation goes, not necessarily who calls who, but what it may look like. Then of course the tool selection matters if we see particular tools being used. So, you know, if there's a queuing system in there whether it's hosted or not, we, we can make some assumptions like okay there must be a need for a buffer, or a retry queue. Right, so we can kind of know okay maybe that's part of the system that performance really matters right, or maybe it doesn't matter maybe we do it because we slowly work over something right, but it gets a thinking in those those terms. Now, next question is it doing what it's supposed to be doing mentioned here acceptance text. So this is kind of the end user experience does, you know, the features I care about is it doing that. Again there's in the testing world there's many terms, you know we can slice and dice this, but if you're frantically learning a new system, having these acceptance tests is a great way and if they don't exist, create them. Now acceptance tests tend to focus more on like actual features that we want a related idea is a category categorization test, and that type of test is with this input, I get this output. I don't care if it's good or bad. That's just what it is. Right, so it's good to consider doing a categorization test as well. That's a technique used very often in legacy systems because no one remembers how it was built. Right, so in the cloud native world if this is new to you, doesn't hurt to go look at what was done in the past with legacy systems those techniques can also kind of help you learn these things so it's another way to learn. Then finally, the, you know the measuring right so measure, you know that's another one is doing what it's doing so it's that ongoing metrics. I can't tell what a snapshot is it right or wrong, but we can look for consistency and behavior what metrics and the bottom question. How do you empower others it's the sharing of that information. Right, don't don't hide it engineers tend to measure and not necessarily share. It's not nefarious it's just very often happens that way to consider sharing it because that also gives other eyes and other conversations. Right next engineer. It's a different view but some overlap. So how do you learn a system you just inherited here. It's a data flow analysis again the protocols and messaging. We're going to see there's tools that help us with this. Then second graph graphing that communication flow so seeing the relations right so again we'll see there's tools later that help us with this. So how do you know if the system is solving what it's supposed to be doing here metrics matter. What's more, we want to see those trends a snapshot is to to, you know, point in time we need to see trends. Ultimately we need to see those acceptance tests to prove it did what it did. But in, you know it's a system and presumably a lot of activity we want to start with the generalities and then start to dive into the details again, kind of the principle. First principles approach. Okay, and then the bottom here how do you empower others again it's back to sharing so you tell people what you see you listen to what they say, and you try to bring as they say bring to light, the parts that are missing. So how do we go about understanding. What we want to just say here is you inherited the system. So you need to do some reflection. What about these technologies that you do see on the bill via talking do you know, and what do you not know. Right that gives us another opportunity to quickly figure out what we need to do, whether it's we're learning it or finding someone who knows that thing we don't know. Right we try to bring him in and fill in our holes in our knowledge so again that requires self reflection. Very often if you're just fighting bugs and doing it you're you're very myopic and just doing that. Again, you don't have a bigger plan the idea is we've inherited the system, and we're going to maintain it for the long haul. So how do we, how can we start to do that. The same kind of echoed earlier reflect against others, share the information, see if when they tell it back to you do you come to the same conclusions or not, you know what does that mean so think, you know, think about it as a group, not just engineers, customer care, right, other managers finance right other groups, they will have their own tools and insights into these systems. Okay. Now, there's a continuum here from a technology point of use less esoteric and more into the tactical stuff. This continuum on the right hand side cloud is we generically all the larger encompassing technology. Kubernetes is our orchestrator right running these pods pods are a collection of containers containers are one or more processes. There's technologies at each of these levels. And for us in cloud native land, these, you know, this is kind of the spectrum. And so, starting on the right, how do we start to drill through these we've mentioned some concepts like accessing bills accessing the hosted dashboards, right for the VM maybe, maybe again those initial scenarios maybe we deployed Well, it's on a VM that VM probably has performance metrics, right, won't break it down necessarily to any particular process, but it gives us at least something to start with. Same thing with logs if there's ability to do log aggregation, take it to give an inch of it. If there already exists, there's probably I am type controls access controls, we can, you know, without knowing all the security settings, we can kind of see what is used to access and how many accounts there are, how often they change. We start to get a feel for for how these things are all put together. We get into the pods and the lower level into the individual machines right so very often in teaching of this technology this is generally where people tend to be the weakest. It's actually ironically distributed computing there's a big complicated subject, but it's the one that people can easily talk about because of you know horizontally scaling and things like that. It's not just data right, but knowing the individual lower level technologies is one of those harder areas. And so here, you know, can we start to gain knowledge, well, if we can run a pod and in the production cluster, we can run it locally. If it fails because of dependencies that tells us something right we don't necessarily have to have known there was dependencies, we can just experience it. Right, but you know, a lot of people don't consider like just running the thing locally right. It's, it can do that. It's just an image, right if there's multiple images in the pod maybe we can run the individual images, right, and then working back, we go all the way down to the kind of process level in the related technologies for sharing resources through name spaces, and the security mechanisms those things provide just a brief example here. We're running a, you know, a pod here with engine next, and we tell the log that's at a higher level, Kubernetes level. And then we jump all the way to the left, right at the process level, we get, we get the pittier. What can we do with it. Well by knowing some of those left handed, you know, continuum concepts here we're showing that we can echo output into the output stream of your pod. Right, you're in this case the engine next container in that pod. And so again when we go to tail we see our message hyzer. It's not meant to be a cool, you know, it is a cool example, but it's not meant to be a compelling example like this is something you should do. It's just meant to show like, there's high level concepts, and then there's a low level concepts and they can meet, as long as we know where to look and so that's the, the goals we want to fill in. So what are the tools that are some tools that are available to do this. Again on the right hand side, kind of cloud, we don't talk much about that because that's, you know, whatever the hosting provider provides but there are things there to learn. In the middle to Kubernetes side, let's not forget Kubernetes itself has ability to produce logs and metrics, and then there's things we can deploy to fill in some of those gaps. So say we have, you know, basic metrics but we don't have any way to analyze them or do anything fancy. That's where something like a Prometheus comes in right to aggregate all of our stats of our software, but that takes work. Right, we may have to open up ports and deploy code. There's an effort there. So let's not forget there's some built in tools as well. And then of course we can go down to the communication level right between things right and that's where we have those service mesh tools like an SDO or link or D. On the left hand side are things we don't necessarily have to install. And this is why we like to highlight them. So at the CLI level the terminal level the shell level. Things we need to know we need to know like what does return codes mean inside of a shell context. Same thing at a process level what is exact and forking at a high level do what does signals do right those are the methods of control. So how do we isolate and share resources right all those things. Then of course there's tools to manage that right the CRI, you know, Docker and company. There's the CNI plug and help with the network level and the combination of these things. So what is their core technologies, how did they work how did they log. Right, just because you know how to do Kubernetes log. Don't forget we can potentially go on the node and look at the individual machine to help to bug issues and understand things. So there's the nodes themselves. We have a system D as an in its process system that comes into play in a lot of cases. Falco system dig, you know, any kernel level system call monitoring. That's a lower level tool that applies to any process. And so now we can see what's going on I may not have any idea what your code does but I can start inspecting it as it's actuating to see what it does. And then the comment on security. It's a comment more on just understanding what it's doing good or bad. So there's some questions here, what's easiest what's the most powerful what is the easiest to share. In my opinion, based on what I've seen in general is the left hand side by knowing these lower level tools. We start to, you know, again that's the first principles parts break down these tools back to, you know, the first principles are built on right Docker, some built on name spaces and resource control. Kubernetes is scheduling this across multiple machines. The cloud is moving even beyond that. And so it's the combination of all these things but I think from a tactical point of view, back to the self reflection what on the left is missing that usually will be one of the areas that will help you the most if you can start to fill it in. There's other things we've inherited the system. There's pipelines. That's another great view of the world. And it's probably reality right if push code and shows up in production. It is what it is so studying those steps is a great way to start to see some critical features. But remember it's not everything. Right, just because you've had a pipeline doesn't mean it was all everything was automated right there could be systems that are not so try to fill in those gaps, right, again you're not going to do this overnight. So you want to, you know, take a very systematic approach at what you see, and verify it right that's that's really the kind of game we're playing here. Beyond that, there's security, there's optimization, there's all these things that's, you know, the goal here isn't, you know, trying to give you advice on how to optimize things. The data you'll get from all this will help you do these other things be more secure be more optimal, get get to those improvement phases right not just maintenance phases, but even getting to maintenance mode if you're new to the system is difficult and in the cloud native world, we have lots of tools to fill in these holes and metrics and in controls, but without kind of knowing what their pieces are individually. You know, a lot of people miss they hear the high level terms and they just want to go for it. But then they hand it off to someone and, you know, I can kind of leave a mess so we're, we're recommending is take a risk adverse approach here and build up a plan, study these things over time, and they will continue to pay off in the future and be sure to share this information. Right. You don't need to say everything to everyone, but get that reflection against other people. Okay. For for our more junior friends out there. Here's some more specific things you can can look at and for the more senior who may have forgotten some of the, the lower level technologies, which can happen. On that left hand side, here's some things to start considering most of this is actually very tactical like processes in user space in the proc file system, but some are more modeling. Right so OSI model is a good one to reflect against doesn't mean we do it, but we reflect against it. Microservices, you know why are we having these hosting platforms and containers and why does it stylize this way. I'm not going to say in a conversation but you know again take that next step and try to do a little bit of self study to make sure you fill in the blank. Some of those tools we mentioned Prometheus service meshes things like that highlighted them here. Again, take advantage of that landscape, right go see what's out there. A lot of them have super great documentation. Again, we only have 20 minutes or so 30 minutes here. So we, we can't cover everything so there are a lot of good stuff out there so please take that next and do some fundamental research and again, take advantage of first principles here. A lot of cool technology, and there's a lot to learn and it's a lot of fun. All right, thank you for attending a talk. I hope you have a great conference. If you have any questions about any of these things, of course, teaching and consulting, but even if you just have a question, a technical question. If I can help you, I will. So please don't hesitate to contact me. Have a good day.