 So this is gonna be kind of a whirlwind tour through some of the management tooling around what I like to call day two operations. So a little bit about my background. I've been working in open source communities for a pretty long time. Started doing Linux users groups and other things about 15 years ago. Contributed to open source projects here and there. Finally got a sysadmin position about 10 years ago in 2006, which started with all the kind of grunt work of racking servers doing onsite installs like from a CD, remember those, right? So we were going into data centers doing lots of that fun stuff, which is still fun to do today because I don't get to do it anymore as part of my day to day job. So these days I'm a developer advocate working at Mesosphere. Mesosphere is the company that does, we employ most of the developers of Apache Mesos and then we also have a product called DCOS which sort of helps with the management of Mesos and the scheduler marathon. And I'll touch upon that briefly sort of in this talk just because I give some examples later. I also work on a website called opensourceinfra.org. That's a bunch of open source projects that have gotten together and shared the fact that they have open sourced their entire infrastructures. So OpenStack, which I used to work on is one of them. Our entire CI system, all of the wikis and bots and everything we run inside the OpenStack community is open source. And there's a bunch of other communities that are doing this in the open source community. And I kind of see this as like the next evolution of open sourcing of infrastructure. So I'm not talking about that today but it's something I'm really excited about. And we did an open source infrastructure day at the Southern California Linux Expo a couple months ago. I've also worked on a couple of books. Ubuntu was one of the open source projects I got involved with really early on. So I'm one of the co-authors of the official Ubuntu book. And then I wrote an OpenStack book that came out last September, which means it's like three years out of date already. But yes, I wrote a book and in my day job now, a lot of my work now is like speaking at conferences like this and going around talking about infrastructure topics. So today I wanted to talk to you about day two operations. So the premise here is that I've worked with a lot of deployment tools and infrastructure tools over the years. And maturity-wise, I have learned that anyone can write a deployment tool. There's tons of them out there. So that's sort of what I call day one. So the first day you have green fields, maybe you deploy this new piece of software and it's awesome, right? Everything works great. Everything's highly available. Everything is running well. Your applications aren't dying. Everything is up to date because you just deployed it, right? Like your servers aren't getting stale. Your containers are all nice and shiny. Everything is stateless and that actually works. So deployment, that's easy, right? But when you're looking to evaluate a tool, you can't just obviously look at how easy deployment is. And a lot of these companies, and I know even the sales guys at my company, like they go out and they show off the really shiny UIs and they show you the really nice web interface, the documented API maybe if they have one, they show these really great tools for deployments and getting things going. But you really need to dive deeper into this when you're actually looking at bigger cloud native systems because in spite of the sales guys coming along and saying it'll be really easy to run your apps because they're handling things like proxies, high availability and all kinds of stuff that your microservices need under the hood. That stuff comes really easy these days but it comes at the expense of a really hard, a much more complicated, I guess under the hood look at how things are going because you have more layers which I'll talk about. And you need to learn how to handle those and make sure that the tooling that you're using supports that. So when we're talking about cloud native systems, sort of talking about the environment where you've got all these layers of things. So you've got like your network and hardware at the most base level. You've got some sort of resource abstraction. So maybe hardware and network and resource abstraction just means that you're using AWS. You just tell AWS how much resources you need and it gives you a machine with that many much resources. Or it could be if you're running your own Damage Data Center you may be using something like Mezos or Kubernetes to abstract out the hardware and you tell it what you need and it gives you what you need. So whether using AWS or using your own hardware and your own abstraction platform, you need to do some sort of, you need to pay attention to that. In the case of AWS you need to make sure it's spread out between regions. In the case of regular hardware and network you actually have to go into your data center and make sure you make allowances for that and keep an eye on your systems. On top of that you might have a scheduler that's actually handling your resource abstraction to make sure that your applications stay running, to make sure that they launch properly, to make sure that when one dies it brings up another one so you may have some sort of scheduler. Maybe your application is running in a container so then you need container management. You need to make sure that they're staying updated. In addition to your physical network a lot of these systems also have virtual networks. They have, your containers are talking to each other internally on a virtual network and then for all of the external tooling and for all your proxies and all your public facing stuff it's running on a whole separate network which may be a physical network, it may be a VLAN inside your data center but all that needs to be maintained. And then of course you've got your application. So when that seg fault you need to know about that and need to know where to look. So one of the things that we found sort of working on DCOS which again is like a deployment thing on top of Mesos this all gets out of hand really quickly. You've got your physical servers and you've got your scheduler and you've got all of this stuff going on when all you wanna do is launch an app. All this stuff happens. So you want to have some sort of unification of your resources so that when you launch your app you can make sure that the scheduler got the resource, that the container was created properly, that your networking is working properly. And so you wanna have something like a consolidated ID throughout your stack for tracking that application. And that's something that we have done kind of late in our application like it's been really hard tracking applications through DCOS historically. But with the latest release we now bake that in because we want people to be able to trace what's happening with their app because it doesn't always work perfectly and sometimes things fail. You also have the problem of if you have teams that aren't working very well together they might all wanna run all of their own logging agents and all of their own monitoring inside each level of the stack. And that gets really complicated because if you have the people running your hardware, running monitoring on the hardware, then you have people who are running their applications who want to launch a monitoring agent alongside their application. And then you have someone else looking at the scheduler. You end up having all these extra resources, all these extra resources being added. You're logging things multiple times, you're doing resource checks for metrics and monitoring multiple times. And that's not really necessary. You're also making people duplicate work a whole lot because they're all setting up their own monitoring. And then this impacts all of your resources, right? Like it's taking away memory, it's taking away processor, and it's taking away network storage and all kinds of things. And it makes troubleshooting hard because your logs are just all over the place. Your application fails to load, you have no idea where it failed and now you have to start going through all these logs on lots of different servers because everything's all awesome and stateless now and it's all over the place. And having sort of a consolidated view for everyone is kind of helpful because you wanna know where the problem is and who to call when something goes wrong because I can tell you when an application goes down and you don't know where it's wrong, you don't wanna have to call everyone on the list and find out where the problem is. You can't figure it out immediately. So I sort of divided these up in a few categories of things you need to consider with this. Once you have things launched. First is metrics and monitoring, which I'll say are two very different things. I put them together in this slide but it was kind of hard for me to do. I worked on OpenStack before this and there was a project that did metrics, monitoring and alerting and it was a huge disaster because they're very different things. They have very different goals and you don't wanna do that but I put them together for this because there is some overlap. So you have collecting of metrics. Metrics can be used for things like monitoring but you also want metrics to provide to customers, to provide for directors, other people in your country, company who are making decisions and wanna know about your infrastructure and then for your teams who are actually making decisions about where to allocate resources. From there, once you've collected these metrics that's when you do the processing on them so that can be in the form of alerting if you wanna send it to your DevOps team. Dashboards, which are really cool to put around the office, people love those but executives love them too because they get to see how things are going and just as humans we like to see pretty graphs so dashboards are pretty great and then storage of your metrics because you wanna be able to look back and find out when something changed a month ago what was that change, what did it trigger because we're now having the same problem again and we never fully figured it out that time so you wanna have some sort of storage for your metrics which sort of moves us also into logging. As I mentioned you've got these different layers of your stack like hardware and containers and applications so you wanna have scope level logging but again you want this to be a unified view of logging so you wanna make sure all the logs are going into the same place. You have a few options here. You can build your own centralized logging server which I'll talk about or you can build your own local one or you can farm it out to some integrated solutions often that you pay for. Proprietary services from companies and then you also have security concerns with logging and this is something that I see get missed all the time. I had a friend who tweeted recently about not putting secrets in log files because there are several applications out there that put passwords and things, they dump them in the syslog or the application log and that's so bad, I don't need to tell you. But you need to make sure that those security considerations are addressed because there are a few projects that are not very secure about their logs and this is something we learned also in our open infrastructure stuff. We wanted to just dump our syslogs and make them open too, but it turns out there were bad things in there so we couldn't make everything in our open source infrastructure perfectly open. I'll skip maintenance for a moment because that's the huge one. Troubleshooting, so again, like when you first launch things, everything works great. It's all, everything runs fine but then things sort of wither over time. You wanna know how to debug things. When you're looking for a full stack solution and you wanna make sure there's debugging tools every layer that you can debug services, you can debug your base systems, you wanna be able to do tracing through the whole network and also that when you run your little chaos monkey, there aren't crazy dependencies in your stack that cause it to cause much more havoc than you want. So you want these things to be somewhat isolated. Maintenance, I almost would put into a day three thing because while it's very important, it's very complex and a lot of solutions out there don't do a very good job of solving it. I'm talking about things like when you're updating your cluster, so you've got like a whole fleet of servers. In your data center, you wanna start upgrading those and these are the things that are just running the most basic services, right? But they need to be upgraded. Maybe you're not having anyone access them directly but they need to be up to date. Doing resizes and in general like capacity planning because resizes can be somewhat hard. So you wanna make sure that you made some smart decisions early on. One of the things that I see a lot of people also struggle with a lot is user management. I mean, this is why we have held app, right? So, and it's terrible, but it's the only thing we have. So user management can get hard because you wanna make sure you have a plan for who has access to what early on because if you don't, you're gonna have people copying around profiles and giving people too much permissions. So being really attentive to your user permissions when you're starting out is important. And the same thing, very similar with package management. Making sure that you have a way to deploy your applications that stay up to date and don't have dependency problems against each other. You've also got things like your networking policies. These can be often really hard to change later on. So you wanna make sure you have a pretty good early on view of how your networking is going to work in your stack. And then things like auditing, backups, disaster recovery. All these things need to be taken into account. So going a bit deeper to some of these. So this is sort of just an example of how your staff might work. So you have this host, which may be physical hardware, it may be something in the cloud. You've got a container that runs the top of that, and then you've got a service running in the container. So one thing you could do is you dump everything and collect D on the local system. And then what you wanna do is put that into a consolidated event router. Because that way, you're collecting all the statistics. I mean, you're still running an agent on each one of these, but you're putting them into a consolidated event router. From there, that's when you can farm them out to everything else that you wanna be doing with these metrics. So that's where you store the log, you log everything from the metrics. You create your dashboards from the event router. You create your alerting and monitoring from this. And then it's all a really great consolidated solution. And that's one of the things I'm really excited about that we've been working on lately is that we're consolidating a lot of these metrics. Finally, and it's actually working pretty well because we're now having a really much, a much better view of everything that's going on. So I used collect D in that example. There's other ones out there to the advisor. As far as the event router sitting in the middle here, there are a few open source options. I'll share these slides later. They have links to all this stuff. But Fluent D, you've got Kafka, Log Stash, all kinds of stuff. And some of these are open source. There's also proprietary options for your event router. But it's essentially, it's just a data stream. You just need some streaming software to be able to collect all of that from your different scraping mechanisms. And then you want to store it somewhere. This could be in Elastic Surge. It could be in Influx. You could be sending it out to Graphite or something similar. Or you could just be, if it's more like logging type metrics, you could just dump it on HDFS or CIF or something. And then there are a few dashboard options out there. I'm a big fan of Grafana. I've been using it a lot lately because you create just text files to define your Grafana graphs based on your data. And you can put those in revision control, which is super nice because then you can track how they've changed over time and you can collaborate on them. And Grafana is really shiny. If you haven't seen it, I suggest checking it out in its open source. From there, you've got alerting. You, there are different types of options out there. I'm sure you're familiar with Page or Duty. There are a few others out there that are doing these alerts. But what you do is you send the data up to them. And then that's another, you send it to the dashboard, you send it to your storage system and you send it to your alerting system. But again, it's all of the data you collect from all levels of your application. There's also a bunch of like integrated tools. I liked going through like these earlier ones first because now it's possible to like build your whole nice tool chain that can work with everything. But this is just, you know, just sort of a giant list of other monitoring and like metrics tools that are out there. I mean, I use Nagios at home still. But there's lots of really, there are lots more modern ones and ones that are doing really interesting things out there. And it's gonna depend definitely on your environment. So logging. So again, you got your host, you've got your container, you've got your service. And we sort of call these like scopes of in your logging system. So in this one, this was just like in DCOS, we have a consolidated logging view just like we're doing with metrics. And then of course there's like Docker has their own logging drivers. And then at the most base level, system D has a journal control. So you can use all of these together. And again, they can all go into a consolidated event stream so you can collect your data. And then you can sort of farm these out to other services to handle your logging. So this is again, just like metrics. What you're doing is you're collecting things from all the layers, you're putting them in one spot and then you're farming them out to whatever tooling that you're using. So in this one I include some examples of DCOS for troubleshooting. Again, what you want when you're troubleshooting one of these big stacks with lots of moving parts is a high level view into what has occurred. Preferably this is a unified view where everything is on the same screen or you can query the data in some way. So that it's all in one place. So you don't have to log into a bunch of different things to find the data. So you want to have tooling that can trace an error through the stack. Whether that means a consolidated application ID that goes all the way down when you're launching a new service or a service is trying to be restarted or whatnot. But there are tools out there that can do that. And then I'm a big fan of communication on your team. So making sure that you're delegating your solution's responsibility responsibly once you figure out what the problem is. You only want to call the specific people and obviously when you have a problem somewhere in the stack. So we've got a few things like in DCOS like we have a task log and you can also log into specific nodes if you don't want to look at the consolidated view which I have a picture of here. So again, like I'm not telling you to use DCOS but as an example, this is like one of the screens from one of our latest releases is these are resource offers. So this is when an application says this is how much resources I need. And for some reason Apache Mesos and Marathon are like, yeah, no, that doesn't exist. So I've got a screen down here that tells you the host name and gives you a little check marks across about what went wrong. Oh, I meant X about what went wrong. Check marks are good. So in this one, it didn't have enough memory somewhere on the cluster to run it. It checked all the other boxes but it didn't get that one. So having this consolidated view instead of just dying without telling you or making you look at a bunch of logs or making you search through and figure out whether it was the application that died or whether it was a resource offer that failed or any number of things that can fail when you're launching a simple application. This is the kind of new thing that we released with our latest release just to give you this consolidated view and you really wanna look for that in something that you're looking at or build something like this so you can track what's going on. You also wanna be able to run diagnostics on your systems. I think there's a talk tomorrow around the same time that talks about some diagnostic tools for containers which sounds really interesting. So tools like it's going to be talked about tomorrow. And again, just having these dashboard type things. I talked a little bit about tracing but this is sort of just making sure that you have tooling that can trace things like network latency and performance down through your stack. So if you're looking at a networking issue, you wanna make sure that you're doing tracing on a public network or a private network. Is it a problem with communication between containers or is it a problem with external tooling? And again, you can use pretty much all the off-the-shelf stuff that we've always used for network debugging. TCP dump is still a really great tool. So you can still use these in this new cloud native world. It's just gotten a bit more complicated because you now need to know about network name spaces and virtual networks and physical networks. And it's hard. I was working on a project last year and I was writing a bit about networking and a bit about networking turned into a lot about networking just because at the time I was working on OpenStack and it's just really complicated. There were only a few people who really, really understand this stuff super well but using the right tooling and understanding that it is complicated but you do have to drill down to root causes when you're tracing through things. And then we all love chaos engineering. We totally broke our cluster at work the other day. It's just used internally by developers but the chaos monkey ran and caused chaos and that was a lot of fun. Marathon was unwell for a bit. But you wanna make sure you're testing your stuff even in a cloud native world where sales people are telling you everything's gonna be awesome and you have HA and everything. You still need to be testing all this stuff because there are edge cases all the time that you just won't know about until you test them. So I'll take a few minutes to sort of talk about more maintenance stuff that I touched upon earlier. One of the first things I asked when I started at Mesosphere, which was only in January, was how do we upgrade our containers? And my colleagues like, what do you mean upgrade your containers? You just launch a new one all the time. And I'm like, yeah, like how do you know if your application has been running in a container for four months? Like what if you launched it and it just never had a problem so it's just sitting up there. Now you launch a new one and it's fully up to date. It's got all the patches and all the updates. But what if it's just running fine? And they were like, stop asking hard questions. But these are things you have to think about. You have to think about when you're going to be doing upgrades on your underlying system and how you're gonna keep your containers up to date and how you're gonna track when these things have been restarted, when they've been refreshed. And it's something that a lot of people forget because they launched their application and they sort of just let it go. There was some talk this morning about doing security audits on your containers and this is a really big deal because really people just launch these things and just let them out loose and never upgrade them. So you wanna make sure you are upgrading them or at least killing them off at a reasonable amount. Like when there's a security problem you kill off the containers. You do like a, you can do like in our product we can do like a cycle through. Of you make a change in the config and it'll like kill off a bunch and cycle through them. Making sure enough are up so that the whole thing doesn't fall over but you're getting your upgrades in. So you wanna make sure you're keeping track of that. And again, like user and package management. Like who has access to install the things, who has access to make changes and debug things. Obviously you don't wanna give too much to everyone but I'm a believer in open source infrastructure so I'm like just give everyone all the permissions but bosses hate that. But you wanna make sure you know who has access to install things and how packages are managed and then how you wanna scale your services. This is something that has to be determined pretty early on because it makes like serious, you make decisions architecturally based on how you're going to scale your system and how you think it's going to be used. I touched upon networking briefly but again it's sort of you wanna make sure that you have a plan and that you have the appropriate things running on the public network, the appropriate things running on the private networks. I spend a lot of time making sure a lot of making sure everything that's local or nothing is public that shouldn't be public. Not just from a security perspective but because internal network communications especially between containers and things are also often a lot cheaper than putting things out there on the network for public consumption. So you want it for security, you want it because you wanna reserve that public bandwidth for things that are actually public and people are accessing. You wanna have an audit trail and be able to produce that. It's really hard with these cloud native systems because again all these pieces are just they're very different from each other and they're all built on top of each other so it can be kind of hard to have an audit trail but you wanna make sure you are knowing how your services are talking to each other and that it's all working well and that it's secure. Also who is accessing what? So again this is sort of logging metrics sort of stuff and everyone's favorite is disaster recovery so making sure that you have a plan and that it actually works and that you don't make your company go bankrupt when you're recovering from a disaster so making sure that this is all in place because one of the things I think we're told is that in this new world of cloud nativeness is like we all have automatic backups, everything is HA, it's easy to launch a service and it's going to be replicated everywhere but replication is not the same as backups and disaster recovery so you wanna make sure that you have sort of an old school plan for how to make sure that these things stay up. So I wanted to sort of, let's see what time much time do we have? Oh not that much time. I wanted to take a couple comments if anyone had any but essentially what I'm saying is you wanna ask the right questions when you're building one of these deployments. You wanna ask about logging and metrics. You wanna ask about consolidated views and how you can get details into your systems because it's not just easy as running an app in a container, there's a lot more to it than that. So have this checklist of considerations and make sure all of this is built into the time that you spend because I see way too often that this day two ops stuff gets overlooked and then it ends up being complicated and really hard to manage and again just unify as much as you can. So what did I miss? All right, thank you everybody.