 Good afternoon. I'm Nathan Schimek, Vice President of Client Solutions at New Context. We are a San Francisco-based systems integrator that specializes in the container implementations. I'm joined today by Dinesh Isharani, Senior Software Engineer at Portworx. And today, we're here to talk to you about what building multiple scalable DCOS deployments has taught us about running staple services on DCOS. I'd like to take a moment and thank the Linux Foundation for hosting the conference. MesaSphere for developing a great product for us to build on top of, sponsors, and all of you for showing up at 5 o'clock on a Friday. So without further ado, let's dive right in. The containerization space as it exists today has a myriad of challenges. First start, it's relatively new. So the teams that are today being tasked to build and maintain platforms often just don't have a huge amount of experience, similar to the adoption of cloud technologies. There's a real ramp that comes into learning and successfully building all of these things. As such, sufficient skills and experience are one of the things that you should really look forward as you go forward. There are areas where traditional skills don't necessarily directly translate, but need to be built upon. For example, in the networking arena, the recent addition of CNI, SDN, network overlays, et cetera, further complicate an already complex picture. So if it's your expectation, they're going to go from zero to production with a small team in a couple months that don't have a domain experience, it's probably going to be pretty challenging. That said, there is hope. Things are improving very rapidly. And the patterns for success in the space are quickly emerging, and the community is doing a lot to bring those forward. So today, we'll talk about four high-level areas and then dive a little bit deeper. First, we'll look at platform availability overall. And some of the key design decisions you should be thinking about to ensure DCOS implementation is resulting its failure. Next, we will look at some of the sticker points we've both experienced within and outside the cluster. And finally, we'll review how organizations respond to these challenges and what has enabled them to find success for running stateful services in DCOS. So let's take a look at platform availability. You'll see that there's a huge list of things that can be considered failure domains. Don't consider these specific to containerization or DCOS by any means. These are failure domains that you've probably seen in Amazon environment, maybe a virtualized infrastructure, and certainly could be possible in your bare-middle infrastructure. In our experience, these are scenarios that, given sufficient time, number of users, you're likely to see at some point in your environment. So failures happen all the time. It's about how we design around those and to mitigate those risks that matter. At the end of the day, it's our job to mitigate the impact of these outages and be able to. So when we get things wrong and we do get things wrong, it can be dangerous and costly. That said, don't lose hope. These are certainly not insurmountable challenges. It's been something that we've really focused on over the last couple of years in improving in this space. When you do have an issue, get in the habit of holding and blaming us postmortem, then be sure to include how one could have identified a service interruption through monitoring your metrics. And we include that as part of your discussion. Then actually test or set up and test the appropriate monitors and ensure that they behave as expected. If you're unfamiliar with the concept of blame postmortems, let's have a quick conversation after this. Actually diving into how to build a resilient platform or of the opinion that you should design for production quality from the start. That doesn't mean that you have a production level implementation during your POC days, but keeping that goal post further down is going to be really important. It's our experience that the difference in effort is comparatively small when you look at the challenges companies face when a POC implementation gets traction and suddenly you're hosting revenue generating systems on top of unstable or essentially just small scale infrastructure. And variably when that happens, platform stability issues emerge, as the problem is just platform is just being asked to do things that's not really designed to do and users have bad experiences. Additionally, when you design from the start with production scale infrastructure in mind, you inform decisions that you'll make at a later time as you have a specific lens to work with. Your automation and tool choices are very heavily impacted by the design and over the medium term, you should actually be able to get further ahead as your investment in automation from the start around the ability to build and rebuild clusters easily with minimal impact will greatly reduce your time and upgrade and allow you to rapidly iterate on your DCS infrastructure. It's again our experience that on small scale implementation so you can experience more talent and time with cluster rebuilds than a much larger one due to the typical approach of manual intervention in a POC environment and small scale environment versus a heavy focus on automation when you go to scale. Some key points to think about from your automation efforts. Are your operator safe to terminate at least one node without any measurable impact? Yes, that's great. What happens if three nodes go down? What if you answered no to that? What happens when you lose a node and you're sitting at a talk at MesosCon at 17.06 on a Friday? Well, did your monitoring and metrics collection pick it up and automatically resolve that and just open up a ticket to let you know something happened or did a developer who's relying upon services provided on top of DCOS have to open a ticket internally or even worse yet, some end user wait 25 minutes experiencing an outage, open up a ticket with your company and then you have SLAM packs. All of these things can by and large be mitigated with a proper design and implementation from the start. Continuing down that scenario, so now we've got some nodes down, a customer is called in after 25, a customer is called in after 25 minutes, you've been paged out and now you gotta open up your laptop, connect remotely and take a look at what's going on. Do you have the ability to bring back the failed nodes with a single command easily executed or do you have to actually dig in and do some manual intervention? Again, now we're adding time. All of these things are relatively easily addressed, especially if that skill sets required to really take on the challenges associated with containers and stateful services within them. Now I'll hand it over to Dinesh to talk about the stateful services on storage. Thanks, Nate. So in this new age of DevOps worlds, typically everything needs to be automated because no one's really got the time to log in and manually recover from failures. Also, this is not really possible at large scale because you don't want one of your DevOps works to basically be up at, like Nate said, at five PM on a Friday to basically try to bring up thousand nodes that went down and try to recover your data. You want to make sure that the storage solution you choose has good integration with schedulers and if you're using multiple schedulers, you want to make sure that they work, they work across multiple schedulers so you don't need to use multiple solutions with them. For example, you also want to make sure that you are able to efficiently schedule pods to be co-located with your data so that you get good performance for your pods or containers and don't spend a lot of network bandwidth just sending data across your nodes. On a large scale, you also don't want to manually provision volumes every time a customer, whether it's internal or external, needs to spin up new services because that is just adding another layer of manual intervention, which is just not acceptable in this day of automation. So one of the things is you also want to make sure that you test various failure scenarios and how schedulers deal with them with regards to storage in order to avoid nasty surprises in productions. So we at Portworx are actually working towards an open source framework called Dorpedo which will help you validate these various failure scenarios to avoid just that. The next thing that you should look at is how easy you are able to basically add or replace storage nodes and perform maintenance operations because these are the kind of operations that could basically result in downtime for your services. So you want to make sure that any storage solution that you choose minimizes or eliminates these kind of downtime. So for example, if you're using order scaling groups with Amazon, you need to figure out how that would affect you. Would the storage from your old nodes automatically be available to your new nodes? And if you wanted to add capacity to your storage solution, are you able to scale up your current nodes or would you basically be able to add new nodes to scale up your cluster either? Another thing to keep in mind is how your services would work in hybrid cloud deployments because you don't want to be building tools and automation for different types of environments that you have. You want to have one way of doing things across multiple environments. So for that, you basically want to use a cloud-native storage solution like Portworks to make sure that it's easy to manage and deploy your storage in one way. You don't have to have multiple automation frameworks and tools to manage different deployments in that way like Nathan pointed out. Also, you want to aim for highly available data as Nathan pointed out earlier because you don't want to run into production and then figure out that you lost the node and then you're not able to bring the same services up because you fail to replicate your data. Another thing that you want to make sure is that your storage solution is automatically able to place replicas across failure domains so that you're always able to bring up your service even if an entire rack goes down. So this will actually require your storage solution to be intelligent enough to figure out where they are located and automatically place data in different availability zones when they are provisioned. Finally, you want to make sure that when the time does come to upgrade your software solution, you don't have to bring down your entire cluster. You want to make sure that there is a way to perform in place rolling upgrades to minimize disruptions. Again, this sometimes requires integrations with schedulers to let them know that your storage is going to be down on a particular node so that it should not schedule any containers onto that node while the upgrade is in process. So I'm going to hand it back to Nathan now to talk about the test for the failure scenarios that I alluded to. Great, thanks, Anesh. Testing is key in our world. Today, there are a number of companies like Portworx with Torpedo and Netflix with ChaosMonkey that are building and open sourcing tools which allow you to simulate real-world outages for various services. Ideally, you would eventually mature that to actually running in your production environment, but on the path there, I would suggest building a production-like environment, so a minimal-scale implementation that follows the same colostering topology, network topology, et cetera, as your actual production infrastructure, and run it there. Doing so will likely expose gaps in monitoring, response times, any number of areas that you are going to hit in production. This would just be a less costly way to find it and patch that up. So again, develop metrics and monitoring that align to the failure scenarios that you see most commonly and are most impactful. In the world today, it's incredibly easy to implement a tool, check a bunch of checkboxes, and just get totally inundated with the data that's delivered to you, and it becomes unactionable. Really focus on what impacts you and how to respond to that. Additionally, when things break, and they will, it's really important to limit the blast radius. The last thing you want to have happen is a cascading failure, which takes significant downtime and effort to recover from. If we started with a HA design and implementation, have focused on automation, we've already taken significant steps to reduce the impact of single-zone outages, and we can further contain the likelihood of that happening by isolating user applications from each other, platform services from users, and platform services from each other. For example, if a platform service needs ZooKeeper, then the ZooKeeper instance of the service that is linked to should not be accessible from the platform users. Isolating platform services from the user space will help ensure platform resiliency in the face of application issues. Additionally, sandboxing platform services will help avoid everything from noisy neighbor problems and resource consumption issues to cascading failures. At the end of the day, infrastructure is multi-disciplinary and cross-functional, and DCS is no different. You really need expertise in security, compliance, container specifically, compute, storage, networking, automation, CI, on and on and on. We're not yet to the point where we fully converge those skill sets, so find people with experience in that space and bring them in. The days of having a compute team and a network team and a storage team don't really align with the model of DCS, and nor do they align in really modern operation models in general. DevOps has pretty fundamentally changed the space and you should look to a lot of the learnings from there. So now let's take a look at what's happening within the cluster. With that, I'll hand it back to Ninesh. So once you have your cluster up and running, you will realize that your needs will change over time, either because the apps that you use will change, the scale that you run them at will change, or it's just the ever-evolving tech that you're involved with. So in such scenarios, you don't want to tear down your volumes or cluster and reinstall everything to be able to deal with your new requirements. For example, if you provisioned a 100 GB volume for an application, but the demand and use for that application far exceeded your expectation and you now need to allocate more space to it. Do you want to provision another volume and move data over from the old volume? You don't, the ideal way you would want to do this is do it in real time without having any downtime for your services. And you will eventually hit a point where you will need to add more storage to your storage solution. Again, you would want to make sure that the solution that you have chosen allows you to do this seamlessly by either adding disks to nodes or adding new nodes, as I had mentioned earlier. You also want to make sure that you understand your customer's needs with regards to backup and archiving data. For example, you want to set up schedules to take regular snapshots automatically and also archive your data outside your cluster in terms of, in cases of disaster so that you can basically recover from that. For example, with Portbox, you can do this by setting up snapshots, schedule it at a container granular level, and also take cloud snaps, which can back up your data to either S3, Azure Blob, or Google Cloud Storage. So in a case of disaster, all you would need to do is basically reshow from that cloud snap and reconfigure your apps and you would be up and running with your service. You also need to understand your security needs based on the service you are running. For example, how is your data stored on REST as well as in transit? Depending on the industry you are in, there might be regulations, and you want to make sure that you can enable encryption for both these cases. Lastly, you also want to make sure that you can monitor the health of your storage solution and receive alerts in case of pending doom so that you can proactively take measures to avoid downtime. And today with tools like Prometheus and Grafana, there is really no excuse for storage solutions not to provide such integrations. So I'm going to hand it back over to Nathan to talk about some of the platform security stuff that you can talk about. Security within the containerization realm and security in general is a much broader and deeper topic than we'll have time to really go into today, but I figure we just do a couple of quick hits. Platforms or patterns for both attacking and defending containers are evolving rapidly. There are several open-source software initiatives, creating patterns to attempt to address this space that the bulk of the progress really has been made in the enterprise software realm. There are a few things that you can do today, probably with relatively low cost, by either deploying new tools or tweaking existing tools to take advantage of some of the security improvements. We're probably, most people here are probably wearing the CIS Docker benchmark. I think it provides value. It's one of the things that we integrate as part of CI on a very, very regular basis. Additionally, you can look at container image signing, build time vulnerability scanning, and compliance control enforcement and monitoring through something like InSpec and test driven development. Again, I'm happy to talk about all these things over beers afterwards, but each one of these things warrants probably a multi-day track, so we'll just skip through that a little bit quickly. Now to the fun part, really operationalizing things. At the end of the day, you will always need to maintain what you build. Maintenance encompasses version to version upgrades, major upgrades, accommodating, breaking changes, et cetera. Cluster maintenance and upgrades have become significantly easier in the DCOS world. If you've been using it for 18 months or more, you know this, and based off of my rough understanding of the product roadmap for DCOS, there are some significant improvements in 1.11. I'm sure that the people out at the MesaSphere booth would be happy to run you through the product roadmap on that front. Even with good controls and training, users will still find a way to break things, lock up resources, and otherwise, just cause havoc in the cluster. You know, occasional frozen jobs, runaway ops, and orphan tasks just happen. That's the name of the game. Planning ahead for these issues will really make your life much easier. Now let's take a look at how we handle some of the challenges with externalizing services, which are built-in running in DCOS. As I kind of alluded to earlier, networking is one of the more tricky areas, given the additional complexities added by network overlays, CNIS, et cetera. Clusters today are really well designed for internal traffic. Apps talking to apps in the cluster is highly reliable, well understood, and overall pretty trivial. The real challenges in our experience come from when you need to wire in to existing infrastructure and externalize a service. For example, do you have an IPAM tool in place today in your company? Does it provide an API that is both is easy to automate against? If not, do you have to adopt a new tool for your company to do IP address management, or do you carve out some subset just for your containerization environment? When you start to add in things like IP per container, this conversation becomes much more complex. In this realm, as well as with service discovery and load balancing, what you have in place today is going to largely inform what you do with your DCOS implementation. I'm sure we all have opinions on what you would want to do in a greenfield environment. When it comes to load balancing, service discovery, et cetera, it's my experience that I haven't been particularly lucky as a consultant, and the ability to just start from a greenfield. If you have those projects going, that's really cool. I'd love to talk to you about those and hear what your thoughts are. But again, for me, the name of the game here is really how do we integrate into existing environments and then move stateful services that are today running on either bare metal or in a virtualized environment into containers. So last, but certainly not least, let's talk about organizational. As I've alluded to, I actually believe that this is kind of the chief metric on whether or not you're going to be successful in your DCOS implementation, and I would say containerization in general. The team who leads the internal container initiative really will define its success or lack thereof. It's our experience that they need to bring themselves, their peers, and the internal developer community up to speed on all of these new technologies, patterns, et cetera, pretty much in parallel, and that's a pretty difficult task. As such, one of the ways to kind of ease the burden here is to engage people early, probably your developer community first off, and really get to understand what their requirements are. At the end of the day, you're building a platform for services, and if you're not providing services which are consumable or of interest to the people building software on top of that, what are you doing it for? So I'm a strong proponent of kind of thinking of this as a software project more so than a traditional infrastructure project. I always handle this in a agile fashion, do some requirements gathering, work very rapidly and iteratively to provide value as quickly as possible. That way, assuming they have a good experience, assuming the platform's available and resilient, and provide services that people are interested in, they're typically to use it. And then once you have adoption, hopefully you can turn those people into evangelists to turn other people into your community on to the platform you're now providing. It's too often I certainly see a small scale implementation that doesn't really look at what they're trying to service and they don't get adoption and wonder why. You know, it's not one of those things that if you build it, they will come. It's if you make it easy to use and attractive, they might use it, but certainly if you just make it difficult to use or are not providing any value, they're not gonna engage with you. As the next touched on, there are a number of guardrails that do need to be built, especially when it comes to reasoning about data services, guardrails in general. At the end of the day, there are data sovereignty laws which can be as granular as a local level, but certainly at a state and national level exist. As Dinesh pointed out, that can be something as simple as encryption or if you're in a multi-region implementation and accidentally decide or purposefully decide to replicate personally identifiable information, for example, out of the European Union or the United States or vice versa, you have now really gone off the rails and your internal controls and appliance organization is not going to be happy with you. Unfortunately, it's kind of trivially easy to do that from a technology perspective and it can have huge ramifications on your company from a legal perspective. That's not a great conversation anyone wants to have with your CIO or your internal legal general counsel. So, look at internal controls and how engage with your internal controls group to see what you can do, what you should be doing. Additionally, be mindful of any industry-specific controls, right? In the United States, we have HIPAA for healthcare, we have SOC and a number of things and they reason about what we need to do, what our responsibilities are and regardless of the platform that we're delivering our services upon, it is our responsibility to do that. So, at the end of the day, have some conversations, stay in compliance, make everybody's lives easier. On the skillset area, there are some ways to just engender growth and adoption, right? First and foremost, hopefully, even if you have to do some external recruiting, find some experienced engineers who have run through this or worked in the platform. They will be a great asset to you. If you don't have that, not a problem. There's a wide community, right? There's Slack channels and GitHub and a million places, conferences like you're at now to really find people to engage with and learn from them. Additionally, especially early on, I'm a big fan of creating an operational playground, both for the platform engineering side as well as the developer side. It's my experience that I need to be able to figure out how to break cluster, break clusters unintentionally, rapidly iterate on automation to rebuild things and if I'm doing that while developers are attempting to learn how to use the platform, I'm negatively impacting their experience. So, if you have enough compute resources available, just give yourself your own playground, give Dev their own playground and eventually you can probably get to the point where you're mature enough that you can consolidate those things but off the break, I would definitely start there. Additionally, just general notes from Agile, right? Fail fast and fail often, it's totally okay, right? This is a learning experience for many of us. And finally, if you want to make this more attractive and drive some internal adoption, set up a hack day, figure out what makes sense, what problems are you trying to face. So, now you have a great way to help or a great way to help on all these fronts is to focus on training. After you've found some internal advocates and evangelists who are familiar and can drive excitement within your organization. And once you have a base level engagement, coupled with providing developers and engineers with environments where they can learn, you should be able to rapidly iterate and experiment to drive adoption. Around somewhere in this point, I would suggest investing in formalized training and then use that experience of formalized training to build training that actually matters to you. At the end of the day, Stateful Services is a very expanding field. So, what makes sense for a company looking at implementing a Sandro versus something else might not be the same. Really find how to bring up the skill sets that are applicable to your organization. At the end of the day, running Stateful Services in containers is not trivial. DCS and Portworx are making it significantly easier but it still requires expertise and a wide range of areas to successfully do that. And finding experienced advocates and evangelists and your organization will really help. So by pulling all these things together, you should find that you're fostering the right skills to make your platform attractive and available and eventually, or hopefully soon, getting to production level services. So with that, we're pretty much done. Any questions? Okay, thank you very much.