 All right. Good afternoon, everyone. Thanks for coming to our talk. This is Windows Server Containers for Cloud Foundry. Hopefully, this is where you want to be. Promise it'll be interesting. So my name is Sanjay Bhatia. I'm an engineer at Pivotal, working on the Garden Windows team. And I'm Matt Horn, software engineer at Pivotal, also working on the Garden Windows team. So what have we been working on? Part of our mission here as the Garden Windows team is maintaining the Windows 2012 R2 stack. So if you want to push your .NET applications to a Windows 2012 VM, we maintain the Garden implementation for that stack. What we've implemented in the last year, since our last appearance at the summit, BuildPEX, thanks to our friends at HP for the initial work on that. We added a configurable HTTP TCP health check to match Linux instead of the default mode that was just HTTP. We've improved on the bind mount implementation, enabled the use of instance identity credentials on Windows 2012, some various security enhancements, and we now have a CF deployment ops file that you can use. In addition, we've implemented the Windows 2016 stack. So you can see a screenshot of our lovely CF stacks there. Now we have two times more Windows stacks than Linux. Not bad for years of work. So why have we been doing this? We've envisioned a first class .NET developer experience on CF. We want to enable .NET developers just in the same way that Java developers are enabled with the Linux stack. We want to reduce the burden of leveraging Windows on CF. Part of that was helped by the Bosch Windows team. We can Bosch deploy Windows cells. But containers on Windows is a really important part of that. So Windows 2012 gets us part of the way. We implemented that stack that enables a lot of legacy applications, a lot of .NET developers. But some of the things that the platform enables for Linux, like container-container networking, pushing OCI images, OCI BuildPEX, volume services, those things are not really feasible on Windows 2012. And we want to enable those for .NET developers. Cool. So there are some shortcomings with that 2012 stack that we had out there for a number of years now. Basically, those containers are pieced together using job objects, the Windows firewall, file system ACLs. All these things are relatively limited. Job objects weren't really intended to be used for containers. Job objects were meant to be similar to Linux process groups. So you could say these processes are related to each other in some way, and then you can kill them all at once. So job objects were a nice way to approximate containers, and we could apply resource limits to some degree with them. But they really didn't deliver everything that you'd expect from containerization, especially compared with Linux. The Windows firewall is relatively limited. Well, it's actually quite extensive what you can program it to do, but there are some serious limitations like not being able to firewall localhost. And so a process running a container on Linux can talk to processes running on the host over localhost. And so if you're trying to protect your server from unauthorized use, maybe you're deploying Cloud Foundry and you have components like the rep or metron or console, you don't want containers to be able to talk to those and stop that from happening via TCP on Windows. And then file system ACLs, while they're great, you still have a global file system. And so file system ACLs work pretty well, but they're not as good as a container isolated file system. There's a real lack of true isolation and resource limiting. There's a shared system registry. And so if you have a legacy application that's writing to something like HQ Local Machine or you're writing to a user registry, it's really you're going to have a bad time. It's just not good. That shared file system, if you accidentally write some files to a temp directory, it's global. And so unless you've set the temp environment variable correctly, you might end up writing files to the same place as some other container. And that can really lead to some problems. In fact, we saw issues with that in early iterations of the project. There's shared network interfaces. And so you can't firewall localhost. And you can't set network bandwidth limits. So if you have a really bad actor and maybe you're running a multi-tenant public cloud, Windows is really not a good fit for that deployment. We also didn't implement CPU limits in the 2012 R2 stack. Job objects do have the concept of applying a CPU limit. However, it's percentage based. And when you're trying to deploy end containers, trying to divide up that share-based load via percentages becomes really hard. We also saw problems when turning on CPU limits where they just weren't sufficient in that containers couldn't even get themselves started if we applied CPU limits. You just have so little CPU allocated to you via that percentage that we ended up shutting off CPU limits in the 2012 offering. And the isolation primitives are, well, primitive. So there's a known exploit in the job object kernel call known as inconsolable. This was documented by the Google Project Zero team in some work that they were doing for the Chrome browser. Chrome actually uses job objects to do all the limits that are applied to Chrome on Windows. And so this con host exploit allows you to start processes outside of the job object that you started in. So when we're using job objects to apply memory and other limits to your container, if you can just start a process outside of that, that kind of defeats the whole purpose. Now what you can do is monitor the process tree and look for processes that start up outside of a job and then put them back in the job. That's exactly what we do. We have a process called the guard. But that's reactive. And so when you're building a system that's meant to scale, you don't really want to have a bunch of reactive processes putting things back in job objects. So how do we improve the experience? We want to improve the experience for app developers, give them more isolation. We want to also, a major point here, we want to improve the experience for CF component teams. Windows 2012, the implementation is built on top of Ironframe, which is a .NET library. It requires Visual Studio, MS Build, lots of workstations set up, and onboarding becomes hard for new members for the garden windows and other teams that want to interact with the garden windows implementation. So we want to improve that experience as we go forward as well. I think we went like three months without a workstation that even had Visual Studio installed on it. And when we had to do a thing to Ironframe, it's really painful. And so how do we improve the experience? So we want to leverage the Windows Server 2016 stack. Microsoft is working to implement Windows Server containers. Containers are now a native concept to Windows in Windows 10 and Windows Server 2016. In concept, they're similar to Linux containers. There's a version with a shared kernel, which is what we use. And there's also the, quote, Hyper-V containers that are suitable for multi-tenant workloads, according to Microsoft. And we want to bring these to Cloud Foundry. In addition, we want to adopt existing Cloud Foundry development patterns. We want to have all our components in GoLang. We want to integrate with the garden team more deeply and use Garden Run C release and Guardian so there's no longer a separate garden windows Bosch release. And we want to try to adopt industry standards with containers. Try to take advantage of the open container initiative, the runtime spec, the image spec, and emulate the Linux version of the runtime plug-in, which is called Run C. And we implemented our own called WinC. So what were our design goals when building the 2016 stack? So we focused initially on Windows Server 2012 parody. So we weren't trying to bring new features to the platform. We just wanted to deliver everything that we had before. So initially, we're only supporting the build pack app lifecycle. There's no Docker app lifecycle yet. We support application security groups, just like we do in 2012. These are currently implemented via Windows firewall rules, but more on that later. We have resource limits, just like you had in 2012, for memory and disk. And we're really targeting that same class of applications to be deployed. So we're not initially supporting the Nano server image. Microsoft has two root file systems for 2016. There's Nano server and there's Windows server core. Nano server is great, but it's more like maybe Alpine Linux, right? You can have this tiny base image now, tiny to Microsoft is 200 megabytes. So tiny, tiny base image versus Windows server core, which is a much larger image. But we didn't want to require app developers to rewrite their apps to take advantage of this new stack. And so most applications that are currently targeting the Windows.net desktop CLR runtime, you need things that are in that server core image. And so our initial support will just be that server core. And maybe we'll have Nano server some other day. We wanted an improved experience for .net developers. So that isolation, right? The isolation primitives in 2012 are really primitive, so we wanted better isolation. We also wanted to bring CFSSH to the Windows stack. Right now it's really difficult to debug Windows applications when they're running on the 2012 stack. And we know from our experience with Java and Linux that developers really like being able to SSH into their container and do whatever they need to in there to figure out what's going on with their apps. So we wanted to bring that to the platform. We also wanted to enable more of the existing platform features, like container networking and Diego persistence. With 2012, it's been basically impossible to support either of these because we didn't have isolated container networks and we didn't have isolated disk. And so with 2016, we can bring these features to the platform. We also wanted to set ourselves up for future platform opportunities, things like sidecar containers, which you might have heard about in some of the initiatives that are being talked about this week, and OCI built packs. So what does the 2016 server implementation give us? We have complete file system isolation. So each container runs in a virtual disk volume that is booted as a sandbox from the container image. The root of the containers file system is the root of this volume. It does not get to see anything outside of the container volume unless it's bind mounted in. And bind mounts are read only by default. And with this, we have a container root unlike on Windows 2012 where we were just on the host. It follows the Linux patterns for security updates and deployments. We have a root FS that's packaged in a Bosch release. If there's a security vulnerability in .NET or some other component of the root FS, we can build a new Bosch release, ship it, and you can redeploy and have your patch in as long as it takes you to bush deploy yourself. It simplifies the Bosch stem cell as well. So no longer do we have to have the application dependencies like the .NET framework installed in the host stem cell, we can now have them in the container root FS. And there's more about this on the next slide. So here we have a lovely diagram from our PM, William Martin. You can see the bottom layer of this root FS is the Windows Server Container Image that Microsoft provides. We install some .NET features, some Windows-specific stuff, .NET modules, the URL-level rewrite module, for example, some utilities. And you can see in the orange there an example of where your application droplet will live with the BuildPack application all compiling together to become your droplet. Yeah. So one of the things that we kind of heard from a lot of users of the 2012 stack was like, where do I put my DB2 module? And this was never a fun thing to say, well, you build a special stem cell and you put it in there and then you deploy that. It's not a great experience, especially compared to Linux. So now we can say, hey, well, you put it in your root FS or you put it in your BuildPack. So some more improvements we have with the Windows 2016 stack. We now have CPU limits. And Microsoft has implemented them directly in the host compute service with shares. So you're able to do the sort of CPU limiting you would require for applications to start up and for application CPU sharing to continue on Linux. Users are unique to each container. So before, containers were implemented with a user on the host in Windows 2012. Now users are unique to each application container. And now you also have registry isolation, which is very important for these legacy apps that people will be pushing to the platform. Each container has its own copy of a registry and each layer of a container file system actually has a diff between itself and the previous layer registry. We also have network compartments, which are pretty much akin to Linux namespaces. Processes no longer listed on the host IP. They have their own loopback interface for the container. And container processes cannot communicate with the host unless we explicitly allow them over the network with the firewall rule. Cool. So we have a couple architecture diagrams to talk through. They were drawn by our PM and literal architect, William Martin. So here's a high level diagram of what actually runs on a Windows cell. So ultimately, everything that's on a cell is deployed via Bosch. And that gets there via the Bosch agent. We have these components on the left of the diagram, the rep, the metron agent, console client, and the route emitter. All these components are written in Go. We are currently using all of those components in the 2012 stack, as well as the 2016 stack. So for the most part, all the components that make up the 2016 stack are tried and true. We already know these things work well on Windows. We have a deployment mechanism for them. And we've seen them working in production very well over the last two, two and a half years. On the right is the new bit of Windows 2016 stack, which is Guardian, or it comes from the Garden Run C release. And Guardian is the bit of the platform that actually runs the Windows containers. So you can see it consists of the Garden server, a container plug-in, a network plug-in, and a root-if-s plug-in. We'll dive in a little bit to what those mean. You have the Gardener server, which is implementing the Garden API. The Diego component talks to the Garden API and says, hey, I needed to run these containers. So that's our standard API. That's exactly what we're using in 2012 and in 2016. Instead of writing yet another Garden release, let's see, we had Garden Linux release, Garden Run C release, Garden Windows release. We didn't want Garden Windows 2016 release. So we went back to the Garden team and we said, hey, is there a better abstraction we can build here? And we found a place to push that down to a lower level. And Sanjay mentioned this around OCI and adopting standards. And so with the Garden Run C release, we have this server, the Gardener, that implements the Garden API. And that server has three sub-components the Containerizer, the External Networker, and the Image Plugin. The Containerizer is where we saw an opportunity to leverage a lower level abstraction to implement Windows Server containers. So we saw an opportunity to implement the OCI specification on Windows. This is the Open Container Initiative. It's a standard that defines how containers should be created, maintain their lifecycle, destruction, et cetera. So we wrote a CLI implementing the OCI spec for Windows called WinC. WinC talks directly to Windows host compute service. And this allows spinning up, spinning down of containers, putting stuff in containers, that whole lifecycle. For networking, we wrote a networking plugin, Wink Network, which talks to the host network service. This sits alongside in Windows Land, the host compute service. And for images, this is that root FS. Right now, we just have a dummy Wink image plugin, which gives you access to a root file system that's installed on the host. All right, now time for a demo. So here we can see this is our ops file. We have deployed Cloud Foundry and some Windows 2016 cells using this ops file. This is in Master of Seve deployment on the latest releases, so you can take a look at this and use it if you want to. And here, so if we look at our stacks, we can see our Windows 2016 stack right there. And we've pre-pushed a couple applications. We have a 2012 app, Norah 2012. Norah is our equivalent to the Linux test app, Dora, for Windows. We have a Dora and a Windows 2016 app. You're missing a D, or an N. All right, we can see it's running on the Windows 2016 stack with the HWC build pack. And let's do the most interesting thing and SSH to this app. So this should put you in a command shell session up in GCP. This is actually running on a server in GCP. And so what do we want to do here? We can look around our directory. Oh, look, there's our application directory. There's vCapture. No subfolders exist. Let's go to PowerShell. Takes a little while. But it takes a little while on Windows normally. So there's our application files. And let's try to install a Windows feature. Let's see if we can run containers in containers. This will hopefully autocomplete for us. Nice little progress bar. Oh, we're not admin, so we cannot install Windows features here, which is pretty good. Let's see. Let's try to open a notepad. Oh, it doesn't open notepad. We can't see that. But if you look at the process list, you can see a notepad up there, which it's running somewhere. We can't see it, but it's running. We can see our PowerShell session. And so if you notice, let's see if we can scroll a little bit. How do we scroll here? So we can't see any of the Bosch service processes, which is good. We're showing that we're just running inside of a container. We can't see the system host processes. What else? We can see our, oh, no subfolders exist. Let's go look at the files. You can see our, whoa. I think you need not the dot slash. Not the dot slash. PowerShell, no. Well, I don't know. Well, you can look around your container file system. You can get instance identity credentials. You can look at all the processes inside of your container. And we're not admin, which is pretty good. Cool. Awesome. So I want to talk a little bit about the Windows Server semi-annual release channel. The Microsoft team is actually moving to, instead of these gigantic monolithic releases once every four years, they're moving towards the semi-annual release channel every six months. So you've seen this in Windows 10 with the anniversary update and the creator's update. So they're targeting basically every six months to do these major releases. 1709, you might think, oh, that's 2017 September. It's October. Where's that release? Any day now, promise. We've been actively working with the Windows Server containers team to deliver new functionality in Windows. We've seen major networking and performance enhancements in Windows Server containers in the semi-annual release channel. There's improved process isolation. That con host breakout that we mentioned is mostly mitigated now. It's still sort of possible, but you can't really contain it. There's also improved CPU sharing. We see a smaller root FS, just 2.2 gigabytes. To give you an idea, that's currently five gigabytes. And sidecar containers. So you can have two containers that are stood up next to each other that have the same networking can communicate with each other, but are still separate containers. So some areas that still need improvement in our Windows 2016 implementation. Memory limits, as in 2012, still do not constrain memory map files. So you could potentially take up a whole memory with a large memory map file. This is something we're working on with Microsoft, but there are no process limits. So you can run as many processes as you want in a container. That's on the roadmap for the Windows Server containers team. And containers are semi-privileged. So if you elevate from the default VCAP user, that is a non-admin user inside of your container, if you elevate to administrator, you can get around disk limits, which are implemented currently as a quota on your disk volume. And you can also get around the network access restrictions. This networking stuff will be fixed in the part of the 17.09 release as the networking team has done a lot of work and now have network access control lists for network endpoints. Future roadmap items. We're working with the GrudaFest team for OCI or Docker Image Push Support. We're also thinking about true upstream support in concourse for Windows workers. Right now there's a separate Bosch release that stands up Houdini-based worker, but we want to have real Windows Server container workers. And we're also thinking about Nano Server Image Support. But first, we need multi-RudaFest. So see the point about working with GrudaFest. In terms of isolation that Sanjay mentioned earlier, right now we just have shared kernel isolation implemented. And this is similar in principle to 2012 R2 and Linux. In the future we might think about adding Hyper-V isolation. It's not implemented today. Microsoft says Hyper-V isolation is intended for hostile multi-tenant workloads. What they're actually using this for is the Azure Container Service and isolating processes. However, Hyper-V isolation requires nested virtualization support, which is not supported by all ISs. Although I think it was GCP that just announced last week to have this option. Not for Windows yet. Not for Windows. OK. But these containers are very heavyweight. So standing up one container took three gigabytes of memory. And so in talking with our operators and seeing what they're used to from the Linux experience in 2012 R2, Hyper-V isolation is kind of crazy. But as I mentioned, Microsoft is working on making these things better. And so I expect this will improve over time. We actually have seen improvements in the semiannual release already. It's down to two gigabytes. So call to action. What can you guys do? We are hiring at Pivotal and Cloud Foundry. So give us a call if you want to work on Windows. Work on Windows containers. We love pull requests. All of our WinC Guardian code, of course, is open source. If you take a look, see some bugs. Give us a call. And definitely start using WinC Guardian, Garden Rinse Release on Windows. And take a look at our CF deployment ops file. Give it a whirl. See if it works for you. More pull requests there open as well. Cool. Any questions? Would like to ask how the licensing topic is covered. If I want to have a Windows stem cell, do I have to bring my own license and how to integrate it? Great question. So licensing is one of the tough points here, of course, with Windows Server. And so for the Bosch deployment of Cloud Foundry, you need to figure out your own licensing. You need to bring that along. If you're using something like Azure, GCP, or AWS, that licensing is included in the cost of the VM if you stand up a Windows VM. Now, for your own IaaS, like vSphere or OpenStack, you've got to figure that out on your own. Now, for the containers, it's my understanding that you can create as many of these Windows Server containers, the lightweight containers, as you'd like. So if you have a Windows Server license, you just bring it up. Those aren't charged at that per CPU or per socket licensing basis. And so you just start up as many containers as you want on a server once you have it running, which is basically the same concept in licensing that Microsoft has for if you have, say, ESX host and you're running Windows VMs on there. You license the physical host, and then you run all the VMs on there that you want. And the public IaaS stem cells, they're all available on Bashio, so you don't have to build them yourself. You can just get the download link, and those will be a valid stem cell that you can use. And coming soon to CF deployment will be the opportunity to use an offline and an online Bash release to get the Rout of S. So by default, there will be an online release which will download the Rout of S from the internet for you. And if you have constraints on having internet in your environment, then you can use the offline off file, which requires a couple of manual steps to build the release for you. We already have scripts and code in the release source and instructions for you to build the Bash release, and you upload it yourself. And then after that, it's just a usable Bash release, and it deploys as normal. Cool. I think we're at lunchtime now, so I don't want to keep you from that. But go ahead and find us after this. If you have any more questions, we'll be around.