 Excellent. Hello, everyone. My name is Aaron Price, and this is Jim, and I'm standing right in the middle of the slides. That's dumb. Just a second here. So, we're going to talk about BPM and Cloud Foundry one year in. We started this project about a year and a half ago. We did a presentation last year at Basel, CF Summit, and we're going to talk about what we've learned from trying to get most of the CF application runtime running under BPM. James Myers, software engineer at Pivotal Software. My name is Aaron Price. I'm an engineering lead with Pivotal Software. We've both worked on various runtime and Bosch components over the years. So, just a quick overview. I'll talk a little bit about BPM, talk a little bit about why you might want to use it, what we learned from this last year of getting people to adopt it, and then Jim will go through some product updates and roadmap to BPM V2. So, what is BPM? I think, given the size of this crowd, we'll probably fly through this pretty quickly. BPM is a Bosch process manager. It provides declarative configuration for running the processes in your Bosch jobs as a Bosch release author. It should provide a consistent execution environment for those processes. That is to say, it starts up a little container, keeping your job isolated from the rest of the file system. It has an opinionated enforcement of Bosch conventions, which are sort of common but not necessarily enforced across the application runtime. It also provides a nascent operator interface for inspecting what's going on with Bosch jobs. So, BPM list is one of those examples and we'll cover some more of them later. This is what using BPM looks like in your Bosch release. So, the BPM YAML contains one or more processes that are named, contains a path to an executable, could be a script, could be a binary. Any arguments you want to pass, environmental variables you want to set, some limits you can set for memory open files. You let the system know whether you want to be able to write to an ephemeral disk or write to a persistent disk location. And then there's this unsafe section, which we'll talk a little bit more about. That's kind of an escaped valve for those releases that don't conform to sort of normal Bosch conventions. And then we involve MONIT, but only as a way to invoke BPM, using start-stop conventions and a well-known PID file location. So, why should you use BPM? It removes a lot of boilerplate that developers have been either cutting their teeth on, copying and pasting from other releases to do things like PID file management, log redirection, signal handling, process execution and termination, directory creation, permissions, setting and privilege de-escalation. So, if you look at some of the older versions of the releases like Diego that had been using BASH previously, there are probably hundreds if not more lines of BASH that have been removed by adding BPM. It provides a production-ready replacement for these things that is well-tested in a Golan code base and also allows us to address bugs in these particular cases and security issues in a single place rather than having to replicate those fixes across tens of Bosch releases. I think that's one of the things that is being presented in the security talk this afternoon. Some of the lessons learned from CF security problems over the past year. So, it provides isolation and security by default. At the file system level, the processes can only see the resources that they should have access to, so Job A can't read the configuration file for Job B even though they're on the same virtual machine. This provides credential isolation to some degree. Processes have a consistent execution environment. They have a very scoped view of the stem cell that's consistent whether they're co-located with one or 100 jobs. And as we saw before, you can opt in to be able to write to ephemeral persistent data directories, but otherwise the rest of the stem cell is off limits for writing. At the process level, each process starts its execution in its own process namespace. If you're inside that process namespace, you only see yourself running and secure by default. So, it prevents tampering with co-located software, either reading another job config or writing to the stem cell. It automatically drops root privileges, and we apply a conservative but common Seccom profile to all the jobs running. So, it also provides some operational standards. So, BPM Start expects a process to start non-damanized. It writes the PID file directly to allow it to know the process is up. BPM Stop has a conventional way of shutting down a process, sends sigterm, sigquit, sigquill, and then removes the PID file. For operators, we have some tools like BPM Logs, which allows you to see the standard out and standard error of a process. Trace, which provides an opinionated SysTrace of a running process. PID to get the PID file. And Shell allows you to execute bash inside the jobs container if you need to do debugging or understand what's happening with the job. All right. Deep breath. Some things we ran into, learned from while getting the CF application run time to adopt BPM for most of the jobs running in that, in that Bosch release, the Bosch deployment. File system dependencies are super hard. So, sharing volumes right now is implicit and based on path in the BPM file. We're not super happy about this, but it's a good first step. What that means is that if you want to share a directory between Job A and Job B, both of them have to include that as a volume that they mount inside their BPM YAML. Because Monit doesn't start things in a deterministic way, this might run into some weird mount visibility options if the jobs in question are doing interesting things with a file system, like Diego and the Rept, for example. It's also uncovered some file system restrictions, or, say, the file system restrictions that we have, allowing jobs only to write to our VCAP data or store, uncovered some odd behavior in various releases. Some of them changed their behavior. Some of them, we had to give this unsafe volumes escape hatch so that they could continue to operate as expected. Process management, Monit, as many of you know, is relatively brittle. Fun fact, who knows that Monit communicates with the agent via SMTP? Welcome to the club, no. So often, Monit's often undefined or hard to understand behavior. Execution of a job and termination of a job must occur within 30 seconds, or Monit breaks and starts to do weird things. Most people don't know this. It's hard to discover. Pit files have to be cleared to prevent false, healthy status. So if a job dies, the pit file still has a pit in it. Something else starts up and picks up that pit. Monit gets super confused. One of the things you should do is not use your pit files for health check. Health check probably should happen in Bosch's post start, but just another thing we ran into. And another sort of hiccup with BPM is it expects the processes will not demonize. They expect logs to be written to standard out and standard error. This required some tweaks in a few releases. Some C groups things we ran into. We initially treated C groups as something that BPM kind of owned and maintained wholly for itself to manage these processes. It was a bad assumption, in particular with Diego. They need to have access to a standard C group setup. They need to be located in an expected location. They need to be consistently configured, which means being aware of swap on the VM and a couple of other options. Also pre-start doesn't run on reboot. Another little gotcha for people starting out with Bosch. Jim. Cool. Cover the product updates. All right. Let's go over a little product update and talk about what we've been working on in the last year since we were here last time. So one of the biggest things we've been working on is actually providing some increased flexibility. And the reason for this is when we initially dreamed up BPM, we knew that it's strict opinions would only support about 90% of software on Bosch jobs. And so we wanted to bring BPM to a wider range of software and provide the BPM's execution and operational guarantees so they could take advantage of a lot of these benefits. And so the way we've done that is one by relaxing some of our stricter opinions. A perfect example of this is what we did with Linux capabilities. We started with this dream that we would have this approved list of capabilities that people could request. And then we would have conversations as people needed new capabilities and determine whether or not this is good for the product or if we should add it to our approved list. It turns out this doesn't scale at all and it was really hard to manage this. And we had a lot of conversations. So we just eventually ended up removing this approved list and kind of let users specify exactly what they need. The other one that Aram has mentioned a couple of times is that we kind of evolved with release authors as they adopted it. And this kind of evolved through the unsafe escape hatch that we built in. And so the two things that you can really do with this unsafe configuration is one, you can request privilege containers. So you can still run your jobs as root if you really need to. This is still important for things that are really tied to some of the host actions like modifying network interfaces and stuff like that. The other thing is it provides volume configuration that can escape the limits of VARVCAP data and VARVCAP store. And so the two major use cases of that is software that doesn't confine to Bosch's standards. So you can't always force it into those two directories. But also software that needs to execute other jobs we found out. So people need to mount in VARVCAP jobs so that they can call out to binaries that are configured. And so while we've implemented all this flexibility and kind of taken a step back from some of the security guarantees we have, we think this is okay because the declarative configuration allows the organization and operators really the ability to audit what a job is going to do before they deploy it to their production systems or their development systems or something like that. And so we think that's still just a really big win. And as we go on, we can take the time to identify where these, you know, anomalies occur and either fix them or just be aware of them going forward. Another place that we've kind of tackled in the past year is errand support. So we really wanted to have a mapping for BPM and Bosch run errands. And what we came up with is BPM run. And what BPM run does is it just executes a one-off task to completion and then returns its exit code. It's still a declarative configuration in the BPM MMO, so it's very similar to BPM start. But what we think it provides is it still removes that boilerplate. You still get log redirection and you still get a lot of the file system setup. And you do get the isolation still. So you can be confident that your errands, when they're co-located, because that's now a very common pattern, they're not conflicting with other things on the system. One of the really interesting things that we found out implementing errands is that you don't actually know all of the configuration you need at deployment or template time. And so we started exposing volumes and environment variables at runtime so that these jobs could take advantage of some of these aspects. The really common case for this is an errand will want to tweak what it exposes depending on which jobs it's co-located with. And so that's kind of a difficult thing to determine in templates, but really easy to determine at runtime. And then really the biggest thing that we've been focusing on the past year is trying to force adoption or drive adoption, but also make sure that BPM is really production ready so that we can get out into production systems and have it running. And so what we've seen so far is that the Cloud Foundry application at runtime has actually mostly converted to using BPM at this point. I would say 90% of them are using it. I think the only exception right now is like Garden and a couple of others. So that means Diego, Cappy, Container Networking, et cetera, they're all using it. Some of them are feature flags, but at this point I think we're at the confidence level of BPM to say like most releases, if they're going to use BPM should just go for it. It seems pretty good. We've also seen other products in the Cloud Foundry network, like the CFCR team and Bosch keep using these things, as well as I've heard many community releases starting to take advantage of BPM. I know Dr. Nick talks about it in Bosch Gen and that kind of stuff. We've really started to see adoption increase and also we've been confident in BPM's production readiness. And so we've been identifying bugs, solving them, and I think we have it running in some production systems today and hopefully even more tomorrow. And so one of the things that's really going to come with that is I would say BPM v1 is right around the corner. We are pretty ready to cut it. Every release that we're making currently to me is kind of a release candidate for v1. We expect the interface to be like very stable going forward and pretty much all the functionality is there. So be expecting that probably within the next month. And because we just talked about v1, the next thing we're going to talk about is what does it mean for v2? And so I think this is where things get exciting. But so this is kind of the current structure of BPM on the left. And what that is is BPM is just itself is basically an isolation and execution tool that is still wrapped by Monit and still used in that kind of Bosch framework. But where we envision BPM kind of moving is kind of a two part system. One is we imagine BPM as a process supervisor, but then we also imagine it as the process isolation layer as well. And so what do I mean by that? Let's first look at process supervision and execution. And so we imagine BPM as a tool that can manage process life cycles and execution. So it's going to start, stop, etc. All these things. And it's still going to bring this generic release authorship improvements to releases that decide to opt into it. So that means pitfiles, file system dependencies, etc. Also what we're going to see is we're slowly hopefully going to start phasing out responsibilities from Monit. And I think it's a long-term Monit replacement eventually. And this is pretty cool for a couple of reasons. I think the first one is it gives Bosch the ability to really unify its lifecycle. As you saw with an earlier slide, like we even run into this, but pre and post start kind of have this one gotcha if they don't execute on every process start and they also don't execute on reboot. And so people often put core business logic in these scripts and then don't realize that until something breaks and then have to move it somewhere else. And so with BPM as a process supervision, we can actually provide that guarantee that we will run these pre and post starts on every single process start that you have. Another thing is really unifying this across OS. So if you've looked at a Windows Bosch release right now, you'll see that their Monit file is actually a JSON file that looks very similar to a BPM config. And then on Linux we make use of Monit. So there's just two distinct ways that Bosch and Bosch Windows execute today. And we think that since BPM is just a going binary, we can actually drop this in and have a consistent experience both across Windows and Linux, which I think would be really cool. And then the other thing is just replacing Monit. So I don't know if any of you know, but Monit we actually haven't updated in a really long time because of licensing concerns. And so we're really stuck on an old Monit. And then Monit has all of these gotchas that, you know, Aram talked about earlier. And what we can do with BPM is we can really simplify that experience and not have release authors have to know all these nuances. They just have to use a still hopefully powerful interface, but a much simpler interface to understand. I think one other cool, exciting thing is like this opens room for more feature work on this area of Bosch. This has been a pretty static area for a really long time, as we haven't changed Monit. And so by replacing Monit, we actually get to drive feature development in this area and start imagining what this future looks like. The other one was process isolation. And so this is kind of imagining more on the terms of flexibility and what that means for jobs. And so we talked about how, you know, we tried to address the 90% case and all that. But what we really imagine with splitting out isolation is that we can provide multiple levels of isolation to jobs. I mean, this can be, you know, no isolation at all for some jobs that still need to see the whole stem cell for various reasons. But this also could just mean using the current BPM isolation model. And so it kind of gives you that flexibility to run more software. I think another really cool thing about this that we've talked about is with this isolation layer, we can imagine running container images on Bosch stem cells. And I think what we imagine that to be right now is we provide the container image as a Bosch package and run it on top of the stem cell. And so it's kind of starting to bring Bosch and other platforms more consistent together. So you can actually imagine transforming software from one platform to Bosch and vice versa in a much easier way. And then lastly, I think window support once again is just another big win for us. If you see Windows releases, they have the same problem as current or old Linux releases. They have all this boilerplate and PowerShell to manage, you know, standards. But with BPM, we could just bring that in and remove all of this boilerplate and just standardize execution across operating systems, which is a really big win. Another thing we really want to focus on with BPM v2 is comprehensive volume support. And so as Aram touched on, it really feels like volumes are kind of a hack right now. The way you specify just a path on disk is super weird. And it's not clear that you're trying to share this directory with another job. And so what we're kind of imagining is moving volumes as a first class concept in BPM and having users define them outside of a process. What this means is process, multiple processes can actually reference the same volume and then volume life cycles can be independent of processes. This is pretty huge because it turns out the volume life cycle is actually much longer than the process life cycle and that volumes are often, their life cycle matches the VM life cycle, whereas processes can be restarted multiple times. So by doing this, we get more flexibility and we can handle volumes much more efficiently. This is one area that we definitely want to gather some input from users and we're going to reach out and get some feedback. And we want to know, like, what is the configuration that would work for our users? So feel free to tell us our thoughts. The last one is we're trying to tackle unique users in BPM v2. And so right now I'm sure all of you are aware that every Bosch process pretty much runs on this magic vcap user. But what we would like to imagine is a world in which BPM can actually provide a unique user for each process that executes and then this leverages Linux user permissions for additional security and isolation. One of the biggest complexities with this and why we haven't tackled this so far is we actually didn't know how this would work with volumes. So if we use a new user for every process, then we have to somehow make permissions match so that volumes can be shared. And that's something that we didn't know how to do. But if we talk about what we saw with volumes as a first class concept is that we really can start tackling this with maybe group permissions or something else. It really helps us get to that world. And so that's pretty much what we're imagining for BPM v2. I'd love to hear your thoughts and whether or not there's any functionality, your features that you'd like to see. But that's pretty much it. And that's our talk. You can always reach us out at hashtag BPM and the cloud foundry Slack. And we really appreciate issues and pull requests in the GitHub repo. So please contribute. Does anyone have any questions? Or I would say no. So the question was I am a lazy release author and I don't know what capabilities my software needs. Is there an easy way to figure this out or, you know, have configuration that works? I would say no at the moment. Like I think it's worth understanding what capabilities your software actually needs. And as a release author, that's probably a beneficial process. But no, we have not tackled allowing that behavior. Any other questions? Yeah. So the question is what what happens if we replace Monit? And I'm not sure the mechanics, but one of the notions is maybe this becomes an opt in thing that people could set on their VMs. Maybe initially not baking this into the stem cell but enhancing the agent so that we know how to go get a certain thing from the director that would allow it to to to operate without Monit or to fetch Monit and install it on the stem cell. As a possibility, I think there are other ways we could go about it. We could we could start baking that piece into the stem cell. But I think given the release lifecycle of stem cells, it might be more interesting to figure out if we could develop a way to iterate on that more quickly. Rather than having to bake it into the stem cell and wait for that feedback cycle. I think one of the other interesting things with this opportunity is that we have a chance to define a clear API for what the Bosch agent actually uses to determine process lifecycle and health. And then you can imagine an infinite future of many implementations of process managers kind of plugging in there, which would be pretty cool wouldn't tie us to necessarily BPM going forward. But whatever the future holds. It's so the question is, are we actively looking to move to move Bosch jobs to other executors, say Kubernetes or vice versa? A lot of the inspiration from how we set up BPM YAML was drawn from the pod spec. We aren't actively working on making some sort of a transformer at this point. But I think it's a general North Star that we've kept sort of in our minds as we've been developing this. At least in so much as we wanted to do, we didn't want to do anything that would intentionally preclude somebody from doing that or us from doing that in the future. So I think it's there, it's not actively like a pursuit, but it's something we're keeping in mind as a door we want to keep open. I think also just from my perspective, the best way to increase usage of Bosch is to make that transition from either easier and have less overhead for people writing software to understand how they want to run their software. And so by kind of moving towards this configuration declarative style with maybe even container images at some point, let's envision that world where you can take a Cade service and move it to Bosch, you can take a Bosch service and move it to Cades and we can exist in this world together, which seems pretty cool. So I think when we would solve that problem is most definitely when we kind of tackle volumes as a first class concept. Oh, sorry. The question was we mentioned timing issues with volumes as they exist today and mount propagation and various things like that. And when is our timeline for solving that? And I think in my mind, our timeline for solving that is BPM v2 with first class volumes just because it's difficult to with the current interface kind of extract that and deal with them at a higher level. And so that's probably when we're going to tackle it. Any timeline for BPM v2 is probably within the next year is what we're aiming for. But it's a little ambitious. So we'll see how that goes. Yeah. We might try to address some of the shortcomings or maybe some edge cases that we can in v1, if possible. Yeah, definitely drop an issue or or ping us in Slack or on GitHub if there's something that is a particular thorn for you at this point. Cool. Anyone else? Awesome. Thank you all for coming. Yeah, thank you.