 Hi everybody. Welcome to this talk on container canary. So my name is Jacob Tomlinson. I'm a senior software engineer at Nvidia. I work a bunch on kind of infrastructure type tooling. And I contribute to and maintain a few open source libraries, many in that like data science, Python space. So I work on Rapids, which is a suite of open source GPU accelerated data science tools. And I work on Dask, which is a parallelism framework for Python, which allows you to scale your work out to many machines. As part of working on those projects, I also maintain, I'll help maintain a bunch of Docker images. So each of these projects, Dask and Rapids have kind of a special Docker images that make available to people. And these images have a variety of use cases. They're intended to let people pull things down and try out different versions of the library, build out different things, but also to be used as building blocks, to be dropped into like larger systems or dropped into like various platforms that allow you to bring your own containers. Now, when it comes to maintaining anything like this, occasionally you're going to end up having regressions. And so I kind of wanted to start with a little story about an issue I had a while ago, where I was trying to update the Rapids Docker image to work with Kubeflow. Kubeflow is like a machine-lining platform on top of Kubernetes. It has an interactive notebook service that allows you to kind of bring your own container and also has a pipeline service that lets you build up large computational pipelines. And each of these allows you to bring your own container images with you and drop them straight in. And so I wanted to make some changes to the Rapids images so that they would work with this platform. But in making those changes, I ended up breaking compatibility for a different platform. And so I kind of thought about, well, if I was writing code for this, I would be writing tests to make sure that I wasn't having any regressions. So I started looking into, well, how can I write these kind of tests for my container images? And the tooling around that just really didn't, there didn't seem to be a huge amount of things on offer. So I went off and built a new tool, this open source, so you can find it on your GitHub called Container Canary. So Container Canary is effectively at its core, it's like a specification for writing tests. You write them in YAML, the same way you would write resources or manifest for Kubernetes or other platforms. It actually borrows the Probes API from Kubernetes. So that's kind of how you implement each test, is the same way you would implement a liveness probe or a readiness probe in Kubernetes. So if you have experience doing that already, then a lot of this will seem familiar to you. And then as a command line utility that you can use to take a manifest with your list of your tests, your specification, and test a container against that to make sure that it complies and gives you kind of a pass-fail status at the end. It's really important that this can kind of be written down as a YAML spec because then we can commit that into version control and keep it either with our documentation or with our container images or with the platform, wherever it makes most sense. So this is particularly useful in these situations where you have a platform that allows you to bring your own containers. So Kubeflow is like one example where we want to be writing containers kind of directly targeted at that platform. But that platform is going to have assumptions around what services are running and what user ID you're using and where your directory is and what's available on the path and all that kind of stuff. Often these things are documented in a human way but not enforced in any kind of technical way. So Containing Canary allows you to come along and codify all of those things and then check that, right? You can run your tests locally. You can run things in CI. But then that was the initial use case that we built this for but that can expand way beyond that to just writing tests for any container image. If you're going to drop it into a system, you're going to make assumptions about what their interface of that container image looks like. What services does it start? What ports does it listen on? All of that kind of thing you can test with Containing Canary. So let's go back to Kubeflow and kind of talk through that specific example here. So if you go to the Kubeflow documentation, there is a page on container images. And at the top it says, these are some base images that we provide. They've got these libraries installed. The kind of path of least resistance is to take those container images and install your additional dependencies, set these up however you want and use those. Because these have been tested, we know that they work with Kubeflow. But there's a bunch of reasons why you wouldn't want to do that. And I see a lot of projects doing that. They're kind of pushing, oh, these are our kind of official images. We know they work. Use them as a base and you'll have a good time. But your security team may not want you to use arbitrary images from the internet that's maintained by somebody else. Maybe your organization has like a vetted base image that you have to use for building everything. Maybe you are building your own images for your own tools, like we are for rapids. We want to kind of, we're already building these images. We just want to bring them along to Kubeflow. We don't want to build a different one. There's like lots of different reasons why you already have an image that you want to use. And so often you'll also find in these documentation pages there's a section on, if you really want to build a custom image, here are our assumptions. And it's kind of written out in a bullet point list here on Kubeflow. But you see this a lot across a lot of platforms. Trying to communicate to you, the maintainer of the container image, these are the things you need to check for. So in Kubeflow's case, it expects you to expose some kind of HTTP interface on port 8888. That's probably going to be Jupyter or VS Code Server or something along those lines. It expects you to honor some environment variables for setting either path, the root of the URL. It expects that headers because it's going to iframe these web applications in somewhere else. And then it makes some assumptions around how data is going to be mounted into this. So it's expecting you to have the username Jovian in this case and the home directory should be home slash Jovian and the user ID should be a thousand. And a volume is going to be mounted into that home directory with those permissions. So it's listing out what those expectations are. But this is quite a brittle way of handling things, right, because we have our container images. I can go through and apply these assumptions, which is what I did, but that might change something. Other systems might have other assumptions that I might break that. But also what happens in the future if I want to maintain multiple different platforms, support for multiple platforms, or if I want to just make a change to the image because there's something wrong with it, how do I kind of, I don't want to keep going back on referring to written documentation like this to make sure that I'm complying with it. So with container canary, we can just write all this down in a YAML manifest. Let me just move that away for a second. We can write all this down in a YAML manifest. It's going through, each check in this list is one of those bullet points, but we're going through and programmatically testing using a probe from the Kubernetes API to make sure that we are meeting those specifications. So we're testing the username, we're testing the user ID, it supports the exec probe, the HTTP probe and the TCP probe. So you can check that different web services are listening on different ports and they've got different headers and different content all of that kind of stuff that you would write in a live network ready that's probe and Kubernetes, you can testing in container canary. And then once you've written this manifest that describes kind of those requirements, then you can run canary validate on the command line, give it the path of the URL to this manifest, give it your container image, and it would just run for each of those tests, give you a pass fail. This makes it very easy for setting up either for testing locally or in CI as you're building images. There's also no reason why you couldn't use this in the actual platform itself to run like a pre-check on a container. So how do we use this within RAPID as well? We started off by kind of trying to codify the platforms that we want to explicitly support, like Kubeflow or Binder or Dask or whatever. And because we have written those down at manifests and we're running that in CI, we can confidently say we support those things. And if we want to add a new platform in the future, the first thing we can do is write a manifest based on the documentation of that platform and then check whether or not it's going to comply. And this is really also useful to finding out if different platforms have like conflicting assumptions. If they expect two different user IDs, there's kind of no way to resolve that other than having separate images for those platforms. But that's fine. We can move forward knowing that. So container canary is written in Go, it's compiled to single binaries, you can find a bunch of releases on the GitHub, so you can just download it, check the checks, make it executable, put it on your path, and away you go. It's very easy for installing in CI. So I would say that container canary is kind of complete in terms of its scope. You can write the manifests, you can run them. You can write all the tests that you need to be able to write and we support the full set of probes. There is a new probe for GRPC, which is kind of in beta at the moment in Kubernetes 1.23, I think, which we'll have support for once that comes out of beta around. So the things we're going to look at doing kind of going forward is bug fixing, this is still a relatively new library, so there's going to be work to be done there. We also just want to improve the user experience, make sure it works nicely, both on your local machine and in CI. One future task might be to mount canary binary into the container for running kind of more involved checks, maybe reducing some boilerplate, so they'll be having to provide batch and things in the exec. And that's kind of where the name container canary came from, you go to mount the binary into the container and it would be the canary down the line. But that's something we can come to later if the amount of boilerplate becomes problematic. But the main things we want to get to is creating a bunch more examples in our repo so that it's easier for folks to come along and kind of learn about all the different tests you can write and write their own manifests. But then we really want to start encouraging folks to be writing the manifests, checking them into their version control with their project if you're maintaining a platform that takes, bring your own containers, kind of model. Your documentation should have container canary in it and say, look, here's a URL for a manifest, you can run canary validate to make sure that your container will work. So going back to the Kubeflow example again, I would really like to take the Kubeflow example from the canary repo and upstream that into Kubeflow itself and update that documentation page. So you still have that bullet point list, but it also has the command that you can run to kind of programmatically validate that you are compliant. So please start testing your containers or make all of your lives easier. Hopefully you aren't running to regressions like I did in the future. And please publish your tests, like if you maintain open source projects, please put your manifest in those projects and update your documentation to say, look, you can use canary to validate your images. So thank you all very much for paying attention and happy testing.