 Hello, my name is Krzysztof Opasiak and today we would like to tell you a few words about our bumpy ride around automating license compliance for Docker containers. So this schedule for today, first of all we would like to introduce you the environment in which we are dealing with that, then we will tell you what we have tried, what worked, what didn't, what we have now and where are we going in the nearest future. So first of all disclaimer, neither of us is a lawyer, so this presentation cannot be viewed as any kind of guidelines to license compliance, especially in Docker containers. If you need that kind of guideline, please make sure to ask your lawyer to provide you one. So first of all the problem statement, so when you are hearing about the license compliance, you are usually thinking about company that are shipping some products, so in general about some corporate environment. So you may be surprised, but we are dealing with this topic in the open source project. The project is called ONUP and it's a center of gravity for 5G open source networking. It's a collaborative project hosted by the Linx Foundation that aims to provide the open source orchestrator for virtualized networks. But for the purpose of this talk, the functional part is not that important as the technical details. So from the perspective of code, ONUP is just a few hundred repos written mostly in Java, Python, JavaScript and even Bash, which builds into over 100 Docker containers that are all being hosted on our public Nexus instance. And all those containers are being deployed on Kubernetes using Helm with our single-click deployments. So we didn't start our journey in the outer space. There was already some stuff that we had to take into account. So first of all, ONUP had an automated Docker release process that allowed you to release a new version of a Docker image by just submitting a change request to Gary. All the heavy lifting related to building the Docker after such changes submitted was done by Jenkins. It was also running some tests to verify the new Docker images. We also had the sonar instance that was constantly scanning our resource code for bad code smells and also some periodic same manual license scans that allow us to catch all the license issues in our source code. Finally, integration team provided also base images for Python and Java projects, which are the vast majority in ONUP. So how did it all start? I mentioned to you that we had some periodic license scans of our source code. And after each such scan, a report has been sent to the interested people and people who should fix their comments. And after one of such scans, one of our developers started asking questions. Those questions were related to Docker containers because we've been putting a lot of efforts to make sure that our source code is well aligned with all the license requirements. But we didn't, at that point, we didn't realize that apart from the source code, we are also providing Docker containers to our users. There was a pretty long discussion and it was really hard to understand what we are responsible for and what we are not. But fortunately, Dan Steve Winslow shared with us a great article that was at that time very recently published by the Lynx Foundation. And I really recommend you, if you are dealing with this topic, I really recommend you to take a look at it. But the bottom line from that article is that if you are distributing a Docker image, you're responsible for everything that is inside. And distributing is not necessarily only shipping that to the customer. It's also making the Docker image publicly available. And guess what? We realize that all our images are publicly available on our Nexus instance. So it means that we are basically distributing everything what's inside of our dockers. And believe me, at that point, there was really a lot of stuff inside. It also means that fortunately, we do not need to care about everything that we are deploying as a part of our single click deployment because we are reusing some of the community images that are coming directly from the Docker hub. So fortunately, according to the legal analysis, we've been not responsible for those. So we realized that we need license compliance. So how do you start that? So first of all, we decided that we really need to shrink our images. When we started this task, there was number of Docker containers that had a full Ubuntu image inside, which contains a few hundred packages. And that's really a lot of stuff to take care of during the license compliance process. There was also a will from the community and the TSC decision to avoid using GPOV3 components due to some license implications that we wanted to avoid. So we also decided to create base images that are simply GPOV3-free. We knew that we need to automate the license compliance process. Due to the number of Docker containers, it's over a hundred dockers for every on-up release. And there is sometimes even five releases in a year. We really need to have that automated as much as possible. And according to the article, compliance is required for each and every layer of the Docker container. So even if you add a file in one layer and then remove it in another one, you're still responsible for all the potential license issues that this file can cause, and you need to do a license compliance also for this file. And when we started investigating what's there, what we can use, we realized that there is no tool that can give us a full compliance report. And that also we realized that this task is not really simple because Docker is stripping a lot of data. A lot of metadata is missing in Docker images. So often additional steps needs to be taken in order to collect the data that is required to create a compliance report. So what was missing? Before we even think about the license compliance process and compliance document, we really need to learn what's inside of our Docker images. So we need to generate a proper software buildup materials. Then we also need to define policies because licenses are compatible with each other, like the GPLv3 that we want to avoid. And obviously, if we introduce a third CI to our project, then probably community would not be really happy about it. So we want to integrate tightly with the two CIs that we already have. And finally, we want to generate a compliance report and publish it alongside with our Docker images so that we are not bothered with anyone who is willing to access the source code of our GPLv2 components. And we also want to integrate with existing image release process. So it's great to generate compliance reports. It's great to have policies. But if you are unable to provide feedback to developers in a timely manner, that's simply not going to work, right? So some considerations that we had, we need to automate as much as possible. And we also do not want to promote any vendor in the open source environment. And we want to be able to add any kind of modifications that we require to make sure that the report is generated the way we want it to be generated. So that's why we really prefer the open source solution for all our issues to make sure that we are able to modify it freely. As Krzysztof has mentioned, first of all, we need an Aspen. So I went to look for a tool that could provide us such. And the first one that could do that out of the box was Tern. Tern is a CLI app that takes a container image or a Docker file and generates an Aspen for it. It is pluggable and has some support for Skanku Toolkit, the most popular open source license and copyright scanner and CD-bin tool. On itself, it supports only packages, but with Skanku Toolkit, it will cover both package metadata and files. Tern itself will be as precise as metadata reported by the package managers, which it queries. And it gives, this tool gives us the ability to show people what's inside their containers, what licenses those components have, what copyrights they have. So we integrated it into WTCI chain with Morgan Richon, but we wanted to provide feedback sooner. Unfortunately, Tern was not sufficient to run on any submitted changes. As we wanted to give feedback as soon as possible, we wanted to integrate it with Garret and Jenkins. And we couldn't do that with Tern because of how it works. It mounts layer by layer via overlay FS, CH routes to it and queries package managers for import licenses and copyrights. But this means that anyone could push something to our public Garret Git and we don't want to execute unknown binaries in our CI environment. Tern also requires Docker SOC access and some extra privileges on older kernels that don't have overlay FS support in user space. And both those overlay FS and Docker did fail on us quite often. So each week we had a bunch of Docker's to fail to scan. Tern might still be okay for people where they develop closer solutions or have restricted push access, but it doesn't work well for us. So we had to switch to another tool that emerged in the meantime. So what we are switching to? We're switching to Skanko.io, which is a Django-based wrapper around Skanko Toolkit, extending and specializing its features. Obviously, it has REST API and web GUI being Django-based, but there's also a CLI. Another factor prompted us to switch is that the next P authors of the tool have a lot of supporting libraries for software composition analysis and related stuff. Skanko.io is easily extendable and customizable due to usage of pipelines. And we did need some extensions, especially given the challenges that Alpine Linux has given us when we were trying to scan it. So Alpine Linux is well known by people who use containers. It has small footprint while having all the typical OS stuff like package manager, et cetera. And a cherry on top of that is that the base image is GPR-V33 as they went with Moos on PZBox instead of Jalipsy and Coriutils. But it's not okay, as the Alpine packages don't have sufficient information needed for license compliance. For example, there is no copyright info, which is required by every open source license, I know. We only get license identifier. And those are easy to get wrong, like BST license had few iterations. There might be a case where someone added some stipulation to a license and it might be marked as the original license while the stipulation might be of legal importance. But likely it doesn't mean that there is no way to gather that info, which is what we set out to do. So what we do is we download Alpine Apports repository, which holds built recipes for all the Alpine packages, check it out on a comment related to package version question, then parse the built recipe, download the source code, analyze it with scan code, toolkit and get all the missing information. This is not a perfect approach, as there are many sub-packages in Alpine. And sub-packages share code repo and version with their parent packages, but use only a subset of what's in the repo. This means that if a repo has, for example, GPLB3 code that was not used to build given sub-package, this pipeline will report it as including GPLB3 anyway, until we add support for parsing, makes, files and possibly other building recipes, this is what we kept. Nevertheless, this significantly improves Alpine scanning results and it would not be possible without Philippe Ompedan and Matteusz thanks guys. So where are we going with this? The first thing that we want to add is we have so many dockers in On App and it's hard to control what baseline images are used for them. In On App integration, we are maintaining two GPLB3 base images for Java 11 and Python 3, which we're trying to popularize. I've been checking from time to time the status of utilization of those images using Dogfis, which creates a dependency dot graph of all layers in images used by Docker instance. Those graphs are huge, and you can see it on the bottom of this slide, that's a limited version. This only includes all the named layers. So what we're doing is we're developing additional pipeline that would do what Dogfis does, but without dockersock access and present it in more readable manner in web GUI. Another thing we need is support for policies and waivers. ScanCode.io has guaranteed support for defining license policies, and there are three categories for that, approved, restricted, or prohibited, but as of now only files are checked against policies, packages are not, and we would need to support packages for that, and we also need support for waivers. In On App, we keep all the waivers in a good controlled repo, and we would need to check against that waivers to not highlight projects that use prohibited licenses, but have waivers for those. The next thing is guaranteed integration. As I mentioned, we really want to provide the feedback to developers as soon as possible, so that we have their attention, and there is a bigger chance of fixing the bug before the next On App release. So it's really just a matter of integrating this ScanCode.io with Jenkins that is already building the Docker container, so as soon as it is ready, we will get it in ScanCode, scan it, and provide the feedback as a comment in Gary, so that developer is well aware of what he is trying to release. The next thing is generating of a final compliance report. We really want those reports to be public in the same way as our images are public, so that everyone may see what we are really using inside of our Docker images, and we really would like to provide easy access to that for all the companies that are involved into On App. Maybe it would be possible to try to attach the compliance report to the Docker itself, but still that's something that we would need to work on in the future when we have the report itself. So summing up, we learned a lot of stuff during our bumpy ride, and the first and probably the most important thing is to make sure that you choose the right distro or distro less for your containers. If you use large base images, you're probably going to have a hard time trying to do license compliance for that, and there is also a gigantic risk related to licenses if you use some random image from Docker Hub, so make sure to check their reputation. And also we realized that sharing base images amongst sub-projects is very important. When we first ran DogVis, it was an absolute mess. We saw a bunch of graphs of dependencies, and almost everyone used different base images in different versions. That's really a lot of packages to deal with, and we believe that there is no silver bullet and even the commercial tools that are being advised to do it, do a little bit less that you would even expect them to do. So we believe in the power of open source, and we are always looking for people who have similar interests and issues to be solved so that we can collaborate and work on this stuff together. Thank you. Thank you.