 Please join me in welcoming Sam. Thank you. Can everyone hear me? Is that too loud, too quiet? That's okay. Brilliant. Well, thanks for coming to this talk. I'm going to be talking about integrating software into collections of software that work together. Which sounds easy, sounds like a solved problem, but actually isn't. So how many people do that here? Work in distributions or creating software for embedded systems? How many people here? A few. Do you find it easy? Are there pain points? So the base work project is developing tools to try and make this process easier. The goal isn't to replace traditional distributions. The goal is to develop... It's kind of a research project to develop a set of parts which work together but also work independently. So people can adopt any of the parts that they see useful for them. All the tooling is written in Python. It's all free of legacy. Well, the project started about four years ago, so it's not completely free of legacy, but more free of legacy than most existing things in this area. So the project started with this problem. Build a working new Linux operating system straight from the source code. And how many... If I ask how many lines of Python code, do you think it would take you to do that? Any guesses? No guesses? Well, I'll give a couple of hints. So the project of dealing with source code, we have a solution for that, which is a server which mirrors every form of version control, every popular form of version control, and mirrors tar balls all into Git repositories on one server. So the build tool doesn't have to deal with downloading tar balls from random places or anything else. It can consider that everything's in Git. Most things are in Git now, but it imports things from a curial, subversion, or whatever else, so you get a consistent interface. And then also, all the build instructions that you need are spelled out in this consistent YAML format. So this is an example of a simple build instruction for binutils. This is the instruction we have for Python, well, the C Python interpreter. So it says, use the standard commands for auto tools, but override the configure commands, run something when it finishes to create a sim link. So we have a reference distribution which describes how to build a whole system in a form like this. There's another slightly more complicated YAML document which then says what ref to build and how to fit it all together. But that's it. And so, with these parts, we actually have a build tool to work an operating system with about 2,000 lines of Python, which is an order of magnitude simpler than anything else you'll find in this area, I think. And that's because we've taken the approach that writing a build system should be easy enough that this squirrel monkey could do it. If we solve the problems around all the problems in the area, then the build tool itself becomes trivial, which is good because writing a build tool is quite a thankless task. We're always doing it, so if we remove all of the problems around it, then it becomes trivial, at least by not trivial, but fairly easy. Lines of code is a bit of a horrible metric. I don't want to assign too much meaning to it. But the tool in question is kind of a prototype. It's called YBD, and we have an order build tool as well called Morph, which is a lot bigger. I'll go through the bits of base rock. So what do you need to actually build such a system? These are the items that you need, really. I'll go through each of those in a bit more detail. So the source code mirroring service, that's a server appliance called Trove. We have one running, which I should be able to show you in a browser. So here's one we have live at baserock.org, and it contains lots and lots of Git repositories. Quite boring, but it's good to have a consistent interface. And there's an easy way. You submit a patch against a repo called lorry, and it mirrors more things. You can also set up your own instance of this, or the actual mirroring tool at the heart of this is a simple script called lorry, which takes a JSON file, which describes where to get source code from, and pushes it into Git. So that's source code mirroring. You then need a way of describing build instructions. I'm going to go into that in more detail later on, because I think it's one of the most interesting parts of the project. So I'm not going to touch on that now. You then need a build tool to actually... Hang on, I've done this in the wrong order. So you need a language for creating build instructions, and then you need some actual build instructions. So we've separated, we've defined the syntax for describing how to build stuff, and then in order for the tool to be useful, you need to have a set of definitions that you can use to build a system as well. But you don't have to use those definitions. So this is... It's quite hard to visualise build systems and build tools. So I apologise for the fact that a lot of my slides are screenshots of a terminal. I also have a few diagrams. But this is the list of package groupings. We call them strata. They're like layers in BitBake. If the text is big enough, there's some fairly standard packages, GTK, Qt, various Python libraries. OpenStack is in there. You can actually use base-ruck tools to deploy an OpenStack Juno instance, which is quite impressive, I think. And going back to building a working GNU Linux system, we release every so often, we release one of the reference systems, called the build reference system, which you can download from here. To show that it does, in fact, work, this is me loading the VM image in QMU. So this is a base-ruck reference system, and it boots to a bash prompt. And there we are. It's a Linux system built entirely from source code. One cool thing about stuff that's built with base-ruck is every image contains metadata that shows you exactly what repo and what ref everything was built from. This is slash base-ruck directory. Is that big enough, by the way? Cos it's QMU, I can't really make it any bigger without using a serial console. So I apologise if you can't read it. But there's a bunch of metadata files, one for each component in the system, and then fire up on one of them briefly. It contains metadata about... This is the Zlib component, and it was built from... It was built with these build instructions. This was the environment. These are the versions of the dependencies, and at the bottom it shows you the URL of the repo it was built from, exactly what SHA-1 it was built from. So if you've got the source code mirrored in the server I showed you, then you can go from any system that's been built, and you can look at exactly what commit of what Git repository everything was built from. So there's no... Oh, the system broke, and I can't actually work out what I'm running. That problem goes away. So we have the build instructions. They're in a repository on git.base.org. We call them definitions. Then we have a tool, in fact two tools, which you can use to build them. So source code from Git goes in. The build tool just runs a sequence of shell commands in the right order, which, like I say, should be easy, and produces a binary, and then we have an artifact cache which just holds tarballs of binaries. So there's two tools. Morph is the older one, which has a lot of features, some of which it doesn't actually need, it turns out. It has some quite cool things. It has a distributed plugin. So you can set up multiple Morph workers, and have them share builds at the component level. So if you've heard of disk.cc, which distributes at the level of the source file, Morph can distribute at the level of the actual component, so you could have different packages compiling on different systems. YBD is more of a proof of concept that shows that you can make a radically simple build tool. They're both as well. They're available on git.base.org. So after you've built something, it's not much use having a tarball, really. So you then need a tool to deploy it. Deployment is a bit more messy than building. I think building is quite a well-defined problem. I say building. I should be saying building and integration, because there's more to it than just running compilers. But the output is a binary. Once you try to run the binary, you need to do some extra work to deploy it. So, for example, if you want to deploy to OpenStack, you need to create a disk image, upload that as an image to OpenStack Glance, and then boot it. If you want to deploy it to Docker, then you need to import it into Docker as a tar file. If you want to deploy it to real hardware, then you may have to put it on an SD card, wait five minutes, take the SD card out, put it in the machine. So deployment is a bit more messy, and there is tooling in Bacerock to do that at the moment, but actually I'd like to get rid of it completely. How many people here know Ansible? Good. Ansible's great. I'd really like to replace our deployment functionality with an Ansible module. So we don't have to think about that in Bacerock anymore, because I think Ansible solves a lot of problems really neatly. So you have deployment. The last piece of the puzzle is caching, because you don't want to build things more than once. Because of the way Bacerock tracks the inputs of everything it builds, and it builds everything in an isolated staging area, like an isolated charoute, you can be sure that if you run the same build twice, you get the same thing out. Not always the same bits, although we are working on that, but you get an artifact which works the same each time. So you can cache, basically, by hashing all of the inputs and all of the dependencies, coming up with an identity, and then saying, right, this is what I built. I'll refer to it with this hash, and if something's already built it, then it's already cached, and you don't need to build it again. So we have a simple cache server, which you can use for storing artifacts. Now, there are a couple of other bits. We have a continuous builder, which is really just a shell script which runs from the morph build to all over and over again. So it's not that interesting. I won't talk about that. I talked about sandboxing builds. We recently spun out the code to do sandboxing into a simple Python library called sandboxlib. So it has one API, basically. It has one function call and a couple of others. One main function call, which runs a command like the subprocess.p open, but you can specify a couple of things that you do or don't want to share or isolate. So you can say, put it in a new isolated mount space so it can't see the mounts from the system, or put it in a new network namespace so it can't connect to the internet, or mount these extra directories from the host, or make certain bits read only. It doesn't implement that functionality itself because there's lots of tools that already do it, but they have different strengths and weaknesses. For example, a lot of the containerisation tools like Docker, and SystemD, Ensporn, and Rocket need to be run as root, where you can use a much simpler tool called Linux user charoot and run that as a user. Most of those are Linux-specific, so it also has a charoot back-end which will run on any POSIX OS but doesn't support most of the sandboxing capabilities because in a charoot you can't. Or using just POSIX APIs, you can't say open a new mount namespace because they're a Linux-specific feature. So the charoot back-end is fairly incapable, but it allows you to degrade the sandboxing capabilities if you want. So there's the chart again filled in with the names of some components. I'm just going to take it quickly. So I said the part that interests me the most about baserock is the definitions language, which we refer to as declarative build instructions or declarative definitions. And the idea is to turn build instructions into data. At the moment, they're code. There's lots of build instructions in the world. I mean, Debian has build instructions for 10,000 or 100,000 packages, but it's all code. It's really hard to reason about it unless you understand all seven build systems that Debian has developed over the years. Whereas declarative build instructions, we want to treat the build instructions as simple sequences of commands so they can be treated a lot more like data. And we discourage ad hoc implementing features in shell scripts in the build instructions. There's no logic for the build tool mixed in. So if you look at build route, which is a tool largely in make for building systems from source code, build route's great, but nobody really understands how the core of it works anymore because all of the instructions are written and make and tied up with the build definitions themselves. And so while it works, it's quite difficult to actually make changes to it anymore. Finally, I really don't like shell scripts, so I'd like to minimise the number of shell scripts in the world. I'd much rather have everything as data on Python scripts. So what we've done is defined... ..this Yama language was defined a few years ago, and we're now trying to rationalise it and turn it into something formal and useful outside base rock. We have a... I defined a schema of the current data model. I tried to make a nice graph, and instead I came up with this graph, which shows you the entities we have at the minute. So we have a command sequence. That's the fundamental unit of building something. You run a sequence of commands, for example, configure, make, make install. And then there's something called a chunk, which is kind of like a package. We have these grouping called strata and systems, which I think in the future we'll do away with and just have one sort of component that contains other components. Really, I think the main problem in doing that work is coming up with a word, which means component that contains other components without having it be really long or really weird. As they say, naming things is one of the hardest problems in computer science. So at the moment we have this data model, which is still fairly simple. The final entity is the cluster, which represents a cluster of systems. So when you deploy something with base rock tools, you deploy a cluster, even if there's only one system. And so we have our reference systems repository contains a set of chunks for things like python, gtk, qt, different python libraries. It contains strata, which integrates those into logical groupings. So, for example, there's a qt5 strata, which contains the various bits of qt that you need to use it. And then systems, which have a specific purpose. So, for example, the open stack server system contains a bunch of different things. Its purpose is to deploy an open stack system that you can then host other VMs in. And there's also a build system which has build tools in such things. And it's easy to define your own ones. I'll show you... I meant to show this earlier, actually. I was going to show YBD starting to build something. It won't finish because it will take hours and I will probably run out of time. But this is the definitions, our reference definitions repository, in the systems directory. I'll see if I can make that a bit bigger. We define the build system. And it contains a simple list of the strata that you want. So, for example, core python libraries, the BSP, which contains Linux and a bootloader, different python libraries, Ansible, Cloud init, and such things. And then if I tell... Oops, I didn't want to do that. I need to build that. I'm not entirely sure how far it will get because I'm not sure if I'm connected to the internet or not. It won't get too far anyway. It's still loading things from disk, in fact. I'll come back to that. Another interesting thing we can do with once the definitions are considered data is there's a lot of existing data analysis tools which you can use to look at them. So I made this... This is YBD actually building something. I calculated an identity for each component involved in the build. And pretty soon it'll get to the point of running some... running configure for bin new tools, probably. There we go. So this is what a base rock build tool actually looks like. It's just running a command, and this will take in about four hours. You'll get a system out of the other side, which I won't show you. So going back to browsing the definitions, I found an awesome python library called rdflibweb. So rdflib lets you browse... lets you deal with linked data in python. An rdflibweb lets you create a really simple browser to explore it. So this is running on my local machine. I implemented it in about four lines of python using rdflibweb. And it shows you all... I can look through what a chunk is. It has these different properties. And then I can look through all the chunks that we have defined in the reference definitions. Here's cpython. And that defines some configure commands, for example. And then it shows me the linkage between them. So that gets referred to in a few different strata, for example. So my point is that this is really easy to do once build instructions are represented as simple YAML files or stored in a database, you can reuse analysis tools like this, which has not developed at all for build tools, but it's a general purpose thing and we can now use it for analysing build instructions and I'd like to generate some interesting graphs in future as well. Haven't been to a lot of data visualisation talks yesterday. I'm very interested in making pretty graphs and network diagrams now. So the final part of the talk is how this can be useful for python development and how many people use virtualenv. So virtualenv is really useful. Quite a simple way of isolating your python dependencies. It has a few problems, which is if you want to install a library that needs a system library and you don't have it installed on your system, there's nothing virtualenv can do about that. So you can use the baserock tooling to build a container which tracks all of the dependencies that you need rather than just the python ones. I wouldn't recommend if you don't have a problem with virtualenv, keep using it because it's much more convenient, but if you find yourself reaching the limits of what virtualenv can do and finding they actually have to start installing packages and tracking dependencies elsewhere, baserock gives you a way of defining everything, all the python dependencies, all the C library dependencies, right down to the toolchain you use to build it. Creating definitions by hand is a bit boring, so we have a tool called the import tool which can import metadata from other packaging systems. We developed a way of importing information from PyPy. So quite a lot of work went into this, quite a lot of research by one of my colleagues, and we tried looking at the source repos of python projects and using a patched version of PyP to analyse what dependencies it expressed, but it's actually quite difficult to get information that way. The problem is, again, because setup.py is code, people can really do anything there, and so you find setup.py that don't make sense to PyPy when you run it in the repo. So what we've ended up doing is we have a solution which sets up a virtualenv environment, uses PyPy to install a package, and then uses PyP freeze to get the list of dependencies. It's not the most efficient solution because you have to compile any embedded C extensions or other things, but it has the advantage that it always works. Does anyone want to see this? So, hey. Yeah. Well, the idea is to generate something which can be used in a tool that's useful outside of Python libraries. So I can show you, if you name a package and I can show it working if I have an internet connection. Or I can show you an interesting one, Alex ML. I found that some packages, some that you expect to have a lot of dependencies don't actually list any. For example, Django and NumPy don't list their dependencies in a machine-readable way. They list them in the readme. Sadly, no. So I guess this is going to do quite a lot of compilation. So I shall leave that. OK. Well, the final bit I want to talk about is why we're doing this. There's a few reasons. One is that hacking and operating systems is quite fun and hacking and operating system tooling. One is that there's a lot of best practices today which some people follow and some don't. And we find ourselves a lot of the time cleaning up in projects where the best practices haven't been followed. So making tooling where you can't actually avoid following best practices is a goal. And some of these are, not depending on third-party hosting. Most build systems today download tarballs from upstream websites, which is great until the website disappears or gets compromised. Recently, getorius.org went offline, for example, forever. And all of the source code mirroed on getorius disappeared, which would be really annoying, except that we'd been mirroring all of the projects we needed for years anyway. And so it didn't make much difference. At some point we have to find the new upstreams for the ones that have moved, so they keep up to date. But you can imagine if you have a build system which clones stuff from getorius and then the day before your release it disappears. That's a real problem where if you have a source mirror, you're insulated from that. So making a source mirror is really easy using the Trove server appliance. Trusting third-party binaries is another thing, which seems to become really common at the moment with the rise of Docker, which is great, right? Download a binary which you can't really inspect the source for, run it as root on your computer in a bunch of namespaces. No. So please build things from source instead. That's why we want to write tooling on everything from source, so you don't have to trust random binaries downloaded from the internet. Two other things, keeping things up to date and making it as easy as possible to fix them upstream. So because everything's in git, you can clone any component that you think there is a problem in. You can clone it straight away from a local server. You don't have to worry about what format it's in or anything else. And then once you've worked out what the problem is, you can then at a later date try and submit it to the project. But we discourage patching things in the build instructions. A lot of distributions carry endless patches against projects which never seem to get upstreamed. And some of them can't be. Some of them are legitimate things which are distro specific. But we really want to discourage patching things because it makes it more difficult. Then you come to upgrade from Python 3.4 to Python 3.5 and it turns out half your patches no longer apply. So you don't upgrade for a long time. So we encourage building things directly from source code. And that's all I wanted to talk about. So thanks a lot for listening. I'll be happy to take any questions. Hello, thank you. I have a lot of questions. First, can you compare your system with Parker? With what, sorry? Parker. Yes, I have used Parker. So Parker starts by taking an image that's already built. So it'll take a say an Ubuntu base image. And then it runs, it can run a bunch of different commands like it can run a Chef and run a Chef script or run Ansible and run an Ansible script. And then it can deploy the image somewhere. So it's in a related area and they overlap. I actually thought at one point about writing a base rock plug-in for Parker which could, instead of starting with an Ubuntu image, would start by building or using a cached version of a system from source code. So the answer is they can interoperate. At the moment, they don't. But I'd like to look at how to integrate base rock with Parker. Thanks. Does base rocks work on Windows or on the Unix systems? The tooling only works on, well, YBD works on any Puzzic system. Most of some of the tools only work in base rock itself to free us from having to track dependencies and make it work on old distributions. So Linux or Puzzics. About containers. Some containers tools like Rocket or Waga or LXE can work without root. Do you use them on Puzzic systems? Not at the moment, no. But I'd be interested in implementing that in the sandboxing library. If you want, I'll tell later. Great, yeah. Last question is about, do you know about Nix's package manager? NixOS, yes, that's an excellent question. I do know about NixOS and think it's a great project. I'm terrified of the complexity, but I would very much like to align everything we're doing with them as it becomes possible. Thank you. I'm a bit slow between the errors, so forgive me. You probably already addressed this. Do I understand correctly that with base rock, I can do a sort of Gen2 type system or the entire system is built from source, but there's no way that I can start from a CentOS base or WM base that's correct? Yeah, that's it, yeah. Okay, so that packer integration would be pretty awesome. Maybe I should go write that myself. Thank you. Hey, so if I understand correctly, NixOS is happening as it's a charoute that you're essentially running these commands to put binaries into the charoute. You mentioned there's an integration thing that happens afterwards if you're willing to perform modifications of things in the charoute, I'm imagining. I saw post install commands. Yeah, so there are post install commands. Basically, those exist so that you don't have to override the default install commands. So, for example, for auto tools, the default is make install. So you're not actually having to execute any commands inside of the charoute itself? Yeah, those commands all run inside the charoute. Okay, so now I'm wondering how do you deal with architecture differences, for example, or things of that nature, where your build host doesn't support the target, like running executables inside the target thing? So cross-compiling. Yeah, that's one example. Yeah, it doesn't support cross-compiling deliberately to avoid the complexity of supporting cross-compiling. Okay. Well, yeah, there's a whole other set of scenarios that I've run into a similar tooling where, like, SE Linux is another example, where your build host doesn't support it or cross major kernel versions, that type of thing. Um, we recommend running build inside a base rock VM or charoute. And so the only thing that affects us is kernel versions. And so there is a requirement on what kernel you have, but you can get around that by using a VM. Okay, so you're trying to use the same target and build host essentially? Yeah. Okay, cool. Hi. So I have a couple of questions. How do you bootstrap this? Like, where do you get make from? That's a good question. The bootstrap is actually quite interesting. It's based on the Linux from Scratch Bootstrap. If you want to see the gory details, you can look in the definitions repository, which is a Git server. The gist of it is that we start by building from tar balls. We build, we have a bootstrap build mode, which happens outside the charoute and uses the host tools. So that builds, I think it builds a GCC in a bin utils. And then with that, it builds a stage two, which is six components, I think, make GCC, busybox and glibc. And then it builds everything again with those tools in a charoute. So we use basically clever ordering. The actual, the description of this is in here. So it's kind of explained in comments. And you see it starts. It does stage one bin utils, stage one GCC, and then Linux API headers, glibc and so on. And the bootstrap's quite good because it's really easy to cross bootstrap to a new platform. So base rock's been ported to a bunch of different architectures like ARM and MIPS. And we did a ARM big-endian port, which I think is one of the only OSs you can run on ARM big-endian at the moment. Because you only need to cross-build about six things and then the rest you can native-build. OK. And so follow-up question. If there's a security vulnerability in, say, glibc or something low-level, would you, I assume the implication is you'd have to rebuild most of your image? There's a way that you can cheat by adding a new version of the component on top. So if you wanted, you could add glibc again and overwrite the existing version and deploy that as an upgrade. But yeah, the design of it encourages rebuilding everything from source, which isn't ideal when doing a security update. You need a lot of compile machinery. The more of us providing distributed build work is the better, and that's an area. So we used Nix. Great. And you clearly are trying to fix the same kind of problems that they are trying to fix using similar components. So were you aware of Nix when you started your project, base rock? I was, yeah. I wasn't actually one of the founders of base rock. I've kind of got involved in it later on. I was aware of Nix. I've never really used it much. I found it has quite an area of complexity. The build definitions, rather than being data, a sort of functional code. So I think long term, we definitely need to align the two projects. But you didn't use Nix because you were scared of it? In a way, yeah. I think the people who originally came up with base rock didn't at all think of using Nix. So some of it's been developed in parallel. Okay. Thank you. I should add part of the original goals of base rock is to reduce complexity. Oh, yeah. Let's see if LXML has done anything. There we are. It's generated a stratum which has LXML and C-cython in it. I can show it you in here. So that's quite a simple example in the end. It wasn't the most efficient solution, but it worked. So it just saves you writing definitions by hand for things where there's metadata that already exists. There's also importers for Ruby gems and NPM and something else. Now is that the least useful bootable Linux distribution ever? I don't know. It depends how much you like using LXML from the console. Any final questions? Okay, great. Thank you very much, Sam. Great presentation.