 Welcome everybody. I'm Gunther Meegel. I have a background in system engineering and I'm taking care about IT infrastructure since 10 years. And today I want to talk about Coinboot, cost-effective, this list GPU clusters for blockchain hashing and beyond. Yeah, let's start. Beginning 2017 there was this emerging cryptocurrency boom. So the company I was working for at this time got a customer who ordered 20 of the containers packed with computers for cryptocurrency mining. Mining is cryptocurrency lingo for taking part in the generation of new blocks to a blockchain for a reward. The most popular cryptocurrency Bitcoin is mined with special hardware as you may know and other cryptocurrencies like for instance the Ether of the Ethereum project is generated mostly with GPUs. The customer wanted 20 overseas containers for GPU mining. So we crammed nearly 5,000 nodes and 30,000 GPUs in these containers split in 240 nodes and 1,440 GPUs in each containers. And the emphasis was on minimal total cost of ownership to maximize return of investment. So commodity hardware was the first choice. The customer ordered not only the hardware, but also software stack to run and operate the hardware. Though I came up with a solution. But at first let's take a look at the hardware. There I have a video for you showing you the hardware and the production facility. Oops, sorry, wrong screen. Let's go. So this is the production facility. You see the hardware there as the hardware is beside the six AMD GPUs on the low-end. Still low-end GPUs. Four gigabytes of RAM. No BMC, no APMI, one gigabyte of Ethernet. The containers you see here are air-cooled and they have an electrical power consumption of 250 kilowatts each. So a lot of electrical energy is going in there. And we also have produced water-cooled containers. Okay, video is done. Let's see if I come out of full screen without any harm. Okay, yeah, not working. Ignore the bottom. Yeah, so this is the hardware we got. So now the initial approach. The initial approach for the deployment of the software stack was the following. Create a golden image of the OS plus the configuration and additional software and deploy it during the production on the cheap USB flash drive. After the first container was completed, we switched it on and really recognized that around 10% of all nodes did not come up properly. Though, as we find out there was a race condition between the initialization of the controller of the USB flash drive and the storage initialization of the main board from there. So sometimes it happened that the USB flash drive was not fast enough to be recognized by the main board as disk. So, then a workaround was put in place and it was the following. Load a bootloader or a network and in this case rub, rub had some logic to determine if an USB flash drive was initialized successfully. And if this was the case, booting would proceed from this disk drive and else shut down the node properly and we later switched on the node again via bacon LAN and hope that booting next time succeeds. So, so the workaround was working. We were busy with the production and shipping of all the containers. We had a working setup, but I got an idea for further optimization. Can we drop the unreliable USB flash drive at all? So, the USB flash drives we use there, they are just costing five euros for 32 gigabytes and they have, as you may recognize, but as workaround, they have a lot of issues. And so, the idea of getting rid of the disk comes with some pros and cons. And one pro is cutting costs by having no USB flash drive, no workaround for boot failures anymore, and we could also streamline our production and operations procedures because just slow use B drives, they really make really trouble also in production and in operations. And the cons are mostly constrained by hard-on software. So, we have only one gigabyte of network and the golden image we have created has a size of four gigabytes. The RAM is also four gigabytes and the proprietary GPU driver we need for the cryptocurrency mining is only available for Ubuntu, Redhead and Zuzu Linux. The preferred OS by the customer and by my team was Ubuntu and the main conflict is between the size of the golden image and the size of the RAM. So, if you put this golden image on the RAM, you don't have any memory left for running the system. Okay, so, the main topic to go diskless was to get a nice image to this last bulky. Less bulky because it has to fit in the RAM and also to reduce the network load. So, I set myself a goal, less or equal to 200 megabytes of volume to be transferred over the network for booting a node. Beside this, the image should be bootable by network, of course, and able to run diskless without any further storage devices because I'm quite lazy, as most of you probably. I don't want to reinvent the wheel, though such a tiny distribution supporting network booting and running diskless should already exist. I did know that in the HPC domain, diskless was done since ages. So, I searched for distributions in this HPC domain, a special tailored HPC distribution in the front, for instance, rocks, open HPC and XCAT. So, none of them seem to be a good match. Rocks, yes, is CentOS-based. Open HPC is all CentOS-based and XCAT TLTR, a very complex project, mostly written in Perl. I don't want to spend my time on this. It seems to work quite well, but I doubted that I could integrate, for instance, a graphics driver there. Yeah, and then there are the nukes on the block, like CoreS and its front fork, Flatcore Linux. Both are known as light-wide diskless and able to run diskless. So, I looked at both, but both were 100% about the threshold of 200 megabytes. So, they need 400 megabytes of image size for building a node. Yeah, this is why I would call them light-wide, but yet too bulky for my use case. Sorry to the Flatcore Linux and CoreS people, you do a good job, but you cannot help me in my use case. So, I had to slowly convince myself to create something on my own. I proposed my idea to the product leader of the mining container project. I liked the idea. Asked me for an estimation about the effort for proof-of-concepts to report to a supervisor. I said, give me four weeks. To make a long story short, the company had no further interest to look deeper into this idea, but I was stubborn, curious and eager and did it as a side project in my free time. So, while working on this as a side project, the proof-of-concept took me four months. Most time was spent on getting rid of the unnecessary libraries, modules, documentation, and other stuff that you don't need to run an operating system and getting the job done. Taking apart the massive proprietary GPU driver was also an endeavor of this whole project. And I swept it down to the bare minimum. The initial commit I did on October 2017, proof-of-concept was finished on March 2018. Then I got stuck in a corporate roundabout for a while. And what should I say, at the end I was able to publish all of this open source at August 2018. What I got? Yeah, I got lightweight PXE booting with an image size of 105 megabyte for kernel 4.15 and an image size of 155 megabyte for kernel 5.0. I got diskless worker nodes, configured real environments, a plug-in system, a coin boot server docker container. I'm using IPXE with remote logging, which is very handy if you want to debug the remote. The booting process. On all these nodes, I got support for legacy PXE and UFI network boot as well. And mostly important to move fast in the development process is I got testing with Travis CI. On Travis CI, I test, of course, the coin boot docker server container. And I'm spawning multiple CUMI VMs in the Travis CI instance to see if network booting is working at all. And, of course, daily builds and releases of the images. Yeah, as new updates and upgrades come to the distribution which this is based on, so I want to, I'm rebuilding the image every day. Okay. So let's have a look at the tool set to build the slide-wide OS images. A boot loader needs two files to boot a Linux system. At first, a compressed Linux executable called formlinux. Why it's called formlinux? Yeah, it's a long history. It's a long history. And a initial root file system called in-a-drama-fs. So I was looking for two to build that and I found deburf, which is also tailored towards running this list, so obvious a good match. Deburf is an abbreviation for Debian on initial run file system. Debian is part of the current Debian and wounded releases. You can easily install it. But sadly, it's the last activities upstream on this project were 10 years ago, and even if it's in the official releases of Ubuntu, it's broken. The images you can create with it, do not boot properly. There are some system-debulated patches required, which I have done. Deburf is at the end just a nice wrapper for the bootstrap with the possibility to run scripts to adapt your build root file system. Deburf is calling this a profile. So I created a profile for deburf. Yeah, for that, I customized early user space process, though to use a compressed RAM disk for the root file system and having the capability of loading plugins to extend worker nodes at boot time with functionality. So now let's take a look at the customized early user space process I created. But at first, we have to look at the two types of running this list nodes. Type A is without centralized storage, where the root FS is on the local RAM of the working node only, and type B with root FS on centralized storage and accessed over NFS or iSCSI, for instance. And of course we go for A, because B is not playing well with our commodity one gigabyte network. So the early user space has to support root FS on the local RAM FS. So I came up with a two-stage early user space. The first stage is running slash init. It's an initial boot script, and it's based on a minimal busybox environment. It creates a Z-RAM disk with SCD compression for the root file system, and then it extracts the final root FS archive on this compressed RAM disk. After that, it pivots root to this file system and handovers to init two of the final stage. Slash init is what you found in all classical learning systems for booting, but it extended it and added a second stage, which is called init two. It launches and triggers system DUDFD to get all devices up and running, for instance, the network interface card. Then it's downloaded and extracting the plugins to the root file system on the compressed RAM drive, then it's downloading and adding the environment file to EDC environment, and after that it handovers to system BD with calling aspin init to finalize the boot process. And in the future, there's a plan to directly mount a squash FS image at the first stage. Yeah, now some fancy graphs. Let's talk about RAM compression with STD for the root file system. The RAM drive compression of CoinBoat is using STD compression. It's a STD compression algorithm. Thanks to Jan Coley from Facebook and the other contributors for this excellent project. As you may remember, yesterday there was a lightning talk about STD given by a colleague from Ivan and he had shown their use of STD. So it's a really sophisticated compression algorithm. STD is in the kernel since 4.14 I guess, so late November 2017. It's the fastest high compression ratio compression algorithm you can currently get. It can be also used for a compressed RAM drive with that RAM. So with the root file system on that RAM with STD compression you get much less shared memory usage and overall more memory available for the system in comparison to a root FS on tempFS which is the classical approach for diskless nodes. You may recognize that this blue bar, which is completely missing or very tiny on the set RAM section there, this is basically the shared memory which is allocated in tempFS completely for the root file system. If you look at the memory available, the green bar, you have quite more memory available if you use set RAM. And STD RAM is compression. So the next topic is booting is tricky these days. Mainboards come with a wide range of default boot options. So for instance legacy, BIOS, PXE, UFI, PXE, UFI, HTTP boot and so on. Touching the firmware configuration of thousands of mainboards in production was absolutely not feasible. To cope with that I came up with the following solution. We are using DNS mask as a DHCP server and I configured DNS mask like that. It's setting a tag based on the data provided by the DHCP client DHCP request and based on this tag different bootloaders got delivered with the DHCP acknowledgement. Next topic is, as I mentioned, configuration environment variables. I think this is well known from working with containers already. You have one central environment file at the Coimbo server where you can tweak the environment variables for your whole cluster. And this environment variables are then available at each worker node. Plugins. Get all you need. Coimbo plugins extend nodes with functionalities. Coimbo plugins are just a set of file system changes packed in a compressed archive. At boot time they are downloaded from the worker node, from the Coimbo server by the worker node and the worker node is downloaded archives and extracted on his root file system. Plugins are created with Coimbo maker. And I also created an experimental way to keep the Debian Packet Manager database in a wallet state when using Coimbo plugins. Let's talk about Coimbo maker. Coimbo maker is used to build Coimbo plugins. Coimbo maker, for this case, takes just a Coimbo in-earth ROM affairs. And then it runs run-c to create a container with this in-earth ROM affairs baseline. This container has overlay affairs, this backing file system. And then with this file system changes are tracked in the backing file system. So you just install whatever you want on your working node there. And when done with the changes, the files to be part of the plugin get collected and packed as an archive. This archive is then placed on the Coimbo server. And during boot, it will be downloaded and extracted by the worker node. And to be placed on the root file system. Yeah. Coimbo server. Yeah. Let's give some short hints how to use Coimbo server. So it's a central component for booting your cluster. So there's a quick start guide available under this URL. There is a Docker container, which brings all services required. So basically, DNSMAS for TFTP and DHTP and EngineX for HTTP. It's all pre-configured and just use it. So clone the repo for Coimbo and then configure the DHCP range. So that it reflects the range of IP addresses you want to hand out to your cluster nodes. Then that's a mandatory environment where we have the Coimbo server IP, which is basically, should basically be the IP address. The Docker host, you spawn the Coimbo server. It's available, should be. And the kernel and initial drumFS of Coimbo are downloaded automatically when you spawn up the container. If you want a different version of the kernel and the initial drumFS, you can specify distance in the environment file. So let's go, just make it Docker compose up, wait a short amount of time until all services are up and running of the Coimbo server and then switch on your worker nodes and things magic happened. Okay, so I'm already at the summary of my talk. Coimbo can run GPU based blockchain hashing on GPU clusters with a minimal TCO. By using RAM drive comparison with STD disk list worker nodes have more usable memory available and Coimbo is easily extensible and can run various other number crunching workloads. Of course. So what's next? There's an ongoing transition to a monorepo because Coimbo consists of a lot of sub projects and it's a complete mess to work with them. So I'm working on moving to a monorepo and of course going out of the better status as well. There's TensorFlow for AMGP GPUs coming to Coimbo to support for NVIDIA GPUs. There's a plan to use peer-to-peer plugin loading with the local BitTorrent stream using SquashFS and overlayFS for the root file system. And there's also a plan to giving mining hardware a second life because at some point in time mining is no longer profitable and Coimbo will probably be being part of a platform that makes GPUs accessible for the machine learning and data science community. Yeah, so at last some thanks. And I want to thank Yuli and Elmo and Steve Schmeller who convinced me to apply here as the ambassador for the photos. Heime, Komeda, Komeda Adia and Lucas Rave of Cloud & Heat for their support. Cloud & Heat for the video material. Subramanaya, Umshanka, Yoshi for being the best unicorn whisperer ever. And now you can ask me questions or you can ask them later. Thank you.