 So, my talk today is called to run an app with guarantees. We must first create the universe. My name is Blake Irvin. I'm an engineer slash product coach at a company in Berlin called SmartBee, which is a sustainability company. I'll talk a little bit more about that in a second. That's my contact details. I'll post them again at the end. I was hoping to be able to do more hallway track stuff today, but I have a sick dog at home, so I have to do my talk and run home and take care of my dog. My flat's not destroyed. Yeah, so I work for this company called SmartBee. We make various software tools that do, we do data analysis, and we focus really on tools for large enterprises and industrial sites focusing specifically on transparency, efficiency, and sustainability. So our strategic goal, or our ethical DNA as a company, is reducing human impact on the planet. Here's an example of one of the tools that we make. So you can kind of see here we've got some general consumption information and savings and how many kilograms of CO2 we're saving by doing certain actions, and we're basically trying to give people sort of like a feedback loop to help them consume less. And we are hiring, so if this is the kind of work that you think is interesting or are curious about, please talk to me either directly or via email or something later. What I'm specifically talking about today is things that we do at SmartBee with a tool called Habitat. If there's any convolk people in the room, I think convolk is collaborating with the Habitat. Folks on Kubernetes support for Habitat. Habitat is like an application lifecycle management tool. It's very difficult to explain Habitat in one sentence, but that was my best effort. And that's also something I'd love to talk more about later, if anybody's curious about how Habitat works in more detail or hasn't tried it. I don't think there's a user group for Habitat in Berlin yet, but I would love to be part of that if there were a few people that wanted to do it. Anyway, so yeah, but it's not specifically a Habitat talk. We're going to talk specifically about some things we had to do in Habitat, but I'm thinking more about, like, big picture stuff in general. Yeah, so anyway, the name of the talk is to run an app with guarantees we must first create the universe. And I think a fair question at this point is why am I saying that? Why isn't it necessary to create the universe? And the answer is because we have to. Anything that happens requires the universe to exist. I'm stealing, or I'm paraphrasing Carl Sagan when he said, if you wish to make an apple pie from scratch, you must first invent the universe. So on the left, we have the Big Bang, and then everything else that happens all by logical evolution and then the apple pie at the end. And since it's the autumn, we can contemplate apple pie for a moment, which I'm looking forward to eating lots of, especially if I get to go home to the states for Thanksgiving. Yes, but anyway, I'm talking about this sort of idea of a pocket universe for the application, like not the universe itself, but the universe from the point of view of the app. And so when I say universe, I don't obviously mean a real universe. What I mean is the application, all the code I need to run the app, plus all the dependencies the app has. And I'm talking about pocket universes because I don't want interactions between these different applications. I want to define what those are and control them. Another way to think about this would be in like biological terms because what's a little bit different about habitat versus things like, I don't know, like a BSD jail or a cell or a container is that it's also explicitly defining the entire application life cycle. So here, this is a beehive in like this little cell number one, we have an egg, right? And then in two, we have a slightly older larvae and a much older larvae in three, one in four, we have a pupa, so now it's about to transform into a bee and what's going to hatch out of that is a bee. That's the whole life cycle, not the whole life cycle, but it's the entire juvenile section of the life cycle and that's all happening inside in isolation from the rest. And that's important because we want isolation, but one of the questions that a lot of folks that I work with at Smart Bee asked when I started pushing this idea of like heavily isolated applications, as why does this matter? For me, as sort of like a career long operations engineer, a lot of this is about safety, right? So if you, I remember lots and lots of times early in my, well, it wasn't really my career, but before my career when I was working or when I was using horrible things like Windows 95 at home, to play games, I would install a game, if there was a problem with a game, I had to reinstall the entire rest of the operating system because there was some kind of like splash damage happening. That was not safe, right? That splash damage is bad, like we don't want, just because we make a change to one thing, one part of the stack that we're running, we don't want, we don't want bad effects to spill over into other components and that, I don't know if anybody else has had experiences like that, but I certainly have and we'll talk a little bit more later about why safety matters for another reason, which I'll get to in a bit here. There's also like the problem of entanglement. Does anybody, can anybody tell me what film this is from? This should be easy for this crowd, right? I can't hear. Yeah, right. So this is, I think this is the first Lord of the Rings movie and the dwarf, Gimli, is about to fall off a cliff into some hole or something and one of his buddies grabs him by the beard and the reason that this is a dramatic moment besides being funny, the reason it's dramatic is because you don't know if the beard, which is like the connection between the two people, is going to save this one guy or kill both of them, right? Because if he falls too far, he's going to pull the other guy with him and then they're both going to die and that's the kind of entanglement that can happen between services or applications that run, let's say, on the same system especially and that, and so we want to really control exactly how those connections look, right? Because we don't want to, we don't want, we want it to, we don't want to be in a situation where one service going down pulls the other service with it. So we, in general, what we're trying to do, or one of the things we're trying to do is avoid cascading failures. This is another good example from biology. If you've, has anybody ever built like a self-contained terrarium before? So these are like sealed, right? Like there's no air coming in or out. You have enough microorganisms and like an amelia in here to generate carbon dioxide or yeah, carbon dioxide, which the plant will then turn into oxygen and you have this cycle. If I have three of these and I mess up the system here, the other two should survive, right? So that's another example of like isolation being a good thing for us. Or we can go back to thinking about like the beehive example, right? Like one of, I'm not sure offhand whether there are wasps that parasitize honeybees, but almost every insect on earth has some kind of parasitic wasp that preys on it and the parasitic wasp will come and lay an egg on the egg of the thing that is parasitizing and it's larvae hatches and eats the bigger animals larvae. If that happens here, let's say like I'm a wasp and I come and land here and my baby starts parasitizing that larvae, all the rest should be okay in theory because everything's separated and isolated. But the more important thing for us at Smart Bee is our emphasis on scaling down and some of the safety stuff applies to this as well as we'll see in a minute. Scaling down matters to us because the entire point of the company, like the reason Smart Bee exists is because we are trying to get humanity as a species to scale down. Most of human history, especially since the Industrial Revolution has been all about growth and now we're kind of pushing the envelope for resources and we're starting to see places where we just don't have any more stuff to use so we need to scale down and start reusing those resources or avoid using resources that we don't need to and that also applies for technical operations. I don't remember exact numbers but you can definitely see some really interesting studies about the amount of energy that's going to be consumed by data center operations over the next like 10 to 20 years and it's not super cool if we continue powering these data centers with fossil fuels, for example. So ideally what we want to do then is we want to do workload consolidation, like we want to get as much stuff to happen on a single compute resource as possible so that we don't have any unused resources that we're still emitting carbon. But that is traditionally a little bit scary because we have safety concerns if we're doing, if I was doing this in the 90s and somebody told me put every single one of your services on the same box I would be like there's no way I'm doing that. I'm absolutely not doing that. That's crazy. I can't guarantee that my database and my application server can run on the same system safely because they have too many possible interconnections and then I won't be able to do like an app to get update for example because if I update OpenSSL and I have one version of OpenSSL for the database and I need a different version for my application server and I do both of those things or I do that one update I could break one of those two critical services but complete isolation makes things much safer so we can put a lot of muscle in a very small space. These are, I'm not sure if they're necessarily race horses but I go horseback riding every moment or every opportunity I get and one of the things that you realize when you're close to a horse is that they're a very big animal like they weigh around 1,000 kilos I think or somewhere in that range if one of them falls down on you it will either paralyze or kill you so you really like safety is a big issue when you're working with horses and horses can also hurt themselves and other horses very easily just because they're so big but when people transport them they want like a safe mechanism for getting all of this valuable muscle from one place to another and building these extremely strong isolated sort of like containers basically for the horses that that's a way of getting all of that valuable muscle from one place to another without wasting a lot of resources so instead of having like one truck per horse which would be the easy way to do things which is basically what we did with compute in the 90s and kind of even still do it today to be honest the alternative to that is to build the strong isolation so that we can pack a lot of resources and a lot of compute or work onto one set of resources and this is this is kind of this is like the classic like I would say for half of my career which is now 15 years or so this is what I thought was a good this is the way that I thought the world should look for the services I was running that everything was like perfectly clean and nice and there was lots of overhead lots of you know likes like breathing room at the top and nothing was overutilized and everything was fairly cold like literally cold or I would just say that figuratively that my my system is really cool and not running too hot but that's not actually what we want what we want is something more like this this is the famous this is fine dog but actually this is fine like if everything's almost on fire like all my systems are running at like 98% capacity that's actually great because that means I'm not spending any carbon on anything that is not being used right that's actually what we want we want every one of the big problems that we have with transportation in the United States is that if you go to work in the morning in the Bay Area where I that's where I lived before I worked in Berlin I would commute to work on my motorcycle and I would and also in California you're allowed to lane split which means you can drive in between two rows of two lanes of cars on the highway so I'd be going like 120 in between these other cars and I would pass car car car car on both sides of me and every single car had a single passenger right so they're all burning fossil fuels but there's only they're only operating at like 25% capacity that's bad that what we want is every car to be completely full right these people probably for economic reasons are much better at utilizing the resource than most of us are and this is something that I'm we're trying to fix it smart B for ourselves too like the way we run we run services that looks a little bit this this might actually be not ideal optimization right like there's a good there's a good chance there's a good chance that this truck is not it is not operating at the best load to carbon ratio carbon output ratio but that's something that we can figure out right we're smart people we know how to measure stuff we can figure out ways to say I want to put as much stuff as I can on this resource without increasing my carbon footprint in a bad way and this is this is the this is the picture that I like to look at when I think about this stuff this is the NASA photograph AS 08-14-2383 which was taken from the Apollo 8 spacecraft in the 60s I guess and this is I think the first time in history that a human held a camera and took a picture of the place that all of the rest of human history had happened right so that's like everything that ever happened is right there more or less the everything that we care about everything having to do with human life and that that was I think the beginning of our realization as a species that the we are really in a very small container like we're in a very small limited set of resources and we need to protect those or things are going to get super weird and that is why I really think that the future is scaled down not up at the moment we're still very much sort of addicted to this the the sort of high you get from working on these like all of this tech stuff that's very focused on growth and speed and performance although technically speaking performance systems can also be very small but we can talk about that more later anyway yeah so this is why I think that the future is scaled down but there's a problem with this that there are downsides to doing this kind of isolation and this is the this is the sort of the nitty gritty technical details I wanted to touch on very quickly today for some of the places where we've had pain trying to keep things really small and isolated and how we tried to solve those so one of the biggest problems is that if you want to prove that you can run in isolation you have to start from zero every time habitat as a tool which is the tool we use for doing this other folks might use Docker you can use habitat Docker together actually have that as a tool assumes that you always have to start from zero to prove that you really have isolation which means that if we are also say depending on something upstream like we do a lot of Python stuff that means we use we do a lot of PIP installs PIPI is pretty much guaranteeing the contents of the packages that we install right so they're doing this for us it would be stupid if I was getting these supplies for me to unpack and repack every box that drops out of the airplane right this is me doing PIP install basically during our builds yeah so that doesn't make a lot of sense but that's actually what we're doing whenever we do a build in like a habitat build when we build our artifacts we do a PIP install of all the dependencies that we need and then out of the which literally means that we're downloading everything unpacking everything and then compressing it again but it was already compressed and checked something so that doesn't really make a lot of sense right and it intellectually is annoying but the real like tangible pain of it is that it just makes builds very slow so we have a very one of our applications is a very big Python monolith that does a lot of scientific computing it's about half a gig after it's installed on disk waiting for half a gig of of files to be compressed is a long build and it's it's it's and it's like 300,000 files or something so one of the ways that we chose to fix that was by rendering right so if you if you've used like a package manager for any of the common distributions you've probably used a vendor thing before like a really common one that I used to use on a boom two systems back in the day was I think the apt version or the apt apt vendor image magic because image magic is super hard to build but I needed it for a lot of my apps it's also full of security holes but this so for us though that's not really what we're what we care about so much we don't do much with image magic but we do do a lot of machine learning stuff which means we use tensor flow and tensor flow is about or this version of tensor flow is about 150 megabytes compressed so in the uncompressed it's like 300 or something it's a huge huge huge package so uncompressing and then recompressing that when we don't need to is something that we that was kind of like a wasteful part of the cycle we wanted to fix and we fixed it by doing some tricks to actually unwrap it in the Habitat Builder service which is part of the application lifecycle and then store the thing that we had sort of pulled out of PyPy or pulled out of the PIP module store that as a vendor package which but a Habitat package so that now we can depend on it as a Habitat thing and we don't have to go through the dependency song and dance twice so this is like I have mixed feelings about showing actual code in talks like this because I don't think you can learn much from it but I can share this to anybody who's interested directly if you contact me later via the internets but basically what we're doing here is we have a variable called package version or it's actually a function but we're treating also as a variable it's part of the build plan we get we look at our requirements.txt which is part of the PIP install ecosystem for the main application that we build and then we we get the version and we update that we do and then the main the big trick that we do here and you depending on what dependency manager you have this may or may you may or may not be able to use this trick we use a feature of Habitat that allows us to push an environment variable in from a thing you depend on into the thing that depends on it and that allows us to construct like a new Python path with all of the dependencies that we've rendered and that's how we do that's how we tell builder what we care or what to what should trigger a new build of the vendor module and here we're depending on it this is very ugly code but very readable which is a good thing for ops because you usually don't look at this stuff for months or years so you really want it to be readable not elegant and then the last thing that we do is local caching and for that we use similar tricks this is a physical like a food cache basically we don't want to if we had one if you had like an emergency cache like this you wouldn't want to restock it every time you use it you would leave as much as you could behind between uses and that's what we've been trying to do because it's caching really just means reuse and so we I wrote a very very small package that uses the same trick about pushing environment variables to push or to configure your different like npm or go or pip or the pip build environment to store the things that those dependency managers cash in the loop back mounted caching location that's part of habitat anyway this could certainly be improved we could also talk about like I've been thinking already about additionally pushing the cache like s3 or some object store somewhere else but this basically means that if we do this if this is my dependencies for my build if I depend on be sure catcher that's going to automatically at least for local builds like local development work it's going to put all the stuff that I want to cash into a permanent location instead of an ephemeral one because every time I do the build habitats going to tear everything down and start from zero but this the stuff that we're caching will avoid that I don't know if we have time for questions but I'll put my contact information up there again because I think the time is short and that's it Python packages yeah so you're talking about vendering for Python packages is that the kind of primary solution you use for for supporting Python dependencies or do you use like like wheel builds or anything like that yeah so what if we were to use it if we were doing wheels that we kind of run the same situation right like the wheel is kind of like a guaranteed thing and then if we have to download the wheels from somewhere even if it's our own thing it's just doubling the work like because we already have a mechanism for taking some bits putting them into an archive which is in this case a habitat archive is like the lowest level lowest common denominator which has a checksum and guarantees so we just we did try the wheels thing for a while but we didn't see huge performance wins by doing that because we still had to go through the unzip expand compress repacket like this whole this whole cycle again also because no I think that was pretty much it yeah that was that was mainly just a speed of light problem like just going to the internet and doing all these TLS handshakes every time we grab a new module just turned out to be really slow and so the more we can avoid that the better it is and in Python's case it's especially difficult when you're using like scientific computing stuff because like TensorFlow is huge Cython is huge sci-pi is huge these are like massive modules I think in the case of like npm where you have like like absolutely insane number of modules you probably have more trouble not on the compression and uncompression side but the the TLS side just a network transfer we have one minute left for questions can definitely ask me stuff online later I'll do my best to respond as quickly as I can test okay yeah so when you say vendering is that like checked into your source repository is that how you know no no actually what happens is so the build service and habitat is looking at our repository and that stuff that I kind of had to skim over real quickly that Tommel definition it's it's builders looking every time there's a git push it looks at the master branch sees if any of those things that match that those globbing expressions have changed and then if they do then it goes through the process of actually looking at the requirements file figuring out what the version of the module is and then building a vendor module and then pushing the vendor thing back into the the builder repository it's a lot of there's a lot of moving parts that's kind of that's why I decided to put my contact stuff back up on the screen because there's a lot of like little details that are not clear that I couldn't explain efficiently in 30 minutes so I'm happy to talk about it later online or something anything else alright I think that's it then thanks