 Welcome everyone to Lightning Talks. Our next presenter is Gaston Thiel. The topic is static infrastructure status with Jekyll and GitHub pages. Yeah, thank you. So seeing this up there now, I realize it's not about static infrastructure, but we're using static pages for our infrastructure status. I'm Gaston. I work at the State University Library in Göttingen, where we are maintaining the Daria DE research infrastructure projects with about 40 servers, which we're managing through Puppet. And, well, it's a rather large infrastructure. Some of the parts are in Göttingen, some are hosted in Munich, some are hosted in Jülich. And we have lots of different services that are interconnected, that depend on each other, that depend on the infrastructure components like the storage system or the virtualization infrastructure. And, well, sometimes things fail. That may be our services, it may be that for some reason our storage backend drops. Things like that happen, which is not so nice. And users want to know about this. They want to be informed, they want to figure out what's wrong, can I work, why can't I work, is it my problem, is there something going on, are they working on it? And they want to know this immediately as soon as possible, of course. So we do have some monitoring using classical Isingar and so on. So we have a few of our services on the right. In this instance, everything's fine. You also see the four data centers that are operating the services. And we're checking this. Classic Nagios checks, NRPE checks or PINs, and checking the responses to HTTP requests. Now the problem is if something happens, something goes red, emails are sent out. Our admins, our developers get notified that something's broken and they can start fixing it. But are they there? Is it a weekend? Maybe they're on vacation and only one person gets notified who's not around. So what we're missing, or we were missing, is a way to notify users that someone's actually working on it. And we figured what we need is some way to manually add information that we're working on it, that we know there's something broken and we're working on it. Because sometimes it's not even visible from the monitoring, which means we have to implement a few more checks. But still we realize something's broken, sometimes because they call. We want this to be independent from our infrastructure, which is mainly hosted at our university data center and some of the outages we experience include lots of components of that infrastructure. We want the thing easily accessible in case of an emergency. So if the entire University of Goettingen is offline, which does happen, we want to be able to make a change and make it visible to users. And also we want to have it low maintenance. So installing a software on some other PHP application was not really what we had in mind. Also because if you install it somewhere else, then you have to have independent authentication because again, connecting it to our LDAP, not a good idea if the LDAP is not reachable. So independent authentication means independent credentials. If it's only credentials used for that single thing, well, if you're in a hurry because you have to fix something, you probably won't have the password. So we figured GitHub pages might be the solution. With Jekyll, there's the simple syntax. It's completely independent from our stuff. Well, the availability of credentials shouldn't be a problem because having your GitHub password around is something I assume you do. Also, static pages, we don't really care. We have to care about what's going on there. There's nothing that can break or can break us. So I'm assuming most of you know GitHub pages and Jekyll. It's a Ruby implementation for static to create static pages out of a bit of markdown. This is what it looks like when everything is fine. We have a green box in the middle. Which has all services are available. There's also information on how to contact us if there's something wrong. We also have an independent Twitter account that we're only using for notifications. In that case, it's a bit more tricky. Does the right admin have the password at the right time? Not as trivial as having their own GitHub passwords. And also a problem is that in the heat of the moment, you don't think of all the things that are broken, right? So if something breaks like the storage breaks or one kind of storage because we're using more than one kind of storage, what does that actually mean? Which services are affected? Because if I'm saying to users our store next instance is broken, no one's going to know what I mean. So for this, we implemented Jekyll collections, which is a way to have our infrastructure inventory directly there in the GitHub repo as well. And including these dependencies, dependencies on each other, dependencies on our infrastructure components such as the storage or the virtualization environment. So we're running this on plain GitHub with Jekyll, but without any additional comments so that GitHub can actually compile it. So it's possible to open the web browser and directly commit from the web interface if needed in an emergency. We have our infrastructure inventory in there as collections. On the bottom, you see a graph that's rendered from these collections. It's not rendered by Jekyll because, again, we're using GitHub directly. So this has to be created independently, but the code is in the repo. This is not the entire infrastructure. This is just the thing we tested it with. Next thing is to get this filled up. We're able to put out announcements for upcoming maintenance windows that might have some implications because if we're rebooting stuff, that might take a few minutes. And if we're doing major upgrades, this might take more than a few minutes. And we're doing dependency resolution with recursive liquid templating. Now, if you've ever played around with Jekyll and liquid templating, this can be tricky because it's hard to debug. You can't just output debug notifications. If the compilation of the template fails, Jekyll will tell you compilation aborted. Maybe it will give you an error, but it's hard to, in these iterations of your recursive function calls, it's hard to know where you actually are. You can't just output all of the variable contents that you're recursing over into the console just to look at it. It's a known limitation of how Jekyll works, that it's not that verbose, even in verbose mode. So what happens if we do a git push? GitHub will build the page using its internal tooling. If there's a problem with the dependency, this will break. Because then the system won't find the right entry in the collection. This means if you've mistyped the kind of storage solution that's broken, then this will immediately give you a notification back to your email account. Again, it's not very informative of what broke, it just doesn't work. We can decide on a few things, whether you've inserted something that's wrong. Sorry, if you've mistyped your object, then it's a different error, but it's always only division by zero. We can tell you if you try to divide four, five or six by zero, and that then tells you what you did wrong. We're running Travis CI independently from that as well. Travis CI checks whether that graph is up to date. So the graph and this image is also committed, and Travis will recompile the graph. It will also break it if it can't compile the graph, which it could do for different reasons. And it will also do some network analysis whether the new graph is identical to the graph that's already in there. We're also rendering everything that we're currently announcing as broken or as upcoming problem in YAML format for the trivial reason that we want to have a history function which is not dynamic. So let's first look at what it looks like when it's broken. This is a service disruption with the FOSDEM service disruption with a screenshot showing an outage that affects these five services. We can change the text in there. Of course, that's a small YAML file where you insert the title and also the things, and you can insert data that says it's already fixed. It will still be in the data, and so we can use it for the history. So this is the history file that shows you that there was a problem sometime last year. So they were doing some maintenance in the data center which meant they were turning it off completely, which made all of our services were broken. Well, it was a Saturday, so for most people that was not too big of a problem, but this is what our message back then looked like. Back then we were still doing hard-coded HTML documents now we're up to Jekyll with independent resolution, so that's easier because we had to and this is where the idea came from. When this became clear, I mean, they did announce it more than a month in advance, we were putting this list together of things that would be affected and discussing what do we put on there, what do users actually know that we have, what should we tell them about, which services are relevant to end users, which are more relevant to developers. Maybe we need also to put some of our internal APIs on there because some of our developers develop independent, other research projects develop independent systems that use our internal APIs. So there's a few things left. One is that we need to have the full infrastructure encoded in those YAML files in those collections. One is to get those YAML files actually from our planned config management database because as you've seen from those seven services it was already rather complex graph. It will be more complex if we go up to our full 20. And the other thing is maybe this can somehow be turned into a generic solution because this is very much dependent on how we build our, or how we model our infrastructure and that we have a very specific set of collections that are also more or less hard-coded in the Jackal templates. I mean, the theme is probably the easiest to replace. So if you want to take a look, it's online. Thank you. Thank you, Carson, for the talk. We still have some time for questions if anybody has one. Okay, thank you once again.