 Well, thank you all for coming. There's lots of other interesting talks going on. I've managed to successfully trick you all into being here. You can collect your free money afterwards. So I'm Yubi. I work for University of California at Berkeley. I'm also part of the Jupiter project. Do people here know about the Jupiter project? How many people know about the Jupiter project? OK. So I mostly work with them on scaling things to, like, lots of people. And that's what the talk is going to be about. So this is an image I really like to use. This is day one of UC Berkeley's Foundations of Data Science course. It has about 1,200 students in that one class. Most of them, they're from a large variety of backgrounds. Some of them have some amount of computing experience, but most of them don't. But they've used the computer for a long time. They're like 16-year-olds in 2017. But they've not used the command line or anything of that sort. But they're here to learn about computing. So this has a lot of challenges in how we can support those people in learning without having to deal with installation issues all the time. So these are some of the problems that I have to deal with. First is setting up and supporting these many environments is a major pain in the ass. If you're an instructor or a TA, you want to be spending your time on teaching students and not figuring out installation issues on Windows XP or Windows 7 or Linux or whatever it is. Disparate access to computing power is a big problem. If you have a $2,000 MacBook Pro, it's much easier for you to learn and set up and do things. Then if you have a hand-me-down Dell from 2005 or a Chromebook. And this is something that we take seriously. And we also don't want to invent anything new. We don't want to have a Berkeley-specific environment that people learn. And then when the students leave the system, they have to learn something completely new. We want to reuse things as much as possible, don't invent our own stuff. And we also don't want it to be just for Berkeley. Like this problem that we are trying to solve, which is like allow teaching large groups of people, data science or like intro to programming. We want this to be solved in a way that everyone can use all around the world and not just in like a really top public university that can get free money from a lot of people. But we also don't have enough time or money to actually do this properly. This whole setup that I'm going to describe is managed by two part-timey people, one of whom is me. So we have to build this in a way that scales for humans. Like I do not want to be like a traditional academic assessment. I don't know what they really are, but in my head, I see them like sitting and working all the time and like someone calls up and says, I want to do this. And they're like, ah, this is so terrible, but fine. I will be the hero and do this. So we do not want to make that happen. So what this all really means is we really wanted to have like a multi-cloud, because we want to run it everywhere, arbitrary code execution environment. You just want the user to be like, I have a browser. I'm going to go in here and I'm going to do all my work. I don't have to install anything. I've used lots of browser apps in the past. And this is just one more of them, except it's letting me learn how to code. And we want persistent storage with all of them. This is a hard problem to solve and scale. And we've spent, we've been doing this for about two years now. I only got involved in this about like end of the last year. So the talk is going to be, we're going to go through how we did this for the first semester and then how we did this for the second semester and like what are the things that we learned and how have we open sourced them and how other people are using them. So I'm going to talk a little bit about scale. So this is spring 2017, spring 2017, which is sometime like from this January to this June. Google generously donated us cloud credits and we ran this whole infrastructure. There was 2,000 users of this and we ran it all on Google Cloud. And this was like 44% people who didn't identify as male. We had 10 plus connectors. So it was not like foundational courses, but it's like there is a legal connector. So people who are doing legal stuff then take that and then they learn data science to do things with like law texts. There's an ecology connector. There's a child development connector. Lots of things going on over there. And then we had a very small class in the summer. I'll talk a little bit more about what we learned from that later. And for the fall, which is right now, we are running on Microsoft Azure. Microsoft has also graciously donated credits to us. It's a larger class. We have about 3,500 total users and like again, mixed up over like very different connectors and very different like majors that people are coming from. I think this is the first time we have more people from non computer science and non stat related majors than from computer science and stat related majors. She's like a big step for us. And in like a few months, we are actually gonna launch a massive online course. This is the whole like, you know, take it to the world situation. And we are planning for 100,000 users from all over the world, which is like much bigger scale. We expect about 25K to be active at a time. And we've been planning for this for like several months. And this really like is where I want to be because I'm like, oh yeah, I want to build things that when I was 15 and with like a two physical connection somewhere in Southern India, this would have been very useful to me. So this is like great for me. And like we're running this on Google Cloud because again, Google has graciously donated a lot of credits to us. This is also why like the multicloud situation is great because different people want to give us free money at different times. And we want to run in like all of them. Yeah, even the summer, summer we ran on bare metal, but it was also donated hardware. So I'm going to show you a quick demo of what the student experience is. So this is open to everyone. Okay, I have managed to hide my full screen and I don't think I can bring it back. Oh, I brought it back. Cool, okay. Right, so this is the course website for the main course. If you go there, there's a bunch of announcements. There is a calendar. And like this is like a normal course, right? Like everything has this. Except if you click on this assignment, what this is going to do is, I've already authenticated. It would normally, at this point, ask me for my Berkeley credentials. But this is pulling down my assignment. So all of, we use Jupyter Notebooks for everything. So we have like, so this is an assignment that the students are supposed to do. And the notebook already has like, the spaces where they can write their answers. It has a bunch of exposition telling them what they are doing, how it related to the class that it just took this week. And the students can like immediately go and execute code here. Jupyter Notebooks are great for pedagogy and lots of people use them because you can mix like your output, your code and narrative in the same document. And you know, this is also persistent. I think last time I changed this like from Hello World to Hello Worlder. So it's still there. So I pull in my Notebooks and I can do this. And this is the experience student have. They don't actually know anything about Kubernetes. They don't have anything about Docker containers. They don't even know that they're like, you know, running on the cloud somewhere. To them it's like, oh yeah, this is just another Gmail, Facebook, like application that I'm using, just in my educational system. And we have lots of other plugins here. We have like automatic grading. We have like instant feedback for test failing. I'm not gonna go into that too much in detail, but this is the workflow that we want for students. And we have been having this for like a while. And this is really cool for like a lot of students who are not in CS. We have modules which basically take like four hours for like a single class. So we had like I think an ecology module. So these were people who were like studying ecology in some form. And we would only do a four hour course for them in like a semester. And if you only have four hours, you cannot actually teach it if you are like spending all of the time installing stuff. And this is only possible because we have this kind of infrastructure. And that's, I think that's pretty cool. So if I go back here, and I hate Google Slides. Oh wow. So how does this happen? So this is, we are using a bunch of technology here. One is Jupyter Hub. Do people here know about Jupyter Hub? Okay, that's cool. So to those who don't, it lets you run multi-user Jupyter notebooks. So instead of just installing something like you run it on your laptop, you can set up authentication, which you didn't see, but we use Google OAuth for authentication. And we have lots of authentication providers that people can use. And then it spawns them in some providers. So like Kubernetes is one thing you can run it on, but you can run it on like MISOs. You can run it on just system D, just Docker, whatever you want. But in our case, we're using Kubernetes because I'm not gonna go into like, Kubernetes is awesome here because I think you're tired of that by now and you have a day of that to go. We also use Helm everywhere for doing our deployments. I'll talk a little bit more about our deployment strategy later, but it's been really, really useful to us to use this. I'm gonna go through a very quick overview of how Jupyter Hub itself works in our setup at least. So we have a Kubernetes cluster and you're logged out. You haven't like done anything. You come in and then you hit a proxy. We have a proxy server running. It could be traffic. It could be anything that supports ingress, but we also have a native Node.js based proxy running. And the proxy is like, okay, this user is not logged in. I don't have a route for them. So then it sends them to a hub process, which is a centralized process that's running Python that we have. And then that redirects them to the authentication flow, depending on what you configured or what does the redirect dance. And then it starts the part. It just talks to the Kubernetes API and then it waits for a part to start. And we use automatic volume provisioning, manual volume provisioning, whatever you want. That's all configurable. So we wait for that to start. And then we talk to the proxy and be like, okay, yeah, next time the user comes in, just send them directly to the part. Like don't go through us. So it's like hub is sort of like a Kubernetes controller, although not directly so. And then we just redirect the user to like their prefix. And then now they're logged in, the user comes in, they hit the proxy, and then they just directly go to the part and directly connect. And the hub is taken out of the equation. And this lets us scale a lot more easily than if we have to have the hub in like every single request. And it also lets us like do like restart. So if the hub is down, it's not HA yet. So when we do a restart to do a deployment, then users can't authenticate for about two seconds, which is much better. It still sucks, but it's much better than like nobody being able to use the system for two seconds. So that's like a very high level overview of how this works. I can tell you more about it when we catch up in person. So we're gonna go through like a bunch of things that we set out we want to do and how we accomplished them. This is gonna like, so like from the problem sections, we remember like, you know, we wanted a multi-tenant setup and we kind of did that, right? It was like very easy to do. And at this scale, like I find it hard to like call like 3000 users like a massive installation. It is like, it's 30 gig. It's a 30 terabyte RAM cluster. Everyone gets a gig of RAM. So I feel like it should be massive, but it feels kind of wrong to say that. But it's like a solvable problem. We fixed it. But then we were like now trying to tackle all the social aspects of this. One of them is we do not want someone who knows and understands communities to be required to set up a Jupyter Hub. We want it to be like run by everyone. If you are just like a TA who has a 60% class and you're spending all of your time letting people install things, we want you to be able to do this without actually having to learn all about Kubernetes. So how do we go through that? So I'm gonna go through the history of how we set up. So for spring 17, which is the beginning of the year, we were running on Google Cloud. And what we did was we just like, you know, we had like 15 days to get the setup. And but we've been working on a head and shot for a while, so it's not that bad. We set up a Google Container Engine, as it was called then manually. And then we just ran Helm install commands. We were like, okay, okay, let's get this running. Oh, wow, it works. Okay, let's go away and finish all the other things off. And this was very useful, but it was also very specific, right? It was like a Berkeley specific installation. This is where a lot of academic deployments just end up being, right? Like, oh yeah, we have this cool thing. Let's show you. And then you're like, can we use this? Yeah, you can come join our university and then you might be able to use this, but otherwise it's not good enough. And then we had production and staging. We were not like that bad, but they were all in the same cluster, just different namespaces, which was still useful. We were able to do deploys to staging and break that into production, but setting up a new cluster was still kind of hard. And summer 17, so this is a much shorter course. It only had about 200 people. So we had like more time to experiment. So we had bad metal stuff. We had six nodes that were donated to us from the Haas Business School, who are pretty awesome people. We created like a bunch of QBDM scripts and just manually went and installed stuff there. It was not repeatable any form, but I feel like for like a course, there was only two months. It was totally worth it. You don't have to like go through the whole setup for every time. And we learned a lot about setting up manual installations. But the really cool thing we did was we took our Berkeley specific installation, then generalized it into an open source project, and then what we installed on the summer was actually the open source project. We didn't actually like carry over our stuff. So I'm actually gonna, this is the open source project. It's called Zero to Jupiter Hub. It's really very nice. It's a helms chart and a bunch of documentation that people can follow step by step without entirely having to understand what they're doing. That sounds really bad, but it's been actually fairly useful in practice for a lot of use cases. So we want to target people who do not consider themselves operational. So what does this mean? This means that like by default, it's secure. By default, it has all the best practices and it is not too jargony. So we were able to like abstract, the helms chart helped here a lot because people are able to understand, okay, I have a config file here. I'm gonna put these things and then run them and then able to slowly learn from there instead of like, okay, now we understand pods and now you understand deployments and then like two weeks later, now you can have a hub running. We really wanted to go from like, I don't really know anything about the cloud to like, I have a running Jupiter Hub in about 10 minutes. And we use a test of this very heavily with people who do not consider themselves like assessments and then we like changed both our chart and the docs to match that. And we kind of consider this like the Ubuntu for Jupiter Hub because you know, like Linux kernel was great but you had to install Gentoo or Slackware or Debian. You had to be like this really whatever. And then Ubuntu was like, no, that's fine. This is for everyone. You don't actually need to understand what you're doing but you can use this for cool things that you care about that you want to do. And that's sort of the thing that we set out to do. And it became part of the Jupiter project which has a wide umbrella. There's lots of projects in there and that gave us governance models and like we were like very happy to be accepted by them and like there's a lot of community people there who got involved. So it stopped being like a Berkeley specific thing and just became like an open source project that Berkeley happened to use. And that's the URL. You can also Google for this. It's gonna show up. So what I'm gonna do now is is gonna do a very, very quick demo of how this actually works. Is that visible in the back? Yeah, okay. So I'm gonna try to do this from memory because I can't see my speaker notes. So what we're gonna try to do is I have a QB discusser that's already running. So if I do a QPC version, it's running 1.8. This is running on Google Cloud. So what I'm doing is I'm literally copy pasting the commands that is in the guide. I'm not going to copy paste because I know them from memory and I don't want to like go back and forth. So I'm gonna do Helm repo add Jupiter hub. Has she DPS? Okay, is that wrong? That's wrong. Okay, so Helm repo update. We're not in the official charter repository. We have our own thing because of lots of reasons that if you're involved in Helm, I can tell you. And now I am going to create a config file. It still needs like a secret token to function. We don't want to auto-generate it because that's very insecure, but users seem to be comfortable doing just this one thing. And we say like, you know, like run open SSL and like copy paste it, but I'm gonna use some WimFu to make that happen. Right, so we have this config file now and we basically want to say, okay, let's just realize this into an actual Jupyter hub instance that does something that's useful. So I'm gonna do Helm install, name equals to kubecon demo two, namespace equals to kubecon demo two, jupyter hub slash jupyter hub. We tell people to version everything explicitly. So 0.5, then fconfig.yama. So this is also just again, directly from our zero to Jupyter hub guide. You can copy paste this and it'll just like completely work. It's gonna take a minute because we also pre-pull all our images because they're like sometimes really large and we don't want our users to like wait for them to happen when they first show up. Okay, so this is now set up and I'm just gonna wait for like the service to what do I call this? Kubecon demo two, okay. Get SVC and it's still pending. Well, that's gonna take a minute. But that's it. So this is pretty much all of it. And like the default installation is already useful. It ships with conda, it ships with a bunch of like useful base libraries that a lot of people use. And of course you can switch it out for your own image if you want very easily by just changing the config files. We also give everyone, every user gets a gigabyte of personal storage by default when they log in and we use Kubernetes as really nice dynamic provisioning to make that happen. Okay, I have an external IP now. So I'm gonna go back to a different browser and I'm gonna open that and hopefully that works. Yay. So by default, the authenticator just lets you in no matter what you have. So I'm just gonna say Kubecon and okay, type out it, but it doesn't matter. But you can easily configure this to use whatever authentication you want, LDAP, OAuth, and whatever else. Okay, so you do this. So this like creates a persistent disk and attaches it to your pod. So that can take several seconds. But so there's like a lot of people use this for their workshops. So if you have like a day long workshop, a lot of times it just ends up like just pending time installation, installing things. Now instead, they just set this up the day before and they use either like the free credits on Google Cloud or like we also support instructions on Amazon, Azure, and some of the academic clouds. So people just end up using whatever it is that they have. And the multi-cloud nature of this is very helpful because we don't actually have to like set up a separate installation setup for all of them. We just like, we'll get you two Kubernetes using any of these external solutions. And then on top of it, it's all kind of the same. So... Da, da, da. Well, usually it's not this low. The disk provisioning sometimes ends up taking way too much time. I'll talk a bit about storage afterwards. It's been really the biggest problem for us. Yay, we have this. And users can also like distribute materials with it. We have instructions on how to do that. So a lot of workshops just have an initial sort of material, then they set this up and then they do this and they tear it down. And it's very useful and the people are just like, oh yeah, I just want to teach people like, you know, like PyTorch or like Scikit. And I don't actually want to learn about Kubernetes. And this does that very well. Well, also like for people like me, we're learning large classes. We want this to work for us too. So it has all the knobs and whistles that we can turn to manage like a large class, like with automatic deployments and whatnot. But it also is useful for like the other use, other persona. All right, so that was the demo. It actually went pretty well. Right, so this is like we call it like democratizing distributed application deployment. I don't actually have to understand what is going on to be able to use it. I think that's the same thing with like all of the apps. Like I have to install it. I don't actually know what's going on inside. Someone else takes care of it for me. And we've been very happy with the turnout from here. Right, okay, I'm gonna go back to talking more about like our continuing setup. So for fall, which is happening right now, we're running on Azure. And this time we fully automated our turn up. We started using NFS for storage, unfortunately. I'll go into detail about that later. But instead of manually creating it, we wrote a script that turned it up for us. Like I've heard a lot about like, oh yeah, you should automate your cluster turn up. But I'm like, that's not, doesn't seem worth it. Like, you know, it's gonna take me two hours to do. And like, I don't know, I might do it once every six months or once a year. It's, is it worth it? But it totally ended up being because we were able to like very easily performance test multiple things. Like because instead of taking two hours to set up and then we are not sure if the setup that we did was fully reproducible, it's like two minutes to set up on like a script. So we could be like, okay, let's do this and do this and then test it out. And that was very, very, very useful. We were able to fix a lot of performance and like race conditions in our application because we were like, okay, let's just spin up a cluster and throw like 5,000 simulated users at it. And that like rats out all the race conditions. And we did this for like, like almost a month to make that happen. And that's all because we were able to automate cluster turn up. And that was very, very cool. This also let us have fully separated staging and production environments because, you know, that was like one line instead of like, okay, I have to spend another day doing the exact same thing. No. And now this is the online course, which is like an order of magnitude bigger, right? It's like 100,000 users versus 3,000 finite users, which is totally different. So this time we have like completely automated everything, everything, including like provisioning of the NFS server, setting up multiple hubs that we can load balance against. And last time we only had one NFS server, this is going to be 20 NFS servers. And that all totally works out okay. And we have a simple Python script that drives both Helm and Google deployment manager. Google deployment manager is like, you know, AWS is cloud formation or Azure's like Azure resource manager. And it basically has YAML that lets us declare and you specify the resources that we want. And it's very cool. And it also has like, we fully separated them including in like separate projects so that, you know, like we don't run out of credits on one or the other, right? So that was like, that's where the MOOC is. And like, we are very happy with where we have come to along there. So the next thing that we wanted to want, we wanted teaching staff to be able to do most of the deployments themselves. We don't want to get a call saying like, oh, my class is at 9 a.m. I know it's only 8 a.m. It's 8 a.m. right now, but can you install this for me? Because otherwise my class is not gonna run. We wanted to put that power back in their hands so that they can stop bugging us. And so we did a bunch of work around automated deployments. One is of course, where do we store our configuration? Because it's a YAML file, of course. But like, where does it go? For Spring, we had a public UW repo. I come from Wikimedia. So like, my instinct is always like, oh yeah, everything should be public by default unless there is a very good reason it should not be. So we had a public UW repo with all of our config and we had a private repo on a single Bastion server that we backed up, hopefully, that had all the private config and we just merged them together when we did stuff. And this was okay. I'll talk more in detail about the problems that we ran into after that. And for summer, we were like, this is fine. We just have a private repo. It's okay. Nobody's gonna touch this. It'll just finish up. For fall, what we did was we completely redid this. We started using a public UW repo and then we used Git Crypt. How many people here know about Git Crypt? Okay, it's a very, very cool and useful tool that I'm sure is gonna bite us in the ass at some point because there's no way this could actually be really like this. So what it lets you do is you can use either GPG or a symmetric key that lets you encrypt certain files in your Git repository. So if you have the key, it looks like an unencrypted file to you and it transparently encrypts it when you push it out. So we use this for like OAuth secrets, cookie secrets, like various tokens that need to talk to each other or things like just student user lists sometimes because that's protected data with some regulations and whatnot. And then we automatically deploy from Travis. So if there is a security issue in Git Crypt or Travis, we're kind of screwed. But hopefully, I think so are so many of the projects. So I'm like, okay, that's fine. The world will be crumbling around me if I need to worry about this particular one. And we're doing the same thing for the MOOC. This is very, very, very nice. I'm gonna talk about our deployment workflow next and that will make it very clear, hopefully. So this is for the Spring 17. So for us, if you are a TA and you wanted a new package, what you would do is you would go make a PR on our GitHub repo and like just add to the Docker file, okay, I want this new thing. And then someone would code review it. Might be another TA. We ended up giving everything a good community of these people who are doing this and it'll run your tests and make sure it builds. And then they would merge on Slack, sorry, merge it on GitHub and then they'd have to show up on Slack and say, hey, I'm doing a deploy. Nobody do anything. Please let me have this. And then they would have to search in and then run commands on a bastion. How many people have done a deployment situation that's like this? Okay. So this is not bad. Like the PR let us do code review. Like people didn't like do obviously broken situations and we had a staging so that caught most of these. But we also had situations where someone would be like, I'm doing a deploy and then they'd forget to say that they finished doing the deploy and then they would go home and we would be like, like what? Is this like a TFS or SVN days where someone has logged something? And the deployers never felt safe. I once destroyed like everyone's data by doing deployment wrong because of Git submodules. And thankfully it was before class started. So we were okay. But it never really felt safe enough that I could completely let the students do it entire time. For summer we were just like, you know, it works. Let's not touch it. It's only two months. Let's move on to do something else. I wish we did. For fall we spent time making this much better. This is now a proper like CI CD environment. Yeah, you merge on GitHub. It will automatically deploy from Travis using GitCrip for all the secrets. And we were able to do helm upgrades. So if like, oh, this class needs four gigs of RAM for the next week instead of two. Or like, you know, you want an image update. That's fine. The students are able to completely do that without any input from us. And that was really cool. And so most PRs are now like self-merged. We feel confident enough that like, you know, like I'm not going to look at your PR until you explicitly ask me to and you can just self-merge it and test it on staging and then merge it to production. And that totally works. We have more about 150 merged PRs on this. And it's like very empowering for them too because I don't know how to go at their speed. They don't know how to go at my speed. I don't know how to be annoyed at them that they're interrupting me. And they don't know how to be annoyed at me because I'm not paying enough attention to them. For the MOOC, we are doing something very similar except instead of just helm upgrades, we are also going to do full-on Google updates. So if you are like, I want a new cluster because we're running out of capacity or I want an entirely new hub with a different configuration because I'm teaching a separate class, you just change the YAML file and then merge on GitHub and it will automatically deploy. Which is different from what we have right now because we don't create like, you know, like the, we haven't, we have automated the scripting of creating new clusters but that's not like automatically deployed in any form. And this is, this is really nice because we have to like expand and contract. We might, we might, we're not gonna, we're not gonna like provision for 100,000 users in weight. We're gonna like provision for 20 day and like, okay, that's good. And then keep bumping up until we get to where we wanna go. Right, so this is great. Like we were able to like spread out responsibilities for deployment and that helped a lot for morale for everyone. All right, the last bit of this is about storage. I hate storage because, so like the worst thing that could happen is students lose their homework. Right, like the computer I hate my homework should be like an excuse that's not actually true. And if it is true, then that's kind of really bad and we should never do it. And it also needs to be cheap because we don't have that much money. And I want it to be reliable. I don't want one student who accidentally wrote like an infinite loop that hit your disk all the time to like screw everyone else. I wanted to only screw them. So in spring 17, we almost had all of these things. We had one Google person disk per student. So whenever the pod would get treated, we'd have a PVC that will dynamically provision them. And that was great. We gave them all SSDs and everyone was really happy with that. And all of these things were handled for us by Google. Like, they're a provider and they provide us this cool infrastructure. Performance isolation, like if one student takes up all their IO, nobody else is affected. Backup, we just use like volume snapshots and that was good enough. And quota as well, like if you go or 10 gigs, you can't. But it was so expensive. Once we had auto scaling, like the storage cost was about three to four X or compute costs because it's super inelastic. If we pay for 10 gigs of SSD and if the students, 90% of them are using only one gig, it's just very, very expensive. So we had to stop using this. What we really wanted was to be able to over provision. We want to be able to say, like, you have a 10 gig quota, but we expect most people to use only one gig. So we're gonna provision it for about a gig per student. And whenever someone goes too much over it, we're gonna go find out and like swat them or like send us carry email from their professor or something like that, rather than do anything much more drastic. So for summer 17, we were like, let's not solve this. So we used the, like it was running on bare metal from like a department that already had an NFS server. And so we used, there's an NFS volume provider in that is entry and community. So we just use that. It's only 200 users, so it's fine, but it's still problematic because it was doing one NFS mount per part, which meant if we had 100 users running, we had 100 NFS mounts. And when the NFS server goes away at this point, we had to then go clean up all of them, some of them are in stuck state and that's very, very nasty. And we didn't really wanna do that. We had to do that a couple of times. And performance isolation was just like, oh, it's someone else's problem. And the SRE book has like a really nice code, which is hope is not a strategy. But in our case, it totally worked out okay. For fall, we were like, okay, we can't really do like the single disk per user anymore. So we had our own NFS server. And this time, instead of using the NFS volume mount, what we did was we mounted the NFS share. We had one server that was exporting one share. So we just mounted it on all of the nodes. And then we used host path to actually get that content into the user's part. And this was much better because instead of having one mount per part, it was one mount per node. And when it screwed up, we had 30 nodes. So we would just have to SSH in and clean up all of them. It's still kinda nasty, but it worked okay. And so it also required out of band work. So we're not pure Kubernetes anymore, right? Like we had an out of band NFS server and then we had to like had something out of band on each of the nodes to mount the NFS because Daemon sets were not good enough to do that yet. This is 1.6 or whatever. And that was not very nice, but it was definitely better than just mounting like 800 or 2000 NFS mounts. And again, performance isolation was just like, let's just make it very big and it totally worked out okay. For the MOOC, we are doing, we have a lot of NFS servers. One NFS service is not gonna be good enough. And so we didn't wanna do out of band work, but thankfully with 1.8 flex volumes we came good enough that we could actually start using them. So we wrote a custom flex volume provider that does NFS in a little differently than the inbuilt provider and it works pretty good. It does one mount per part and then does a sim link for mounting it to the volumes. And it also automatically like remounts it if you run into like stale file handles or whatever. And that's been very good for us. We've been running tests with that. And yeah, Ansible ZFS for the NFS seems to work okay. And performance isolation, we're doing this with TC, which is traffic control. So we can like limit the network traffic to the NFS servers for per node isolation, which is still like not as good as per part, but it's much better than not having any. And I picked up the stick at Wikimedia, which is pretty nice. So NFS sucks, but we don't have enough people to run GlusterFS. I feel like it's not like a part-time situation that someone can run. And I don't even know if it does like over provisioning GlusterFS stuff, any of them. And it totally fits our use case. NFS is like, yeah, home directories. And we don't really care about the performance. We're not running a database on it, it's good enough. And we have lots of, like, you know, I think it's bringing NFS for a long time. So yeah, someone please fix this for us. It can't be us, but I think lots of people are working on it. And that's great. Okay, so what do we have here? So this is great. That was supposed to come in one line at a time. So Kubernetes works very well in practice. Like, we kind of were able to take it for granted that once we had Kubernetes up and running on some platform, we can just do, we don't have to repeat all of our work. Like getting Kubernetes up and running was the work. And then we just like start from where we were last semester or the previous time rather than start from scratch. It was super cool. Automatic cluster turn up is great. This might be like, you might already know this, but I was skeptical. And then once I did it, it was like never going back. Democratizing deployment space off. We do not want to be a gate. We want to have God rails. And we do have them. And they're like, our like staff and teachers like really love us. And it's really nice. Distributor storage still sucks. And distributing secrets still sucks. And I don't think that's news to anyone. But if you have a solution that follows all of this and is fully open source, come talk to us. So you can be part of this. This is all open source projects. The Jupyter community is awesome. There's the Helm chart and the documentation for it is a zero Jupyter hub. I'm not going to put you what else there. You can always Google for this and find out. And the Kubernetes integration is called Kube Spawner. That's the Jupyter hub plugin that talks to Kubernetes and makes all of this happen. And these are some of the things that we're working on. We're going to scale to 100,000 users. We're doing lots of a phone testing. And we are working with some of the people from Google on TensorFlow integration. We're working with some of the people from DOSC on, again, like working with DOSC. And we're also working on CI for our Helm charts, because it's a fairly complex Helm chart. And we want to figure out what to do. And that's our Gitter, which is what we use instead of IRC or Slack. And this is all not just me. There is a massive team of people who I'm not going to list. But they're all part of the Jupyter team. Jupyter hub team, they are awesome. So this is a very great way to do things that are both technically challenging and have lots of social impact while working with a pretty cool team. So that's it. Thank you all for coming. I don't know if we have time for questions. Do we have time for questions? No, OK. But I'm going to be here. You can find me with my hair. Please feel free to ask me questions.