 So, hello everyone. I'm here to give a brief talk on our Jupyter.cs service, which is in use in the computer science department, mainly for courses. The point of this is sort of an internal talk to let people know how Jupyter.cs works for, basically let people know how it works so that way they can start working on it and developing it. But it's not just for internally. We also plan on posting it so that way other people can learn from us. Some notes about this recording. So there's a vertical screen share and yeah vertical screen share so that what you can have the other half of the screen open with HackMD please write on there to comment. So basically as I'm talking you can both use it to take notes along with me and to mention things which you'd like for me to say or comment on mention and so on. So if you work together we can have a good final set of notes here. The talk's recorded but I'm recording it with OBS and not with Zoom so if you say something it cannot appear in the final recording unless I unmute the desktop audio which I would say if you do. So that means well you can talk but please don't because then it will be a blank area. If I happen to show personal data like student user names or something in here let me know I put a note in the Zoom chat and I can ask that out. I think there shouldn't be many of those if my preparation is correct but who knows we'll see. So with that said here is the HackMD. Can people log in and make a note if you can connect here? So should we begin? This is basically a random walk tour so at any point interrupt via HackMD and let me know what you'd like me to mention. So let's start with the user interface. Or maybe what are the components here? What's the big picture? So this is built on open source software. There's Jupyter Hub which is the main piece. Jupyter Hub uses coopsponer in order to control Kubernetes. It uses the PAM authenticator to authenticate via PAM locally on a Kubernetes pod which is connected to the Active Directory of Alto. When things run then it starts the Jupyter single user server which runs inside a Docker image with various, how do you call it? Yeah, with the software installed inside. So my focus here is not the standard open source components but what makes the Alto Jupyter Hub special. So you'll see a little bit about the Jupyter Hub and images and single user server and all these things. But that's not the main point of this talk right now. Okay, should we begin? I don't see any other comments here. So please do interrupt anytime if there's something which I need to mention. Okay, so signing in. This is Jupyter.cs.alto.v. We arrive at this login screen. We log in with Alto username and password. So when I click sign in, this is basically a PAM local authentication onto the Hub pod. Yeah, okay, there we go. So this pod is connected to Active Directory. So yeah, the image is made with the necessary secrets in there. And then each time the pod starts, it does the connection, which is really sort of magic to me. So someone else has set this up and I, well, I just hope it doesn't break. Let's see. I know, should I be saying the names of people that have done all the other work on this because if it's posted, well, you can write the name of the person that did this in the HackMD if you'd like to give credit publicly. Okay, so once we log in, there's different server options. Well, there's these template messages here, which as you can see really needs updating. We can see where this is controlled via templates. So for server options, the first four here are generic. So they are usable by everyone. After that, we see things which are visible to me because I'm an administrator and otherwise are not public. So you see there's different courses and there are the way for an instructor to spawn it or a user to spawn it. And the user spawning methods may not be public. Okay, so if we keep scrolling down, there's a test course here. And I'm going to launch this as a student. So now we're doing the normal Jupyter mechanism. So it assigns it to a pod, it starts it, it pulls the image. Let's hope the image actually maybe this might be a problem if the image has to be pulled afresh now because we've been reinstalling some of these nodes. So this might take a little while, I guess we'll just wait. So after we log in, we're in the student environment here. So what are the properties we'll see? There are user IDs and groups. There are... Okay, so one of the most interesting things about the way we've set up the Jupyter Hub is that there are real user IDs and groups in here. So basically, we'll log in and the username will be Darstar1, which is my Alto username. And the user ID, the numeric user ID will be whatever my real Alto UID is. And this is important because... Well, maybe we can talk about this some. So I see most people running Jupyter Hub in Kubernetes or really anything in Kubernetes. It always seems to be arranged as a everything runs with a fixed UID. As in everyone, all the students pods will be running as user 1000 or something. But we need to match up these pods with the Alto storage system because the file systems here, well, home is temporary, but all of these other ones here are mounted via NFS from an Alto NetApp export. And this NetApp export uses the real user names and user IDs. So it's exported to the pod using NFS v3 with sec equals sys. So that way it's a network control inside. But as a user, whenever I make a new file stored in one of these, it gets stored under my Alto UID because that's what's running inside the pod. The advantage of this is that whenever I can come to the Alto system, which is open here, let's see. Why show when we can tell? Do I need to gain it? Probably. So if we look at Jupyter U04. So this is my files. This will be the same as the notebooks directory mounted in there. And if I list it, I can see all of my Jupyter files. And yeah, so that means that both as students and instructors, we have seamless access to the data from many other many other ways. It's not just locked behind the Jupyter. And we get free things like multi-user notebook grader, which is a problem for other people. We get data storage handled by someone else as big as we need. We get quotas. We get snapshots. We get all these other amazing things. Okay. Yeah. So the home directory is temporary. So every time it restarts, this gets it's recreated inside the image. Notebooks is stored. So there are some links here from home. So these things get linked into the home directory. So there's some things that are persisted locally. Okay. Let's see. Okay. It says it started. Hopefully we will get the environment soon. HackMD. So for questions, HackMD is best. But what you just said will not end up in the recording. So please, whatever I'm talking about, just write the question in here in the actual agenda. Like, there. Maybe someone can paste this in Zoom again. Okay. So the question is coming. So while that's coming, now that this has started, I'm going to make a new terminal. And let's take a look. ID. So I see, yeah, here's my alt or user ID. Here's my notebooks directory, which as you notice is the same as we saw on the Cosh shell server. There's a question. Can you copy the data from Jupyter to AltoHome easily? So from Jupyter, there's not an easy way from Cosh here. Yeah. You basically can use any of the normal tools to copy it. Yeah. And the reason the home directories themselves aren't mounted here is security. So will the security policies allow us to mount the home directories in here? Probably not. How long is personal data persisted on J to NAS? So what I wrote in the privacy policy is it would be the same length of time as the home directory data. So once the alt accounts expire, then we would start removing the data once the AltoHome directory gets removed, which still needs implementing. Okay. Let's see. So when we look in here further, let's see. This was the test course data. We noticed that here we have access to the whole file system. So if we scroll up one, we see there's the Linux system. And I did this on purpose. So I didn't want to hide the Unix system from the students. First off, because it gets them more familiar with the students. And second off, sometimes they might need to go to things like the other course data directory. Or you might need to go to the course directory and so on for instructors. Okay. So we see there's course data, which let's look at, look at that here. This is something that is mounted in all the student pods. And the students should not have right access here. Why do I have right access? So I guess I have right access because it's by my user. Although I thought this should have been mounted read only, like at the mount level. Yeah. So there's these questions about mounting Alto home with SMB. Well, if someone wants to look at that, please do. It would be an interesting thing to try. Okay. Let's see. So what else did I say? The default working directory. So the default working directory of Jupyter is the root of the file system. And that's because Jupyter working directory is what's browsable, the default starting directory students would see it's a notebooks directory. And there's some bash magic that makes a new terminal start in the notebooks directory. That's like in the bash RC or a bash hook or something like that. Okay. Oh, let's see. Yeah. So kernels. So if we do new, we see there's the different notebooks available. And if we go to the terminal, we can probably see where those come from. Jupyter kernel spec list. So these are included in the image. And we see the kernel runs from opt conda. And we'll get to this in the image later. So this is, yeah, I'm just showing you the effective user connection. Umask. So the Umask is set. So it's not group or things are not group writable by default, except when you're an instructor, then they should be group writable. Yes. So the image, yeah, we see the Jupyter image, which the variables, this is configured via the spawner. It's like this will get to the images later. We can get things like the limits. Yeah. And these are all controlled by coop spawner at the Jupyter Hub level when things are starting. Okay. So most of the user environment is configured in this Jupyter prespawn hook. And all the software inside is configured in the image itself. So before we go to this prespawn hook, let's go to the control panel and I'll stop my server. And it takes a little while. There's no response. If we start, it still says it's still stopping. So the instructor is pretty similar, except there's a course directory also mounted in there. And the Umask is set so that things are group writable by default. Let's see. I'm going to start this as an instructor now. So let's hope this starts a little bit faster. I wonder if it's pulling again. Okay. So as some background, in notebook grader, most of the people that use it haven't figured out how to run notebook grader in a multi-user setup, like with multiple instructors. So what they will do is they will have one dedicated instructor user, basically, which everyone blogs in as. So it's actually a bit more complicated in running as a Jupyter Hub service, but still there's not like this multi-user management in there. It's basically user account sharing. So here for the instructors, it's actually shared with separate user accounts. So I made a modification to notebook grader that would write the instructor files as group writable and group writable. So that way it's reasonable to have multiple instructors log in. So there's still some different UNIX problems, such as if one user makes something that's not group writable, then other users can't modify it. But well, that needs to be solved with UNIX stuff. There's an extra file system slash course, which is mounted in there. So let's see. Core, yeah, here we go. Test course. So in this path, there's two different, this is the course data directory. Although really we call this data the course data directory, this data gets mounted as course data inside the pod and files gets mounted as course. So course is instructor data, like the student submissions and all that stuff. Course data is the shared data, like the big data, which is shared via NFS instead of copied for every student. Let's take a look. Okay, test course data. This should look familiar. This is what we saw inside Jupyter. And if we look at files, we see all these files here, which people have made. So if we did LS-L, we would probably be seeing the, we would be seeing the people that have made this. So it would not only be my user ID here, but we would see other real Alto UIDs. How do I have access to this? So, well, you will notice that my ID includes a group Jupyter test course, Jupyter test course. So it's controlled via completely normal UNIX mechanisms. So, okay, here we are again. We're, we now started this as an instructor. If we go up, we see the course data here, just like I was showing. No, the course files, which is the instructor files, course data, which is the same as the student, except the instructor should be able to write to it. If we make a new terminal and we look at my ID, well, yeah, so I'm an instructor. So that means that I get added to the test course group ID. But you notice I'm in all these other courses. So why is this? So if we look at MJHNAS Jupyter course, we see all these other things. So I'm an administrator here, which basically means I'm an instructor of every course. So I get added to all of the groups of all the other courses, and all the other courses get mounted in here. Only the courses that I'm actually a part of would be mounted in here, not others. We don't, we assume that the users could have root and wouldn't be able to access anything else. And that happens because it's not, we wouldn't mount the other stuff. So basically as an instructor, I can copy data from previous years into my current year and access things across different courses seamlessly. This is all due to the magic of NFS and UNIX groups. Okay, so there's one other interesting thing to consider, and that's the exchange directory here. So if we look, if we go to the course directory, and if we look at, well, let's go through the flow of things. So source is the origin of assignments. It gets translated to release in the instructor directory. Where is the exchange directory? So it says serve and be greater exchange test course. And here we see the way that notebook grader communicates between the different pods. There is outbound, which means outbound from the instructor to the students. And if we list this, we see the different assignments that have been released to the students. So notebook grader fetch would copy this to the student's directory, which would be notebooks, test course here. So for example, A2 got copied locally here. And then when something gets submitted back to the instructors, it gets copied to this directory. And this directory's permissions are set so that, uh, let's see. So others have execute permission so they can access things, but they don't have read permission so they can't list it. So basically other students can put a new directory in there, which is their submissions, and they can't list it to see other students submissions, which get protected with a random hash in there. And that's the way notebook grader gets submissions back. So it uses Unix permissions to keep this secure. And then it gets fetched back to the instructor pod. Okay. Uh, I don't see any other questions here. But if you want to ask questions about instructor stuff, now's a good time. Okay. Uh, oh, let's check my UMass. My UMass is 0007 now. Okay. Uh, yeah. So if you notice everything I'm saying here, it's not really specific to Jupyter or notebook grader. It's, I try to make it look like a completely normal Unix system. And that is by design. Okay. Should we start taking a look at the course management then? So, okay. Here we are on my work computer. Get a course meta. So this is where all the courses are defined. So I have recently made a demo demo YAML file. So I basically copied another course and removed the personal data from it. So we see there is, well, metadata here. There's the name, which is what we see in the spawner list. There's the user ID and group ID. So these have to come from Active Directory and get, uh, yeah. And this is used so it can create the file systems with the right user ID and group ID so that people can access it. And then there's the supervisor, which is supposed to be basically the teacher in charge. There's the contacts and there's managers. So manager is who can control the group via the domestic service. So they can do the self-service adding of other TAs to there. Let's see. There's a, we say we want a data directory. We can give it important dates, which, so these are not currently automatically used anyhow. So it looks like this would tell it that, okay, the course should become public automatically on this date. It should become private again on this date. It should be archived on this date and deleted on this date. In reality, there's no automation. It will just show us in a report what needs to be changed and what is old. Here's the actual dates of the courses. Um, this is what actually controls the course setting. So if I change this to private true, then it would no longer appear in the public list. If we did archive equals true, then it would not appear in the instructor list either. But the data would still be there and would be mountable read only in other courses. And here we see the image. So it uses the standard image, which basically meets Python, and it uses the image as of this date. So if I make a newer image, it won't automatically get updated. But until that date, if I updated the image, they'll get all the latest changes. So another option here would be to specify the exact Docker tag of the image. So it would only use that one image. This is a point that we'll be considering next week. Okay, then there's the course script. Courses.py. Let's hope this doesn't show personal data. Okay, it doesn't. I need to make this smaller. So when I run courses.py, it parses all the YAML files, and it gives us a report on everything. So it tells us what is archived, private, and public. These tags mean things like it's missing the contact, it's missing the instructor, it's missing the public date, it's missing the hide date, it's missing the archive date. So all of these old courses, we see that they didn't have good metadata being stored with them. Note to others, don't make the mistake I made and start setting up things without collecting very good metadata on it. The later courses do have the different things in it. So we can see the currently active courses are not private or anything. This course says delete here, which means that it's past the delete date, so it should be deleted now. If it says something like archive, that means it should be archive, although it's already archived. So basically what I would be doing is running this and seeing, okay, what maintenance do I need to do, and then I know to go and contact those instructors and say, okay, this course is now old, can I go and delete it for you? And hopefully they say yes. There's also things you can do like... So courses.py also does things like validates all the YAML files and validates that it has all the necessary data and so on. Python 3. So we could do... So for example, we could have it print out all the contacts of a course or supervisor. Really this was made so that it prints out the contacts and supervisors of all courses so they can be emailed and contacted. Let's see. How's it actually updated? So there's an update script here. So this basically... First it runs the check on it and then it pushes it, pushes the git repository to jupyterhubmanager.cs.ulta.py. So let's not go over the details here, but let's just say it's a little bit of a hack. So you would have to commit it to git, then you would run update which pushes it to a local git repository on jupyterhubmanager. And then it would run on jupyterhubmanager, it runs the course script again, and it runs setup and makes the missing courses and so on. Maybe let's not go into details because it doesn't really make a lot of sense, but through the scripts in here and through running locally for checking and then the remote end and checking, we can get all of the... It can make the courses. If we ssh to the jupyterhubmanager... This is where all of the files are managed. So this is the computer that has root access sec equals sys to all of the jupyterhub files. So if we need to go deleting data or changing permissions or so on, this is where we would do it. Let's see. Yeah. So we see the other different data here. There's the course directory, which is what we've seen. There's u for users, which is what we've seen. There's software, which has some shared Python files like this MyCoursesExporter and some other things that could be installed globally. There's some other... There's a shared bin directory here. We have shared data. I missed this from the file system list up above. So this is data, which is shared among every single image that's bond. So for example, there was a course from the engineering department and they wanted to share some data but didn't need the notebook grader course. So I put it here and then it's available to every student via... This is probably how it's available inside of the pod. Let's take a look. Yeah. There we go. Okay. What else is in here? There's the exchange directory. And this is what is mounted at the serve notebook grader exchange. There is admin, which has the hub data, test hub data, last login timestamps of users. If we look at hub data, this includes the Jupyter Hub SQLite database, which can be deleted and then sessions are lost but nothing else important gets lost, like the secrets and stuff like that. Oh, user and course hooks, if needed. Okay. So how is this protected? So on the auto shell servers, things like admin and exchange should not be mounted there. So that others cannot access it. At least I hope that's the case. So group management. Let's come back to here. Wonder where it's stored. So one thing that should happen whenever this processes all these YAML files, it makes another YAML file which gets copied to the domestic service. And then that is used for domestic so that way the group managers can add and remove data from there. So it has to be run and it has to be manually copied. Let's see how this is done. Oh, yeah. So here it is. So it's told to write it here. I guess I won't open this file because it will have all of the instructor usernames in there. But then we see it gets rsync to domestics and put in the right place. There could probably be a better way than manually running this. So let's look at the image next unless there's any questions. If you have questions now would be a good time to ask. Okay. So the image. Here we go. So this is a huge mess and this is what we have a hackathon on for later. But since it's Docker maybe we don't need to go in too much into here. So this repository is on GitHub. The build system is via a make file which you can look at later if needed. The make file first builds the base image which is derived from the upstream Jupyter image, the sci-pi image. So it already has some Python in there. It has the Jupyter stuff. But we upgrade and make sure all the Jupyter stuff works. And from there it builds the standard image which is Python. It builds an R image which is R and a Julia image. Maybe let's quickly look at these Docker files. So we see it derives from an upstream image. There is a clean layer script which will remove temporary files. Here we see it's pinning some versions of things. It installs some Debian packages. It installs Jupyter at a proper version. It installs even more content and pip stuff that might commonly be used. It installs the bash kernel and widgets and other stuff. It installs a specific version of Jupyter lab. It enables a bunch of extensions. It tries to enable this for integration with Google Drive which doesn't actually work. And it updates more. Let's look at the standard. So this derives from the base image. It tries again to pin the stuff that's installed here. Here's all the stuff that installed. I tried to indicate it by course. And then some giant install commands. So first let's content install a ton of stuff, pip install stuff. Then let's install TensorFlow and Keras. Then let's install even some more stuff. And here we're upgrading Notebook Grader because I needed to upgrade it and the old version. If I updated in the base image it would have to update all the images which I didn't have time or effort to do. Even more stuff. So at the bottom we start installing stuff. Okay. So around here. These are things which needed to be installed but I couldn't add it to the above layers because it would break the existing stuff. So I have to just keep adding more layers at the bottom. What I would normally do is once a year or twice a year go and copy this from the bottom to the upper installs and then regenerate everything from scratch. But well, that's another story. Let's discuss that later. And some permission fixing. And then hooks. So these hooks get run whenever it is run. Let's look at hooks. So there's, for example, like there's before Notebook, yeah. And this handles things like making ContaCache, setting an editor, various stuff. Okay. This could be a whole other talk so maybe I won't dig too deep into here. Yeah. Yeah. Okay. Yeah, there's a question. What's the philosophy of a few big images of many smaller ones? So I don't know. It really might not be. So I think the reason they get so big is that I can't go and update them. And I have to keep adding more layers. The image for just a single course is not that big. It's quite convenient that I don't have to ask instructors every single thing they need installed. And then they'd probably keep coming back to me and say, oh, can you install this now? Can you install this now? If I give them everything, well, people are generally happy. But this is something which I hope that we will discuss in the hackathon that happens next week. It's really a good open question for me. I think if the mega image was actually maintainable and it could be regenerated from scratch, it wouldn't be that bad. The problem is it gets big and since it's used by so many people, it can't be easily regenerated. Actually, no. The size means that it can't be regenerated, which means that it keeps getting bigger because I keep adding layers. Okay. Like for example, if I ask every course, please send me an environment YAML file, which will work for your course. It sounds really great, except when you have people that don't use YAML files and don't get it right and so on. Okay. So let's go to the back end. So this is the real thing. This is where everything runs. This is the Kubernetes repository. So this repository is public, so that way others can use it, except secrets is a submodule, which is secret. There is a Docker file, which is used for Jupyter Hub itself. And this is a much easier thing to make. So it's actually a reasonable looking Docker file. But in there you saw the connection to Active Directory. Let's see. We see scripts, which is scripts like scripts which are used inside of the Hub itself. We see bin and this has shortcuts like bin create hub, which will take the hub down and up, remake hub, which will take the hub down and up, shell, which will get a shell on the hub. These are all basically wrappers around Kubernetes. So here we see it finds the Jupyter Hub pod and runs kubectl exec on it. Okay. The crazy thing is Jupyter Hub config. So this gets mounted inside of the Hub pod and this controls everything. And we see it's 800 lines here. So it's not a short thing. It sets some default images. It sets the default limits and all that stuff. It sets well, a bunch of stuff, which makes sense. But it also contains a lot of executable code. Here we see the default profile list, authentication config, spawner config, config is boring. Here we see a file that executable code that runs Git courses. So every time that the I've seen every time the spawning page is listed, like someone goes there to spawn courses, it will refresh the courses from the YAML files, which are stored on NFS. And it will render only the courses which are available to that user. So this is how it limits it to limits it to the images that should be visible for any given users. I wonder where is this configured? Like I said, it's configured down below. Select image. There's no point in even trying to go over this code. Oh, here we go. Git profile list. So this is what handles the public, private archive, that kind of stuff. And then it gets set as a property of kubectl spawner. So all of this code gets run on every every time it gets rendered. And we see it does think like it's setting the font color and so on. Okay. And now we have the pre spawn hook. And this is the really crazy stuff. So every time a user tries to launch a particular image, it goes and will try to... Let's see. What does it do? So it configures all the different things inside of there. Like it will configure the volumes. It configures the volume mounts. It configures the notebook path, which is here. So basically all of the different properties of Kubernetes, which got set and we saw inside of the single user image, they start getting set here. Okay, we see environment getting set. This is interesting. These commands can be configured in the pre spawn hook and then it gets assembled into a shell command and used as part of the spawning process. So all of these commands can be run before the notebook runs. So I can do lightweight configurations of the image without rebuilding the image based on shell hooks. Okay. So here we see, for example, if it's a GPU node where adjusting stuff, it just goes on and on and on. We see this is if you're part of a course or if you're not part of a course. So all of this happens if you're part of the course and then there should be a hook here. If you're an instructor, do this stuff. Again, there's no point in even trying to understand this right now. You can come back and read and we can go over it together. So yeah, there's all kinds of hooks there. There's a stopping hook. There's some sort of remembering of the state. This is what removes pods which have been running for too long. So there was no way that someone should try to understand all the contents of here based on what I've said. But the point is, this is where a lot of the configuration happens. Maybe it could be rewritten, quite honestly. Okay. There's also a test environment, which someone has set up. But I've actually never used that. But it's run via this somehow. Yeah. Maybe what do you like for me to unmute and then the person who set this up can talk about it and say who you are or not comment on it if you want. Okay. I unmuted it. So you'll appear in the recording if you talk now, if you want. Okay. So there is a test environment that is currently currently have been set up, but it hasn't been really touched in a while. It's basically more or less identical copy of the current deployment, but running in a different Kubernetes namespace. Currently, it doesn't have the PAM or AD authentication. So it only uses local accounts. But the plan is to integrate the ALTO authentication using OAuth in the future. And it would basically, the idea would be that we can have a separate test environment, which we can use for testing new features or testing courses or everything like that. Because right now we have had to basically do all the testing in production, which is not very ideal. Yeah. Okay. Thanks. I'm muting the audio again. So, well, we're coming up to an hour. This has taken longer than I've hoped. I hope it wasn't that boring and repetitive. I guess we can use the HackMD for some final discussion before we go. But this is basically a tour of the user side and the key places to configure it on the other side. I hope that once you know the user side, then the configuration will make a little bit more sense. And then we can, and then you can start to understand where the configurations are. There's documentation of this in our internal wiki. Perhaps it should be improved. Well, documentation should always be improved. Well, I don't see much other discussion. So maybe I would say thanks and stop the recording. Okay. Thanks a lot. Bye.