 So, quick question. How many of you, actually pretty redundant question, but I'll go ahead with it anyway. How many of you have used the Jupyter Notebook? I see everyone, of course. It's your Python, and it's the data science track. Less redundant question. How many of you have written your own, say, server extensions, or even, like, Ipython line-magics, or sub-magics, or worked with this? Oh, cool, we have one person. So, yeah, I thought I'd feel like less educated here. Like, I'd feel like, okay, cool, everyone's done this a lot. And I just got to know about these things a couple of months ago. But, well, it seems like we're all the same boat, so we'll figure things out. Okay, so, why Jupyter? Well, it is a fascinating piece of software, right? Most of us start out with using Jupyter as just this browser interface. Some of you who are still, like, more geared towards using your terminals. You're probably using QtD or something. But, overall, it's a piece of software that we love. It's something we end up using for, of course, data science, exploring your data sets. It's just nice to look at your data frame in this nice UI instead of going through Excel sheets or CSVs or what have you. And it's language agnostic, which is implied, which is sort of, like, given by the fact that, given away by the fact that the development of Jupyter started with the need for supporting more than one, I mean, okay, not exactly a need, but something like it was so heavily adopted by the community and loved by the community that the community started shipping in more kernels, more languages that it could support. So it's truly language agnostic. And it's supercharged, interactive computing. Would you believe that Jupyter has a kernel for Ansible? I was mind blown. And I've actually even seen people using Terraform or, like, their AWS Lambda functions to just, like, iterate over things and figure things out. I mean, I kind of do that myself, like, there's a new piece of library. I'll just mostly go to the console, mostly just go to the console, open up a notebook, and start experimenting with it. And it's just nice. It's, yeah, it's, like, it goes. Like, it's super, super interactive. And of course, it's shareable, reproducible, and collaborative. Which is, I think, like, what really put Jupyter at the forefront of scientific computing and made scientific research or papers and the code that's shared with them were reproducible. Like, you have even now, like, things spawning up, like, papers with code and, like, all sorts of notebooks and stuff that you can find on Jupyter. I mean, it's even used in, like, education in most places. You have books written in Jupyter. So, yeah, it really ties into the spirit of the notebook platform being super reproducible and collaborative. Okay. Who am I and why am I babbling like this here? Well, I am a machine learning engineer. And after college, I really started, like, writing any meaningful code about three years ago when I also learned Python. And I'm also still trying to make peace with front-end tech. It's just something that's not me. So, I came across this use case once where in, at the company that I was working for previously, we were trying to deploy an environment wherein a data provider could share their data with the data user who's not inside their company. And so, of course, you wouldn't want to give them an on-prem access. And generally, these people are data analysts and data scientists. And they're more or less aware of the Jupyter interface. So, well, we took one of the Jupyter Docker stacks images, did some security magic behind it, and sort of just, like, put it out there as a prototype. Okay, which is all cool, but now you have to write tests. And how do you really test out that, you know, like, maybe an end-to-end integration test or something? How do you really test out the Jupyter notebook is being used? How do you know that, you know, someone's just not, like, have the data, like, have your notebook open, like, hours on end and, like, maybe scraping the data or, I don't know, like, a bunch of use cases? So, one of the approaches was, maybe you try out the Selenium route. And, like I said, I'm still trying to make peace with the front-end tech, so, not me. So then, I was like, okay, maybe I'm gonna go ahead and try to understand how it really works, how the back-end looks, because, well, I mean, if it's a browser, and if it's a console, and, you know, it's like, it's doing some things. So, of course, there's got to be, like, servers and stuff in there, and have heard in passing that, you know, it's like a nice server client architecture. So, well, let's see how it goes. And so, then, started the rabbit hole of going down the entire documentation and tinkering about what things, which is, I think, we all do, and thanks to Open Source that allows us to do that. But, enough of that, what is in this top, for you? Well, thank you all for coming here anyway and hearing me out, so, of course, I have to promise to give you something. Well, we're gonna look at and understand the architecture of this widely used and loved software. And, since it's based on ZMQ, which forms the whole generalized networking protocol, which makes Jupyter super agnostic, super platform language agnostic, we're gonna look at how that's enabled. And, then, it's basically going to try to serve as a gentle nudge into, well, getting into the meat of things. Full disclosure, I haven't invented anything here. I just went to the docs, read the docs, played around with code, and here we are. And, Jupyter ecosystem already has a bunch of things that enable you, like, of course, it's people behind Jupyter that have written a bunch of code and they've written a bunch of utilities. So, they already have a bunch of stuff that you can use to automate front-end notebooks without, like, a browser or any front-end. And, I'll mention a couple of those towards the end of the talk, or you can just grab me for more resources. Anyway, before we get into all of it, well, we'll always start with a history lesson. So, 2001, Fernando Perez, who was a physics graduate at the time, was trying to move to Python, I think, like from MATLAB, Mathematica, and stuff. And, we all know the native IDE that, sorry, the native shell that ships with Python, it's terrible. It's just bad, it's not fun to use. And, he wanted to have a shell that was more than this primitive RAPL. And so, 2001, it starts as an afternoon hack, or as he likes to call it, his theses procrastination, don't hold that against, don't quote me on that. But, fast forward, 10 years, you have the first IPython notebook released, which was not just the primitive RAPL at this point, it was a lot more than that. It was actually a very sophisticated client server architecture. And so, the community starts to use it, starts to talk about it, and seeing the architecture, there are talks of like, hey, can we support more languages than Python on this? And sure, people come up with Julia kernels, and R kernels, and even Haskell kernels. And so, the question started being asked, well, if it supports more than Python, why call it IPython? And, well, around 2014, 2013, there was finally the big split, wherein all the Python specific components stayed with IPython, and this is generally a question that still gets asked on Stack Overflow, like, what's IPython and what's Jupyter? So it's essentially this, all Python specific components, IPython, everything else, Jupyter. So yeah, IPython's mostly the interactive Python shell, and the kernel, and the console that you see when you just type IPython onto your terminal. But Jupyter, it's much beyond that, it's all the language agnostic stuff, the networking protocol, the core of Jupyter, what we're gonna look at today, and the format to store your notebook documents, the other tools to convert it across different formats, and we convert. And all the fun extensions that you use, which, by the way, is also being used in this presentation, which is entirely made in Jupyter, and it's not doing that bad a job. Yeah, I'm using reveal.js and rise, which lets me run this as a nice light show with notes and stuff, so if you wanna use that. And, of course, fun interactive widgets, because who doesn't love those? So, coming back into slightly more technical stuff now, what is Jupyter and what is it in a nutshell? So, like I said, it's IPython, of course, or any language kernel, ZMQ, the networking protocol, which forms the base of the networking protocol, and Tornado, which is essentially the backend server that supports the REST APIs for the Jupyter server. Of course, it uses a lot more open source tools, you know, for latex and widgets and stuff, but in a minimalistic view, it's essentially these three things that bring it all together. Now, let's sort of look at a single, is this visible? Well, okay, more or less. All right, okay, but yeah, we can always fix this. Yeah, cool, all right, so a single notebook architecture, you, the user, you generally open your browser, your Qt console, stuff, whatever you have. It will talk to the notebook server via HTTP and WebSocket traffic, which is also responsible for, and this notebook server is also responsible for managing everything related to your notebook file, which we're gonna look at. And this server is the component that takes your code from the front end from your browser and passes it on to the kernel, and the kernel doesn't really know much. The kernel's like, okay, I'm just connected to something that's giving me input and I'm just responsible for running the code and that's what I care about. And so it's this notebook server, which does that, takes your code, gives it to the kernel over ZMQ sockets, the kernel runs it and gives the results back to you, which is all pretty good, because then you can switch out kernels, you can add in more layers, and of course, it's software, the more layers you add, the more functionalities you can achieve. So what is this notebook document that we just talked about? It's just this. Super underwhelming? No? I mean, I was kind of like surprised, okay, yeah, like, whoa, it's just a dictionary, it was just like, fine, it's an area of cells and it has some metadata. Generally, this metadata would be like, maybe any extensions that you're using, like with this notebook, I can specify any rise configuration that I wanna add, which is like pretty nifty. And of course, the format in which the notebook's stored. So yeah, let's look at the notebook server. And so we looked at it a little bit before. Now, but of course, it's a server, it's gonna do a lot of complicated stuff and that's what it's kinda doing here. It's responsible for managing your sessions. You start a notebook, session gets created and you're doing things in the notebook and it's all sort of like in that session information, which of course, the WebSocket traffic then uses. And then the server's also responsible for talking to lower level components like the Jupyter client, which or the kernel, the tools which look for any existing kernels on your system. And the client then is also responsible for starting the kernel and giving that information back to the server and which then patches it all up for you and gives it to you. So lots of stuff going on there. And of course, there's the contents manager. Super important thing, responsible for storing your notebooks on your local file system or on any other storage of your choice. And also like, it does a lot of cool things about like does a lot of pre-processing, post-processing things if you wanna look into those. Of course, the documentation is linked onto these slides so you can always just go ahead and view those. I'll put this up on GitHub. So well, let's see. Since we were told not to do live demos, I didn't really do those, but the good thing with Jupyter notebooks is they're super interactive and you can just run stuff. So well, let's just see. Okay, I started another session. Anyway, let's look at the sessions API. So we're talking to the server running on our local host right now and we're just looking at the sessions API and seeing what we have in store. And so it turns out that we have two sessions right now. Ignore the second one, we'll come to that later, it's fun. The first is, of course, this notebook. It has and gives you this kernel information which is just the one endpoint, sorry, the one endpoint that's the one frontend that's connected to it right now. And we can see it's busy. Of course, it's like running all this code and stuff's happening. So yeah. And yeah, you can get more information. You can use the contents API to just look at your notebook in a JSON format and do stuff with it and the kernels API to get more information about the kernel and so on. But we are going to today restrict ourselves to just the Jupyter messaging protocol for which ZMQ, like I mentioned earlier, is super important to understand. Now, it turns out that ZMQ, when it was released, it was actually like a super fortune time because 20 years ago when Fernando Perez was working on iPython, it turns out that Peter Hinton was also working on ZMQ. And before that, everyone, it's history. I wasn't there. Don't quote me on it. But before then, it turns out that a lot of people were reinventing the wheel. Like everyone would just write their own TCP sockets and everyone would just try to handle all communications across different software written in different languages over different operating systems all on their own. And so there was a need of a single messaging library that took care of a lot of this stuff, allows you to have multiple patterns like anything you can imagine. And well, fun fact, ZMQ, Peter Hinton, who wrote ZMQ, also wrote the AQMP standard, which is now used in RabbitMQ, which I guess people here have used, anyone? Cool, amazing. So yeah, fun history there. If you wanna go check out the ZMQ book, I highly recommend it. Anyway, so there are a couple of socket types and patterns that we will be looking at, which the Jupyter messaging protocol relies on. So of course, there's the standard request and reply. The requesting socket is a connecting socket. It has an outpipe going from it and the replying socket is always a mining socket. Now the thing with these socket pairs is like, yes, the architecture is straightforward. The requesting socket requests, the replying socket replies, all is good with the world. But the problem is that they're always stuck in a receive, reply cycle. Like the requesting socket has to know, like there has to be some sort of a way for it to know that there's a server connected and there's a server that's accessible. And that's hard, because you're talking over networks, you have code that's trying to talk to another code over network and a lot can go wrong when networks are involved. And also the replying socket, it cannot really do anything, so a server cannot do anything until it actually receives a message. So we'll see that some patterns are really hard to form or to conceptualize with just this basic request reply pair. And yes, one point to mention, there is no fair queuing. So the server is just going to reply as it receives requests and it's not really going to care like how many requests it's received from one client and whether it's like, it's being drowned out by requests from a client. So yeah, a lot can go wrong in distributed systems, which as we can see by now, Jupiter is a mini course in distributed system design. Oh, wow. Well, yes, some things happen. Okay, here we are. Yep. Oh, we can always deal with this. Yep, okay. Cool, so the next socket pair pretty straightforward. You have the publisher and the subscriber. Publisher produces, it can be considered an infinite stream of data and a subscriber just consumes either right topic or it can consume everything. Well, not much going on here, but here is where the real magic happens. Now, there are two fancy socket pairs in ZMQ, the dealer and router sockets. Well, first of all, no request reply cycles. That just, which is actually cool. Like the router can receive a message and choose not to really reply to it and just forward another message and forward that message to somewhere else and the dealer can do the same. But the one thing that's really fascinating about the dealer and router sockets, about the router socket in particular, is that the router socket can track the identities of all the sockets that it's connected to, which makes it really useful in terms of, in context of the kernel, which can be connected to any number of front ends at the same time. Remember, Jupyter is collaborative. You can have several front ends connected. So how would the kernel know? Well, the router socket kind of makes it really easy because identities are tracked. You can have multiple front ends and all 12 with the world. And the router socket is also, it's pretty good about fair queuing. So if one of your front ends is trying to drown out the kernel with a lot of requests, the router is gonna be like, well, hold up. Let me just process these other front ends and come back to you. So, whoops. Well, one thing if you do decide to present on the notebook, don't do this. Maybe like, okay. Yeah, but, here we go. Okay, okay. Yeah, is this, is this visible? Anyway, this is the most we're gonna need to look at. So, as you can see, there is a front end, which is this tiny computer here. Not really important right now. You can essentially, so how the dealer router and the publisher, subscriber sockets are really used within the messaging protocol is kind of like this. You generally have a dealer router connected to your front end. It's just taking all the requests and it's sending those requests to the IPython kernel and the IPython kernel and then that's over some channels that the messaging protocol defines, which we're gonna look at next. So yeah, the dealer router gets your code, your requests, which could be anything from code execution, tab completion, prompts, code introspection, or even interactive inputs. So yeah, the dealer router takes those requests, sends them to the router, the dealer socket takes those requests, sends them to the router socket and then the router socket is essentially giving them to the IPython kernel and the IPython kernel thinks, okay, cool, I just have to deal with this input and it doesn't know what's going on in the rest of the world. And then there's the publisher subscriber and why would you have that? Well, of course, you'd have print statements in your code, you'd have some errors and you'd want those side effects to be propagated to the screen, to all the connected front ends and that's taken care by this very straightforward publisher subscriber pattern. The next thing you'd have is the dealer router socket pair where the directions are reversed and now why would you have that? Well, you would also have situations where you're just doing a raw input on your code cell or you have a widget and it constantly needs a stream of connected input so it kind of like imagine an IPython kernel having a keyboard that's connected to it like a virtual keyboard but that's not the case because we're in a client server architecture and this server in some cases could also just be deployed somewhere else. So you have the dealer router in this case which is the binding, you have the dealer socket in this case which is the binding socket and it essentially connects to your front end, takes your input and the request reply direction as you can see is reversed in this case. It's the router in the IPython kernel that's responsible for requesting an input from the dealer which gets it from your front end and we have this nice proxy created which can run this code interactively. Okay, so now let's look at, let's just go back to things as how they were. Okay, long lesson learned, long diagrams do not work well. Okay, so how do these messages really look? You have, it's, just a JSON and that's one good thing about CMQ, it can support many serialization types and JSON's one of them which makes things a lot simpler. Now this message would generally contain the message ID which is gonna be unique per message and the message type which is like I mentioned earlier, code execution or tab completion, introspection or even intractive input and there's gonna be the session ID. Now what is unique about the session ID is that the kernel would have its own and the client would have its own and so if your session ID for any of these two parties changes between a connection, you can know that something went down and came back up again which kind of makes, which kind of like remove, like allows the flow to not be broken but also like gives you this added piece of information. So, and then of course you have the parent header. So how you have input and output cells in a notebook. How does the notebook know or how does the front end really know where to render the output for a given input cell? That information is contained in the parent header. So yeah, this is for the messaging types and there are a lot of messaging types and you can check them out in the documentation. We are only going to be looking at a few of them today. So, like I mentioned earlier, how does Jupyter take all of these sockets and bring them together into its own abstractions? It does that by defining these things called channels and so you have the main three channels that we're gonna look at right now is the shell channel, the IOPUB and the standard input. The shell channel is the router socket, the dealer router socket in the earlier, the dealer router socket that we looked at earlier which, and this channel which essentially receives the incoming connections and executes them and gets them back to you. And the IOPUB which is the simple pubs of architecture that we looked at earlier which is responsible for broadcasting all the side effects and everything and the standard input which is the final dealer router socket pair that we looked at which is essentially just an interactive input channel. There are more channels. Of course, in a distributed system you would also want to have something for a heartbeat because you'd want your systems to go down and not really have anything to work with. So those are in the documentation you're free to go and check them out. Now let's see some action. Okay, so let's see. We had this visible enough. Oh, okay, cool. I see we're really, okay. This is good, yeah, cool. Okay, so we looked at this JSON earlier. Let's see if we can define a notebook with it which is just simple dumping it. And of course it does nothing because we're not really doing anything. But yeah, we do have a notebook running right now. Yeah, there, cool. Okay, so it's got created a couple of seconds ago which is great. Let's see, does it open and do anything for us? Okay, so it's a valid notebook if you're better considered as a valid notebook. Cool. All right, let's get back to where we were earlier. And cool. Now let's try to create a session for this notebook which I've sort of already created earlier but we can just recreate it. So as we can see, now we have a kernel which if I hadn't started it earlier you would see that it was in a starting execution state but it's pretty clear. It has zero connections because we don't really have the notebook open in any browser tab. And now, so now there are two ways that you can go from this. You can just open a web socket channel over HTTP and talk to the server directly. And while that is going to, of course, reflect changes in your notebook and essentially allow you to run the notebook in a headless fashion. And I think Google Colab has an existing tool out there which sort of allows you to do that. The other thing you could do is just use the internal Jupyter client and see how that works, see how that interacts with the kernel. And often in some cases you can even write your own clients. And yeah, you could do that in several situations. Maybe you wanna write more flag, maybe you wanna write another layer which is responsible for seeing if your code is trying to do something funny or anything like maybe you wanna host your own interactive computing environment. So you could look at how this client works. You could write your own. You could write a whole server layer on top of it. Maybe you could write an extension on the existing tornado server. There are a lot of ways to go from here. So yeah, let's see. So one thing about Jupyter kernels is that they occupy some storage on your file system. And it's generally going to be found in the local folder in Unix or Linux systems. It's, I think, same for Mac, but slightly different for Windows. Of course, everything's different with Windows. So, and generally in this folder you would have kernel specs residing, so which allows Jupyter to know what kind of kernels you have actually installed on your machine. And it would also have information about kernels that are currently running in this nifty little folder called Runtime. So we're gonna look at, we're gonna try to create our own async client and load and its connection file, which is essentially just trying to hook it up with the running kernel and see what goes on. Okay, so not much, right? What you see here is just the message ID of the message that we just tried to send to the kernel and trying to connect with it. And we'll try to do a little more fun stuff next. Okay, so let's see what's going on. Well, now we have three kernels running because we ran some code again and again. Okay, and it's pretty interesting how you can have all these kernels running at the same time and you can essentially take any client, connect that to any kernel, release it, connect it to another kernel, and you can go in all sorts of ways. So let's see if we can try to run some code. Okay, so one thing that we noticed earlier is all these kernels are in an idle state. Do they have channels? If they do not, let's maybe see if they have channels running. Okay, oh yeah. Well, of course, now they would do. Anywho. Yeah, so remember I talked about a few channels that is the abstraction for the messaging protocol. The client provides an interface to start those channels and essentially, like you can see, where we're trying to start the shell, the IOPub, and the heartbeat and control channels here. And now they're running because we ran them a bunch of times. Let's see, let's see if we can just do a hello world because we always start out with hello world. Okay, so we got a message ID, not much happening here. Let's see if what the kernel really sent us back. Okay, well, still working. All right, Jupiter and its kernels in action. Well, maybe we can start another requestal then. Yep, and we busyed up the kernel again. Well, while we wait for those async tasks to complete because sometimes they like to take a while, essentially what you would see is this message header that we talked about earlier. So the kernel would give you its response and your parent header would have the initial message ID which came from your cell that tried to execute this code or the client that tried to execute this code. And the kernel would tell you, well, okay, the execution went fine. So execution result is, okay, cool, great. And right now I'm still connected to zero front ends though. And, yeah, so, yep, it worked. No, that's something else. See, okay, still working. Cool, but yeah, essentially, and then you could do the same thing with an IO pub channel. You could see the output that's getting reflected back to every other front end. And then again, you could write your own layers on top of it. You could also open and a standard input channel here, although that's going to be super blocking, so let's not do that right now. Finally, how all this comes together is this. So essentially you have the IPython kernel at the bottom, the Tornado server, and I've got to zoom out again. Okay, yeah, and you have all your fun stuff going on top of that, the Jupyter client, which we just saw a little bit of, the NB format, which takes care of your notebook, and then all the front ends running on top of it, your Qt consoles, the browser interface, Jupyter Lab, Jupyter Hub, and now you also have this NB grader thing, which is super useful if you're an education and responsible for teaching a bunch of people and grading a lot of notebooks. So yeah, it's these few core pieces and then a lot of stuff going on top of it that essentially makes what Jupyter, I mean, essentially makes Jupyter what it is right now. And yeah, that's kind of it. I mean, we can always just go back and see if we got a result or not, but yep, it turns out it's still running. We'll figure that out later. Anyway, so yeah, that's any questions. Thanks for the talk. It's great. So a question regarding scalability of Jupyter, maybe you can talk about the future that you see for Jupyter, but more importantly, kind of scalability. You mentioned that it's a benefit, but it's also a cost that you can connect multiple clients to the same notebook, how do you, what are the patterns for using Jupyter in a multi-user environment in a scalable environment? Well, actually, first of all, thank you for that question. A tool for that actually exists right now. Have you heard of Jupyter Hub? Yeah, it is a multi-user environment and sort of scalable, or did I not understand your question properly? Okay, cool, thank you. Anything else and more questions? Well, okay. Thank you guys for sitting here anyway and listening.