 Hello folks, welcome back, hope you enjoyed today's talk and before the next talk I have a little announcement to make, the devs, the devs print and workshop tickets are open, you can buy them until the end of the conference today, till 8.30pm, so make sure you buy the tickets because they are in limited quantity. Now for the next talk, we have Rohit Goswami with us, he is a Doctoral Researcher at University of Ireland, he works on large data problems in quantum chemistry and machine I think and he has over 10 years of experience in open source development and today he will be telling us about the production and skin development of course, the next paper will end right now. So Rohit, how are you? Thanks so much for the introduction Nikhil and I am very happy to be here, so as mentioned I am at the University of Iceland, I am in Faculty of Physical Sciences and I am happy to be here for Python India. So it's a long title but we are going to take it slowly, we are going to take it piece by piece and we are going to get to the end of this in a meaningful manner. So what are we going to be talking about? We are going to talk about why this is important, then we are going to discuss how we share our Python code with each other, then we are going to talk about the revolution which is the NICS ecosystem, finally we are going to talk about some practical aspects including usability and high performance clusters in general, high performance computing clusters that likes of which you might see at your university and we will discuss how they are different from say spinning up a cloud computer somewhere. All right, so what is the story here? Perhaps if you have been programming for a while and maybe if you have taken a graduate level course on software maintenance, then no matter what language you are working in you will probably end up with something like this, you will write some code then you will refactor it into objects or more functions depending on your language, you will test them, possibly at multiple levels, maybe even fuzzy tests, all kinds of tests, if you have been programming for a while you will definitely document your code, if not for yourself, I mean if not for your users then for yourself, and then you will have some sort of set up instructions, that is pretty standard. On the right there is an example of this workflow in Python so you might load NumPy and you might write test in PyTest and Docs at Spinks, you can just as well imagine this to be a C++ workflow where maybe you would load Eigen, maybe you would test with Catch and maybe you would write Docs at Oxygen, but there is an issue with this approach and that is for one thing this is not how we do data science and one of these libraries, Pandas in particular is different from the rest, how so? Well, if you have done modern data analysis, if you have taken a course called machine learning or if you have even gone over the tutorials over the TensorFlow, then you will realize very quickly that modern data analysis is more of a tribe for your buy approach, right? I mean on the right you have an example, the text is too small but that is not the point, you see some sort of plots, you know, some sort of descriptive statistics, you are maybe loading a file from some online source, you are inspecting your data, you are inspecting your objects, something which in traditional programming is normally left for debugging, right? And one of the reasons behind this, this approach of course means that we need a strong level of interactivity and Python of course, right from the reprint value of the regular Python repel to iPython and then the Jupyter lab interface, both notebook and then later lab and finally through proprietary spins on Jupyter, Python supports interactivity. Now, we are not going to be talking about co-lab except very briefly towards the end in terms of practical aspects but we are going to talk a bit more about Jupyter. So what is Jupyter really? Well, Jupyter is a gas giant. Okay, that's not fair. Jupyter is a community, Jupyter is an ecosystem of tools. What does that mean for us? Well, Jupyter can subclass a bunch of kernels, Julia, Python, R, C++ through Zing, although of course, Julia has moved to Pluto but that's for a different top. Now, the main interface which we'll be discussing is the Jupyter notebook and what is a Jupyter notebook really? We'll see a concrete example of this a little while later but for now, take my word for it, a Jupyter notebook is an executable file which is meant to be consumed by a server. Now, that's an important distinction because it means that you as a data scientist are going to be interacting with your browser which is going to be communicating with a server somewhere. The server will spawn kernels for you to run things on and the interface is essentially the IPymb file. Now, you might be wondering why did we have this? I mean, why this kind of setup and the answer is interactivity. You can do a lot of things with regular literate programming for anyone who's used org mode or something else but what you can't do is manipulate HTML widgets and other things and that's probably why we care so much about interactivity in Jupyter notebooks, okay? But there's a problem here. How does this fit into the first paradigm which I showed? Where does this Jupyter ecosystem really fit in? What is it meant to replace? Are we supposed to replace our libraries with Jupyter notebooks? Are we supposed to replace functions with Jupyter notebooks or variables? I mean, we could just consume an IPymb file for variables but that seems like overkill. And this is really what I want to talk about. And of course, Colab makes things worse because Colab unfortunately does not support a lot of standard Jupyter tools that we'll discuss in the stock. So, now moving on, how do you generally share code with someone? Well, we know that a .py file is a module and it's standalone if it only imports in a standard library, which means on a good day if the stars align and you send someone your .py file and you imported nothing special apart from the standard library and they have the same version, everything works. Reproducible code, great. And we can take this a little bit further. We can have a pure Python package. We can have a dunder init.py and we can use bit, you know, even better. We can actually go slightly further again. We can have two types of distributions, the estus, which basically just package a bunch of these source files and binary disks. Like wheels of cheese, binary disks are pretty great because they include static libraries. That means they work across multiple operating systems as long as you only need some static libraries. Okay. Now, this slide is shamelessly copied of a much more interesting presentation on packaging by Muhammad Hashmi, but we won't be really going into packaging further. What we will discuss next, though, is are there soft requirements we have not considered yet? And yes, there are. Pip itself, now for those of you who have ever compiled Python from scratch, you may have fallen into the rut that Pip itself requires some libraries to compile, which are not always present depending on what sort of system you're working with. So we find out then that Pip is not perfect. Pip has system libraries and build tools, but there are some standard ways of dealing with this. And before we get to that, let's talk about the other part, the other elephant in the room. How do we manage all the different Python packages which depend on each other and have these, their version problems? And of course, until now the most standard method is requirements.txt. At this point, I would like to check if people have filled out the, ah, okay. I see we only have one person filling these out. So that's great. Okay. I see a 50-50 on Condor and Pip plus Virtual Land. Okay. Now, Pip is, it is still the Python standard. I keep hearing that the pen will replace it, poetry will replace it, but Pip is still currently the only standard approach. And a better way of dealing with this is if you have a log file, if you have a toml file or some kind of structured file where you declare your dependencies and then the resolution mechanism runs and stores the results in log files. Okay. And for system dependencies though, we have a problem. How do we manage these dynamic dependencies? How do we manage getting libraries when we need them to run our Python code? There are a couple of approaches. There are impure file systems like Anaconda, which basically creates a kind of fake copy of your tree, a Unix tree and then works through that. There are the container-based systems, all of which require running a service. Perhaps this is something which maybe you may not have considered before, but that is an overhead. And furthermore, there's a caching problem. Now, we all know when we start a new Python project, we should probably have a nice little virtual environment. Back home with my 8TB hard disk, I didn't really care how many times I was installing Mathplotlib and how many times I was installing NumPy. Out here with just my laptop, it becomes more of an annoyance to have to install pandas eight different times and eight different virtual ends just because I want to be pure and within best practices. So how do we fix this? How do we go beyond this? Well, with Nix. So before we get into this, this is a simple dependency tree, right? Package A feeds into package D, which also depends on C and B, right? And this brings us to a very important point. How does a computer know or what we do to tell a computer where to find our binaries? The answer is the path variable. The answer is always path. For those of you who have written documentation, especially with Sphinx, maybe you've manipulated your Python path to include the subtree which you're trying to document, right? And really to the computer, especially in the Unix file system style, everything is a path. And Nix says with strong academic pedigree from 2004, Nix says let's not give the users the choice. Let's not let them install to say slash OPTE or slash user local bin or let's get rid of that. Let's instead have a single space where the hashes or what the file or folder is called is predetermined and can be computed in a deterministic manner. And that's the Nix store, farthest from the right. And then we say, well, you know, if the user wants to use it, we'll just expose the binary itself through a similar. Now, the one in the middle is a level of indirection required because Nix supports rollbacks, but we're not going to get into that. Now, why, why would we do this? Well, for one thing, this way the user doesn't get to mess up and install a bunch of different things which depend on other projects which may or may not be there. Here we have a reproducible hash which is stored, the hash, name and version seen once again here. Right? And of course there are the paradigm of Nix on the left. We won't cover them. Now, I know in 17 minutes I have remaining it is going to be very difficult to ask everyone to go along with me, but still these slides are out there, so this is included. An interesting thing about Nix is that it spawns a bunch of demons. It spawns a bunch of build users which we basically like because we want to run concurrent builds. So how does this fit with Python? Okay, maybe you will install Nix. So the first trial, let's just directly set up a Python environment. Now, how do we bring this a level further? We can actually use this in scripts which will ensure the same Python environment. Now, notice that this is a big step forward already. No longer do we need our users to start walking around with virtual lens or anything. We say I'm going to give you a script and when you run that script you are guaranteed to get the same Python environment I used within reason. And one of the things within reason is an aside into purity. If we were coding together, then those of you who have PyNv which messes around with your part, again it uses shims to set your Python, then you might have complained that hey, you know I'm not going to run this in the next shell and I'm still getting my good old PyN Python. The reason behind this has to do with purity. And when you pass the pure flag, then you ensure that only the dependencies you have declared are going to be available in the next environment in the runtime which you're doing. Now, how do we make this a little bit better? For one thing, we would like to expand upon our script setup. We would like to get this environment out of this, somewhat like the virtual end, and we can do this. Okay, a word about the syntax. This is the Nix expression language. It is pure and it is proudly not Turing complete. The reasons for this, but one reason is it's a configuration language. It doesn't need to be Turing complete. Now, this is how you canonically build Python packages with Nix. If someone has not already contributed upstream to Nix packages, this is what you would do. Note that there are several important things here. Most importantly, that we are able to pin the version which we are building by its hash, which is a lot better than pinning my version, although we are also listing the version. Okay, and this is there's some nuances here which have to do with building packages in Nix, which you're not going to cover. But what we are going to talk about is how do we make this a little nicer? Because maybe now you're thinking, my requirements.txt has around 200 packages. You're telling me I have to write an expression for all of them? And this is a legitimate question. In some sense, yes, but let's talk about how we can make our life easier. And in particular, we're going to be talking about Nix. Now, as an alternative, there's also poetry to Nix, which is simpler if you already have an existing poetry project. But we're not really going to get into that. The other tools here have to deal with the fact that even when you install Nix, when you run an import, then you need to ensure that you're getting the same. So this is an impure import, because this depends on whatever the user has globally defined. Okay. But let's try to replace Konda completely. Let's try to get, let's try to see this in action. Now, remember that word I said about impure imports? Now, here we're defining our sources. Here we're saying, no, I'm not going to depend on whatever the user has installed. I'm going to depend on this project local source, which I have defined. We're also going to pull in Mark Nix. As you can see, we are pulling this in from the GitHub repo. And now we're going to build our Python, but this time we're going to be able to use requirements.txt. Note this has to be standard requirements.txt, not one of the fancy ones generated by poetry, which includes IDs and hashes of its own. But this is great, you know, within reason, although Mark Nix is a little slow at first, because it does dependency management at first and resolution. It's faster later. And all the other caveats apply, all the other standard set of details apply. We can still write our own functions for things that are not done. Now, there's a lot more about Nix and too much to consider in this talk. You can even make Docker images out of it. But now let's move forward. What is reproducibility? Reproducibility is the ability for someone else to monkey around with your code, almost as if they were stealing into your house and sitting at your PC. It means, and that has a lot of connotations because it means it's more than just telling someone, oh, here's the data and these are the results. Figure it out. It's much more than that. It includes code, it includes tools, and there's this beautiful graphic on the right, which pretty much explains it. What does this mean for us, though? Well, now, there's some jargon on the right. Those of you who have done a lot of data science are probably aware of this, but it's not really that important. What is important is that a lot of these problems are at least at first flush already solved. We know already that using version control tools like gate or subversion or mercurial, we can keep track, in some sense, a log, have a log of what we're doing. We can collaborate with tools like Overleaf and Google Drive and OneDrive, and we can even reproduce environments through Docker, Nix, or even Konda. The final part is a little bit of a data analysis specialist topic. It deals with the common workflow language and, essentially, how can you reproduce full pipelines? We'll see a little bit more about that in a bit. But now, let's talk about the environment in which you or me are going to be discussing the rest of the stuff. So what is an HPC cluster? What is a high-performance computing cluster? Well, there are a couple of ways to figure out if you're on an HPC. First and foremost, you'll be thrown onto a login node. You'll probably not have any kind of GUI access. You will definitely not have Docker. If you're very lucky or if you're in some parts of the UK, you will have Singularity, which is kind of like Docker, but without the security concerns. You normally run a kernel so old that there's no user space support, so you can forget about P-Root and more specialist tricks. It's probably going to be running a super old version of CentOS. Maybe the GCC you have is GCC4. And it has a network file system. That's actually quite important. And the network file system can be anything from NFS to something like Luster or GlusterFS. Now, the resource queue is another issue. You'll normally have a resource queue, something like Slurm or PBS Talk, and you may or may not have support for Elmord, which is a little path manipulator written in Lua. Okay, so that's the HPC cluster. Now, what's the HPC problem? Well, the problem is if you're on an HPC, you're very likely doing some sort of scientific analysis. And when you're doing such scientific analysis, then everything counts. So even though technically all you really need to do is get from raw data to results, all these steps in between are what's called the provenance of your data, of your analysis. And it's important because those things are required when you write your paper or when you want to explain it to your colleague. Okay, so how do we get these concepts to work together? Now, recall that when I discussed the architecture of Jupyter, I mentioned that the server which we're communicating with runs different kernels. Now, it so happens that the server itself lives in an unholy union of Node and Python. I call it an unholy union because it has thus far, at least, more or less completely defeated the ability of NICS to cleanly package things into pure environments. That is with Jupyter from tweak, but it doesn't work very well for all kernels. So it's a better idea then to set up your Jupyter server manually with KONDA. Do it once, do it right. Use a Node manager. Just never use system anything, right? So use NVN. You might need to track bits of the provenance manually, like plugins and setup. And then you can, of course, export a nice little YAML file which you can copy around and use to spin up again with KONDA everywhere you go, okay? You should always consider Derenf. Whatever you do with your PC, consider Derenf. It'll make your life easier. And this is an example of some of the configuration which you might want to do, okay? Now, let's talk about the other tools. Now, I've been programming for a long time and I personally have not found anything better than GDB for debugging. And therefore, Zeospython, which does not support much of the magics which people know and love, is still the best Jupyter debugger because it gives you an interface almost exactly like GDB. And that's fantastic. Okay, now we come back to the crux of this talk. What do we do with notebooks? Where do they go? What are we supposed to do with them? Well, as in any vibrant community, there are two approaches to this. The first approach is on the right. It's Jupytext, which says, the notebook is really, what it should be at least is a literate snippet. It should be code plus documentation all in one thing. And that's actually not a new idea. There's a lot of literate programming out there if you know where to look. There's the no web syntax. Org mode is still alive and well. But there's an issue here. Now, for anyone who's ever written a lot of code into a Jupyter cell, and has then realized that, oh, okay, so now this works on my data, I must make a function out of it. Then you know it's not a pleasant experience to refactor cells into function. And it kind of feels like you're just trying to, you're just trying to do regular standard programming with documentation in one forced environment. It feels like there's a disconnect. That's where Papermill comes in. Now, Papermill says the notebook is a function. The entire notebook, whatever you did in that notebook can be considered to be one function. You can add some parameters to it and you can just rerun that entire notebook. Now, of course, one of the pros is this becomes a lot quicker to work with, right? And actually you can use both of these together. Because you can actually call Papermill in Python. So you can still have one IPINB where you're actually tangling with Jupytex and we'll talk about tangling in a second. And that is going to subclass other notebooks through Papermill. Okay, so what does that mean exactly? What is tangling and what's going on? Well, for one thing, we should never commit an IPINB. I should have mentioned this earlier, but maybe many of you have seen this. Why shouldn't we commit an IPINB? Well, for one thing, the IPINB looks like this. It is not human readable. I mean, it is technically if you enjoy reading a lot of JSON. But for most of us, we don't like reading JSON and we much prefer having the more literate cleaned up version on the right, which is what Jupytex does. And that's pretty much all there is to Jupytex. Now, what about provenance? How do we actually track provenance? I mentioned the CWL, but it didn't get a slide of its own because we don't really care about writing CWL, which as far as I can tell is mostly written by biologists using Toil. Now, instead, we will use Renko. Now, Renko is a lot of things. Renko has a web UI. It stores your data sets, it versions them because it uses Git LFS under the hood. It's great. It's a whole different platform. But for our purposes, we're just going to use it to wrap our commands to generate CWL files. That's it. That's all you need to do nine times out of 10 to gain a lot of advantages in real life. Okay. So now we're coming to the snapping up bit. So a couple of parting practicalities. As I mentioned, you should keep your Jupytex impure if only because you don't, once again, you don't want to spin up a different server every time in every project. That kind of ruins the whole point of being able to subclass kernels in the first place. So you want to keep one, so per system, you want to have one Jupytex server instance and you can manage that with Conda and NBM. Don't rely on Colab. It's kind of lazy because if you rely on Colab, then you're going to have to suffer a lot downstream talking to someone who doesn't use Colab. And you should always comment plain text versions. This is important. You should always use something like Jupytex. And wherever possible, especially when you're doing analysis, you know, exploratory data analysis, try to parameterize the whole function or the whole notebook so that you can actually use this with paper math. Okay. And yes, petition your sys admins for Nix. There is a user install setup, which is actually part of a talk at Nixcon, which I'll be giving later, but that's, it's, it's rough. Okay. And of course, use Nix derivations. There's a lot on Nix, especially both in the stock and elsewhere. It will pay off. The learning curve is a little bit steep for those of you who have never used functional programming, but it really pays off. And you can use Renku as much as you'd like. You can use it just to generate CWL, in which case it's a little bit more accessible to other people, because you can still host everything on GitHub and you can also just give the CWL files out. But you could also use it maximally and track databases with it as well. Do everything with Renku. Okay. Conclusions, right? You know, congratulations. Well, kind of not quite there yet. There's still two slides. You have to stare at my references. But yes, interactivity is here to stay. This is the way we teach data science. This is going to continue. Gone are the days when you could say, oh, you know, once you're in production, I don't want to hear things about interactive anything. You know, those days are gone. Jupyter notebooks are here to stay. This is the way people do analysis. This is the way people learn. A lot of people nowadays learn Python only through Jupyter. It's just the way life is. Okay. We can and should adopt some TVB practices. Definitely. There's a reason why unit testing has stood, you know, the test of time. And so with these tools, CioSpython, Jupytex, Papermelon, Renku, we should be able to meet harmoniously midway. And there's some recommendations or references. And now finally, thank you. You've reached the end. There's no congratulations here, but yes, thank you. We have some time for questions, but further questions and details along with the slides are on the stage. Do I have anyone online? Yeah. Let's wait if we see any questions. Nothing as of now. I guess there are no questions. But yeah, awesome talk. Yeah, that was interesting. Thank you. Yeah. Rohit will be available on the Julep stream. Hashtag 2020 slash stage slash Hyderabad. Please post your questions there. He will be available to answer them. See ya. Bye-bye.