 And turn it over to Ana to talk about Jupiter lab. Hello everyone, I guess it should be audible for you, all of you, just let me share my screen. So again, hello everyone, I'm Anup. I work with Freiburg Galaxy team in Freiburg, Germany and I work in the Freiburg Galaxy team. So today we will discuss about a special Jupiter lab that we have developed as a Galaxy interactive tool. And that's dedicated for developing AI programs. So to develop that, I mean, before discussing interior details, we already know or many of us may have used Jupiter lab or Jupiter notebooks for prototyping different projects, creating visualizations. I mean, in general, it's a popular editor for scientific computing and data science projects. It's also very popular for developing machine learning and deep learning programs. Also it can be all useful for learning Python, for example. And Jupiter notebooks, they support lots of different programming languages like Python and R. So it provides a simple and easy way to create such prototypes and people and researchers and users who are using these kind of notebooks, they do not deal with lots of package installations. Many, many popular packages are already installed there. These notebooks can be easily shared with someone. It can be committed along with the results and then can be easily shared with someone else. And one of the most important features is it can be made to run on web, which makes it even better to share this kind of analysis. You can just share a link and then the exact analysis will be available to someone else. To develop this project, we have customized a Docker container. There are lots of Docker containers available and that has a Jupyter notebooks already prepackaged in the container. One such kind of Docker container is Jupyter Docker stacks and we have used this one of these stacks that already has TensorFlow installed as a base container. On top of the base container, we have added lots of packages. For example, I mean, JupyterLab, which is a better version of Jupyter notebooks, then we have other packages such as scikit-learn, TensorFlow. We will soon install PyTorch as well. Then it has a support for GPU computation as well. If the hosted machine has NVIDIA GPUs, then this notebook can access to the GPUs via CUDA packages that we have already made available in this container. Then we have another package called Ilyra AI. Using the Ilyra AI, we can create a workflow of notebooks and then entire analysis can be executed as a workflow in one unit of software. Then there are other features such as Git integration. We can directly clone a GitHub repository inside the notebook and then an entire repository can be used. This customized Docker container is then pulled by Galaxy's interactive tool. There are lots of other interactive tools available in Galaxy, such as there is one for Jupyter notebook, I think, and there are lots of others. This interactive tool pulls the Docker container and then creates the entire ecosystem and then it provides a link. Using that link, Jupyter Lab can be opened and lots of different features can be used inside notebooks. In this slide on the right, we have the architecture diagram of the entire project. We have a Docker container here that we have customized and it has lots of packages that we have added. For example, CUDA packages, TensorFlow and OpenONNX is one model sharing package. Machine learning field generally suffers from model sharing problems. For example, TensorFlow creates models in binary files and there are lots of binary files inside and it's very hard to share those files with anyone. Therefore, with ONNX format, these entire models can be converted to one file and can be shared with anyone and can also be used for inference or predictions. There is another feature in our Docker container is remote model training. We have used Galaxy and we know how we run jobs in Galaxy. We have used similar methodology for doing remote training that can be started from the Jupyter Lab notebook. This Docker container takes some packages from the base container which is Jupyter slash notebook, TensorFlow notebook and it already has Jupyter Lab installed and when we run this container, it automatically opens a Jupyter Lab and we have added several other packages to make it also work with Galaxy. Then this customized container is pulled by a Jupyter interactive tool and Jupyter interactive tool runs on Galaxy's compute clusters. Alternatively, this custom container can also be used on other compute infrastructures which has lots of CPUs or GPUs or even GPUs. For example, therefore, it's kind of a general container which can be used on different compute infrastructures. With Galaxy interactive tool, it opens the notebook and it looks like this. That contains lots of different features. For example, we discussed it has CUDA packages. If the compute infrastructure has GPUs and NVIDIA GPUs, then we can leverage our programs on these GPUs and our model execution and training would be faster. Then it has lots of popular packages for machine learning, scikit-learn, for example, TensorFlow, then computer vision packages. Then we have for data manipulation, pandas, and for saving matrices, we have H5PY and then for interacting with images, then there is NI Babel package. There are lots of visualization packages as well. For example, the popular ones are Matplotlib, Cporn and Voila. Then Git integration and ONNX models we have discussed already as well as Ilyra AI. We have also installed a Bioblend in the container and it helps the notebook to interact with Galaxy and we can access from the notebook, we can access Galaxy's histories, we can create histories and interact with datasets and workflows. This Bioblend package is internally used for remote training to make it possible and it also uses a separate Galaxy tool. We will see how we can do remote model training in the latest slides. There are lots of small features, for example, IntelliSense is there. If we put a dot in front of a package, then it shows all the available methods inside and there are lots of dashboards for monitoring the resources. For example, we can see on the right how much percentage of CPUs are being used and how much memory is being used and so on. This is how it looks. When we run this interactive tool, then it opens a Jupyter Lab container and then we have already created a few example notebooks. For example, this is the home page and this gives the brief introduction to all the features and there are inside notebooks and there are lots of other notebooks available that showcases different features. We will see that in live. How does it compare with other notebook infrastructures? For example, there are two popular notebook infrastructures. One is Google Colab and the second one is Kali Kernels. These two are very popular for, I mean, they are also available online and people use it a lot for developing AI programs. And they provide different kinds of features. Let's see and compare. How does our infrastructure, that is Gilan Galaxy's Jupyter Lab, how does it compare with other popular infrastructures? In terms of memory and disk space, Google, Google's Colab and Kali Kernels, they are not that great. Also, their disk space is also not high. I mean, 70 gigabytes. If you deal with lots of images, I mean, millions of images, then it would be pretty small disk space and the memory is also not that high. Also, these memory and disk space, these are also dynamic. So, we understand that they are big companies and they have lots of infrastructure, lots of GPUs and GPUs available with them. But to make it fair for everyone, people who use more, they get less resources and people who use less, they get more resources. So, that's how they want to make their system and infrastructures more and more fairer to everyone. Therefore, many times I have used it and they give around 70 gigabytes of disk space, but sometimes it's also higher, as high as 100 gigabytes as well. But our JupyterLab gives like one terabyte of space and it's quite big. And the CPU memory, CPU is around 20 gigabytes and I think there are 20 virtual CPUs, I guess. In terms of GPU and GPU availability, both these Colab and Kernels, they have both the GPUs and GPUs available, but currently we have only GPUs and GPUs are not available. GPUs are slightly better compared to GPUs and it was developed by Google, but now I guess it's open source, it's open. And GPUs are specially designed for processing tensors and specially designed for creating and training neural networks and especially customized for TensorFlow. So currently we do not have GPUs available, but GPUs, yes. Let's also discuss the maximum users time. For example, if you're using Google Colab, then we can use it for 12 hours. We create a session that gives us a connection to a remote virtual machine and it's available only for 12 hours. After 12 hours, it gets killed. And CalCernels are a little bit better. I mean, they give 30 hours of GPU a week, which is also not that great and also 20 hours of GPU. If we demand higher resources, then the users time becomes less and less. In our infrastructure, there's no time restriction. We can use it for days or even months and use and block the GPU for a long time for job execution and creating training models that take many days of training. Our resources that we give, they are fixed and guaranteed. They do not change based on the usage, but Google Colab and CalCernels, their resources are dynamic. They change based on the usage pattern of a user. In addition, these two infrastructures, they do not provide any technique for remote model training. They have to go through the notebook. It's not like we close the notebook and the results would be available to us. For us, it's true. We can start or we can call a function that will take the entire training and model creation to Galaxy and the notebook can be safely closed and it's totally decoupled from the notebook and the models will be available like any data set in Galaxy. To showcase the power of this infrastructure, so we have reproduced results from two recent papers. First one is related to CT scan image segmentation. In these, we have lots of CT scans available and these CT scans come from infective people and they have certain areas marked where inside the CT scans, which region is actually infected. The unit model, the unit neural network, this learns which regions in the entire CT scan is infected and it predicts that region. If we give a CT scan, then it predicts which regions are actually infected. So this entire paper can be reproduced in our infrastructure and similar accuracy can be attained. So this can be run in two ways. The entire analysis in Jupyter notebooks and in alternatively, a remote model training feature can also be used and that takes either CPU that runs on CPU or GPU and the models can be created in Galaxy history. These models can again be pulled into notebooks using BioBlend and then predictions can be made. For remote training, we need to convert the data sets to H5. In general, all these neural networks and machine learning algorithms, they take input data as matrices and since we know that machine learning and deep learning algorithms work on wide variety of data sets from sequences to images to gene expression patterns. So therefore to make it kind of a standard so that we can send the data to Galaxy. Therefore, all the data sets that we want to train a neural network on, they should be converted to H5 for remote model training so that we can upload these H5, either one file or multiple files from the notebook by running a function. We will see that technique in a couple of slides and then the entire analysis can be executed. We save our train model as ONNX file and this is one file and which can be used for making inferences on unseen data sets. So in the slide, we have two images, one on the left, one on the right. On the left, we can see here the CD scans and the chest CD scans and in the middle row, sorry. In the middle row, we have the infected regions denoted in red. And from these infected regions, we compute these masks. These masks are actually an example where these infected regions lie in relation to the CD scan, original CD scan. Therefore, this image and this image, they make one training pair. This is the image and this is the label of that image. And this is, we want to predict when we have a model, then we supply one CD scan and then our model will predict which region from the CD scan is actually infected. So in our notebooks, we could train our models and then we could produce this kind of image. On the leftmost column, we can see the original CD scans and in the second column, we see the ground truth masks, which are the true masks or the two infected regions from the CD images. Then there are two error functions that were used in the paper and they produced a similar output compared to the ground mask. For example, if you compare this image with this image, they are quite similar. Obviously, they are not the same. And the error function that was used in the last column, which is BCE, stands for binary cross-entropy and TV stands for total variation. This produced better results as per the paper. And this entire analysis was possible in our notebook infrastructure. So only one thing that was different is we have to convert all the images to H5 files. And these H5 files, they contain several data sets, training and training data sets and validation and test data sets. And these were uploaded to Galaxy in the notebook and then was used accordingly. And the code that the authors shared, so we had to modify the code a little bit. We did not modify anything in the neural network or the loss functions anything, just how the input data is passed to the neural network. The second way, how we can do such kind of analysis while training our models remotely. First, we need to create a script, the entire analysis in one script. This script will be executed by a custom function that we made part of the notebook itself. This is a run script job. This is part of the notebook itself. We just write a run script job and then pass the correct parameters and then it runs a tool. This tool is a Galaxy tool and this is hidden and is not available in the Galaxy search. And there are a few parameters to this function. First one is the path to the script that is executed. So first we create the entire analysis one notebook file and then supply its path. This path is related to the notebook. For example, in the notebook, we create a data folder, then the path to the script, data slash the script. Then we need to specify the list of datasets used in this notebook. In this notebook, we may have used lots of different H5 files. So these H5 files, the relative path should be given as a list to this function. And then we are using BioBlend. Therefore, we need to give the URL which Galaxy server needs to run on and then the API key. Also, optionally, we can also give the name of the Galaxy history that it will create. This too, this custom function. First of all, it will create a new Galaxy history on the server specified using the API key. Then it will upload all the datasets specified in this list. After that, it will read the entire notebook and create a Python script out of that. And it will again upload it to Galaxy in the newly created history. Now we have the datasets and the history and the script in the history. And then the tool runs the script on the uploaded datasets. And if we have a model inside it that gets trained, then the tool will create a model in ONNX format. And it also creates some other datasets that we will see shortly. So we can see in this slide, let's suppose this is our script which is for underscore create model Python script. This is script we need to run. And this has everything for training the model and creating plots and everything. Then we create another script in another tab and then provide all the necessary parameters. For example, the script path and the dataset list and then run this script job function. This internally invokes another tool and that is a hidden tool. And then it creates the entire history on the right. So it creates, suppose we have given the CT segmentation and the name of the history and then it creates history with this name and then it uploads the dataset that we have added and then the extracted code from this notebook and then the training happens. Let's suppose the training finishes in one hour. Then after one hour, the train models would be available. So our script can generate multiple train models. For example, in GAN networks, we have different models, a generator model and discriminator model. So it will create a collection here and all the models would be available. Then it creates a saved arrays. Saved arrays are just, if there are global variables that has matrices in it. So it creates different files for these arrays inside one H5 file. And in the script, if we have created lots of different files and for example, some JSON file or text file for saving some results, then everything would be zipped and present in this dataset. This is like the dump, the whatever is present and so it will just create a dump of all the files. Then these files can be downloaded and further pulled back into the notebook and then further analysis can be performed. The second use case is predicting proteins, 3D structures. So we recently read about Alpha-4-2 that made a breakthrough in the prediction of 3D structures of proteins. But Alpha-4-2, it's very memory-intensive and takes a lot of time to predict the 3D structures. Therefore, people have come up with smarter solutions which takes less time and memory and produces structures at the same accuracy at what Alpha-4-2 would do. So one such technique is Scholar Fold. So they recently got published in Nature as well. So they are less intensive than Alpha-4-2 and it's quite fast in the paper. They claim that they are 40 to 50 times faster in predicting the 3D structures and they use Alpha-4-2 weights to do that. So they have optimized the homology sequence searches and that takes a lot of time by many against many sequence searching and that reduces lots of time. So to pull this package into our infrastructure, we needed to add only two packages, Scholar Fold itself and JAX. JAX is just like a Google's version of NumPy. It's just used for matrix multiplications. And some people say it will overtake TensorFlow somehow in future or it will replace TensorFlow in future. So since our infrastructure uses GPU, therefore prediction of the 3D structures is quite fast on our infrastructure. So the Scholar Fold people, they have created different notebooks, Colab notebooks for showcasing their software. So we have adapted one of the notebooks and this is available here. And in this notebook, it uses Scholar Fold and Alpha-4-2 weights to predict the 3D structures of protein. I tried to predict a 3D structure of a protein. This is a spike protein of SARS-CoV-2 virus. This is around 300 amino acids long and this is what it predicted. So we have a running instance in Galaxy. This entire infrastructure is running. I can show that to you. And also we have created a Galaxy Training Network Tutorial that showcases how to run this infrastructure and there are lots of features and how to reproduce the results from the two publications that we discussed briefly. So this is the running instance inside Galaxy Europe and then we see this is the homepage and it showcases lots of features in a notebook and there are other notebooks, for example, how to share machine learning and deep learning models for scikit-learn and TensorFlow. Then here we see this is the grid integration. So we can initialize a clone or repository here. You can type the name and everything. So the tutorial already explains the technique, how to do that. And we have a GPU monitoring dashboards, how much the GPU is being used and if we run a neural network training using TensorFlow, then we see our GPUs being utilized. I think that's all from my side. So if there are questions, so I'd be happy to take. This is a really amazing resource. I'm wondering, is this actively in production now that we could go play with it? So this is available in Galaxy Europe. So we have the tutorial there inside statistics folder and it explains all the different steps needed to run this infrastructure and to open the JAPIRA lab. So the tutorial, I can show that. So if we go under statistics and machine learning in the Galaxy training network, so this is the tutorial available and it explains different features and how to open this JAPIRA lab and how to clone a repository and then how to run. So for these two papers that I have discussed, these two papers, so they are modified notebooks available and that can be cloned from this repository and then they can be used to get such results. And of course, different projects can also be different. Neural networks can also be developed and trained. Very cool. Very cool. And then kind of a related question is, I mean, Bioblend is really useful to sort of programmatically manage files histories and launch jobs and whatnot. And you can say no, but I'm wondering if there's a way to kind of expose some of that functionality through a GUI, right? If you wanted to kind of explore files or histories and just sort of pick which ones you want. Is there any sort of capability for anything like that? I don't know. So currently we do not have that. I mean, what I understood, you asked that the Galaxy's history should be available as a GUI in the notebook. Yeah, just some sort of widget, right? You're working interactively. You've run a workflow in Galaxy. It's created, I don't know, 100 output files. And sometimes it may be easy to sort of identify the right one, but sometimes it's useful to kind of be able to view. I'm just wondering, I mean, maybe the right thing to do is just sort of pop back into the Galaxy UI and then identify it there. But I'm wondering if there's a way to kind of just make that transition as seamless as possible. So if I can jump in here real quick, actually, so there is another project ongoing called a gin or galaxy and notebooks, which the goal of this is to provide a graphical interface inside of Jupiter lab to Galaxy itself. Oh, cool. And so it enables us to run tools, upload files and so forth. It does not yet support workflows, but obviously that's on the roadmap, but you can run tools, you can run tools at multiple Galaxy instances, you know, send files from one Galaxy to a different Galaxy and so forth. All juicing just point and click. Super cool. So yeah, it would be good to integrate all these together. So I guess that's another question, right? So obviously Jupiter lab is a very popular sort of notebook. And I can see a lot of different variants growing. And so what's, do you have any thoughts on what's the best way to handle this as like, I can see, I could take your stack. And I'm like, I want to add a few more things on top of that. So now this is a, this is another variant of the same notebook that maybe it has, it has the gin in it, or maybe it has Qiskit or maybe it has some other, other tools and so forth, right? So, but any thoughts about how we stop like massive, potentially non-necessary proliferation of these sorts of multiple notebooks? Any thoughts? So we have not yet thought about having a UI inside it, but it could be cool to actually converge these two projects. I mean, so we didn't think about having a UI inside. So, I mean, it's a standard Jupiter extension. So I imagine you just install both packages at the same time, right? But then that becomes like, is that, that's another Jupiter notebook, right? Or someone wants to add their own stuff on top and make it sort of exchangeable, right? So it might be nice to sort of think about ways to have like a base set of notebooks, but then also provide additional inputs that, so that each time you run the notebook, you don't necessarily have to install a bunch of new packages, but you can just pull packages from like a shared data set or something like this. I don't know, there's a lot of stuff to think about. Yeah, yeah, right. Yeah, no, it's very cool. With the GPU utilization stuff, is that like node-wise or does that only show you the user for your specific usage? It's for the, yeah, it's for the user-specific. So the entire VM is kind of result for this infrastructure when the session starts. So how much GPU is being used? I mean, user can also type NVIDIA SMI command and then already see that, but it's available via a dashboard. Yeah, I guess the question is, is if two people are running notebooks, do you only see your usage or do you see the combined usage? So when I start a session, so the entire GPU is for myself. So this GPU will not be used by someone else. So this is currently not possible. So one GPU is reserved for that one user. So that is not shared. Can you reserve more than one per user? Is that configurable or? That's actually hard. So currently, yeah, TensorFlow does not provide a clear way to actually do that. Also, it's not, for example, that the GPU has currently what we have. So it has 15 or 16 gigabytes of memory. Then we are not sure how much one user can use. So TensorFlow provides a way to reserve a part of that. For example, we can specify, okay, I need to reserve for this model only five gigabytes. So I can do that for that user. But if it grows more than that, then it starts producing error. That's not convenient from TensorFlow side. So that is kind of a limitation. So the reservation happens at the TensorFlow level, not the Docker container execution level? Yes. So it can happen only at the TensorFlow level. No, that's fine. Because there are some other tools that you might want to run that can make use of multiple GPUs, right? For example, like Gromax and so forth, right? So I was just sort of thinking if there's anything that could be reused in these cases as well. But no, it's very cool stuff, absolutely. Thanks. Are there other questions? Then I have maybe one, maybe also more for the audience as for Anup. So the big elephant in the room here is security. So, I mean, we do this project because we still think it's super powerful and if we can, I mean, if we would be able to give everyone this infrastructure, right, you can, you have really an accessible GPU infrastructure that you can use for playing around with small models, but then in the same environment also outsource it to the job scheduler and run it for days. And I think it's super useful that the problem here is if you enable that, you will have pretty, I mean, if you enable that for everyone without any control, you will have Bitcoin miner tomorrow. We have seen that with normal notebooks already, so not GPU ones, so that people create 10 accounts and create them and let 10 notebooks run, just Bitcoin mining on CPUs. And of course this will be even worse if we enable that for GPUs. So the question is kind of, do we have a model for all these advanced ITs? And in particular for the GPU enabled ones. Yeah, to control users, to block users. And this is a broader discussion that we might want to take either way. We don't have a good solution currently. I mean, what we do is more or less, people need to register with an academic address and really need to ask us to write us and then we enable it for this account so that we have some personal contracts, that there is some reliable person on the other end. But of course this is not super accessible for everyone, right? You need to go through this bottleneck of writing us. But yeah, it's a trade-off and maybe you have better ideas, maybe we should discuss that. We would be very much like to hear your ideas or your concerns here. I guess the other infrastructures, they're just limit for hours. For example, Colab, you can run only for 12 hours a session, which is a very simple way to do that, to restrict abuse, I would say. But I mean, we do not do that. I mean, we do not put a cap on the user's time. So yeah. I mean, it seems like if it starts to become abused then we'll have to put in limits, right? Yeah, but if you put in limits for everyone then the service is maybe not so useful anymore, right? Maybe you want to train your model over two days, I don't know. Yeah. I mean, I think that's the tension of writing infrastructure like this, right? Yeah. Nate, I don't know if you can talk, but in general, is there anything that we need to do to dot org to enable some of that at least? Because I'm not sure what the future is. Are we basically waiting for that to enable cluster for us or if you can talk, that's fine. We'll talk at GCC. We cannot hear you, Nate, if you're trying to talk. Blink twice. It's okay. Probably in a car or something. No. Can you hear me? Yes. Okay. So yeah, it's sort of a race, right? So there is a Kubernetes cluster that we currently submit to the cloud and we have been able to do that. And we have a similar. Notebooks to that's broken. And they have a new production one that they're going to be putting that's not ready yet. So if that's ready before we can come up with an alternative solution. Then we can do that. We have a lot of Jetstream to credits. Essentially that we can use. It's an easy way. An existing way to deploy to that. So we can set something up for it. But it'll take a little bit of work. So whichever one of those two things, you know, sort of happens first, allow us then to run it. We have GPUs on, on Jetstream too as well, which is a nice benefit that we probably won't get. Out of the attack cluster. Yeah. I mean, that's the extreme end of Jupiter usage. Simple as possible things. It's so nice to have it working. Anyway, perhaps I'll just, you need to have a firm road map. Yeah. Thank you. And for that very. Exciting talk. Are there any other questions? If not, the next community call. Is going to be on August 4th. So I think that's in four weeks since in two weeks, we'll have GCC. And just sharing that the September 1st slot is free. So if anyone's interested, let me know. And then we can get you signed up. But thanks everybody for joining and see you in four weeks or sooner. Thanks. Thanks.