 Okay, so this is a talk I gave two weeks ago at our team meeting at the day to team meeting on Tuesday. And as asked to give it again basically and I added a couple items to it. So it's about the experience I had implementing the protein protein interaction pipeline I was working on. So that's a pipeline which was initially written in C++ it had some fixed path that's in there. Some old code, it's from 2013. And I had, we had the idea to reimplement it on Galaxy and then apply it for different genomes and particularly also for SARS-CoV-2. Okay, so just a brief overview on what this pipeline does is basically it takes a pair of sequences, sequence A and sequence B. And then it does a homology detection basically in the simplest way one could imagine it as sequence alignment to a non redundant database where we have sequences and structures. So once we match the sequence or sequence A fitted to another sequence of which we know the structure, then we can use that structure to create a model basically assuming that if the sequence is similar, the structure will be similar to. That's basically protein structure modeling at first for individual proteins. The approach in this way here, what it does is it also predicts complex structures. So initially we predict the individual proteins, predict their structures, and then we look further in the PDB database for monomers or hits, which are involved in any kind of complex with other proteins. And then we try to match our two individual proteins to one of these complex structure basically use it as a, as a multim, multimaric template for a complex structure prediction. The approach works on genomic scale, but what it requires is that we pre process the PDB database. Previously, I use the approach on local cluster. So I had a shared directory, I could just download the entire PDB and access it from every note. I could access the directory structure however so fit. Of course in galaxy things are a little bit different. We have to be adjusted we have to have the libraries and we went through a couple iterations also a lot with parents help to make the PDB available in efficient way, and to be able to access these currently 170,000 data sets quickly with our algorithms. The main approach so we see we find a match in the PDB 70 database it's database 60,000 sequences approximately with less than 70% sequence identity to each other. Then we get these monomers here, and we get several hits right there's this is a top ranking one, but there might also be other similar ones, and some of the other similar ones. And you can see in this figure here are involved in protein complex building those are the gray ones. And then we do this individually and then we try to match our top ranked monomer from sequence be to this for example here, the binding partner, which we identified through threading sequence a. Then, when we have pre computed PDB and all the relationships. Basically, it involves a lot of sequence alignment so I use cybles to to create this index to find similar partners. And it works with cybles cybles is great of course fantastic algorithm, of course. But if the sequence identity is very low, let's say less than 40%, that's what usually is commonly where the gray zone starts or where it gets like the matches get worse. And then it's not so successful but since I have PDB 70 database. So I know it's 70% sequence identity, and the rest, which I haven't matched as more than 70%. So side last works pretty well. But for this threading here for the initial threading we use hh search, which is more sensitive. Okay, so then I find the partners, and I do a structural alignment using TM align. So I match TA one here to TB one, and use one of these complex templates to construct a final model. That's approach. And if it's successful we get, we can use our NGL viewer in galaxy and visualize our structure our hit. And this is what you see here. They are like the, the red one and the blue one is for example your structural template from the PDB database, and the yellow one here is the individual model produced for sequence a. And the cyan color here is the individual model produced for sequence be and they're matched on this complex structure template from the PDB. And you can see that the match is very good in this case. We had energy terms for it and as I said we can apply it on genomic scale, and it works pretty well. We made some experiments to validate the method and compared it to experimental methods to a variety of experimental methods using the biograd database. It's also tools which are all available in galaxy and performs comparatively well. But there were quite a few challenges to transfer this pipeline to galaxy. And one of the challenges was the general tool update process. So what I did is I rewrote all the scripts. In Python with less dependency is much more readable of course and it's much easier also to modify and much clearer and it was possible because I knew exactly what I was looking for the C++ program I had before had many more variations different ways of testing things but when I re-implemented everything in Python, it took of course a while, but it was fairly quick because I knew exactly what I was looking for. Okay, so the tool update process however is tricky and it's it's like I hope you guys don't see it like it's just complaining in this this just to to mention these things some of these things are necessary. I might not be able to improve upon just on how the infrastructure is logically and maybe but there are others where we can improve on and it's just a listing unopinionated in some way so that was initially my tool update process so I update my Python code locally. And then I have to go to my local directory, find these mini condo files I'm using my Mac for it and update those files manually. And the director is pretty cryptic, it's fine after you know which files are actually used, but there are duplicates in the mini condo directory there may be like three of them, and they might be also in different locations. And you have to find the one which is actually used when you run your local plan a more plan a more test and your local instance. So I had to find these files, modify them manually, then run and test of course my local instance. And then the process was and beyond helped a lot with that too. That also points out to that there was basically two people required to do this, and someone like beyond was, of course an expert on it. So that might be challenging for others that's why I mentioned it. So we had to find a Bjorn to help. So we create the conduct packages at step four, then we push the conduct PR and merge it. And then we update the tool repository and push the tool repository PR to get up. Then we wait until the tests pass, then we may merge the PR. And then we put the tool in the tool set which is now automatically part of the script with Bjorn added to it. And then we install the new tools on the instance and then we execute the tools. So that's 11 steps. And then when we notice anything is wrong. There are some issues. For example, runtime, some files might be too big. We change the data sets whatever you change you have to go entirely through the process again. So that can be cumbersome and delaying. Okay, let's look at a couple things I run into. While I was doing this, particularly working with collections so collections are of course excellent. They let you track your individual data sets. They allow you to paralyze your computation immediately while everything is tracked and you have nice access through the history, and that works fantastic. If you have more than 20,000 entries, like genome sequences, let's say the human genome, about 20,000 protein coding genes available in the Uniprot database, then your collections get really slow. And initially, I worked a lot with collections. But then I started reducing it and I will show in the next slide how because it got really slow. The history has got really slow and it was basically I had to click and then wait. And if you already start off with 20,000 sequences, then your next data set or your next step in the workflow will also have possibly 20,000 outputs and so on. So you continue to iterate this collection, which I don't do anymore, because it just makes it impossible to work with, and it makes it also impossible to schedule the workflows. So if you make mistakes scheduling one with a parameter or you want to try another option, then you'll be lost. Another thing is collection cannot be easily imported. So, for example, I started off with one history. I noticed it got slow. I wanted to start at a new with a new history, but I want to just pick one of the collections out of it so I can start fresh with like let's say two collections of 20,000. It's impossible. Even when we copy data sets we have to basically copy the URL and either upload it again or use a multi view, the multi view, of course cannot handle so many data sets right now. Okay, so collections were an issue so what we did is find a workaround. We didn't reinvent it. It's a very simple approach. You basically instead of producing a collection you produce one big file where you concatenate all your data sets into, and then you create an additional file, the ff index file now, which contains the entry name, which has three columns. First column contains the entry name. Second column contains the location of the data set. So where was it concatenated into your ff data file. And the third column is the size of the data set. So by converting the PDB into these ff index ff data pairs, we now could easily run the tools actually. So the current pipeline uses both. So it uses the collections to do the threading because that helps paralyze it otherwise it wouldn't be possible probably to do it like in efficient manner. Once the first threading results are there which is the largest part of the computation. So threading sequence a against the PDB 70 in this case, so it takes the most time. Once that collection has been produced through the paralyzation and they worked really well on main, I know we had some adjustments for that too. And they had worked on it. Mario's to he added something to it, to allow us to actually quickly run these, and they work pretty well. Once that collection comes out there, I turn it into an ff index ff data pair, and continue working with this individual because for the consecutive steps I don't need paralyzation. And if I do which we also run into, then I split the ff index ff data parent to two or three or five and do it that way. Okay, so what this also requires is tools, which we had to add to add it to it. It's called DB kid ff index. And those tools basically allow you to merge create these pairs, merge two of them split them into a basic collection like operations on this individual files, which initially un uncompressed. The file was 150 gigabytes. But we move forward and compressed the file. So that points out to this thing that when we develop tools for galaxy. In my case it was easier. Fortunately, I took that route early to rewrite it in Python. I think it made it much easier. I think I would have money more problems if I wouldn't have done that, particularly considering the updating process and you have C++ you have to compile it you have dependencies and so on so so fortunately I started very early to rewrite it. And it points out that tools have to be sometimes customized for Galaxy. Galaxy itself pushes you to write tools better right with less dependencies with clear parameters. It's like a clear interface and so on but if the tool is not in that state. So if you want to do more complex things instead of just running it. Then it might be conservation for other people to to just rewrite it. Some of my tools require downloading large directories so that has been separated so the data acquisition has been separated from the algorithm itself into modules. So everything comes with pros and cons. One of the differences is that log files are sometimes not visible and that differs on use galaxy dot you and use galaxy dot org. So they're already since I ran both the pipeline on both systems already noticed differences, several differences. So that's one. And that's an issue on. I understand in the context that it's a security issue that's why the lock files are not visible while the job is running, even if you flush your data into it you cannot see it in the in your interface news galaxy dot org you can see it. It's very helpful because if you have a long job, which might take a day or two even, which is not a problem because you have to do it just once, let's say the indexing. So, it's possible to paralyze it but it's unnecessary and adds additional complications to it, possibly, then you are blind, because you're not seeing anything until the job is complete. Otherwise, you could directly detect issues, right continuously like let's say, take a look every, every 10 hours, see everything is still straight. You can do that. We do have user interface box. Several we have the paper cuts officer are aware of it. Many of these things are related to this challenge between adding new features making changes in galaxies continuously being developed. Continuously provide new elements but sometimes that comes with consequence and price, particularly if new features are added. I feel that we sometimes and I'm guilty of that too possibly that we sometimes push a new feature, although it's not entirely ready, and then say okay we will go through the cycle, and we have two weeks to fix things. Sometimes the consequences to resolve these issues because the planning was not done right, takes much more time even than that, and that's an issue and it holds us back, because if we make additions to certain components or modules or things like that. If we just push it into the code, it might lead to the fact that new additions will have much harder time, just on how it kind of locks in the code. So you have to go back redo what was done, at least the last things to we restore the modularity of certain components and of course there are as many opinions about how it should be done as their developers. But what I noticed particularly for this challenge was to sharing histories, for example, did not work right for large histories. If you click, and you don't get any response, you're sharing URL is not there you don't know what's going to happen. And then you have to take about 10 minutes later and see their history has been shared, but the user interface doesn't respond and we know why this happens is because and we have, I know the working groups are working on solutions for that. The back end group to we need like a different infrastructure because right now we make these API requests, and we cannot wait for a response because history is too large but will take too long so we'll time out, and you won't get a response in the toolset installation for example we have the same issue but we resolved it differently. We submit a, let's say a tool installation request, and we don't even look at the response, forget about it and make a separate request to the database to see what the actual installation status is so we separated the two things, but that's a workaround would work here too. It's probably common to do that, but we will have better approaches as far as I understand the strategy of the of the back end group right now. Okay, so basically in the direction of WebSockets, something like that. Okay, we do have new workflow editor features. Sometimes they lead to bugs and unexpected places. What I noticed is, for example, changes are not always detected. So a simple feature like hiding or disabling the save button, which is cool because you don't want to see the button if you can save anything. That requires also that we really catch all the changes, which are possible and tag that correctly. What happened a couple times, you change something, the save button remains disabled. And then you have to kind of shift a note, for example, to trigger this change. There's also a tooltip with a tooltip issue with the save button. It's a paper cut. Both of them are paper cuts. But still you run into these. And it can be annoying or I mean, particularly from the perspective of someone who's kind of new to it. For us, or for me, after a minute, it's clear. There are some things I don't do. And I know how to work around it, but how to work around it. Another thing is outputs can sometimes only be marked as outputs after reloading. So you insert a note and it's a bug I'm actually working on, but it's a reoccurring bug which we solved. And I believe due to other changes got surfaced again. So you add a note and you can't tag it. So now you save it, reload it, then tag it. You cannot navigate back in Galaxy. So we do have routing issues. You can't just press the back button. It's not going to work. And of course, maybe it's not possible for it to work everywhere. It's something which would just make it much more intuitive if we would get that right. There are some new features like we use view make sense and we have some issues with it. A console error. So I put this in here because JavaScript is very sensitive, as we know. So if there is an error somewhere even unrelated and my interrupt your, your, your code the code which you're actually running. So there might be conflicts. So we have obvious errors even in the console sometimes. Okay invocation splits and accessibility. I know we put some work in some of these issues are known. So invocation grids currently display all invocations. If you have a lot of invocations, you have to wait until it's finished, like until you can actually access a page and get slower and slower every time you run a new workflow. The history grid becomes inaccessible. Again, you click and you know it, you just wait. Don't do anything else. And then it will appear after a while. But it's an issue. Deleting history content takes a lot of time. Sometimes requires just deleting individual items. And you have to delete it intuitively I would like to sometimes just delete something and really make it disappear. But to make sure it's not there it's not indexed it's it's just really gone I understand that we want to keep want to keep data available and we tag it as deleted, but that doesn't solve the issue that if you run to performance issues that still might affect it. So it would be nice to just really delete and gone. If that's something we would consider so there were some tool set issues got also much better over the over the last years but there were still some. So for example in your step, I don't remember eight or nine. When you push it to the to the tool set your tool repository gets updated, but there might be version conflicts. So you really fix something very minor and you don't always maybe that's the wrong approach maybe I should always update the version. But it would just accumulate so many versions through this iteration process that I rather don't do that. So what can happen is there might be delays for your tool set repository to become available. So that's the version conflicts and sometimes you have to just change the version or change something else and try it again to trigger this refresh. So that's an issue I had initially plenty of more issues, I don't have any anymore. And one issue was because I think when we switch from Python 2.723 that there were quite a few conflicts. My approach was to remove everything just switch the whole thing to Python 3 cleanly, and then add Python 2. And now I can know how to switch between the versions, in case I need to look at an older release or something. But overall that was mostly a Python 3 issue and there were some very cryptic errors. So tool dependency handling didn't work. So basically I run the tool everything goes through. But when I, I mean I run, I run plenty more and like start up my instance and see the tool and everything looks fine but when I run the tool, it just says you couldn't find that dependency. There were no errors or anything and the solution was to just remove all my miniconda, remove my Python versions manually and then just start installing everything fresh. So that was an issue job runtime limits. It's an interesting topic I think. So of course we need these runtime limits, because it allows us to control the compute time. And you make the most of our infrastructure by using it efficiently. But it also can force you to paralyze jobs which you don't want to paralyze. And what I noticed is that there is a difference use galaxy dot you has a 30 day limit, which works great. I think no job should run longer than that for use galaxy dollars is 60 hours. So if I have a job, which takes two or three days, because like let's say the PDB, I want to index it once don't really care if it takes three days or even for doesn't really matter. And then you have to go and create a workflow and paralyze it in order to get it running. And that increases the complexity and redundancy to because when you paralyze it, you might end up having to do duplicate computation in certain parts. So for example for the indexing of the PDB you need to create the side last database first to run against, and every job needs to create that side last database that takes in this case only maybe five to 10 minutes, but still, it just adds to it. So looking at zip files. We made an addition to these ff index ff data pairs now allow also to deal with zipped entries initially I didn't do that because I want to keep it simple. So when you concatenate your files into this ff data. Blob of single file basically, you can now also use zipped entries so you can pack your files and then push them in there. Of course it saves a lot of space. Currently the PDB from 150 drop to 30 gigabytes, which is very common that it drops by a factor of five for text compression. And we had to adjust the tools again right because now our tools operate on these ff index ff data pairs which are zipped. That caused a couple of more iterations to get that going. Problem is that the user has no control when uploading these files that's another issue so when you upload a zip file would just get unpacked. If you have a log file locally, then you can use actually the packed file. So that means if you have test cases, you have to distinguish between upload files and lock files. So that's add additional complexity of course and it's kind of unintuitive to a certain extent that we cannot upload a zip file and say keep it as a zip file causes quite a few issues. And I understand that feature initially it sounds like it's intuitive you upload it gets unpacked, but there is some weakness there. And we also have limits on job output files, I think. I haven't hit that limit anymore just because I avoided it but I remember that I hit it. There was a huge zip file, and I just needed four gigabytes of it but the file itself so was bigger than was 40 gigabytes but then when it gets unpacked there was much more. So I couldn't upload that file I had to locally, I had to download it on my local machine, grab the files out there and just upload what I needed. Okay, conduct packages of course they are great for versioning and helps a lot to have these specific environments where our tools run in makes everything so modular. So I think there are definitely necessary, but it requires just additional, quite a few additional steps, and this basically a summary again off of the first 10 steps, and there might be, as I said, server specific applications even, which I ran into, and it makes it difficult to debug the code to figure out what's actually what went wrong on the server, and then kind of reproduce or guess what it was locally and then go through the cycle again. It can be very frustrating for someone who's new to it. Okay, the Jenkins testing is great, we haven't helped a lot with this you basically wrote it like added the necessary files to automatize them. And it's extremely helpful, it made life much easier, but still the testing is sometimes cryptic and unstable and for someone who is brand new. I don't know. I didn't see the tutorials. I'm not sure how that person would be actually able to go and produce these files to have this automated testing I mean it's very specific to our infrastructure. Basically the idea is to probably copy it from an existing tool or from an existing repository and then make the adjustments. It's not something really possible for someone else. So it's a block there. Selenium tests, of course failing there when they fail, they'd be very cryptic. They're difficult to find to read and to resolve so that requires quite a bit of training to get through that. Of course, data sets, data sets, data sets size limits of course, as I said, log files are great. And I know we work on additional approaches on improving our data management for these large files. But we have to because if someone really works with Galaxy and has to manually create these files or make sure that they exist and there's no way of getting them into the system without having an admin. At least on a public instance, I mean, on the local instance you're your own admin. So that's a challenge. And we do have we do have the history right and we do have library files, and we do have the log files. But there's like some conflict initially intuitively if I would be new up saying okay the history didn't work. I'll do it through the library that doesn't work either. So that's an issue so the data management part is really important. Okay, and this is like other issues are running to my last slide is that I noticed that sometimes workflow step orders, they fail. And I think I know why it's specific situations where a job actually starts running before the input of the other job has been generated, which shouldn't happen. But I can pretty much reproduce it I dug a little bit deeper and then left it there for now. And as I said copying data sets even can be challenging our grids have generally issues. So for example if you have started workflow and you want to cancel it. I don't know how well it works. Ideally you'll be an admin, and you can go to the admin interface, then have to access the jobs table there which is also since it shows all the running jobs can be really slow. And then you have it's just not easy to just say okay I want to cancel this workflow, or at least the continuous steps. Just take him out it's it's tricky, and then while your jobs are in there, you can't do anything else right. Okay, and as I said differences in local behavior sometimes, and what a big challenge and the summarize it to a certain extent is that we often for approaches like this, we compete with specialized servers. So, of course they are entirely inflexible, but for someone who says okay I want to do this now. He can just go to server has a simple page a plot is to sequence press okay and and get an email when the results are there. So we have to in certain ways, get closer to that simplicity. And then all with improvements, particularly the contributions of yarn, getting the log files there having everything zipped before because initially I, I had one script this DB kid create script is also able to download entries from sources. So I literally downloaded the entire PDB just to get this job running. And actually that was my approach, fortunately we don't have to do this anymore so that's not a worry so if someone really runs this workflow he should be fine. So it's very efficient now. But yeah, that's my that was my experience. Does have any questions comments. We put a lot of comments in the chat. Okay. Okay. It's a lot of things I don't know if we, like, maybe want to go through the slides and see something can be distilled or is a lot of comments to be made but I don't know if there's a point doing it, this will be a few hours time. Sam, we need to put this slide somewhere that is accessible. I just pushed it into the chat, but I will chat is very, you know, galaxy lab maybe yeah galaxy lab maybe it's also has no memory really but okay we'll see. It's just when we plan for example. Next, not quarter but the four month period to look at this again and for all the paper cuts. Yeah. Yeah, I think Wolfgang did make I mean really quickly Wolfgang made this point to that I mean as a power user the multi history you just doesn't work and if you need to copy a collection into a new history you can just use the copy data sets option, and that works a lot better. But is it faster. I mean, I mean, well the multi history you just doesn't work at some scale right. Yeah, I don't know if we, will it be better with the new history or that doesn't fit into them. I mean, not right now but it would work much better if we had the multi history view like that, but I don't know for copying data sets, the multi history views. Yeah, I still don't think it's going to be your best way to copy data sets right yeah with when you when you just use the copy data sets interface, it's explicitly in a specific operation of complexity one, instead of showing all your histories and you know, the whole thing. Yeah, does it support copying collections. Yeah. Okay cool. Maybe it should be renamed then I guess potentially the confusion then I don't know. Copy items or something. I mean there's also we've added or did we add it, but you know we have the idea to have a new history switcher picker. The only one idea would also be that you can just, I don't know, have a copy button or control see or commands see whatever you have, you select whatever you want in your history switcher history and you do control V as a sort of, you know, be the fastest way to do stuff. I was actually wondering if we should generalize that history picker with the history contents picker that Sam made a year or so ago right and just have one universal picker modal right that you could pick histories contents. And the props would determine what it shows of course but you have a single consistent interface for choosing artifacting galaxy. Yeah. I mean, for this particular for the data sets that would be maybe intuitive to just if we can do it directly in the history right you see the data set you click on it, you get the option and says doesn't actually do that says copy this data set to another history. And then you just get either an interface or whatever however select field and then you just specify the history and the data sets copied over there. I think that's the idea with the batch operations, right. I mean, or maybe that is something that we can add to the batch operations where you know you filter stuff in your history and then you can do actions on them. Right, right, I think that'd be the best. It is in the list of operations. Okay, cool. And then I think for me personally the biggest challenge was the large data sets to deal with them to get them in there. And I think there's a few things about that so I think your ff index thing I mean it's not a workaround that's the way to do it. Like, you know, we can make dealing with 20,000 data sets in the history faster to some degree but I don't think there's a point having each individual protein as a data set in galaxy. I don't I really think that's the way to do it. We should aim for being able to process as much as possible and, and the individual data sets are sort of the way that we do map reduce and things like that but like, it's not I mean it's, there's a limit to how many things we can reasonably do because that also uses traces in the database. I don't know I mean, I think, I think indexes is exactly what you would do in this situation. There's something that we could just transparent. Option, don't keep track in galaxy of all the data sets and when you submit that what you normally map over it instead it does the ff index thing run the modify the command to do the explosion on on on the cluster or whatever and then I mean that's, that's actually super similar to the old task framework right that we abandoned. The task framework that the first years ago. Yeah, yeah, yeah. So the first incarnation of that would take a big file ripping into a number of pieces and then run the same job on. I mean that's exactly what it did and it threw away all the metadata because you, most of the time didn't need it. I mean it was, it was never great. I wrote it so it's okay to say that but that was kind of the gist there. Yeah, and I think like a data type driven way to do this I mean it totally makes sense thing. And I combine the two and one point that basically having a collection of these pairs. 10 of them right for if any 10 top runs, but maybe the question would be, is the current integration of handling these ff index ff data pairs. Is it fine or should it be more integrated into galaxy, because right now it's just an additional tool, maybe that's fine. Because collections are of course, I mean, native to galaxy, this is just like a tool, but that's maybe not that critical, it wouldn't just make it more, maybe easier presentable. The question would be, I mean I brought this up a couple times, but so 20,000. If that's the limit. I mean we don't want to make a statement and say, don't use more than 20,000 right it's up to the people and they're going to run into it and figure it out. But sometimes the question is, would it not make sense to say okay, this is too large. Do you want to split it up into some collections or like into these pairs. That's a good that's a question. I mean, I think what you have to sort of round about keeping your mind is that in most cases, one data set corresponds to one job. So from there on, you know I mean I don't I don't know if there is an automatic way that you can say, you know, keep them together or split them. So I think. But I also wouldn't say that 20,000 is the limit right. It depends what you're doing. So I think, overall the the three port on a good healthy galaxy system is, you know for creating jobs, it's about 60 milliseconds per job. And so you can make the calculation for 20,000 data sets. That's seriously on a single thing that's 20 minutes. Right. So that means if you schedule a workflow and you reach a step that creates 20,000 jobs, that's going to take 20 minutes at this point. Maybe it's not too bad. If the interface kind of works still in the fast then it might be fine right. Yeah, so the other thing is that we're going with salary and background tasks. So I see that you know when these things take a long time you just send them to the background, you get back task ID that you can check, you know, what's the status currently. So that's, I think that's that's the main reason we're going to do salary and apart from this it also allows us to parallelize over, you know these 20,000 things, because you can. I mean a lot of it can be broken down so you can, you can run that in parallel and these tasks I mean they're not as heavy weight is running for a galaxy process so you can also run many more of them than currently is possible with a handler. You can also limit them because I think one thing that you haven't noticed but other people have notices that they're waiting for your workflow steps to schedule. So if we break this down into smaller pieces, and we put certain rate limits on it which again is something that this gives us. Yeah, we've been in a position where I mean we can handle this easily and we should be able to go higher. There's also still optimizations that to do that are not terribly hard. But also we're reaching sort of I think, you know, overall it's probably about 10 times more efficient to create jobs for large collections, then last year and maybe there's another 10 fold in in it. It's doable but after that it gets hard to make the individual thing faster. I mean, and that's always excluding some weird performance bug that can always happen. I really think it's the interface to be honest, I think from, I mean, if if the interface would play along, as like you said these ideas and the salary workers and so on, then it wouldn't it wouldn't be a big deal if it takes 20 minutes to schedule those 20,000 jobs. No one would complain about that. Yeah, and I think the same applies to the other things that you mentioned like things that time out. And so not a problem. Just don't time out just say we're working on it, right. And all that is sort of why why we're doing this modernization on the back end. Yeah, that's exactly the solution for the massive copy history task that takes half an hour or whatever. I mean it's also ridiculously inefficient in that there is a whole lot of one data set flush all the attributes get expired we need to load them again from the database and so on and so on. So that in itself can be faster but again I mean there's a limit and we should be doing this whatever, whatever it is we need to do this in the background. Issue was like using every is our goal to keep on going using the library and extend that or do we modify the history or I mean how does someone who has fines of like 40 gigabyte data set in the internet. How does he get it into galaxy. Maybe 100 gigabytes. How does he kind of is there a way to get it into galaxy as a user and even, you know, share it maybe. I mean, quite intentional. I didn't quite understand the problem there. Well there was so there's a max output job size right on main that was, I don't know 2040 gigs something like that. Sam's file was bigger than that. So for him, we've got to have a limit somewhere. So for him we bumped up the size to make it all work but what is a normal individual do when this this sort of thing happens to them. We could try to accept that just for upload and data source tools I guess but the goal, the point of that option was for we had people running joins that would just run forever and produce an infinitely size data set until we ran out of space. So we could go back on what would be reasonable. And when your quotas 250 gigs granted we give people up to a terabyte for, for, you know, temporary purposes but when your quota is only 250 gigs, a 50 gig output is. I mean maybe something that we could do which is somewhat intermediate difficulties to say we give you half your quota sizes maximum output. Yeah, I like. Maybe like up to your remaining free space, plus an extra amount. Yeah, I mean as a, as an easy first step, just allowing the upload or ingress tools to bypass that to some much higher limit is seems reasonable. So those aren't going to be run away and jobs in the way that joint always was. I mean, not. They can right we've had people. Although this hasn't happened in a long time but another case was people selecting stuff from the UCSC that would, you know, return entire tables. I don't know if they've made some changes that prevent that but hadn't happened or haven't been an issue. Yeah, if that would work, like, if I could have been able to upload as much as my quarters or half of it. And if I wouldn't have the issue with the with the automatic unpacking of zip files. That would have been nice. I would have noticed the difference, even though we would still have made the improvements we made because they were extremely useful, but still, I would have been able to run the things much smoother, those two things. I mean, automatic unpacking and and being able to get the file in there. How is the job size limit actually enforced. Nate, I guess. It actually probably only works on local runners not pulsar but essentially it just, it, any defined. It only works on defined outputs but essentially it loops over outputs, while the job is running and it'll terminate the job. If any one of those grows larger than the limit. Okay. Also Sam, just quick interjection, like you can upload zip files right you just need to say that there's it files. They will not be unpacked. Oh, okay. So I mean it's this trade off right you, if you don't say what it is Galaxy makes a guess. Okay, but yeah that will be a solution I thought I tried this you're sure that they're not. I mean, there are several settings that inference this but like, if you don't know the instance it works fine. We don't have like a pick, pick files to extract from a zip file tool though right. So, then if you go to extract it you're just going to get everything dumped out of it which is going to hit the up the limit. That's right but, you know, I'm sorry to the tool so you cannot do that. Sam maybe you check that I also got a bug reports or maybe that's actually a buck and 21 or one. I have a similar buck report with with zip files and so on so maybe we should check that again. Okay. And also, I don't know I think a lot of the things you mentioned there are already issues for it there but we should maybe go through the list and see if there are some that are not. Yeah I know there, at least for the UI bugs there definitely issues for some of these but absolutely it'd be a great thing to go through this and put them on the wish list roadmap, or something to keep track of. I mean, you know, I think our bugs are well labeled right. Go to bugs, pick something I mean they're they're all worth fixing. I try not to label things that I don't think should be fixed so. I think a larger issue that we could maybe discuss in the remaining minutes is what Sam also mentioned that we have currently no. No feature and probably maybe no idea how to actually get rid of complete histories complete users forever in a in 100% way. I think Jen complained about that that her account is currently so big that she cannot use it and she really wants to get rid of complete histories. I'm doing now the covert monitoring where we really produce millions of jobs now, which is fine and it scales nicely so that's cool. I'm just worrying that I want to get rid of that in a few years and I cannot. And it would be super cool if we can just drop the Zaskoff to account at some point and also remove 10 million rows in our database to as an example. I think that's worse to investigate because I'm not sure. I mean our users are using collections and large collections really heavily nowadays. And they also try it, I mean they do exploratory research with that right so they create a lot of collections that are maybe useless and they get maybe bigger as they should be. What I would say is that we needed to have the mechanism to be sustainable to really get also get rid of things in a database. And this is what I'm a little bit scared with the covert effort now I mean I'm happy to analyze now all these data sets, but essentially we will end up with 20 million jobs that I would like to get rid of somehow at some point. So we need to think about the strategy how to do that. What we need to do to keep track of that in our database, because I don't think that's a super sustainable solution that we currently have that we keep everything forever in the database. What about if we change what history purging actually does and transform it in actually purging everything. So the history, the data set from the database. Yes, if one person selected purge then it must be purge. Yeah, but my understanding is that this is actually a technical limitation so I think even if I delete users permanently, they don't get grid, I mean the user still exists in the database somehow. Over right, I mean we we kind of replace the name with some random hash or something, but it's still in the database so it's still waste space in the database with all the jobs so I don't think we have it. I'm not correct me if I'm wrong but I don't think we have a mechanism to really get rid of a user with all the associations or history with all the jobs that are connected and so on. Yeah that's right in all the major tables we don't we don't ever delete anything we just mark as deleted. I mean, but that's, that's a choice right I mean it's, it would be significant effort making sure nothing breaks, but it's also not impossible. Right. How much space are we actually so just given, I don't know one. I don't know, come up with whatever random or arbitrary unit of work we want to talk about but how much space are we talking about in a database for like one of the big coven runs. I don't know. I, and you know, calculate that. Yeah, I might be able to come up with that. The total main database is now over a half a terabyte. And most of that is in the HD a table I believe. I mean also regarding slowness of, you know, the interface for people that have a lot of data sets. I mean, like, he was doing fine right you create a new account and you browse around and you're amazed how far amazed like how fast things really are. And I think a lot of that is because we do limit offset queries. And that's not super efficient when there is a lot of rows to skip over. And there are techniques we can use instead of limit offset we could have query set techniques so you say, you know, you limit, you only limit but you do a comparison on update time for instance, that should have much less impact in theory. And I don't know if that's all that there is for things being slow, but maybe that's something worth trying. Also make sure there's no. What do you say, make sure there's no missing indexes. Yeah, but that one too. But we have a script now so we shouldn't be creating that anymore. What is our, what is our plan for really getting rid of stuff in the database. If people choose to remove their accounts. Why do we keep that so, and should we not investigate make up with a proper test at first, of course, that we can actually kind of really get rid of half of our users if that is needed. And then the databases also again, not half the terabyte may maybe just 100 gigabyte. So you could one thing that that affects is if you need any kind of historical data for, you know, figure out how many jobs, how many hours did we compute on this specific it. It's less of a case for you probably but like in my case where at the end of each exceed, you know, allocation period we have to go back and look at, you know what what how many jobs, how many hours did we run on this cluster versus this other one. And we can always extract the stuff that we would think that we would need into into a separate database. I don't think I've just device keeping everything. We'll come up with some number. We could also look at what columns are really worth deleting. I mean there are some columns that are heavier than others. And that maybe nobody needs to look at anymore. Yeah, like you just saw the peak was. Yep, exactly. That took up much more space than I expected. The problems with deleting users and data sets is, you know, if they shared anything that can be shared with another person another person and then you know, deleting one of these records you could leave something hanging over and it's just, it's not that you can't kind of solve that already right because, you know, they saw I mean we do some kind of reference counting with the data sets so I think when it's not trivial but it's doable right. Yeah, I'm not I'm not saying it's not possible but I yeah it's going to be a bit of work is what I'm saying. What you said about historical data about what happened in the past. It doesn't have to be the same data database which we use for running jobs now. It could be some kind of data warehousing effort and just put it there a certain break point and use it for reports from the past and purge everything from the current database which we use for live stuff. Yep. Yeah, I agree. We definitely could do something to move it out of band. Okay, I think we've gone over time. So I'm going to stop the recording. Thanks so much Sam. Thanks so much for the great conversation everyone.