 Yeah, hello, my name is Leila. I'm from Freiburg and our first session will be about New Galaxy Futures and First up is Marius talking about the workflow editor Where's he? Yeah, thank you Well big honor to do the first of the regular scheduled talks, so I said you all made it Yeah, as Leila said, I'm gonna talk about a few different things That we're happening with regards to workflows in Galaxy in general, and I think the major thing we've worked on Yeah, during the last year And actually the last couple years is to make workflows more reasonable And that often meant to make them more flexible So they're not you know that you don't have to go copy paste your workflow around or like edit it on the go so that you really have a Set workflow high-quality workflow that you can use and use everywhere share with your colleagues and Maybe say hey actually running a workflow is much easier than running a tool So yeah, I mean a brief brief super brief history of workflows in Galaxy, so workflows were available since at least 2008 I that's before I joined the community and I've seen on my visit at Penn State that there was even a nice cool t-shirt that was All about the workflow editor noodles I was a great t-shirt Yeah, and I mean since then it has evolved to meet the community demands datasets of grown not just in size, but also in number and Like if you look back, we've really come a long way But that doesn't mean there aren't great ideas that we can still pursue Personally, I think this is the fastest and most reliable way to do analysis because I think my Like second experience in Galaxy was like, whoa, I found this really cool thing But I actually just chose the wrong input data set and that's something that doesn't really happen in a workflow where you've chained up everything So, you know, it's it's very reproducible as opposed to like manually doing something. It's Whatever you're doing Galaxy, it's quite reproducible, but it doesn't help you if you did the wrong thing Yeah, so it's also reliable because like with everything we record every parameter every step of the way So you can go back and say well, that's a weird result and then you you see oh, I did the wrong thing That's exactly what what happened to me in my early days And I want to say that, you know, running a workflow should really be easier than on the tool Well about the state of workflows in in Galaxy In general, I mean, there's really broad spectrum of like how refined the workflows are And it sort of goes from like an analysis that you did in your history and you generated a workflow Which is super cool. I mean, it's it's a workflow. You just you went along. You did your analysis look good You deleted the stuff. You don't need anymore. You extract the workflow and that's amazing But it's only going to work if you have the exact same set of data sets That you use so, you know, if you just want to change the reference genome or you have a bunch more data sets That's likely not going to work right away, but it's a great start And then, you know, let's call it you you refine it a little bit and I get into it Also in the in the workflow training later today, you can you can tweak them some You can add test data. You can submit it to repository And then you have some really high-quality workflows that ideally should be used Widely and then really sharing is caring so you can you can submit workflows to the GA for GHTS servers So that would be workflow hop and doc store and then inside Galaxy you can discover these workflows Install them and run them At the click of a button Um So another thing that I think is important to remember is that creating workflow requires scientific domain knowledge And a bit of trial and error I think that that's always involved and at the very minimum you got to check that this worker actually Create what I think it should have created and it requires some knowledge of Galaxy concepts and The more powerful your workflow the more concepts you need to be aware of Um But running a workflow should only require domain knowledge, right? I think that ties in also with the first talk that We want to make it as easy as possible and I think there is a gradient of skill from workflow author to workflow user and At least in terms of computational knowledge So really if you create a workflow you should be able to hand that off to a student and just say These are the input data sets. That's what you need to be careful about and make sure like the result looks good And you know, you can always drill deeper But like just running it should be the easiest part and because all the parameters are set it should be easier than a tool which Oftentimes we introduce users to Galaxy by saying you can run a tool on your data set But maybe we should say Hey start with a workflow if you need more details you can always go back to the tool level Anyway, and then Yeah Something that's important to use this of course also to make it easy to find the workflows to run the workflows to track What the workflows are currently doing and to archive them because you also don't want to keep them around forever and I think it's also important to create high quality artifacts for people That just don't want to look at Galaxy like your mind from attrition colleagues May prefer to like You know you use Vim use nano to to explore everything and they're not going to be super happy with an archive galaxy Yeah, so I mean in the broad sense, that's the direction We want to go into and now I'm just going to show you some cool new things that happened during the last year I went around saying it's not a rewrite, but in the end it's definitely at least a partial rewrite of the workflow editor and we've made it completely reactive and So what you see here is that? I'm writing some stuff here On the top and you see that The data this is a this is the internal state of the galaxy workflow and corresponds more or less to what you would download This is not exposed in the user interface currently It's just something I used for development to make sure that you know when I do things it really reacts in in real time that we didn't Miss something maybe if you use the editor extensively you've noticed in the past that sometimes you did something Saved and you reloaded and it didn't wasn't actually that so that's an easy visual way to confirm and show that Things are really reactive. What you do really applies Another thing that if you're a long-time user have probably noticed If you take a workflow and you upgrade a tool and the tool has gained an output or has maybe lost an output That connection would just disappear and you wouldn't even know That's not great and instead we're highlighting in red the things that aren't present anymore and in order to have the connections that You know, you got to do something about it if you want to run the workflow So that makes it a whole lot easier Yeah, we've really improved the panning and zooming That's work that I mean I started it I didn't do a great job then Leila came in and like made it really awesome and It works with the mouse It works with the keyboard it works also with a touchscreen. So this I recorded on my phone I mean just Playing around with it. I mean, it's really really cool We have improved keyboard navigation. So with a tab key you can jump along Spots the steps in your workflow with the shift tab you can go to your previous step and then You have a little outline what is currently active if you hit space you do the action So if you're on an output node like up there hit space a small menu opens And it will give you the available actions there So that can either be connect to a compatible step or disconnect from a connected step And like in huge workflows that can actually be easier than dragging the mouse all the way to your next input because it just gives you what's possible The switching between steps is much snap here So it used to be that once you activate the step. This is the tool form If you've opened a galaxy tool before this is the same thing that you have when you open a normal tool And all of the state is defined in the back end So we had to make a request to open it And now we don't because all the data is already encoded in the workflow So it feels much faster. It feels much nicer Laila added another thing that has been requested many times which is to like you can you can add tags to outputs Personally, I don't think you should you should use connections, but it's a different topic come come to the training come to my Ted talk I'll tell you why you shouldn't use tags, but For some purposes for others. It's great, but Whatever so you can add your tag And it will be displayed right here on the node So if you want to know where a tag got added or removed you can now go see that without clicking through each and every step This is something That's maybe geared towards power users, but you may know that When you create a workflow not only can you include tools you can also include other workflows. So those are sub workflows and And You know your 20-step workflow Each of your 20 steps has maybe two or three outputs You don't want to see all of them as options for your next analysis. So what we Did is that only Outputs that are marked as a workflow output. So it's not just that it's produced. It means like this is like an important artifact of the workflow This is a workflow output So you can select these workflow outputs with a button With a checkmark. We used to only have I think we only had the checkbox And that would also control whether or not the output is visible in the history Which works well if you're not using sub workflows, but if you have a sub workflow You may want to show outputs that are not important for your sub workflow and vice versa You may want to have Output from a workflow that doesn't Interest the user at all and shouldn't be visible in the history. So the separation. It's something simple, but it makes creating like Workflows that reuse other workflows Much easier and yeah, I mean it's a it looks small, but it's a big deal We got an infinite grid again Great work from Layla. It used to be possible if when you create a workflow with 200 300 steps that the little grid, I mean, it's probably not visible to you. I can barely see it But it just you could like move your workflow out of the grid and now she did some magic and It's an infinite grid and you have some larger grid lines. I hope you can see it So you have the smaller grid lines and the larger grid lines. So you were using plotting paper in school I mean it looks a bit like that We have something else that is again doesn't seem like a big deal, but really really cool again It's an idea that Layla started in in the tool form and it also makes a lot of sense in the workflow editor To make the distinction between an optional input and a required input So some tools can take like optional datasets like Yeah, I don't know Don't have a great example in mind I mean you probably know tools that take like additional datasets that they can work with but they don't need And you don't need to connect them, but the required ones you do so in the workflow editor if there's something I have 20 minutes, right? It's at 15 to damn it. All right. Well, it's not your fault. I'm sorry. I should have read this Well, a lot of cool things So the next thing it's conditional workflow steps So you may have a step that should run under certain conditions and certain conditions it should not say you have a pre-processing step where you take your data and It needs something needs to happen before and sometimes you've already done it and sometimes you don't So you can introduce a conditional and then only run the pre-processing step If either you detected it needs to be done or if the user said do it Another application is you can separate data So for instance, you can have a tool step that checks is the data good enough And if yes, it runs the main analysis and if no, it just says well, here's your data sets that weren't good enough You can also do More months things like you can start with your data set and then you can have a step that says well This is a single and paired and this is Nanopore and there are three different sets paths in the analysis And you can then basically use a conditional to say on this part on this pattern this part And even if you start out with like a data set collection that has all of them mixed together It will just put do them all Separately up until this step and then you can join them back together and get like a unified output Yeah, so how it works is that you toggle on this little Wasn't beginning you turn on this little thing the when appears and you connect the Boolean parameter to it And that Boolean parameter can be something user sets or that you calculate in your workflow based on other outputs Then you run it Yeah, it's It's in principle like you select your workflow you run it you select the data set There is the Boolean parameter whether or not you want to skip a step And much better error reporting We have the invocation we have export with a lot of new workflows The IWC is the top Yeah organization on docstore we have the most stars so you can find all those workflows there And with that I want to thank the people that like are bringing the workflows to live and please join them Especially I want to thank Lucille who did a lot of new workflows for the IWC that are really high quality Boatman who's been pushing it for a long time Laila who did a lot of work That I presented and yeah, really thank you all Yeah, thanks a lot nice talk next up is Ahmed talking about search in Oh Hi, my name is Ahmed and I'll be presenting some recent improvements made to search interfaces in Galaxy So as we know users often find themselves searching for stuff in Galaxy such as histories tools workflows and whatnot and The dev team has been working hard Sorry Yeah, we've been working hard on making Uniform search interfaces all across Galaxy because we don't want users to have a learning curve when they move from one search Interface to another This is what a typical search would look like in Galaxy. You would have a search bar and then an advanced menu where you can add to your search To recently improved features that I was presenting today are the history item search and the improved tool search The history item search is found on the right side in your history panel where you can obviously use it to find data sets and collections in your current history You just can either simply search for the with the name of the item they're looking for or they can add Column identified filters to the query better. You can use the advanced menu and type your filters in there There's a long list of their filters. I won't go through all of them, but we've added more and now we have more than we used to in the past Typical search for instance could look something like this You have you can type in the name of the data sets and collections that would match the results And let's say you have a range of indices you want to search you just type those in into the menu a Cool feature that we wanted to focus on is the related items filter So in Galaxy you often run jobs such as workflows and they can take inputs and then produce outputs So in that process you can understand how each data set can be an output from a job And then more data sets or collections could have been created from that Nucleus item so being able to find the inputs and outputs and have them in a nice interface was not something we've had in the past And so now we have a nice related filter that you can either just type related and the index of the item You're looking you want to see related items for or you can expand the data set and then just click this Site map icon and it would show you the inputs and outputs You they're they're identified by arrows pointing towards the main data set There are several use cases some that come to mind are for instance, you have a workflow invocation for for a workflow that failed You expand that look at the data set and you click this icon and it will show you Give you more context and you can then explore those data sets and see what might have caused it to fail or you could also Use this filter to create a new history for items that are related to one important data set Next search that I wouldn't like to describe is the galaxy's improved tool search It can be found on the left hand side obviously and you can search for tools that are installed in the current instance in Galaxy And this has been around forever, but we needed an improvement in the way it worked in the prior backend implementation We had a search that wasn't instantaneous and it was still not produced Properly ordered tools you would search for a tool that would show up way down in the list and we also had results that did not match your query at all We improved this to a straightforward direct string match and now you only see results that you search for and they're ordered better To further improve your search and add some filters to it you can expand the menu and then apply some of these filters here We have a name filter for tools that would match that filter and then we have a section filter which is just want to show that it's linked to the current tool panel view We have the default view or you could have multiple other views like edam ontologies and so this field in that menu would be linked to what current view you have There's also an ID filter that would match the tool XML ID There is a repository owner filter that would match the tool shed repository owner and there's a keyword search that you can match words that could be found in a tool's help text For instance, this is what a typical search could look like you want to search for items that are within tools that are within the mapping section that come from the IOC tool shed owner That match genome and size let's say those are some keywords and then you would see some results for that A minor cool feature that was fun to implement was the fuzzy search which means that you could can now misspell and we show this low did you mean prompt that was fun to implement So the idea is that the UI team and the backing team we've been working on implementing not only search interfaces but other interfaces that serve the same purpose Uniform so that users don't need to learn every view over again if because it looks so different And yeah, I'd like to thank everyone who contributed to the code and also everyone who gave encouragement and advice. Thank you I think we still have time for some questions. Any questions. So right now we only have two ontologists there you were showing it them and what other one that's in so this screenshot is from Yeah, the screenshot is from main I think on all the use the outside browsers we have these three views Yes, so how easy it is to create your own custom ontologists and I think this is something we probably want to get community involved in because this is good start but there are many different ontologists you can create I think these we get from bio tools but yes we could obviously this is easy enough to implement I think I think it comes from XMLs but it's it's easy enough to add as many as you want based on each server But how do you do that. What's the mechanism I think that's just a case of requirement I mean if we if we can find views that will match galaxies tools that are usually installed like typical tools you would find in most servers, it could be implemented. Yep. Thank you. Next up we have then and talking about modernizing the galaxy client. Okay, this talk is about, you know, sort of our recent efforts, modernizing the galaxy web client. A couple of main topics are sort of type script using these automated auto generated API bindings and kind of progress to and what's happening with migration to view three. I'm Dan and but this is the work of the whole UI UX working group, everyone's contributed. So yeah we're adopting type script. It's, if you're not familiar type script is just sort of a tatically a statically typed superset of JavaScript. You can define types and interfaces. Tons of people use it. It's the view core team writes view in it. It's how strict it is is super configurable but we're sticking with basically the recommended view settings and actually we're a little bit more strict and we have a couple of tweaks but it's pretty close. We figure we'll go all as strict as we can on the initial implementation instead of trying to like retrofit a bunch of you know fixes into half baked type script down the road right. So you build it and lent it with a couple of utilities, and it should be pretty familiar to anyone that's used Python's type annotations right it's kind of the same idea there they're optional annotations that are enforced by tooling and the bottom line is it's still JavaScript. So it's not like a couple of years ago, it's been a while. At one point we're considering adopting coffee script right which is kind of another layer I'm really glad we didn't at this point but yeah type script is amazing it's still JavaScript. And so why are we doing this. So with a statically typed language you can coordinate changes across a big code base galaxies. We just talked about it yesterday but it was something like it's near a million lines of code right between the front end and back ends, and I forget exactly how much that was the front end but it's not insignificant. So if you change a method somewhere and the return type changes. If that's used in typescript right if it's a type script return, whatever. Then you would know immediately that you just broke code and some other part of the application you weren't even thinking about. So, really, it helps coordinate changes in a very large code base. So, maintainability, if you're if you've never seen a bunch of code, at least in JavaScript, if it's plain JavaScript you have no idea what you're looking at most of the time, the objects could be anything. But if they're strongly typed it's much easier to just sort of jump into something unfamiliar look at it and at least know what data you're working. The tooling is really the big thing right so it's all optional typing. It shows sugar on top of JavaScript but the tooling that enables code completion in my documentation will see a couple of screenshots of this in a little bit. Air detection, obviously, and some refactoring stuff and the bottom line is just so many bug fixes. When we first adopted typescript and, you know, I think the very first file I converted from JavaScript to typescript, it didn't work because there were bugs. It's like, okay, well, I should start taking account of how many I really wish I had, but it finds bugs all the time. So, building on that we now have these auto generated typed API bindings. Because of all the work that the backend team did modernizing the API framework. We now have we can use so it's all that's all fast API. And it can generate an open a open API spec of, you know, the exact routes, parameters, responses, etc. from the API just one big thing. There's an update procedure built in you just run make clients update make update client API schema, and it generates this new schema, and it's immediately available for use in the client. This also runs. So if you change the API and you forget to update the schema, and you open a pull request, it gets flagged and you know right away. Whoops, I need to update that. So it really, it outlines the layer between the front end in the back end, much more clearly. So we use open API typescript to build the schema, and then there's a little open API typescript fetch library that we use to provide a little kind of a fetcher service for forgetting stuff and I'll show you the example in a second. There is there's an Oscar for that. So with these auto generate type to API bindings we now have end to end type safety. So if you know that the API is going to give you, you know a string and integer and whatever a Boolean, and that changes, you know that all the way from the server side, all the way to the client side right so it's end to end type safety. So the tooling is really the biggest thing. So this is how the example here. So that that's all you have to do to use the fetcher to make Jesus. Okay, moving on along. So, wow. So yeah, we're, we're migrating so we've adopted you to seven recently. So we're using the composition API and script setup for everything. It's a lot better. So, there's another talk I'll erase is doing on the transition of Pina. It's sort of the spiritual successor to Vuex, I'll let him talk about it you should listen to his talk. So right now the current component breakdown we have 495 view components, we've already converted roughly 150 of them to this sort of script setup composition API. It's really great. We've been working with it so far, and 111 are typescript I didn't actually think these numbers are this high I'm really impressed with how much we've already done regular JavaScript we have roughly 550 JavaScript files. Yeah, so we've made a lot of progress is all this this slide and all of the new stuff is that script composition API components. So why is this in the new feature segment it's just it's massive. What did that just go off and off here. It's a huge. I can't see it significantly improved developer experience. It just came back on screen. I don't need to read all that. Yeah, so the biggest thing view three is imminent so view to will reach end of life at the end of this year. We were waiting on this for a while because dependencies. So specifically bootstrap views dependency we use all over the application. It just fairly recently is now compatible with this view compact library, which it's kind of a bridge point between view two and three. There's, we've done a ton with all this modernization stuff. There's still more to do. Please come to the co fast because I want to organize people to get make progress on view three. And thanks to all the UX folks who worked on all this. I don't know. You don't have to ding it. Um, yeah. Perfect. Thanks for the talk. Next up is John Davis talking about the data access layer and galaxy. All right, I'll go into the weeds and under the hood, but I'll do it fast. This is going to be a very, very fast, very, very brief introduction to how galaxy tracks with the database. So, galaxy uses a relational database to store data data as a not data sets but data about data sets and data related to all the business logic which supports whatever galaxy does. The database tables are represented by Python classes, and collectively we refer to these Python classes as the galaxy data model or the model for short to see what the model is you can look at the code base you can look at the class definitions. You can also look at the definition of the database schema using your favorite shell based tool like psql if it's postgres or SQLite three for SQLite, or you can use point and click user interface like pg admin for or anything else. So you can also take a look at the graph based representation of the database. It will be very useful if you are looking at a couple of tables or relationship between two or three or four or five tables as soon as you move a little beyond that is going to be less useful. The takeaway is the galaxies data model is big. It's 160 database tables though they those days, those tables are non trivial they have more than 400 exclusively defined relationships between these tables. And most importantly it's constantly evolving we've had approximately 200 migrations and out of those 220 only only 20 cents. So, how do you talk to the database in Python at the very basic level is really simple you just use the Python db API you create a connection object you open the connection you create a cursor object you fetch the data into the cursor by. Well first you execute a SQL statement against the cursor you fetch the data you close the cursor you close the connection you're done. It's very simple. However it all breaks down, because you need a separate implementation for every database we support we support to databases and that's quite enough because the implementations will be significantly different. It's obviously very tedious and error prone and errors. And you will make errors. In this case, they will not be caught by the Python interpreter or any static typing type checking tools, they will happen at runtime. Furthermore, they will happen only in cases when that particular query is going to be triggered by whatever the user is doing on the client. And of course it's completely and unquestionably unmentainable. Here is a sample of just one raw SQL query. We're not going to write this thing manual and we have many, many queries like that, which is why we use a tool, and the tool is called SQL alchemy. We have a SQL automation toolkit and an operational method. It is the Python data access tools been around for a long time since 2005 and 2006, which makes it as old as galaxy. It's architecture is very theory based and if anyone really wants to understand how SQL alchemy works I would encourage them to take a look at this book before looking at the extensive documentation. This consists of two main parts SQL alchemy core and SQL alchemy or M SQL alchemy core is basically the data definition language which is Python objects are used to describe the database schema. SQL expression language which is Python objects used to describe SQL statements in the engine, the engine which connects to database maintains a connection pool translates SQL into database dialects and most importantly provides transaction SQL or M exists on top of SQL alchemy SQL alchemy or M exists on top of SQL alchemy core, and it maps Python classes to database tables, it manages the state of objects which we have loaded into memory, and it keeps objects in sync with the state of the rows which represent these objects in the database. Essentially it's yet another abstraction layer, as you know in computer science everything can be solved with yet another abstraction layer. And it adds extra considerable complexity to a code base which is already non trivial. Why do we need this kind of keep it simple. And if you think back to three decades ago and in the early days, maybe not so early but in those days of object oriented programming when big applications were beginning to talk to relational databases. Everything was dirt simple, would keep all the database access logic in the domain object. So what it means is like say you have a user object, and you need to create it so you say user equals you use Java, you load it up with the data and when you are ready to save it to the database you say user dot save done. When you want to retrieve an existing object you say user repository dot get user by ID, you do whatever you need to do with the user when you are ready, user dot save done. Very simple. Why do we need the ORM. Again, things break down with more complex logic. When you need to, when you are a galaxy, you need to load modify keep track of lots and lots of objects and and writing them all back to the database using a single writes is going to be very slow, very inefficient, combining them into batch rights is very non trivial because you need to determine the correct ordering of all these rights because rights might depend on previous rights. You need to lock objects to prevent concurrency issues you need to make sure you don't lock you don't load the same object more than once. So SQL ORM solves everything session maintains list of objects affected by a transaction. It coordinates rights to the database it doesn't require non trivial topological sort, and it resolves all concurrency issues, which is good registry maintains a list of objects mapped to their database identities and this makes sure that no matter how many times we access a specific row in a database, it will be always represented by the same object in the code. So this is good. And this is how we talk to the database, we use direct you well we employed direct use of the BAP I very sparingly in scripts and this is a bad thing we should not be doing that we use SQL alchemy core when performance is critical. In all other cases we use the ORM. Finally, SQL alchemy 2.0. I cannot do justice to this. So essentially, it's the biggest change since SQL alchemy version 0.2 or 0.3 biggest in 15 years. It adds many many good things for galaxy for galaxies code base. And it comes at the cost of many many breaking changes to the core API so we'll have to rewrite most of our data access code, which is okay. The most important thing it introduces explicit transactional database access which is the way it should be done, but this is the main challenge galaxy pre SQL alchemy 2.0 we relied on magic. Magic is good. 10 seconds. We did not explicitly begin and then transact database transactions so read operations have no transactional context write operations we, we never close the transactions we just use session flush and it would do things magically for us. We had basically one big transaction without a beginning without an explicit and and SQL alchemy magically took care of that. Now it's not going to happen database rights must have a commit. If we don't commit nothing will be saved. If there is a man, nothing will be saved reads, even worse, a read automatically creates a transaction and I know. The transaction must be closed if you don't close the transaction that means database locks database locks mean means idle transactions in the database connections of multiply galaxy crash is very fast solution is replaced one big implicit transaction with many explicit transactions which is something we're working on. And that's it. Thank you. Next up, it's Marius again, talking about job caching. So yeah, I mean I want to talk to you about the job cache. It's something I really care about. And I'll show you why. So galaxy does some really heavy computing storage. So we currently have 307,667 active users remain active meaning they're, they're not deleted. I mean I didn't check when they last looked in, but we have about 36,000 new users per year. We have 24.5 petabytes of data sets, which includes also deleted data sets and you run out of quarter you delete your data sets but we have five petabytes of non deleted data sets. We have 51 million jobs. We have 705,000 workflow runs. And I mean obviously that's galaxy's purpose right I mean we do this because we believe we're doing important stuff there. And so this is all great, but not every job is new. And certainly not for tutorials. That's always going to be the same. I mean I mentioned it before, but I really want to drive the point home like galaxy captures the provenance and the entire analysis chains. So all two parameters are stored in the database. We record exactly which input and output data sets have been created and like how they're being used. We know the parameter settings and the inputs and outputs. We can find jobs that run with the same combination. So tools that produce deterministic outputs, you know, put in the same data sets and same parameters and produce the same output. Well, if that happens, why, why, why do we run it again, we don't really need to. So example of tools that produce non deterministic output where we shouldn't be doing this is tools that query external resources like, you know, want to check your weather you don't want to check whether from yesterday stock prices same deal and news. Well, same deal. Or tools that use a random number generator internally, if that is important to you. So I mean if you're doing statistic permutations, you don't want to cash them right. And then one more thing I want to say is that failure is inevitable. So the more jobs the workflow has and more likely will fail at any given point in time. If you assume a constant error rate so if your workflow produces 10 jobs, every thousands job will fail. So it's kind of unlikely that your workflow will fail. If your workflow produces 10,000 jobs will probably fail at one point. I mean, we have mechanisms to deal with it. We've got at least some galaxies have set up automatic resubmission so if something failed and it looks like it's recoverable, it'll try again. But not everything is recoverable like errors in external resources. Your HPC, you know, one node is a black node accepts new jobs but fails them immediately well that's always going to be a problem. We got a rerun button for a long time, but can take a lot of work to resume jobs certainly more than one failed. And it doesn't work for all types of workflows. We could certainly like if that happens we can run part of the workflow again, but that's also tricky because you have to identify which step actually failed modify the workflow to run just that one part. Select the relevant inputs that can also be a lot of work actually. I think the VGP workflows are great example of when that can be a little tricky. And like, you know, if only a fraction of your collection, for instance failed, you still gotta run the the remaining jobs, even if you were very careful and like removing the steps that have already run. I want to say something else. So, you know, developing a pipeline developing a workflow. I mean that can be a lot of fun. So say you're developing super fancy tool that consumes a lot of related data. So, for instance, you'd be trio variant calling with mother for the child. And you have like a very specific panel of normals. Maybe that's like the intermediate like the related family, for instance. So say that in your pipeline you have some pre processing before your actual step that takes time and space. And also that while you work on this you work on this this particular tool of yours that you're developing and it requires also some downstream analysis to even know how well it work. So I'm saying we should just be running the moving parts again right so you update your tool and you update your workflow and then you go go again, all the way. I mean this is an example of a workflow I did in 2018 when I was still working as a as a postdoc in a hospital. And the red part, I mean, it's more complex. There were workflows upstream and downstream but like the red part is the thing I was working on and I had to redo that very often and not waste time and space. I mean, another angle to look at this is if you have a pipeline that requires pre processed data and a QC step should maybe be included, should I include the pre processing is pre processing and QC in my workflow. Like, I would say probably yes because then I get consistent and complete start to end work for it can look at you know what needs to happen. But I made duplicate data sets and waste compute if I already did it and like you're always going to do pre processing QC before you start a real analysis. And I also want to say like the conditional workflow steps are certainly an option to deal with this. What do you say no I mean I want to be efficient. I don't want to waste resources. But then there's no guarantee that you know if you hand off the worker to somebody else they're putting in the right input data sets. And the workflow is not as easy to run you can't just look at it and say oh it's this this this. So again, I mean we should skip the parts that have already run. And then we can look into it like for large scale consortia projects. I think currently there are like two types. I want to say there's like the type where you have a large set of data sets that are individually relatively fast to process like what we did for saskoff to variant calling where we like ran as batches of 100 to 500 data sets. So it's tricky to find just the one that failed. And then you need to write like an external script to look at like, what did actually work and what failed. And another thing that happened in the course of that is that, like, there were variant caller bugs we had to change some of the filtering. And so, if that happens we have to run the entire thing again even though like the variant caller sits somewhere at the end of the analysis, and the filters even more so. And then we have like the other set where we have a modest amount of data sets but like really large resource requirements like the VGP project or tell me to tell me where for instance we have an optional export and publish step. So, that export and publish step consumes a lot of data sets, and the way we've done it is through tags, but it's a it's a little finicky it's not ideal it would be much faster she said, okay we're going to work on that workflow again including the export steps. Because we were not generating new data. This is so much easier. And the same. I mean, the next point it's not something that has happened yet, but I mean it's going to happen that will be advances in tools and methods. So, again, I mean, if like just the last part of your workflow of your analysis has changed. I mean, it should be easy to just know rerun the steps that changed. So can we run previously around jobs. Yes, Galaxy had a search API since 2014 work well for finding similar jobs, but not really for finding identical jobs and that's what we really need if we want to cash things is similar is not good enough. So I worked on extending that to data set collections. I figured out the way where we can say okay we got this job take all the outputs and pretend it's a new job and don't even hit the HPC just take it and copy it. So probably do it. Okay. We, you know, users can change the name of a data set so we shouldn't use what was, you know, the data set or the metadata can appear in the job. So we should look at like what was used to produce the output data set so we're tracking also the history now of their data set so that we can say at this point in time it looked like that. We need to work. I mean this is very specific so I'm going to skip that. And then I reached the point where we could run the entire source code to variation workflow front to end and have it like the entirely cash. Yes. And then, yeah, it worked but it was kind of slow on use galaxy dot science is because a query like it needs to be optimized and at the time I just didn't do the work. And then, I mean we, we did the covert project in a different way other priorities came up. It existed but barely anyone knew about it. And then we have these other projects I mentioned that came up and then it became interesting again to do this. It was just the one line changed to fix the query performance so in in March. I sat down with John Davis and, and in an afternoon we, we kind of solved the performance problems. There were still some provenance problems and like the cash is a really great test to check whether we're really recording everything that's needed to reproduce it. Because if it doesn't then the new job. So you can only. So how does it work. You can only use your own jobs and that's limitation we're just doing not to like have surprising results or like you had your private data set and suddenly somebody else also has it so you can only search within your own jobs input and output data sets must not be deleted for the, all the job metadata needs to match and upload jobs are currently excluded from the cash. How do I use it. If you're in the workflow run form there's this little gear icon and you can select to say attempt to reuse job with identical parameters. There are individual tools. It's also there, but to us it's not an interesting to like, might as well copy the output but for workflows, it really makes a difference. So this is an example of caching an entire workflow from the IWC that's an attack seek work product usually takes about an hour with the data I'm using there. See how the cash works. And this is like, just did this now on use galaxy.org. I'm going to try this and it, there are some limitations so it doesn't work on every single thing. But yeah, I mean, that's it. Yeah, so how do I know it worked your job turns green immediately the quarter usage doesn't increase so right. We don't actually copy the physical data set when we copy things we just copy the reference so that's great. The data set path is obviously the same as as from the job that you copied it from and we're working on an additional indicator to also let the user know that this didn't actually consume anything. So where can we take it from here. So we could enable the job search across all instance users if they consent. We could enable the search for tutorial data only. We could use data set checksums for the rare cases where we couldn't like determine that the data set is actually the same but if we have a data set hash we can say okay well it is the same. We can restrict some of the metadata restrictions. If we annotate the tools to say they need this or they don't need this. Some workers could say that they are always cashable even if the metadata doesn't match or they're never cashable because you know the random number generator in there is really really important. We could start also caching uploads if we do deferred uploads from your eyes so from file sources or from like just general URLs. We could have a job and invocation search preview so you could say well show me just a similar jobs or similar work plans and like a far off goal could be like you could be searching other instances. Yeah, that's it. Thanks a lot. We have a bit of time for questions. Any questions. Yeah, so thanks for a very great work. I had a comment on caching of external URIs. So I think that depends very much on whether that data might change or not in the future right so you need to somehow mark every URI whether they're cashable or not. Yeah, or you can like at the point in time where you're on it you can say we fetched it the last time at this point, do you want to use it or not. I mean it's all very explicit at this point and like you have to opt in to use the cash. So yeah, absolutely and like it's not enabled by default because like we're still working on it but yeah for sure that's that's a concern. Awesome feature Marius. For GTN to make it useful for the training. I think we need to have that cross accounts right is that planned. Do you plan that to enable that cross account. It's just a single filter basically so the one thing we need to do like for instance if a user joined to us. It would be easy to say while they are in tears. New data while that was created while they were in tears we could use. I mean we have to find sort of when they're running a tutorial that this is tutorial data and that the job cash is okay. Otherwise have a global thing that like maybe in the user preferences says I consent to sharing my data and at the same time. That means you get to use other people's data. I mean it's there is a lot of things we could do but we also need to think through like to what degrees are users comfortable with other users taking their work right. Thank you very much. Next up is Paul talking about our great. Yeah. Well, my name is Paul the heist I work as a software developer at Alexa Belgium, and I'll be talking about using our create to export provenance, mostly from workflows in galaxy. We can talk about provenance in context of workflows. Yeah, we have two levels. So either we have the retrospective provenance which is all the in and outputs of a specific workflow, all the parameters, etc. And the prospective provenance which is the recipe or the workflow definition in galaxy for example that's the GX format. So why would we keep provenance. So specifically for workflows we could think about. Comparing workflows. So for example, not only between workflows in galaxy but also from different workflow systems. So to be able to do this type of comparison, we would need a common format to actually do that. And in some context, we could even think about re creating or re running a workflow, just based on keeping. Yeah, good provenance. So this is where our crate comes in our crate is a lightweight packaging format that mainly focuses on on keeping the data metadata together. And specifically it wants to represent all metadata related to an experiment, or it could be just an object as machine and human readable and it does that's true. Yeah, keeping, for example, your eyes, all within a Jason LD format and it re uses a lot of fair linked standards to do that. So and lastly there's also profiles profiles kind of give extra constraints to specific use cases that you want to represent an our crate. So that is for example also helpful for representing workflow runs. So the most basic level mark our crate contains a data set. So, so for example this example, as just one data set and the our crate metadata though Jason Jason LD. It's like the main source of information of the metadata and this is kind of how it's being represented so it describes that even the top level of the directory and then anything below it and any contextual information such as the author of a workflow, for example. And then your eyes, for example, to that are related to that context. So in talking about workflow run our crate we have a set of profiles developed by the our crate community. And these provide extra constraints for workflow runs. The three levels kind of have increasingly more constraints so the process run crate just would be for example for a history where you just have a set of tools that have been executed but there's no workflow definition for example related to it. And the workflow run crate needs to have a workflow definition and the province run crate extends that with intermediary in an outputs of a workflow run. So in terms of Galaxy we're at the workflow run crate level. So here's an example of a workflow. In the Jason LD we represent the prospective and retrospective provenance. And so, in this workflow you can see the DXWF file is the top level of the Jason LD, and then the in and outputs are represented as well as both from the prospective site and retrospective site. This is the actual feature in Galaxy. So once you run a workflow you can go to the workflow run invocation section, or the workflow invocation section, and you kind of see how to export that you could either export directly or to an FTP location, or any kind of remote location. And then you can start inspecting that so it's kind of, it's implemented using the our create by library. And, yeah. Sorry. So it's kind of an extension of the standard workflow invocation export that's already in Galaxy but it extends this with this more readable metadata. And yeah, it's also an alternative to the bio compute export option for workflow run or work for invocations. So, yeah, it's main purpose is to create like this more readable layer on top. So this is an example of a workflow run, a bit more detailed one. So, yeah, this is what this actually would look like for a detail, a little bit more detailed workflow. So this was developed through a lot of collaboration, and it's been implemented in other workflow management systems. So the benefits of the price added RDM functionalities, and it provides ways to, for example, publish your workflow runs. And so in the future we want to move to the provenance run crates we want to include for example the intermediate steps. We also want to work with Galaxy histories, and we want to enable import of our crates. And lastly, it's been it's available as a training subject on the training network. And lastly, I want to just mention that we have a job opening for DevOps, Galaxy DevOps at Alexa Belgium. So if anyone's interested, please contact us. And then, yeah, lastly, I just want to thank everyone who worked on this, and the work will run create community and our create community and everyone at Alexa Belgium. Thanks. Last talk of the session. We have Alireza. Hi. My name is Alireza. I'm a software engineer at Galaxy Freiburg. And I'm really thrilled to be at the GCC. This is, this will be my first stock in GCC and you will see me twice more during the GCC for my other talks to this type is going to, it's about the new multiple history view instant data across multiple history view. And so in the next next slide, I will go through the journey of our previous version, the challenges we face and then we'll introduce you the new version of the multiple history view. And I will explain you the initial concepts and the process and the features that we bring to the brings to the new one to enhance enhance your Galaxy experience. At the end, if you have enough time, I will go through a live demo to show you how you can do this thing with the new one. So, so now let's take a journey of the timeline of the most multiple history view in Galaxy. It's all begin in September 2014. This is the first commit that introduced the multiple history view with this commit hash. We can mark it as a big data of multiple history view in Galaxy. And this was the foundation up in which build and evolve in multiple history view over the years. In 2017, we bring the view JS to Galaxy. It's a significant milestone from that time we added the view JS with more improved performance and most of things to the Galaxy. And then in 2020. We started to rebuilding the history components from that time. And we started to with this pull request in that chapter and we are starting to using view JS to leverage in the power features and component based components. And then in 2022 with this PR we bring the new multiple history view we started to bring it to the Galaxy. So, now let's take a look at the previous version of the multiple history view. As you can see in this screenshot, the interface displayed all the histories from the user at the same time, providing an overview for the available histories. And while the previous version had some useful features features, it also has some limitations, and we will explore both features and limitations now. So, the features was that you could see the users could see all the all their histories at the same time. And they would able to search through the histories, they could also search all the data sets through the all the histories they have. They also would they will to change current history or create a new fresh history, and they could also drag and drop data set from a history to the current history only. And that also has some limitations that time it base and relied on the backbone JS the most popular framework that we had galaxy before. And the users couldn't arrange the histories as they want to have, and they also couldn't drag and drop a data set to any history they would they, they could only limited to drag and drop to the current history. As a number of history increase it the performance of the multiple history view would decrease leading to some unresponsive is unresponsibility or potential delays. So, these features were revolutionary that time, but we use these things and we started to resolve the limitations and bring to the new history view. So, now this is the new version of the multiple story view, we have built up in the strength of the previous version and introduce some several enhancement that significantly improve the user experience and using galaxy. So, during the development phase, we focus on creating a really thanks user friendly interface that would same as the workflow managing the navigating. So, I'm going faster through that we designed so some UI that we could implement it on the galaxy but at the end we decided to use the that the new galaxy one as you can see here. So, the features that new panel has is everything that the old panel had, including that it's completely based on the view. Yes. And now, now, now you can bring unlimited history into view without affecting a performance delay or something like that. And you can also drag and drop any data set to any all any shown histories at the one time, and the users, the user current selection estate would be saved and they can refresh the page go to another pages and come back in without reselecting. And they could also be able to search navigate select on gather data set across multiples. So, I think we don't have enough time to do the live demo. 50 second. Okay. So, I can just encourage you to use galaxy.org to go to use galaxy.org and you can see the multiple history views you can play with it, find the box, or if you have any features that you think it's would make it better just tell us. And at the end, I would to, I would like to acknowledge the outstanding individuals, including Laila and some for their commitment and contribution to building these things. And also UI UX working group. And. Okay, that's my talk. Thank you. Thanks a bunch. Thank you. I think that concludes our session.