 Hi everybody. Welcome to session 17. I'm going to be moderating this session to start off. We're going to have a talk by Keith and built from the period journal. I'm very pleased to be up here and in person. We've supported this meeting in previous years. It's my first time in 10 years. It's great to be a co-sponsored with the likes of AWS and Google Cloud to finally be made. So I'm here to talk about PJ, we're a publishing company, an access publisher. I'm one of the co-founders. I went to Anton's PI talk last night. It was interesting to see this slide. I snapped a quick picture up from the back. Apparently the three major concerns of Galaxy have been over the years accessibility, transparency and reproducibility. I'm going to explain the PJ ticks all of those boxes. Accessibility, we're an open access publisher. Transparency, we require inside of the data and statements and so on. So we're one of the, I guess, one of the leaders in the industry are requiring of the data, allowing all of us to have papers, one of the events, for example, to put as much information down materials and methods as they want and DIYs state-by-dense bias and so on. So quickly about us then. So, yeah, we're an academic journal publisher. Books, for example, for the open access. Everything we do is CC by form. We're 10 years old now, and we were launched 10 years ago basically to try and come up with publication solutions for the 21st century. If you look at other publishers' websites, you think this was incredibly hard to do. So we're trying to re-invent it from the modern internet era, and we wanted to come up with something that was as cheap as possible for authors to publish it in as high a quality. So we have a couple of the main journals. So PGA, this is the bio journal, Life Revival. So this publishes across the whole of biology, health sciences, medical sciences, venture research, field research, ecology, paleontology, environment. So a really broad scope for this one. We published about 200 articles a month. And one of our strengths that really is in the areas like bioinformatics and compile actually obviously a strong area of the galaxy. We'll be indexed as an impact, for example, we were actually signatories to industry standards that say we're not going to promote the impact, so I won't tell you what it is. But we know it's really important for people to write a little copy of these. But PGA is a computer science journal. So this is more recent, seven years old. This is publishing about 30 or 40 articles a month at the moment. Again, a large academic aid board, so about 400 editors. It could be a big surf, a great rich on the advisory board for this one. Really high quality. Okay, so that's a couple of journals, you know, they're interesting, but everyone, everyone has journals. So what's unique and interesting about PGA is the time for this talk. What we do everything that you normally expect from a journal publisher, it's four more peer review for their twirl oversight. Fast peer review, so we're getting the first decisions in 25 to 35 days depending. Like I said, very strong requirements for data deposition. We've got a recently recent requirement now for formal system identifies the code. So, to the individual instances of the code with DIYs, windows and over it's to get home. So we're supporting those requirements. Well, we love reproducible science and negative results. And we made sure we all put the most leads. We are a full profit publisher. I know 40% cheaper than the lead not. You step back to make this a little bit. Yeah, hopefully good for the big man. Okay, so. So we expect what we also do a few things and this is just few of them, but nobody else does. So I wanted to hit on these three things in this talk. So we have a sort of an innovative lifetime membership model. We have tangible rewards for the USP reviews, and we've got a very interesting publications solution for small societies research groups and potentially the gas. So the lifetime membership model is interesting when when we want to leave them through today, the main way to pay for your own access publications and what's called an APC three article process in charge. So we're going to have an access publisher, they didn't appear to be the publisher, you pay them usually $1200 to publish for us. We actually launched with a membership model. So we didn't have an APC model at all two years ago. But the concept of membership model is as an individual become a lifetime member of PHA you pay a single fee once you become a member of the ecosystem, and that gives you publication ability for free life. So if you buy a membership and currently it's $399 basic membership that allows you to publish your articles one article per year for free for that one payment government. The one catch is that every co-author of a paper has to be in it. So if you have three co-authors working together, they each have to afford one of these three 99 memberships, but they're actually their future papers with those three people, you know, in the future years that So we've got this nice infographic here with tuna networks and this is somebody who worked to do these on our Edward but every time somebody joins his lab or every time you publish paper he buys memberships for those individuals out of his research grants. And over time he's built up this new group of collaborators and co-authors who all have memberships now. So in the years he's been doing this he's now funded the memberships that have funded publication of 24 hours and actually this was about a year ago the same way. More than 50 individual authors and the article publication fee at this point in time was just $665 per hour. And in that network every new article they publish is better be cheaper. It gets lower and lower per hours across because they pay on the cost up front of the lifetime. So that's really interesting innovative solution to the sort of cost problem of a connection. We ended up also introducing an ABC fee because a lot of people didn't understand this model. It's a funding necessarily a grant that has to do with lifetime, a lifetime benefit from all that. So we have ABCs on all sides. This is still a model that anyone can sign up to. Then, and I'm going to tie these three together here. So the next interesting thing we've got is something called PHA tokens. So we launched this in January, just earlier this year and the concept here was the peer reviewers always complaining. Rightfully so, they don't get credit or payment for their peer review books. So publishers and journals ask them to peer review for free. They give this peer review labor takes some hours or days of time to peer review in our school then the publisher turns around the tech very much published the article makes money up a bit peer reviewer gets money. So some some publishers in the past have, as it were awarded their peer reviewers a discount or future publication. We used to do that. But what we found as well was that usually the requirements around those things very restricted, usually some small publication discounted expires. And good if you actually intend to publish a paper with that job. If you're not interested then what's the point. And so he abandoned that concept of giving people a little bit of a discount to use and give them PHA tokens. So this is an actual almost like a currency. So when you peer review for us for if you act as an activist until you get PHA tokens, they're worth dollars. So they accrue in your account. And you can use those dollars for future publications. So at some level, somewhat similar to the biggest concept but the clever bits here are that they never inspire. So they sit in your account crew forever. So you can combine with other individuals. So if you've got again three or four calls on paper you've all earned you know $100 or $200 a prayer. You can combine them together and get a free publication. Or you can transfer them out entirely to some other entity some other individual in the internet and sell these to somebody on Twitter. Or you could donate them to some research collaboration for instance or a group of people who are having to be publishing with us because you're not when you don't need that discount. So these are really been well received actually just launched in January, but people I think really see the value of this. So people's tokens have only been accrued since January, but already we've had many people basically accruing up tokens or imported them in from other users in the system to more of the publications. Those two things together membership model, the tokens model. We have a final thing I'm going to talk about is PHA hubs. So this is a really nice solution that we come up with for societies, small societies or research organizations or communities like galaxy. Who want what publication solution themselves. So they're a small society. Some number of members they're charging some membership fee but they don't have a house journal. They don't have any way to reward those members for their membership in society like discounted publication. They don't have a voice for their for their research field. Many publishers now are not launching new journals for societies like that just not expected for many more. So what these societies are basically stuck know that they want some sort of application. Nobody wants it for them. They don't want to launch it themselves because they don't have to publish but also launching a new journal takes years to establish itself has taken us 10 years to get PHA to where it is. It takes many years to get in depth probably so. So, if you're a society you can come to us and sign up for PHA hub. And we've got the first one launching next month. This association, my album, biological ocean. So the basic concept there is that this society encourages their society members to submit to this PHA hub the album hub. The society themselves can invest those missions as they come in they can say that he's either our or aren't in scope for our society or these people aren't members or members of our society. Pre approve them to go into those processes review review then happens with any of our journals so it's a few sites out to go with that. And it goes through the normal peer review process and pops out the other end as a public shop. Why is in peer review the society themselves can actually oversee that. Society can put their members on our editorial board to be an editorial board within the board. In many ways I think of this as a journal within a journal. What comes out of the render all the articles that have been sort of pre-approved and pre-reviewed on behalf of this society or you try and think about galaxies that by all the galaxy community articles that have come out of your tool. And you end up with landing page on the journal so the articles are now published in that journal also published in this hub, which is closely grounded to the society to group. Those articles are, you know, they come born with an impact. For instance, they're in a journal is indexed straight away. You don't have to wait years to establish yourself. The other thing actually with the tokens then the tokens in the membership scheme which I mentioned before, is that when society members submitted this help they get a discount publication discount which comes in for tokens. Those tokens can be donated back to the society. So in the case of somebody like the other they may say, I don't want that publication to get myself as more that I'll get it back to the society. The society then has a pot of tokens dollars credits that they can then hand out to their practical worthy recipients in their society membership students or people who develop the world. So with that sort of ecosystem of the mentioned one of the tokens the hubs, we think we've got a really nice little open access solution for groups of researchers, communities like galaxies, small societies, medium sized societies that can really flip around some of the publication. So those are the things I wanted to hit on. And we are also just a regular publisher scan that your code we get a discount. I've spoken to a couple of people as well to be on about the possibility of the last partnering with Galaxy. Let's talk to anyone about that. Any questions. Someone asked you, what was the first question. So if someone asks you, was it co-authored? Yeah. Does that take your once a year? Yes. So each co-author with the basic membership gets a once a year allowance, and if you're a co-author on a paper that counts but then there are higher tiers so just $499 so just $100 more you can publish up to five hours a year which nobody does anyway. But it's not much more to basically go to the unlimited. Hi everyone. Have you seen any increase in collaboration between people who have already paid into the system who already have accounts that haven't worked together otherwise? So yes, actually. That one demographic I have with the big network effect. So we did a nice blog post with Rob Tootman and because every person who comes into his lab, the hands of the membership, you know, they then leave that in future and sort of as well promote the relationship with the existence of membership to new colleagues and I think they've, you know, they've gotten very encouraged colleagues to publish with us instead of elsewhere. So there's that kind of collaboration that's been enhanced, I think, and approved by this sort of network. Thanks. Any other questions? All right. Thank you. We had a great before we begin the talks for the final part of the session. Jim, let's remind everybody that if you have a talk tomorrow, please send her first slide. Thank you. Thank you. Thank you. Thank you. Thank you. Thank you. All right, everybody, welcome back. We're going to start off with one of the talks for the day. First up is Marius from the You Can Stay Lab. Thanks. Yeah, this summer I'm sending work that has been done by Slime Break, who turned to be here now, but I'm going to send this slide. So, that's very good. The AWC or Interactive Bookmark Mission is a community for high quality products and workflows. We presented this last year and this quick update on where we're standing now, what we're doing, and some really cool stuff we added recently. So, you may be more familiar with IUC, the Interactive Utilities Commission, which is maintaining tools, creating, updating. And what the IWC aims to do is sort of be that same entity but for workflows. So, we do continuous integration testing with Nemo. We do the problem to dog store and use Garex C.star, so you can find the workers high registry or direct administrator. There is a wide range of scientific workers and the absolute new one for your workflows. So, right now we have SIRS-Co2 workflows. We have a generic variant-quality workflow. We have a genome assembly, workflows for V2P. And that's just a star there and more coming. We have molecular dynamics and free energy calculation workflows, and we really want your workflows as well. So, on the right, you can see how that looks like a dog store, but it also has two workflows. These are nicely versioned and updates, so this is all pretty, really cool. So workflows are contributed to review your Github, and workflows, they can be built in the UI or they can be handwritten. And the same to true for the tests. I think it's a little bit tricky because you have to remember certain syntax and why it's similar to tool tests. I think a lot more people are familiar with tool tests. So there is always that little thing that you have to look it up and I also always look it up for tests. But we've made this a long, easier recently, so Simon added this workflow test unit command that will generate the definition of inputs and the definition of the test case based on previous location. So all you need to do is when your workflows are ready, the best practice sort of checks, and then you will find some small, ideally small info data here on your workflow with. When you're done, you just fetch the invocation ID, and you're on that command, and that will give you a test setup. And it's really cool, saves a lot of time. And we're also going to try to coordinate this even further so that people can just say, hey, here's an indication. Also, in the application, you can do the actual scientific preview. The whole process is PR based or you're going to PR review it essentially eventually will emerge it. But then the next thing pops up because it's not enough to just do that you also want to maintain it. Make sure if authors improve the tools that the workflows we see those updates. So how can we do that. Well we have automated tool updates from the command right now. So there's the planning order update amounts we can point that workflow. And then it will spread out by transition galaxy instance and find which tools can be updated and do that. And then we can also target a galaxy server. And so it will update it to the latest versions of the galaxy server. So that builds the functionality builds on the worker actions that john had it, and it's strictly equivalent to hitting the upgrade workflow option in the worker editor. So by having this new command and the episode is to put that into a work on his integration around that. At certain times, we can time that. Daily or once a month, depending on how much time, depending on how often we want to create these updates actually. It's done with a guitar box. And also prints on the list of tools that change. And, yeah, the action also involves actually inputs and outputs. It's also taking care of keeping the step positions the same so the workers are actually renewable and something everything shot around. So that often happen if you get all of that money. So that brings me to the final slide. I'm showing you with a bigger picture of how we can maintain and sustain workflows. Many want to be able to maintain most many more. And this applies to the IDC, but you can, you can start communities like that. So the basic thing is that there's a source code release on, for instance, GitHub. And then bioconda or a monitors the upstream source for new releases when it finds a new release. There is support request created created against bioconda. This just needs a review so if everything's passing, all you need to do somebody to say this is good. So that comes up. That creates a package for intervention. The same time, you buy a container will be created. So that's Docker and singularities is on CDS system. And so that's sort of the larger community thing going on. And then, once we have new bioconda packages, we have a blog that will update their galaxy tool and create a full request against the IDC. So again, that's going to be reviewed and make sure the tests are passed. We still get the expected outputs. We also typically need like a change log to know, you know, other new significant options that we have, or anything that's a red flag. So when it looks good, we'll merge it, then planning or will update the tool to share and we'll clear it periodically update the service of the tools. And then this presentation I showed the last part for the galaxy workers, which is that, once we have the update the galaxy tools will watch the tool shed for updates. And again, this will create a full request with the workflow and the updated tools, and we'll review that merge it. And once it's merged, the galaxy workflow with the updated on the galaxy server. So that's, that's a strategy that will allow us to maintain what pros make sure they're occurring and that they will be able to solve your scientific needs. Right. That's great. That's really interesting. I'm excited for it. But one of the things we have to take to users is Galaxy might change it to a lot. If you say the workflow, that tool version is essentially fixed. How are you going to propagate versions through the version inside Galaxy? So you just see one version exceeds or you take it to the plate. So the workers will have their own versions of the version of the work that's going to be on and was created to change up with every change so that this individual changes. I mean, that's actually an area of improvement. We have the change of which it should be more visible in the galaxy. Any other questions. Now we have you from the University of Michigan. I'm from Michigan. First of all, I would like to say that science can get in for giving me this opportunity to give back to presenters for work. And this is the first time I attended GCSE and our work is mostly about the remix. And it might be a little different from other presentations. Hopefully I want to get you guys. So, red, red present their work about writing a wire pair of heavy galaxies. So in this presentation, I'm going to talk about what exactly a heavy is and what it can do. You know what? Three for the film mix. This is basically what's a great kind of looks like. It has a beauty. It is for the message. And it has several types. Each time they are like one or several tools to finish certain tasks. And some of the tools is from our lab or some are from the other labs. And the first time I will have several of the tools from our lab. First of all, it's. And it was published a few years ago, and it is that they are based on control. After 30 years of improvement, it can do the mass collaborations and the process for the data. For the next collaboration. If you can collaborate a mass for them as well as to get an example from some paper published in 2020. So basically we have a data with very big mess error in this line and as to. So after the calibration, as you can see the arrow. Basically, it was centered to zero and the parents of the areas. This is more like an original ones for post next one and two. So we use the data set up from the teams of data positive to demonstrate that. And that's right. I have a higher sensitivity is basically we use the peaks and Mexico as benchmarks. Both peaks and the next corner are very widely used the tools in the communities. And we, we use as benchmark and it is a hatching is indicated that the search is from the same in the magic search. So now how to buzz on from in the magic search. So, both peptide wise and pretty wise comparison shows that as far as as a higher sensitivity is out of them we also perform the long term comparison. The conclusion also is that I think that as far as I said. As I said, the commissioner closer search. And that's right. I can also perform more better than an offset offset search. Open search basically is such a compare with a very wide mass branch. And as a bigger mass differences can interpret it as a number of patients. But now you can see a big difference is equal to a daughter. Yeah, I said it was for issues like that. And it is very good for the number one occasion discover. And then it's ready to also perform the success. So to basically it's such a certain much is a measure and not a continuous range of the max. So it can be much equal to zero, which basically you don't allow it. So it can be much equal to 16, which is observation. So we have such we know exactly where it's very searching. And because such space is smaller as everyone's search. So we probably will have a higher accuracy and higher sensitivity. So she will extend and spread it to make the support that I hope that I am efficient. We basically give us a very big list of the collections and there's a search also back to the large one by one. And during the search. Mr. And it's also supported the specific irons such as the diagnostic irons, and those they buy or is not they buy irons and kept the white irons which are specific to the right hook and ties. And after the spread it should be also, we can also do the assignment in a tool called ptm chapter ptm chapter is also a tool from our lab. And then it is the project as a black assignment modules that can take the information from diagnostic irons. At the top, at the top, at the top of the arrows, man to give us a couple of positions and also estimate the yeah. And then after talking about it and spread, I would like to introduce our new tool name MS poster. It is a tool for the new innovative scoring. It is very good to basically take us out of the front MS program and then get a list and then you kind of interpret it as to spectrum and reticent time for each of the peptide is a list of the people that needs. And then it compares a experiment data for the users against the predictor spectrum. And then I can put a bunch of additional schools such as special similarities, and I'm teaching them a Francis student. And as well as additional depending based scores together with MS very original schools are fitted to calculate. This is a tool from the other lab. It is right now by the using communities. It's basically take a bunch of schools and a train SVM model. This model can have a better separation of the true force and types and probably no more than 10 years. So we use this additional people in stores, our experiment indicated that it can boost the safety of HR ID analysis. He has a typical example of the HR analysis of recent results posted. This is a very big overlap. And we can show that we can see in this background. And then you have to show that all those additional papers are likely to be true. We analyze the motifs of those additional peptides. And I compare them with shared ones. And then the motif pattern looks similar. It's a primary indicator that those those new funny peptides are correct. So well, after speaking about the patient, I will just spend a few minutes about communication. So we developed a tool named the iron count that can perform label free quantification and that's probably quantification is also in part of that very high in one time. And then this, this slide basically shows demonstrated the performance of the iron count. And then again, it's going to use that. Since it's the one that you see the community. So we have to use the data set to experiment. So again, so each experiment had a protein from three species. And as a ratio of the protein from those three species are different for the human is one to one for the east is two to one and for the class. So basically we know the branches ratio of those proteins. So we can use this information to be about the accuracy and the precision of our tools. This kind of conversion shows that I want to find more proteins with a little bit less hot liars. And then we want to evaluate the performance of the isotopic level of conditions using another set. This time we use a heavy and a heavy and light they've already done that. And then we mix the reason now ratios. And they are one to one ratios for two one ratios and one to 10 issues. And this time we use skyline as a benchmark. We didn't really didn't use the next one because next one extra doesn't support this kind of. And skyline is also why they use the tool for the quantitative dynamics and very, very, very large user base. This experiment as you can see this one shows that rabbit pipe, which is, is, is showing at SP compare is this comparable all. Okay. Then let's go to, let's, let's go to the diet analysis. So I'm not sure if you know the differences of the DNA. You know, you know, the mass, the total mass of the. But it's kind of multiple things. So the single end of spectrum have more than market in it. And you don't know what the mass are they. So analyzing the idea is kind of a little bit different from an idea. So today we need to really extend MS right to make it a support to make it a support the idea. So we have the idea how it can search. And so she had an answer sequence database directly without any prediction or any special library. And then after the MS right to search, it can the result can be used by the MS person. So the computer can be used to build a special library. Special library is very widely used in the analysis because there are there are tools that can take a trip to library as a template to extract the computer. So we have three features from the di data and the one species. There will be a special library and we can quantify that important front. If the user have the idea, we can we can search both of the types or together and spread them all the way down to the special library. And then there will be a special library containing that from the API, and then can be used for quantifying the peptide from the data. And our experiment shows that this hybrid library always use more peptides, more kind of peptides and the library from a period from the data or period from the di. Okay. And that's where you can just basically work and spread it can do the highlights. It can it can do the data analysis. They can do the conventional search open search and write a message to have a different story either also support all kinds of competitions. And we also divided several thousands of workflows to support all kinds of applications influence also which I'm showing here, such as TFT analysis, static data analysis, economic data analysis. So people might want to, okay, Fred can do a lot of things. So what if we want to use it in HPC clusters. and that's it for example, or processing many, many jobs in batch and I want to internet in check them with a very at home. So we all listen to the user feedback, and then we always want to make them happy. So we get the head of this mobile to it, the friend, front of the head of small and the viewing mode basic this share the same code base. The only difference is it is one, if right-back initiated a QE module when it starts, and two, how it takes input from the users. If the user uses QE remote, it takes an input from QE. If it uses the headless mode, it's too much input from workflow file and manifest file. So workflow file basically contains all the parameters for the analysis. And the manifest file contains a pack of the user's manifest function data and as an experimental groups. So with this mode, the user's QE remote prep file will be a single command and this command can be easily put in the Galaxy system to run in the background. And the front of one can be a website taking all the parameters. By the way, I would like to say that I think it's basically proof and also for some group members working with me and working on the prep update document. I also would like to say all those collaborators sounds them developer tools and I use the back prep type. And some of them have the project set in using prep type. And again, I also would like to say that I need another CI for the products. Thank you. That's all. Any questions? So one question I have is in terms of MScragger, it seems to give a lot more implications as compared to Maxxon. What do you think contributes more to this? Elevations, is it open service or is it just MScragger? What are the components which contribute to increase the direction of life? Can you repeat the question, please? Sure. So the question if I was to correct you is that what kind of compression am I using in this figure in my right? Yes. There are quite a few features in MScragger which seem to boost the number of identifications. So what? So for this comparisons, we did an Apple to Apple comparison. We didn't do it. We didn't do it from an open source or semi-intermax source. If we put semi-intermax in that search, it's a corresponding to auto-doers-exit. So here is all the kinds of search. We didn't do any open search because as a Maxxon interface does not very well support open search. So all the sensitivities is basically if you choose a spectrum, these front spectrum processions and the front scorings or those kind of detail scenes is not a lot of them on the open search. So actually we're running a little bit behind the times. Would you mind posting it on the Slack and we can continue from that? Sure. Awesome. Thank you so much. If you're ready, please. Thank you. My next pair, the next is Matier and Sherrman. Bill's statement in the field of mass spectrometry where the main goal or goal is to identify unknown compounds in a sample. And during this process, which we have not talked about here, we basically obtain a spectrum for every compound and we associate this with metadata. This metadata is actually what identifies the compound and we can see that as a set of some chemical identifiers such as chemical tolua or smiles, inji and this kind of stuff, or also some specific database identifiers such as op-con-id, agent-dv-id and so on. The issue is that in the mass spec libraries, these metadata are often very reduced and a lot of information is single and identifiers are missing. And so what we actually did is we developed a tool which is trying to extend this set of metadata by finding the identifiers in some databases and using services. So this is an overview of the tool. It's called investment enhancer. So basically what it does is pretty simple. It takes a mass spec library which contains some specific data which is annotated by these identifiers. It runs them through an annotation process and we get the output which is basically the same data just enriched with more metadata. The annotation steps of the annotation process is composed of so-called conversions. So basically what we want to do is specify how we want to obtain a new identifier. So we have, in our metadata, we already have some known identifiers. So this is a source. Then we basically want to find something new. So that's the target. And we want to do it using a service. This service is typically a web service with access via an API. Or it can be also a compute engine because some identifiers can be computationally first form and other ones. This process is semi-automatic. The reason is that the user can have some additional knowledge about the data. Like first of all, he has to define like what are these targets? What identifiers will obtain? But he also can know some additional information such as, well, when I will search this database, I will probably get most of the stuff I want. And then I can try these secondary sources which might help to fill the gaps. So that's why it's kind of semi-automatic. But there's also a mode where the user can just decide like, let's try everything and we will see what we get. So this annotation process is sort of derivative since we are mostly querying some web services with time and a synchronous way because, well, we are waiting, we are waiting a lot of time for the responses. And also, as you can see in the scheme, there is also some pre-processing post-processing called creation and validation. It's basically trying to ensure that the identifiers are correct form. So we, for example, don't query a database with some random stream and also in the output we want to avoid giving the user some data, which doesn't really make sense. And finally, we work the tool. And that's the tool. So as I mentioned, the user can specify the order in which he wants to execute this in the form versions. So we used a single-select form version in a Arabic mode. So we can iteratively add more form versions and always just select one. In this way, we can reserve the order of the converters and run in this way. But it's also possible to use this multiple-select, just randomly pick from versions you want. And these are basically one after the order ones in basically random order. So these are those which will be like, let's try it. We will see. And the user can also select all possible converters. And these are basically a specified list based on the supported services and supported identifiers, which can be modularly extended as we wish. So here are some of these links for now in the tool. And if you want to know more, there is also demo tomorrow. And thank you. All right, any questions? If one of the resources you requested didn't happen at the identifier, does it just come back as a null field? There are two fields there? Yeah, so the question is what happens is, if one of the conversions basically fails, because the service doesn't provide the identifier we want, well, basically we just continue with the next step, with the next conversion. And since it's like iterative process, it might happen that some of the conversions will help us to obtain some additional identifier. And then when we run it again, maybe this time it will succeed. But maybe not. There is a log file where all these processes will be explained, what will happen when, and you can go back and see what happens. All right, thank you. Here today, I have Nula Valeri from NCBI. Again, I trust the chairman. Okay, so we're going to leave the world a mess back and talk about the butterflies in here and who that's. So, yeah, my name is Nula Valeri from NCBI. And pleasure to be here. It's my first Galaxy conference in person and the last year. And as Anton mentioned yesterday, you've been collaborating with them last year to look great data sets into Galaxy and it's been a really, really nice collaboration. So I'm going to start by giving just a background of what data sets is and why we started the project. And then towards the end of the talk I will focus on what we've done with Galaxy. So first off, what is NCBI ideas as well? The project has started this very lofty goal of making it easier to find data with an NCBI. And we started around the end of 2019 and to where we are right now is that we provided a new way to get data for genomes, genes, a lot of what sets and also get a special start to the dataset. So why do we even start this project? Well, for anyone that's used NCBI, you know that data increases, it's larger, data gets more complex. This has been a real burden on the repositories in order to not only keep up with the data, but make sure that our interfaces can serve people that are coming in and trying to download this data. So this project set on the goal, the goal of creating new web interfaces as well as programmatic interfaces for finding or treating NCBI data. In addition to creating the interfaces, the way we deliver the data, we also made some changes to deliver data that is complex and has multiple file types. There's the version metadata. So to package this data, a coherent package of sequence of metadata. And we also wanted that metadata to be documented. So all the metadata that we have in our packages has to have documented schemas that are published on our website. And part of this project is also to work closely with users. So this is a really important thing for us to develop very, very much on the open and reach out to people and get their feedback and not develop something that was kind of useless. So this is our home page. If anyone wants to learn anything about data sets, you can just come to our home page. It's the NCBI URL slash data sets. As you can see here in this image, this is the web interface that you can use to search for genome or genes. You can search by organism. We have auto-suggest. We accept all these terms that people know. And if you look at the navigation bar at the top here, we have these portals entered to look for taxonomy, to look for genomes, those are the command line tools which I'll be doing in a second and our documentation pages. So anything you need to know, please just go look at that web page. So what data do we actually deliver? So I mentioned these data is the more complex. So we have these different data packets. I'm going to introduce this term which we use for our data sets, which is called the data package. And whenever you download the data sets through any of our interfaces, you're going to get a zip archive that contains data that you chose. So it's going to contain the FASTA sequence or annotation files. And it's also going to contain a metadata file. All of the metadata files are in data lines format. And these metadata files are a dataset of the genome. Metadata is often scattered across the different databases within NCBI. You can do assembly database or by a sample or by a project. So what we try to do with these metadata files is to bring all that data together in your structure. So again, no matter what integration you're using, you're going to get the same data. So we've created these web interfaces because we know that's where most biologists are. We know that we would like to get people over to using command line because it gets more complex programmatic options or really the best way forward for a lot of people. So we've also invented and created a command line tool. We actually have two tools. So here I'm showing you schema. This is in our documentation. This is our data fence command line tool. And we wanted this tool to also be very intuitive much like our web pages. So it's broken down by downloading. So that's actually getting the data. And then we also have a summary command which allows you to browse the metadata. And it's pretty simple structure. It's then genome or the gene in that virus service. And we have, it's just high-level commands and then sub-commands and flags. And the inputs are common terms that are set of the exceptions. You can keep up by tax on, you can keep up by most of the NCBI and N-fires. And one thing I just want to point out here is on the bottom here, this is a command that's called rehydrate. For those who are looking for very large amounts of data, we implemented this thing called the dehydrated package. So you can use flags for dehydrated. And what you get there is a zip archive that has the metadata file and then has sketch files for the actual data. And you can download that and then you can go back into the rehydrate to realize those data files. And the advantage of this is there's a lot of retries built in. It also does the downloads in parallel. So it's a much faster experience if you're looking for something like, also an old genome or a primate genome or something very large. So our meta data files, JSON lives in a machine readable format. And that's not a human readable. So what we've done is create this other tool called data format. And data format takes in those metadata files or JSON lines files and allows you to output the TSP. So you can just put in the field names that you are interested in. And again, this is these field names are all in our documentation. And you can use data format to output a TSP file. So then we have a documentation page. And within our documentation, we have the reference section. We also have an open API under the reference section of our documentation. And we have how to guys, which we are, we're building as we talk to users, trying to build them in a workflows for going through and using data sets. You also have a GitHub site. And then there you can find them, do for notebooks. We have a training folder where we have, we've built tutorials and training from various workshops that we've run. So what are we doing with Galaxy? So in the last year, we have incorporated our genome servers into Galaxy. So here I'm showing what the interface looks like. And I just want to highlight that there's several tools within Galaxy that are called NCBI something that we're NCBI data sets. I'll tell you which ones we are. And so that's the interface. And the way you use it is you go in and you select the genome you want. You can do that by tax selling. You can do by assembly. You can do it by project accession. Then you go in and filter. You can filter by just the reference genomes, the annotated genomes, source database. This one is often a confusing one. We have both the gen bank data base, which is the submitted data. And then we have the resting, which is the invitation pipeline. So we run by NCBI. So you can choose where do you want the rest of files or can find files from Amazon or these days with different filters. And then you choose the files that we want. We have a selection of the files that we have available on the TV. And then once you select in your files, you have an history and you can go ahead and use those files for any sort of workflow. So where we are right now is we're very interested in building workflows. Data sets is about getting the data and those downstream workflows that people have will range from library to flow to abilities and it's workflows in Galaxy. So coming up, we have in the test server, Galaxy, you can see our game service or work along in our SARS-2 service. And this project is under very active development. We have, we're looking across the entire databases and continually adding improvements and adding additional data. So we do really, really want to hear from people that are using this, if they're doing this thing. If there's any aspect of it that's not easy to use or something seems to be just missing. Lastly, I'd like to thank my team, ASS team and Digital Galaxy. It's been a nice working video over the last year. And again, contact us as my email. We also have a feedback button on our pages and if you don't know who's not at left goal, we look at it every day and we try to respond to people within 24 hours. And one last thing I want to mention is we have a new initiative at NCBI called the Verde Genome Resource, which we are part of. And part of this initiative is developing kind of, it's trying to take our in-house tools if we have contamination screening and for genome annotation, if I'd like to make these tools available to the public and they will be arranged in the Galaxy community as we help. So thank you all very much. And I'll take any questions. We have not done that, but that's something we could talk about, I suppose. And yeah, I mean, the nice thing about data sets is you're getting- Could you repeat the question? Oh, sorry. Could you repeat the question? Do we have data cached in Galaxy so people don't have to go down and download it again? I think that's something interesting to talk about. It's the data that NCBI changes a lot. And so I think one of the advantages of getting a straight from data sets is that you're getting data that is, you know, it's very current. But yeah, I'm trying to figure out how much you might want to address that. Yeah, I just wanted to briefly mention that we have actually a job cached in Galaxy that we don't push aggressively, but that's perfect for this scenario because, yes, it's very easy to see what else are in. Well, the challenge with data sets is that things are being out there, so they don't want to actually get there. Because it's a dynamic archive that's built with the latest data, which is great, but it's a challenge for more caches. All right, well, we are over time. So if you have any other further questions, please send them on Slack. Thank you very much. So today, we have our attention from the private part, we're going to be presenting this RAD convolution tool that we have put together. And yeah, so I'm already wondering what RAD convolution actually is. And essentially, you have your bulk RAD data sets and you don't necessarily know what the cell composition is. And this is information that you normally don't really have, like insert by clustering methods or by taking an existing single cell RADC data set, extracting the cell type from either previously clustered data or just other methods, and then fitting this expression for files onto your bulk RNA samples and getting these different cell type compositions in your bulk data. And the reason you might want to do this is because maybe you find a specific like rare cell type in your single cell and you want to check whether or not it's real. So you can then just download other available bulk RNA data sets and check to see whether you can get any hits of this rare cell type in your bulk data. And also, if you want to compare different tissues, so like disease tissues or healthy tissues and see what the cell type composition could be between those two, then this is also a way to do that. Also, it's just a much cheaper than doing single cells. So if you just have bulk data and you just want to put it in it, you can just take an existing single cell, a publicly available single cell data set and then try to fit that onto your bulk RNA type data. So the tool that we chose was called Music and a multi-subject single cell decomposition. This is actually not the only music in Galaxy, there's another one, but this one is with the weird capital vectoring stuff. So it's more unique, I guess. And the way it works is that essentially you just take a single cell data sets from many different subjects and then it looks for consistent cell types between the single cell data set, sorry, consistent classes, I guess. And it tries to infer informative genes just by looking for cross-object mean and cross-object variance. And by using these informative genes, it builds a gene intervention profile for all the cell types that you have there. And then you can use this to fit against your healthy bulk RNA seed data or your disease RNA seed data and compare the different cell types. So why did we choose music instead of other deconvolution tools in Galaxy? Well, there was this benchmark paper in 2019 and they compared many different kinds of deconvolution tools and deconvolution tools come in actually three types. So there is reference-based, which is essentially what I explained here before where you have a single cell RNA seed data that's already clustered and you're just hitting it onto your bulk RNA seed. And there is also reference-free where you literally just have your bulk RNA data and you are just trying to classify data intrinsically. And I mean, you can actually kind of see that this is not really the greatest way to do it because CAM-free and linseed are reference-free methods and they don't tend to score very well when you use benchmarks. And also in the benchmarks itself, it just took simulated data into some very noise profiles and then just checked the peers and correlation. And one other way is a marker gene-based analysis where you just need a bunch of marker genes and then it tries to infer cell types from that. But I think that was DSA that also didn't score very highly. So the reference-based methods were some of the best methods of cyber sort and music and timer, but music was special in the sense that you didn't actually have to feed it a set of differentiated threads genes as well. You could just infer cell types purely from a single cell data that you tried to fit onto it. And another reason music was chosen simply was because it actually had a combat package and there were materials. And I mean, some of these other tools just simply did not have that. I mean, some of them were written in Matlab, so that's completely a no-go for wrapping something in Galaxy, I think. And yeah, some of the others hadn't been updated in many years. And yeah, I mean, I think the director of tool development you're very familiar with just a general workflow of wrapping a tool. Does it have a condo environment? If not, can it be easily wrapped? If so, I mean, then are there other, is it well-documented, other examples? And if so, then you can actually make a nice galaxy tool from that. So the first rendition went surprisingly quickly. I just took the tool as it was. I wrapped the materials that they had and I reproduced their entire workflow in the Galaxy. So you could, here you see one plot where you see, I don't know, the portion of alpha, beta or gamma cells and I think there's like 79 RNA samples there. And it produces a nice heat map, we have nice workflow, we have nice training. And from my point of view, it was kind of finished. And also I implemented an experiment set to data type and here I've written why Daftik and Daftik stands for why get another data type from Galaxy. Because this is something I'm very guilty of where I just put data types into Galaxy, thinking that they're useful, but then they're never actually used. But here I can justify that same experiment set is actually a widely used data type for omics, essays, and R. So hopefully this will actually be used by other tools, not just by music. So I can justify creating this data type this time because hopefully we use that tool. So then I passed this on to Wendy and I said, hey, this tool is finished. Let's get this published. And she showed me just very specific things which I overlooked from a developer point of view, a developer point of view tool found and it works at stable and it's finished. But is it useful? And this is where I began to see that actually, no, it was very much primed towards a very specific use case, such as in this case, it's looking for beta cell proportion for a phenotype factor in the bulk, HBA1C, where the level of HBA1C I think is a marker for type 2 diabetes and as HBA1C, as your type 2 diabetes progresses your beta cell proportion drops. So this tool is very useful at showing that specific use case, but not for anything else. And also we saw that it wasn't actually comparing multiple signals or data types against multiple bulk array, but it's just doing a one-to-one comparison. So we thought, okay, we need to make this much more useful. And so, right, yeah, side run. So this was me trying to abstract code which was written for a very specific purpose. As you can see, everything is well coded. Everything is just really, really messy and I'm surprised it got internet communications but I guess the methodology was fantastic just the underlying code was not, which was great for them, but I liked that for me. So I had to abstract this as much as I could and this took some time, but finally we actually have the second addition of the tool where you can now compare multiple single cell RNA sets against multiple bulk RNA sets, see them and a globalized plus so you can compare different bulk samples by looking at their different cell type compositions. You can also look at them at a factor level. So if you have phenotype factors in your bulk RNA set which maybe show different disease phenotypes and you can compare them in a sort of unified kind of way. So that I thought was quite nice. And yeah, for future work, we're going to put together this multi-factor tutorial based on this extended version of the tool. I'm going to use it for capstone projects but on the ground to us who can then study their, perform their own research using is these single, these bulk RNA decombition tools. And we're going to integrate it with the existing tutorials in Galaxy by just simply creating just buffer workload where you extract essays from existing single cell RNA, see data sets and then just feeding them straight to the user. And if anybody wants to try this out, then we're having a workshop tomorrow. It's going to be a crash course in single cell, essentially taking you from all reads to clustering. It's very interactive, it's kind of gamified. If you're into board games, I'm already able to come. And then we'll have bulk RNA decombition. So that will take you to the clustering stage, to the peak convolting RNA stage. And yeah, so yes. So with that, I want to thank Wendy from the university and the team. The other team and the guys in general and my team, the all about the professor. Any questions? So really close up and I like that you took it and you push the changes back upstream. What was your feeling on like where were people happy that you improved the code and really sounds like this should be another paper then. Right, interestingly, I didn't push it upstream. I was thinking about it, but because the content package which I took, the content package was actually behind from the GitHub source code. So I thought of an old version of their code. So at this point, I'm not even sure if I can even read based onto the original project. And yeah, given the rewrites and the extension, I do kind of think it's its own standard bang tool, but I don't want to step on any toes by making it its own tool because the underlying methodology is there, right? All I did was just repack it in a nicer abstract way. So it's more useful, I guess, what I've done, but it's more the theory is best. So is that enough for me to talk it up into its own tool? I'm not sure. Have you been in communication with their team? No. Maybe I should try to, yeah, I would strongly recommend that. Otherwise, you will end up supporting this way longer than you want to. I would also think and assume that they will be very happy about this. And if not, it's a red flag. Right. All right, well, we are over time. So if you have any other further questions, please reach out on Slack as before. Of course.