 Good morning everybody. This is session 19 and we are starting with session on science talks. This is the part 2 of science talks session and we'll start with read Wagner. We'll be talking about galaxy with the genomic data in galaxy with cloud forest. Yes, thanks for the introduction. Good morning. Yeah, so I'm talking about utilizing the cloud forest. So, of course, is a phylogenomic software suite that we are developing in tandem with LSU and FSU. And researchers there are particularly interested in understanding how the evolutionary histories vary across different regions of the genome. So we felt this from beginning within galaxy. And, you know, targeting multiple different platforms. So the code base is divided into three main pieces. The first would be tree stapler and so that's kind of the computational core of cloud forest. So that's the C++ command line application that we then wrap in a galaxy tool wrapper. And then add that to a Docker image that based on the galaxy Docker image. We add visualization code base built in JavaScript with D3. And this is somewhere that seems really useful for us because it's super easy with the visualization plug in sort of framework to couple the computational nature of our application with our visualization so there's just a little button right on history. So it keeps it all within the same cycle. So the first visualization I want to talk about is nonlinear dimensionality reduction. So, for some background, in front of you are three college and other trees. And each one has the same species but a different evolutionary history. And depending on what part of DNA when you infer these trees from DNA, depending on what part of the genome you look at, you can get actually different histories. And since there's a lot of DNA, you can imagine if you look across the region of DNA, you can get thousands or many different trees and researchers are kind of interested in looking at the structures that are represented in these large sets of trees and they do that by first thinking about how can you measure sort of the similarity between any two trees. Or from another perspective, what is their distance in a tree space metric. And the thing is, so they can probably do this, but this tree space is often not quite hard to mention that we can sort of intuitively understand. And so we include a nonlinear dimensionality reduction tool. And what that does is take this high space, high dimension tree space and reduce it down into an approximation of three dimensions. So, you can see here on the right is trees that were inferred across 13 different proteins. In the mitochondrial, you know, and it's really cool. When we do this, you can see this really intuitive structure. So we can also do this computationally so we have a community action tool. And that's basically sort of a clustering algorithm with entries based. So on the right you see our results for the same asset and you can see the clear sort of 13 structure. I won't go into too much depth on the left hand side, but I will say that the communities that are detected very based on this lambda parameter. And so we put this side by side with the slider so that users can actually adjust the lambda in real time. And you can see like larger and smaller structures emerge across the tree space. So, finally, I want to talk about various networks. So, here circled are two by partitions out of many in these two filograms. So by partition, you can see it's sort of like one problem before. But another way to think about it is it sort of divides the trees into two. So, on one side of it, there's one set of taxa and on the other side there's another set. So two different trees with different histories can actually still have the same bipartisan so even though these trees are different, you can see, if you looked closely you would see on one side, for each way of this, you would have the same set of taxa and the same set of taxa on the other side. And so researchers are the researchers we're working with are interested actually very different by partitions in a tree so for example like 145 and 146 and like kind of understanding how often are these two by partitions in a tree together, or maybe how So we do that with covariance networks. On the right is this and so each dot represents a by partition as possible in this tree space and the line connecting dots represents a positive or negative correlation for how frequently they appear in a We also in this visualized vision sort of put the phylogram next to it and scrolls around the phylograms of the tree set, because it makes it really sort of intuitively useful for researchers to utilize information on the right to actually be able to deal time to map the by partitions in that covariance network onto the phylogram and sort of get information on each one so it's just sort of like a mouse over type deal. You can select different ones and export this so. Yeah, so this is the future. We're right now looking at data testing our next round of data testing so definitely reach out if you have private genomic data, and you want to do any of these analyses. We've got a public data site that you can check out. That's our project site and will also be starting on the stream to. Thanks. We have time for one question. This question is how large of a tree set. I don't think that we've really benchmarked that. Yeah, but yeah, I mean, this was, I guess this was a data set of 2000 trees that you saw with the 13 proteins. So that's just a reference and that was not too difficult but that's part of the other thing is that we're, we haven't like deployable like locally so like, I'll show it on my laptop during the demo but you can also like make it easy back and all that kind of stuff so it's that's a little bit of a scale on our face. The reason we're going to be around if you want to ask me about the next stuff. Hi everybody, my name is Cameron. I'm a computational biologist and Jeremy gets lab at Oregon Health and Science University. And I'm going to be talking about galaxy MTI, which is a tool hub for end to end analysis of multiplex tissue images. So I'm going to begin by just giving some background and sort of motivation for galaxy MTI. Then I'll go into describing some of the tools that are available and some standardized workflows that we've been developing, and then also talk about interactive visualization tools that are available as well. To get into some of the background. There are two sort of gold standard microscopy methods in clinical oncology and also cancer research. And these are H&E and IHC. H&E images allow pathologists to look at two broad cellular components within a tissue so they're able to look at staying the nuclei and extracellular matrix differentially. And so it gives them a good broad idea of the overall organization of the tissue. And then IHC built off of this by being able to stay in a specific protein in a tissue using antibodies. And so these have been used really universally for a bunch of different things in cancer research, including understanding disease development and subtyping different kinds of cancer. But these methods are inherently limited in that H&E's only give you that broad organization of the tissue and IHC is limited to a single protein for tissue section. And so this is kind of led to the development of multiplex tissue imaging assays. And this is a category of spatial proteomics assay that the different methods are incredibly diverse. They all allow you to quantify many different proteins within the same section of tissue at single cell resolution and while preserving the spatial organization of the tissue. And so this generates a really rich data set that is essentially like a 2D tumor map that has really many applications to precision oncology, including exploring the tumor immune micro environment. So you can see the bottom here is an example of what a multiplex tissue image might look like where you can imagine each protein in the assay is going to be captured in a separate image. And these get stacked on top of each other to get the final multiplex image. And I'm only showing four proteins here, but these assays can vary between 20 and 100 proteins. So you can kind of give a sense that you can learn a lot from these images. Typical protein markers used will vary depending on the question that the scientist has, but these could be indicative of a cell lineage. So for example, in pink is Pansky K, which is going to be used to identify allele cells. But they can also be functional markers. So for example, K67 is used to identify cells that are proliferating. And so these multiplex tissue image assays are really powerful. The issue is that performing analyses of these at scale and automated or reproducible way is pretty challenging. And some of the challenges facing this analysis are just that multiplex tissue image data sets are huge and require really significant computational resources to process. There are also a lot of steps to go from raw images off of the microscope to an analyzable cell feature table. There are also different kinds of inputs. So you can have, for example, a piece of tissue that occupies the entire microscope slide, or you can have many smaller samples multiplexed onto a microscope slide, which would be a tissue microarray. And so those are obviously going to differ very slightly in how they have to be processed. And then one of the biggest challenges is that because there's such a diversity of methods that can produce this data, they have different processing methods that are optimal for each different kind of data. And so to be able to analyze all of these, you really need like a robust and comprehensive tool suite. And so this motivated the development of galaxy MTI. And on the left is sort of a general overview of what we would consider to be a comprehensive MTI workflow split into two phases. The first is primary image processing. The second phase is handled by the tools developed at Harvard Medical School for the micro pipeline. And then the second phase of the single cell analysis where you're looking at composition and also spatial features. And then throughout all of these steps, there are opportunities to create interactive visualizations. So this is all available on cancer.usegalaxy.org, which Luke Sargent talked about a couple of days ago describing the infrastructure that this instance sits on. But in the screenshot of the instance you can see in the tool panel, there is a multiplex issued imaging tools category, in which there are tools for every step of the comprehensive workflow. And so for specific steps that we've identified as very, very quite drastically for different imaging modalities, we've included several tools. So a good example of this is cell segmentation, which is the process of identifying and indexing unique cells within a tissue. This can be really variable between between different assays and so we've included four segmentation tools for that. Now I'm just going to talk about some standardized workflows that we've been developing for multiplex issue imaging. And so this is an example of a workflow for whole slide imaging, and it begins with a registered image where all those separate protein images are stacked on top of each other. And goes through cell segmentation from the segmentation mask, mean marker intensities for every protein in the assay are quantified as well as morphological features of each cell and spatial coordinates of each cell. From this it's converted to and data format which is a really common file type in single cell analysis, and that is kind of the basis for the second half of the workflow which is the downstream compositional spatial analyses. So, for example, we can then phenotype the cells and classify them based on the distribution of certain lineage markers, and then from there, I perform spatial analysis using the squid pipe package, which is wrapping out CTI to perform analysis such as looking at how frequently different cell phenotypes co-occur within spatial context of the tissue. And so we have also developed standardized workflow for tissue microarrays as well, which already are where you have multiple small samples multiplexed on to a slide. And so the main difference with this workflow is that you have to split all of those individual samples into separate images. And then once we get about galaxies that you can basically just invoke the whole slide imaging workflow on the collection of those individual core images. So, briefly, I'm just going to describe a little bit of how we've been validating these workflows. And so this is some data that we processed using the whole slide imaging workflow. And it's from a piece of tonsil tissue that was imaged using three different multiplex tissue imaging assays. And we ran them all through the same workflow, and found that the segmentation results were very consistent. And then what you can see here are the results of phenotype assignment and the distributions that are very similar. And the same was found for spatial analysis as well. And then we also did a similar process to validate the tissue microarray workflow. So, one of the really great parts about Gaussian TI is that it also includes interactive visualization dashboards. And this first one I'm going to be talking about is called Avivator and it's more of a QC tool. Well, what's cool about it is that you don't actually have to invoke a tool to run it. On the cancer instance, when an omitif, which is a standard data type for multiplex tissue images, when that data type is detected, there'll be a link in the history that says display an avivator. And when this is clicked, it'll launch a new window where you can view the image, you can pan and zoom on the image, the resolution will adjust because these images are pyramidal. And then you can also turn on and off different marker channels to do a visual analysis of the expression of the marker within the tissue, and then also toggle the intensity of each of those markers. And so this yellow box is just showing what it looks like when you zoom in on a particular section. So the other visualization tool that we have is the test, which is basically the same as avivator where in the bottom right you're able to see the actual image itself. And again, you can zoom in on this, you can turn on and off different marker channels. The test also takes the downstream and data file as an input, as well as a segmentation mask. So this allows you to overlay the segmentation mask onto the actual image. So you can see how well were the cells segmented. And then also the mask can be color labeled with the phenotypes that were assigned during the workflow. And those are also linked to interactive bar plots and violin plots so you can really explore the data from your downstream processing while you're looking at the image, which is something really valuable for especially pathologists who like to be able to look at the actual image in the context of the data that they're thinking about. So also in the tests are heat maps and you map plots and so these panels you can drag around and resize. And this is all done within the middle panel of the galaxy, once the tool has been launched at a rendered HTML file is displayed. So to wrap it up, multiplex tissue imaging assays are the super powerful, pretty super powerful data sets that have a ton of applications to precision oncology and cancer research. To elaborate the analysis is pretty difficult galaxy MTI makes this a lot easier by providing tools for primary image processing and those downstream analysis analysis as well, along with interactive visualization tools. And really key thing here is that these are generalizable to different to different imaging modalities. And we've also been working on creating some standardized work flows for multiplex tissue imaging analysis as well, which is really crucial because this is not been a thing that existed for this type of assay. So it's hopefully starting to bring in more and step with omics assays for example which have lots of standardized work flows available. And then finally one really important note is that because we're basically taking an image and dispelling it down into a tabular data set and data format. This can then be piped into any galaxy tool that also accepts that data type so there are a suite of other single cell tools available in galaxy. And so there's a potential way to link to link these tools together. And then finally, this just bolsters galaxies imaging analysis capabilities in general quite significantly. So I'd encourage you all to give these a try at cancer.easegalaxy.org. There are histories available in the sharing histories where you can play around with the test dashboards. And then also, we have a print coming out soon for galaxy as well as a galaxy training network tutorial. Lots of people are, oh, many of them got cut off the screen. Lots of people from my lab contributed to this as well as many others from different institutions working on actual tool development as well. Thank you very much. Do you need a big data analysis like this or can we do it with a small amount of data? The reason I'm asking is that in Canada, we don't routinely divide the biology samples. So the amount of data is pretty limited. Yeah, if you just have like one image. Yeah, sorry, so the question was, can you run this if you have like a limited amount of data? And yeah, so you can run these workflows on just a single multiplex image. It doesn't have to be like a suite of images or a large data set if I answered your question. Cameron is going to be around and feel free to meet him and ask him any more questions. The next speaker is and he's going to be talking about processing of small molecules DC MS data in galaxy. Thank you very much for the introduction so let's get started. First of all, I would like to highlight that this is a team effort. And these are all the people without this would not be possible, especially as a monthly check, you know, we're setting up. We set out our galaxy, but it keeps it running in a great shape. So we are based at rest talks with part of Missouri University. And we have about 250 people staff overall working environment sciences. And that also includes three core facilities. And we're based in a beautiful burnout, which is second largest city in the Czech Republic. We measure chemicals using mass spectrometry focusing on small molecules. And because of the differences in the samples and the instrumentation that we use, we require dedicated workflows and tools for that. So we have our study design, some people like samples, the sounds are prepared, they run through a chromatographic separation and we have the detection in the mass spectrometer. We get our data and then we want to find out what are the chemicals or what are the, yeah, the chemical compounds present in the sample. So we use like our data processing workflow. So we have a raw data. We detect our peaks initially in that data so stinger signal noise. We align our keys across samples to get a global feature table to have a look at what are the differences in between the samples that we analyzed. We might have to correct for certain errors and batch effects which are sure by time differences and differences in the analysis. Again, since this is GC data, we have to like the individual peaks that we detected into the actual features meaning the compounds that we want to detect in point of time. So we enrich our data with some additional information and then in the end we identify our spectra by comparing them to a library and get such an helpful table which includes the number of like matched ions and like the scores of the probability of the compounds being identical. And for that, we build a workflow with many different packages. Some of these packages we developed from scratch, some we took over from like other existing repositories and refactored them and improved them and we wrapped them all for Galaxy and put them into our into an instance. So I just quickly go over the individual steps. The first peak detection step is like a multi step procedure that does like the actual feature extraction and the time correction, etc. And initially we had this as a single galaxy tool and we actually split it apart into multiple tools to achieve better performance, better scaling to require this memory. And to also just like make the tool easier to understand because it's way easier to have a look at less parameters and think about them and tune them. And this is also compatible like the other packages in the same pipeline. So after that for the batch effect correction, we put the WAVE ICA tool into Galaxy. And for that, I could also implement an interactive PCA so you can actually have all your data inside Galaxy to get a 3D PCA plug, you can like change the colors and so on. And this is available as a visualization plugin and this has also been contributed to the main Galaxy repo. The next step in the pipeline performs the deconvolution. This package is originally developed by Corey Brooklyn from Colorado State University and we started working on this package together at a later point. And just to give you an idea of like what type of work we do is for example, we have a problem with like computing correlation matrices which get like very large if you have a lot of features. So we actually improved just the computational side of this to make it more efficient. And yet the slides will be available online so you can like have a look at the details. In the same data, often a retention index is used to serve as an additional means for identifying compounds. So the actual mass spec information is one part and chromatographic information gives you an index which you can also match to a database and it's standardized across instruments and experiments. And we developed a small package to just do this computation for various types of input data and put this also into Galaxy. Last but not least for the actual identification we contributed to the maximum to the management package and put it into Galaxy and makes up multiple tools for filtering of mass spectral libraries similarity computation. Maxim is currently working on molecular networking module and visualizations side-escaping Galaxy. So yeah, there's lots of interesting stuff coming up. We assembled these tools into workflow to make the processing a bit easier and reproducible. And this includes the whole GCMS processing part and also just a dedicated key detection part since this is like made up multiple tools. One of our tools and workflows are available in our GitHub repo and Galaxy instance. So we didn't like to choose yet a certain platform for distributing those workflows. So this is something that we decided to. And of course we also working on training materials for the tools and workflows. We started now with the overall pipeline and maybe on the co-fest also working on the dedicated training for the key detection part and contributions and feedback are of course welcome. And of course also like to launch a funding which is coming from the European Union and also without that this work will be possible. Thank you very much. I'll be around at the co-fest and there will be the demo today after the session. Thank you very much. You have time for quick question. If not, healthy is going to be around today as well as for the purpose. Thank you very much. Next speaker is getting like room and she's going to be talking about x-ray imaging. I'm Kelly and here to talk about x image or the x-ray imaging microstructures gateway. So we have built a gateway that is built for the structural material science community. So this is a community of scientists and engineers that study materials that can sustain a load. So we're interested in understanding these materials and being able to predict these materials behaviors and design new versions of these materials. So in our field, we need to understand everything from the data and how they assemble up to all the link scales in between and the parts that are formed. And there's characterizations that happen at each of these link scales and modeling efforts that happen at each of these link scales. One of the most important things to know about material science is that the microstructure are the things that happen at these really small link scales. They dictate a material's overall behavior. And we're new to the communities. The Rolf and I came here today a little bit about us. So I'm a material scientist and engineer. I study metals and the performance of metals and I have a pretty unique position at Cornell. So I work both at a user facility. I lead a program in structural materials for the Air Force. I also look at parts that come out of fighter jets like that. And also I'm a faculty member across campus in mechanical engineering. So I'm also a user and I do other modalities besides x-rays and Rolf is one of our software developers at chess. And one thing I can say that is for sure is that structural materials and the syncretron community, we both need data intensive computing help. So in both cases, we have big complex data and data workflows. When it comes to materials problems, these problems require that we have complex multimodal 3D, 4D data sets to be able to understand how materials behave. It's really challenging for us to plan, collect, register, interpret and curate these data sets. And there are some things to note, like is a theme around here. Overall, I'm reduced data sets are really big. They're 10 to hundreds of terabytes or bigger. Our complete data sets are very precious. Often you have one of these workflows and one data set per PhD, which also means that they were cultivated for that PhD. And resources like a syncretron are, they are rare and curated data sets are rare in our community and it's not because we don't want to share them. It's because the complexity and the amount of metadata that comes along is just not being tracked. It's not possible for us to share. And so one of the things we've been asking ourselves is who should be responsible for initiating this. So I'm at a user facility and we are a particle accelerator, we harvest our x-rays and we give x-rays office service. We also provide a compute farm. And we're at an inflection point right now. We're recognizing it's not enough to just give x-rays and computing hardware. So how do we facilitate scientific discoveries, the training of our scientists in the dissemination and curation of data sets, workflows and software. And so what it came down to was really looking at how to tackle complexity and cultivate expertise in our users and in our community. And syncretron users in particular have a very high burden of expertise. They need to be experts in their domain science. They need to be experts in the x-ray techniques right now. They absolutely need to be data intensive computing experts. That is a big challenge for us. And then any other modality they need to stitch on. And in the end, we want to democratize access to our tools and share data sets within our community and ultimately be able to focus on which pieces of the science that need to be cultivated and expertise needs to be cultivated in because we can't lower the barrier everywhere. So that's why we want to adopt an extensible framework to target these goals. One of the things that I found out really quickly is that our workflows need a lot of help. They were never designed to be abstracted or fit nicely into a type of framework. So I thought they were mature enough and they weren't. So that's what the last couple of years have been. And so we have been able to abstract out a lot of our workflows. And then as you go into each of these blue boxes, we find as you eventually get to a software piece that is extremely chaotic. And this is the case for every software piece. So here's one of our most mature workflows. Everything we see in pink is not tracked. It's manual and it's not tracked in the workflow. This is obviously a nightmare. This is only one piece of our workflow. And so we've adopted Galaxy and we started putting these workflows in and I'm very excited about this. I'm just being able to track the histories where we are now we have a production version being rolled out for users. We've added some of our canonical and most troubling workflows for user testing. They're starting to do beta testing and some of our stronger community developers are starting to provide something that looks a little closer to software and less like scientific scripts so that we can start making more tools. It turns out socializing Galaxy wasn't as much a challenge for me as socializing the idea that scientific infrastructures and strategies is actually part of our responsibility as a community. And both groups initially were pretty skeptical but I can say that this year like it was pretty ubiquitous that the sentiment was people are really excited to adopt this and I think we're ready to take on Galaxy as community in both cases. Just a plug to let you all know that as we are new, our community is not very strong in software engineering. They're pretty strong in writing scientific scripts, they work. But our most neglected steps are making those general and anything that looks like software that can be disseminated to somebody else. The initial feedback was that having a framework like Galaxy incentivizes us to be able to do these middle two steps so that's been exciting feedback. But most of the scientists have identified that they need resources in the form of people expertise in software to help them get through this pipeline so I'm interested in hearing more from everyone else about how you make this sustainable. And my last plug is things that we really want to do that we haven't started yet. A lot of it, the first step is interactive visualization and strategies for how to move data around when we can't port our entire workflow at once into Galaxy. So, thank you. I noticed that you have in your Galaxy workbooks and very just thin noodles. So I'm wondering if you're not already on collections of data or you just do just one thing and you do a lot of people have one thing. It's both. Oh, sorry, the question was whether or not we compute because we have thin noodles going into things whether or not we compute on one thing or whether it's collections of things. And one of the things is that a lot of our data types actually have collections themselves. We use a lot of HDFI containers. So the process may be doing a very massive like data set, but it's actually one input in certain cases that's one way to answer that. We are running out of time so put your questions on slack or reach out to Kelly. We have switched our schedule a little bit. We're going to have Sadio combo going to be talking about using using sequence bloom trees in Galaxy or metaheism. Good morning folks and have the boom when I was a doctor of fellow at the data plant. Cleveland Clinic. And today I'm going to talk to you about metaheism team, a framework for a characterizing microgene which is going to be some guys. So I would like to start by quickly by the microbiome support that I usually started by saying that it's pretty much clear to most of the people now that what you want to study human health towards human diseases. We have a lot of problems on our own cells, our own cells, but we also have in many cases, we have to consider our microcomputer, which is a set of microbiomes that represent important functions for our body. We have functions that range from I think, I think I've digested food or training our immune system or having us with our metabolism, or even the gut brain access or really having been an effect on our body multiple worlds. So we don't know a lot actually and we don't know what is the extent of the interactions between the microbiome and the different aspects that we just mentioned but it is really something that we need to look inside much more. And well, metagenomics allows to computationally study, not only those well characterised microbes, but also those ones that are pretty difficult to, to, to cultivate, and sometimes impossible to cultivate. The possibility to assemble, to assemble genomes from metagenomes have been paid away from large, compared to genomic studies of both catable and catable microbes at different resolutions, including also those yet to be named microcosmetic. So the problem here is that we are still lacking for systematic procedure to organise and process a large collections of months together with reference genomes from isolated sequencing and, and their metadata which describes their relation with host health and environmental factors. And this is the reason we developed MetaSBT, which is a scalable framework able to organise and index microbial genomes and accurately characterised metagenomics and genomes from sequence to increase. All right, so MetaSBT comprises different models, different subroutines. This is the first one which is called index, and it's immediately responsible for indexing a set of reference genomes you can retrieve from ICDH bank by also organising them following their taxonomic classification and building a series of sequence from trees for each of the taxonomic levels. So this is of course a quality control procedure if you don't know what you can, and the data application procedure which, you know, makes sure that we are not going to process the same genomes twice. It also provides a boundaries model that, I mean, this is essentially, this essentially process every cluster of the set of taxonomic levels and record what we call the class transfer boundaries. So for example, given a specific species, it could record the total number of kemers in addition to the minimum and maximum amount of kemers in common among all the genomes under the specific species. And what the same applies on all the other taxonomic levels. So the results produced by this model review an idea of the genetic diversity under a specific cluster, and they are really crucial for establishing whether an input genome could be added to the closest cluster, but we will come back on this in a while. So this also provides a profile model that takes an input to a genome and tries to characterize it by querying the database. So since we have different taxonomic levels, we need queries actually expanded into seven different queries. So the first one, the first one we can establish the closest kingdom. And then we can query the closest kingdom, the first we establish the closest file, and so on, up to the closest species and the closest genome database, which is also the closest training. So we also have an update model, which is, which is responsible for updating the database with new set of, of genomes, genomes, and the genomes. And then we're on the profile model, so we will establish their closest kingdoms, phyla and so on up to the species and the closest genomes. And here is where the boundaries is really is required in order to establish in conjunction with the profiles, whether an input genome should be added or not to the closest cluster of the species. And in case we are not just by looking at the key machine column on between the input genome and the closest class. So in case we are not able to characterize a genome, we can keep that assigned and once we process all the input genomes we can finally cluster all these unassigned genomes together. And according to their profile and finally build a new cluster so different that's not so this will hopefully end up with new families, the gene around the species older with no reference to you know, sort of. Those are yet to be named the microbes. They are still completely unknown. Okay, so at the moment we are in the same order of the 40,000 bacterial and the key of species from a CDH event by limiting the number of genomes per cluster 50 just be as fast as possible and build the first baseline and first version of our database. And what we also read through the order of the genomes of the unknown species that have been described in this paper. And, well, they're planning to use this set of genomes to validate our particular so we hopefully will be able to reproduce the same unknown species by the others of this paper. And we also we're also building a tool suite for Galaxy, which comprises the tool for for indexing genomes and that's building a database, and of course updating the database. And in addition to another tool for profiling genomes. So we are still under development actually and we will release these tools as soon as we will have the first version of our database. There's some future directions so we are planning to expand the database with, we told the reference to the CDH event, also considering genomes from other microbial kingdoms, and so they're building different versions of our databases. And, of course, we're also planning to expand it out a bit with new max from from public metagenomics samples from different also send different environments. And, well, we're also planning to also our database on the CDFS just to make accessible to every galaxy user and try to build a community led hopefully constantly increasing collection of new genomes. And, well, we're also planning to build a custom track and database that will be extremely easy at this point. And these will, these will unlock the profiling of unknown species within that with crack. So we're also working on the development of a new of a new procedure to a new meter to define cluster specific don't fit for sizes, and what was new drastically reduce the storage space required to build and maintain the sequence on trees. All right, that's pretty much all. And so I would like to thank the committee for working with the function and your attention. Well, we are assured you have some metadata about your sentence for sure you could do that. We don't have a lot of metadata at the moment, but for sure we're planning to to do one thing. So I think it's possible for our for the, for all the genomes that can form. Thank you, thank you. If you have any more questions or have you put it on slack and on or have the last speaker now for this session. So, Brian Rebeno, we'll be talking about targeting viral helicopters. Hi, everyone. I'm a computational chemist, and the postdoc Dan Lincolnberg's lab clinic. And this morning, I'm going to give everyone a crash force of molecular dynamics and talk to you about viral healer cases and workflow that we're working on to make these relations accessible to both experimentalists and theorists like So a lot of what I'm going to talk about is, I guess, a pretty good case of reproducibility. So a brief outline, but he could give you guys an introduction of theory molecular dynamics, and we're also going to talk about two test cases that previously worked with Zeke and SARS code two. So we're going to compare the results of the original work, which is was an amber, the under amber and the engine, which is commercial proprietary, and we're going to compare that to the Gromax relations, which is a source for the doubt. And furthermore, once we show the reproducibility, I'm going to show you a couple different test cases where we show generalize the workflow with healer cases and I have to say for that I'm going to conclude some current limitations and future directions. Okay, so you guys were the first law of thermodynamics. Where's the end. One of those, you know, conservation things can't be destroyed or created turns out, you know, that applies to things like planetary motion just as much as they do to enter atomic directions. So when it comes to molecular dynamics, one of the first things you have to ensure is the conservation of energy that with Hamiltonian mechanics, which is equation over here which tells you that the system is some of your kinetic plus your time. Now kind of energy is usually pretty straightforward right things got mass got wazid you get momentum. It's got that kind of right. Now, on the other hand, potential energy is sort of the tougher one to there's so many different ways to describe it. But we usually break it down. We're talking about molecules in your bonded potential energies over here on the left, which describes how I guess we're going to get physically connected systems available. And we have our bonded potential energy describes how any two clearly bonded out of stretch and press we have our angle potential energies which try to angle about which any three atoms are connected resonates. And we have a proper dihedral potential energy which describes the proportional angle by the terminal atoms of any work in terms of evidence between these three you can describe the energy of any physically connected system. Now, we have our non bonded potential energies which are ones that were more confusing because I will talk about things that have no physical connection, how they come together, and how they work out. And we have our electrostatic potential energy which is pretty straightforward we use a simple quantum model. And we have some general physics in which you have an atom with charge qi and have a charge to Jay over the radius and get your interest at 100. Then you have the most interesting ball, the non bonded novel interest that potential energy was described so things again they're not connected together come apart without the consideration of electrostatic so I like to call this the invisible force. And then Leonard Jones equation, you know, like the dynamics there's basically no way I've seen this before. And yeah move forward so why am I telling you all this, because this is how it all comes together what you're looking at over here is a standard leapfrog algorithm. And this is the framework behind making the atomic motions reproducible and how you actually generate in terms of directors, as we all know in this room, virtually all biological activity comes down to a series of molecular motion so it's very important these Alright, the way it works is you start out with a potential energy up here you take the negative through the back, you get an interaction force, as people say man, you guys know that in physics right that gives you a vector that gives you an acceleration. You plug that into a standard velocity equation, you get a new velocity. After you get that new velocity, you get a new top position potential energies based solely on position. You get a new potential energy, after you get a new potential energy to do again the acceleration and the velocity and energy and this goes over and over and over again for each out of the system. So for a system like you're looking at the right 150,000 atoms. You can imagine why we need some pretty beefed up computational resources. So far go any further. And the answer is, yes, these are just a brief series of examples in which molecular dynamics have either designed drugs off the get go or improve or in somehow aided the any of the vaccines. So let's talk about the healer case itself. So this is the basic structure of your viral healer case you have this three domain right here. The top one of these two have already buying a flat and then we'll be in the bottom to use to find the site, and the basic function is pretty straightforward. The bottom lines double stranded by our right and as we all know at this point the viral viruses, the way they do the job. The reason that there are nature's most proficient bio actors is because they get that positive sense single mRNA strand their body just interprets as itself. So as you can imagine, if you use inhibitors protein in one way shape form, you can stop the rest of the cascade. This is the first test case in which we're looking at the Zika virus, he would case, and in blue, we are looking at the original simulation that I remember, and in green we're looking at the galaxy simulation. And those are the superimposed structures and there's two ways in this word in this field to analyze structure dynamic proteins one way with the right. This is an armist graph, which is root mean squared radiation that is a graphical depiction of the overall structure of the protein. It gives you an idea of the shape so what it tells you is that gives you an idea of how much the alpha carbons deviate from your structure this case the crystal structure and the take on this is that I want you to slip that is the figures are really close. So honestly, it's roughly within a half inch of each other in both the galaxy simulations of the original simulations, which is pretty darn good. On this set is another way to measure the structure of a protein. In this case, it actually tells you, you know, what areas of the protein are flexible what areas are rigid. Now you're over here, you know, more erratic, the behavior is for that you can see their pizza pretty well aligned, the galaxy simulation predicts the same regions of flexibility as well. Now let's let's look at the case of a non inhibitor. So, as we all know, you know the universe is pretty chaotic right by extension so is biology so when we're talking about drug candidates. What we're going to look at is, is the region where it's binding suddenly a little bit more stiff, less erratic or things more stable, you just can't you know help with a good KD or, you know, turn the given pretty good binding free energy want to see if things get better. And if we look over here in the left the galaxy simulation, it's the same overall pattern in which we know there's no drug in blue, things are a little more right things are moving around bounce around a little more. But, and when you put inhibitor there, in this case, we'll see that before those residues are active residents get a little more stiff. So the same pattern is reproduced both in galaxy and the original simulations. And you also will look at the reason for that stiffness so you will not have a network. And in this case, most of those hydrogen bonds and so on the original simulations as well as hyperbolic interactions will also be captured in the galaxy simulation. So just a little bit, we're going to talk about the source code to the case. Now the source code to the case, maintains the same overall structure in which you have these three domains and already finding class and then they keep people excited over here and the function is exactly identical. With one key difference is sort of need to grow coronavirus healer cases that we have a stock domain here and this same domain here, which at the current moment the field doesn't really know why that same finding amazed everybody does but experiments have proved that if you obviously the enzyme is all kind of like active. So, for that reason, we use special force for research study and it's one of the things that this work will add to the galaxy. We use a zap force field, you know, there's pretty neat things that happen at the interface program of the topic biological chemistry so a standard force field. And for sure by the way we say this a lot is just a large text file that describes a potential energy of a series of different types. So if you look at those same plots, we look at the RMSD over here on the board. Which tells you the overall structure the overall shape of combination of the protein both of them converge right around four actions for the old simulation galaxy simulation. If you look at the RSF, which maps the region and flexible reason the protein, the galaxy simulation once again predicts the same regions original simulations. Good stuff there. Now we're going to look at the case of an inhibitor. This is an inhibitor called FC-1. Again, we mapped out the hydrogen bond network that is recent for this, you're applying to our activity of this drug and the galaxy simulation captures most of them all within a two to four inch range. Which has been out of those are solid hydrogen bonds. We're doing it. That's what the drug looks like. It's bouncing in there. So now that we know that it is reproducible and it's workable to recapture original data. We want to go ahead and apply this is a perfect example of why molecular dynamics is a very useful tool and really good thing to have in reality. Because the reality is we know that experiments do an excellent job of answering the question what sometimes there's a big boy and trying to answer the questions of how and why. And molecular dynamics can tell you that like you can run, you know, anti-matic experiments. Okay, this drug is an inhibitor with this IC-50 and it does it in a competitive fashion, but you don't really know much behind that. If you're trying to operate structure, you just sort of play a guessing game and changing functional roots here and there. What I did was this drug we look after, which is being studied right now in the field that have .3 millimolar IC-50. But again, they didn't know why it was exerting an activity. I ran it through the galaxy workflow. And there it is binding in the ATV binding site and it turns out that captured an hydrogen bond network with a series of hydrogen bonds as well as hydrophobic interactions. It's for the bare stable complex and it blocked the ATV binding site. Therefore, we're making a pretty good argument that this is how it is exerting its competitive. If we look at generalizing the workflow now, I also ran it with some chemo-gases that I've studied before. In this case Zika versus Dengue. Dengue is another flavor virus. So it's Zika's closest cousin, I guess you could say. And we ran the analysis. Again, they have similar areas of flexibility. These proteins are pretty well conserved by the way across different species variants. I also ran the Middle Eastern respiratory syndrome virus. After I did homology model, I used an alpha form with a reality. I ran it through the workflow. And again, it all works. The simulations were successful and reducing similar patterns of flexible and rigid regions. And just to give you an idea of what the whole workflow looks like. So when we start over here, the first step is you start with complex. You can either get from alpha form if you're doing a homology model, or directly from a PDB, you can fetch that. You then generate a force field using amber tools, which is a new addition to Galaxy. And after that, solid system and copy files back and forth. You then eventually end up with energy minimization, which is one of the key steps in molecular dynamics, like everything else, you want to start at the global energy minimum. And then you read dynamics. Finally, you press your trajectories, analyze and then visualize them. And one of the things we're working on right now, which I'm going to talk about is incorporating a visualization software within Galaxy, which is key, right? That's what this is all about is seeing what this looks like. These are just a list of some of the tools that are new in Galaxy and some of them modify in order to keep expanding, making a complete sweep of molecular dynamics tools. Of course, as with any study, there are some limitations, one of them being that what I showed you here, there's single simulations, which historically has been fine, but now the field is moving towards one different random simulation, obviously simulations because there are random seeds involved in the different initial velocities we're in a simulation so the idea is that you run multiple attack copies and then you get an ensemble average and then you analyze that, but we interest in time wasn't able to get that far. And then of course, cut off right so this graph shows you here that when we model things in reality, atoms experience each other to a big right but you know that's incredibly competition attacks and we can't do that. So we have this concept of cutoffs in which we say okay we're going to have an interaction sphere and you know this atoms can interact with anything within eight angstroms nine angstrom so on and so forth. Originally use eight angstrom, which is perfectly valid. But the field is moving forward towards 10 or 12, which again makes emissions go longer but you know, there's a certain aspect that makes it a little more realistic, depending on what we ask right. Future directions. I'm going to augment that workflow so you can run with this simultaneously so you can get this on the average and the coolest thing we're going to work on we just have a grand proposal or we're partnering with two people to incorporate a piece of software that's going to visualize all of the galaxy. It's currently, you can do all of this in galaxy but you still have to download everything you have to have a piece of software in your desktop division like that. You know, I originally got into this game because as you find yourself doing all this math, you know these derivatives and integrals like you start to ask yourself what is all this look like what is an entropic game. You know, it's, it's, it's necessary. So we're going to work on this fingers crossed, hopefully more example like says and you know we get a little money from that way. And then the other things we're going to add is p2 rank, which is a ligand binding site prediction based program based on machine learning necessary to be an autodoc right now but autodoc the lies on you to already have a predefined binding pocket. So this will further allow us to start with scratch, as well as one pie, which is a Python package it builds a possible little bit of membranes. And then everyone I want to thank for a person like a burger, Dan's been an excellent mentor, as well as my peers, Josh, these guys are real computer science guy I'm just a chemist like molecules and I know how to type our keyboard like now. I've learned a lot with the crew so very thankful to them. I think the galaxy computational chemistry team at large, specifically, Bjorn, who's been great as well as Simon Gray approved my four requests. The galaxy project as a whole, you know, pretty cool individuals. It's actually a pleasure to meet and share stories and all of you and I'm looking forward to doing this on a yearly basis. I really thank Jennifer Vessio and company Dave Clements, the University of Minnesota Galaxy team, fatigue, subpoena for organizing this I mean this has been quite a stellar experience. Developers and Python at a common general all over to our software developers to make this really cool stuff that allows us to wrap it. And the resources NIH GMI USDA National Park Services for you know, give me a free place to stay when I'm doing something. And the GCC 2022 travel fellowship, as well as James Taylor Foundation for awarding this fellowship to make this trip. I think the research assistance you know they make all this happen. Yeah, I'm a little curious about the variations seems like they're overall very similar to the simulation records. Is there anything you learn about where there are some like a few spots where it's like, that's an excellent question. And much of the variation from that. Yeah, so there, as you can see, there are some variations, let me just go back to the slides. Okay, so here's an example, there are some variations between the original and new simulations and there's a couple reasons for that one of them, these simulations rely on random seeds. So we do start with different initial velocities in each case that's one of these are like the dynamics is you have to hold conservation of energy thing at frame zero you have to have velocity to get kinetic energy as well. So there's different seeds associated with that and that eventually, you know, things converge a little differently needs simulation. The thing that makes a difference there is from acts uses different constraint algorithms for hydrogen bonds as well, you know, as opposed to amber so that adds subtle differences here and there. Fortunately, it's not enough that you know how like the results one way or another but those are just some of the things that add to the subtle differences. So for the most part was exactly the same. A lot of the work that was involved in this workflow, the work is, is basically ensuring that we have a force field to go to certain that everything was exactly like for someone. But thanks to Dan and crew, we never make that happen. But yeah, certain policies, different constraint algorithms, different, you know, you could use a leapfrog scheme or stochastic scheme and another. Yeah, those kind of differences, but overall the average bang or anything ended up being saying which was cool. It was a little nerve wrecking first when I lost to me okay do everything didn't grad school now in this software that you never used. Unfortunately, it works so thank you galaxy because it's a pretty cool. Yeah, yeah. Yeah, yeah. Yeah, yeah. No, not really it's pretty straightforward so a lot of work was already. So you're asking me like how obviously like a firm like bromance requires you know a lot of parallelization a lot of like, whether it was difficult to implement that. Not so much beyond and the rest of the computational chemistry you did an excellent job of making these kind of packages accessible. And as far as you know if you want to write MPI, or, or not, or configure to whatever computer you're using it it was pretty straightforward you just change something and like just the XML file of that. I ended up doing that a lot because some of these inhalations are right on a really beefed up computer someone agency someone laptop be honest. So, now fortunately it's very a lot of the ground work was there already and it's a pretty malleable setup. Thank you back.