 Welcome, this is gonna be a talk focusing on workflows, how to translate, how some software we've written can translate workflows between specifications. How's this? It's still like that. I might just hold it, yeah, okay. So yeah, this talk is about translating workflows between specifications with the tool that we've created called Janus. And not only that, we're trying to maybe help some people that are current Galaxy users, but also hopefully expose some new users to Galaxy in the process. So workflow systems are kind of like tools themselves, you know, the main thing is that they have inputs, they do some processing and they have outputs. And during the processing, it's not just one thing that's running, it's like a series of steps and they might have parallelization or something like that. So you have some like input data, you run your workflow and you produce outputs. So a lot of us here are familiar with galaxies. So, you know, we've used to having Galaxy workflows and same here, I have like about 20 Galaxy workflows that I use regularly and that's where I always start. So I always start with Galaxy and then only if I kind of have to for some reason, I might think about another solution. Yeah, all these workflows systems are really quite the same. You could write an Xflow for the same workflow that you could write Galaxy. They all have different features and I wish I hadn't just like jump-scale revealed that, but I'm about to say something controversial. There's no best programming language. There's no best workflow system. Okay, good, no riots, that's good. Workflow systems, they kind of each just serve a community. They have different features and they have different purpose and it's just kind of like picking the right tool for the right task. One of the issues though is that the space is pretty divided which there's probably no real good solution for this. It's just a matter of mainly familiarity. Like if you're really familiar with Galaxy, that's probably what you use. If you're really familiar with Nexflow, you know, you really want to work within Nexflow because that's where you're most comfortable. It's also a big investment in time and skill and money really to train staff or something like that if you want to switch from using one workflow spec to another different one. This does present some issues though. For example, like let's say you want to run a specific workflow and you find one on Workflow Hub and it's absolutely perfect, but maybe it's like WDL and your group uses CWL. You might be like, ah, that kind of sucks. Especially if you have to make some small adaptations to it. You might just kind of rewrite it in CWL. It also presents some issues when you have collaboration attempts where maybe a few different institutes or your partners are using a different workflow spec. And so you have to kind of choose, okay, which workflow spec are we going to go with in this situation? And there's kind of no winners there. So back in 2018, we actually had this exact kind of issue we ran into. There was a group of institutes in Melbourne in Parkville who wanted some cancer pipelines, but all of the different institutes involved, they were using different specs. So some of them had CWL, some of them WDL and some of them were using Python scripts. And so as a solution, a tool called Janus was created. It's a Python framework for writing workflows which can then translate out to different specs. So it could translate out to WDL and it could translate out to CWL. And one of the benefits of this was the different groups involved, they didn't have to change like their HPC setups or anything like that. Just had one source of truth, which was the Python workflow and then they could take that, transpile it out to the language they use and then run it on their systems. After a little bit of time, NextFlow Translate was also added because it was becoming really, really popular. And around the same time that NextFlow Translate was being developed, I actually jumped on the team. Previous to that, I'd been working on Galaxy doing tool wrappers. And at that time, we kind of wanted more tools and workflows in Janus so we could transpile out to different specs. And so I kind of proposed the idea, well, why don't we use Galaxy as a kind of reference in some way? Now, if Galaxy had 20 tools, what we probably would have done is we would have written it all manually in Janus, all these different tools and workflows. But Galaxy doesn't have 20 tools and doesn't have 20 workflows, it has thousands, literally thousands. And so we thought, okay, maybe instead of like writing these out manually, maybe we could automatically pass the Galaxy tool XML and Galaxy workflows to generate the Janus for us. And so we created an ingest unit where we could take Galaxy and ingest that into Janus. Currently, like fast forwarding to now, we've got a few other ingest units. So we can kind of pass tools and workflows from different specs into Janus, which expands that ecosystem. And then we could translate them out when we need them as well. But once we had ingest units, we kind of realized, we could actually do end-to-end translations. We don't really need the Janus in the middle, it can just be a middleman. And so what you can do now is if you have a Galaxy workflow, for example, you could take that workflow or tool, run Janus translate, which is our new feature. It will ingest the Galaxy into an in-memory representation in Janus. And for example, if it's workflow, all those tools will be discovered in the workflow and translated to. And that's just like an in-memory representation. And then if you specified you wanted to go out to NextFlow, then it would transpire to NextFlow and write it to file. So as an example, I've just, you know, got a history, a workflow here, which is unicycle assembly. And what I do is download that. So the .ga file. And then once I have that on the command line, I can use Janus translate. And I say, Janus translate, dash dash from, that's where you're going from, dash dash to NextFlow, that's where you wanna translate out to. And then you point it to the direction where the file is. And what that will produce is some NextFlow, which is supposed to be like an equivalent workflow. What gets output is folder, because of course if you're translating workflow, you've got tools, you've got workflows, sub-workflows, you've maybe got some supporting scripts, that kind of thing, config. And so it comes out in this folder, it's translated folder. The main file is kind of like your normal next, your normal Galaxy workflow that you were seeing in the workflow editor. Modules contains all of the tool wrappers that were discovered and translated. NextFlow config is the inputs and some other stuff like maybe config for running on Singularity. We have another folder called source, which is where we place Galaxy wrappers just so that users can have a look at those, in case the translation wasn't fully complete for some reason. And then sub-workflows and templates for stuff like scripts or scripts in the tool directory or config files. We handle inputs in NextFlow.config, so that's kind of like your core workflow inputs. The main workflow file there, it's just classic NextFlow. We've got some imports from the modules. These are the tools that have become NextFlow processes, declaring some variables and then we start the workflow. And then each tool will be its own NextFlow process in the modules folder. Talking about some limitations of this feature we have, it's quite hard to actually translate between all of these specs because you have to support all of them. And they have really different competing paradigms and things that they support. For example, with Galaxy, you can write Python code, you know, cheetah, and that's completely arbitrary. You know, you can write pretty much whatever you want. In CWL, they have JavaScript. And NextFlow has groovy or Java, completely arbitrary. So even just translating expressions is very difficult and that's still a challenge for us. But just focusing on the params, the Galaxy params and freeform cheetah processing, this also can be a bit tricky for a pass at a workout. You know, what's the actual underlying software? Of course in Galaxy, params have two jobs. One job is to supply data into the actual command in the command section. But the other job is to render a nice tidy UI and that's kind of the preference. So sometimes sacrifices are made so that it's very clean on the front end. And in the back end, it's a little bit more tricky. So what we're doing when we pass these files is we actually, we kind of started thinking, okay, the params, the inputs to the tool, or realize that's not the case. There's a lot of preprocessing and stuff that happens within the tool command. And what we're actually looking for is in this command body of the tool XML, what we're looking for is the actual software command. Like for example, FastQC and then arguments, flags, positionals, that kind of thing. It's quite tricky though because it's kind of spread out. It's distributed throughout the whole command and a lot of the values can be calculated dynamically. So what we're trying to do is pass this, work out what the actual software command is inside the Galaxy command section. And then for example, in cases like this, we want to link the parameters to the arguments that the command has. So looking at this one, we've got contaminants, which is a list of contaminants. This is FastQC. And that renders out really nicely in the UI. And when we actually look at where it's referenced in the command section, it's pretty clear what it does. It's an option, it's got a prefix and the value. We can see the contaminants parameter is referenced. And so we know that that actually is a direct link to the input parameter. So when we're passing our workflow and we've got this, we pass the tool state for the workflow step and it says, here's the contaminants file. We know, okay, that's actually related to the dash dash contaminants option in FastQC. It gets a lot trickier with other inputs. For example, the input file for FastQC can be a bunch of different types and it appears throughout the command section in a lot of different places. So there's a lot of stuff going on here, but the thing I want you to take away from it is there's really no direct link between that parameter and where it appears on the command line. It actually, when you see on the command line, just down here, it's no longer what it started with. This here, it's references input file and here it's input file SL and that's because we've done some preprocessing on that. So because there's no direct link, things become quite tricky. Cheater is meant to be kind of resolved at runtime with real data. And so the kind of cheater gets resolved and then we actually have values for everything, but we're passing these files statically and we don't have input data. So when, for example, it's looking at the data type of the input data and trying to work out what it should do, that's quite tricky for us because we're never gonna have the data. We pass these files statically. That said, we do know that stuff like this happens and so we actually have, in this case, a system of aliases and this gets passed fine. It does know that input file SL is referencing that parameter, but this is kind of just to explain that Galaxy tool XML can be quite complicated. And so in some cases, the translation isn't perfect, but a lot of the time it's pretty close. If we were to take fast QC and translate with Janice translate to next flow, get something out like this. The thing, some of the things I wanna focus on just quickly with this is that we're doing a couple of things in the backend for you so that it actually is kind of like a running process. For example, we always check if there's a valid container for the software requirements on Quay.io. So we make some API calls. To get this tool in the first place, we make toolshed API calls to actually look at the wrapper that's referenced in the workflow and then download it. And then we actually load up a mock Galaxy to then load up the XML. So kind of running Galaxy as we translate, which is kind of fun. But yeah, when there's a single requirement, we just look up Quay.io and grab the relevant container. But if there's multiple requirements, we use the Galaxy Mold Build Toolkit to generate a new container on the fly based on the different requirements that tool needs. And you'll see that when we actually, just this little fun thing, when we translate this to next flow, we do actually understand that that input file appears last on that one. So we've made that connection. Can I also translate to CWR? Oh, we're just focusing on the next one today. So yeah, basically with this Galaxy next flow translation, you can expect that the workflow structure is correct. Simple tools will have a completely valid translation and complex tools might have some adjustment. I'm looking at you, High Sat too. That's a challenging one. So yeah, it really depends on the complexity of tool wrappers, but at the end of the day, you at least get all the structure and you get all your processes nicely templated out and maybe you need to fix them up in a couple spots. So yeah, in future, we'd like to do some more handling of dynamic config files. They also use the cheated templating, so a better solution for that. Also like to identify mutually exclusive arguments, looking more at the select prams in Galaxy and just improve cheated parsing in general. And we'd also like to pass all of the Galaxy tools. They're tests and then do the translation, build a test suite, run those tests to make sure that those translated tools are valid. And that's really good because we have metrics then. We can tell you, okay, how is it 80% of time the tool is runable, 50%, 90% that kind of thing. So yeah, in summary, hope that we can help connect the workflow space. And really something that I want out of this is exposing your users to Galaxy and especially people who, for example, only use NextFlow, get them to start in Galaxy, write the workflow and then translate at the end. Thanks. Thank you, Grace, amazing talk. Maybe one really, really, really, really, really quick question here, I think. I've got to, but we'll just choose one. I mean, that was really amazing, thanks a lot. I guess I have one question. I mean, so your approach of rendering Cheetah templates is really cool. I mean, one thing I thought immediately, like, do you think we can sort of embed that in the backend to as the user changes parameters or render the command line that will be run? It's not super easy to tell just by changing the parameters which command line is running, but like papers often say, well, use the dash Y option with 33 or something. And like, well, which one is it? We've got to find it in the Galaxy interface. Yeah, yeah, I don't think that, I'm not going to stand here and say, yeah, we should change the way tools are written. I don't think that that's like really particularly viable. I think that if anything, we just need to be a bit clever about how we pass this kind of structure. That said, yeah, I should actually add on to this as well, something we wanted to do. Maybe it could play into this in some way. We were thinking about putting Janus Translate as an actual tool on Galaxy. So instead of having to like, docker pool or pip install Janus, you can just run Janus in Galaxy. So maybe there's a way that we can work that in there. And I mean, maybe- Do some analysis. Just a tiny additional question. You mentioned going from Galaxy to Nextflow. How far is like, I mean, how, I mean, basically I'm asking the other way around like, can you go from Nextflow back to Galaxy and like, where are you with that? We don't have Nextflow ingest yet, but if there's enough interest, we'll add that. Yeah. Thank you. Let's thank Grace once again. Thank you so much. And our next speaker is, sorry, if I'm pronouncing this correctly, Katazanya. Yeah, that's fine. Most of you know me as Kasia. Hi. And I will talk today about a fair day test you are shipping Galaxy. So whoever visits my poster, you know that I'm very excited about the fair. And first stands for findable accessible interoperable and reusable data or research. It's not, yes. So what our goals are, in terms of fair, we would like to develop data management training which will be addressed to everyone and supports everyone in a way of reproducible science. And also we would like to embed existing training and help people improve. I'm a thinker, maybe too loud. We want to improve everyone's research experience. Is that good? Very good. Yeah. I have only short talk and I can talk for hours. So let's wrap it up. So we already developed a set of training materials available through Galaxy Training Network. So if you're interested, go there and have a look. They are addressing very basic stuff focusing on fair in a nutshell and also fair data management solutions for people who are writing grants, publishing and are super involved in data. So we are trying to address whole scope of audience and also how you can improve your existing Galaxy Training materials. So shout out to Helena and Sasky for constantly improving the training network. And we also have our own community in the UK. So if you visited my poster, you're already aware that we are having Fair Data Stewardship Fellowship and 24 fellows from 17 organizations. What is cool about that, that our guys are focusing on different data and on different areas of expertise. So we have people from climate, from agriculture, from chemistry. So they overcome different issues with fairness and data. So what is the whole verification process? What we can do about our data and then training essentially, is we constantly see publications or data released without lacking unique identifiers or lacking proper metadata. And even stored somewhere which is not really reusable and it's not possible for a secondary data use. And so we would like to highlight the opportunities and what can be done within the existing data or the training. So hopefully you are aware of our Fair Research section and you can find out more there where is material provided by us and amazing Crocrates team. And yeah, hopefully you can come and talk to me after the talk. And I think I fit perfectly in time. So we have time for questions. Thank you, we do. Any questions? If you have data and you don't know what to do with them or if you have existing training and you would like some expertise on that. Yeah, we... Svening's going back to back with two talks and we're gonna have questions at the end of each one if we have time. Sorry, Matt, any further, please take it away. Thank you. So my first talk is called Accessing and Processing Sensitive Data in the Public Galaxy Server. So this is the output of an implementation study that was now finishing which is called Strength and Data Management in Galaxy is one small part of larger projects. And the work package two is the one which was about sensitive data. And this talk is mainly about the outputs of the first task which is named encrypted data processing. But I need to say a bit about the second task but I will not talk so much about that in detail. So, and this is about the data access. Where do the sensitive data come from? And integration with the EGA which is the European Genome-Film Archive in its central form and also in its federated form. So a few words about that. So there is this project that is very hot now in Europe which is called the Federated European Genome-Film Archive which is based upon the existing EGA which has been in, I don't know exactly when it started but it's been there for four years in two locations in the UK and in Spain in Barcelona. The problem is that legal restrictions to sensitive data and then to personal genomes are often at least in Norway and most places they're not allowed to be shared outside the countries. So it's not really a solution to provide them to UK or Spain. So instead there's this federated system which is set up with a database in each country and this works so that the data itself is only available in the country while the metadata is shared so that you can search across and if you find anything interesting you can then apply for access and then get direct links to get the data if you're it's granted. And also a bit proud that Norway is the first node that was officially signed this collaboration agreement. There are also others coming and I don't know exactly who assigned it. Yeah, it's also part of this large EU funded project called the Genomic Data Infrastructure as you need to speed up here. Okay, so there is already some tools you might have even developed them for downloading data from EGA. The one thing that's important here is that these tools assume a private galaxy installation so they decrypt the data and have them available in the galaxy in decrypted form. Which is fine if one has that but then you sort of have one installation per project and you have to have a security infrastructure for the galaxy. So when it's probably to be looked at different scenarios one of the scenarios is exactly this we have a single private galaxy server or a variant of that to have one galaxy server per user project within an infrastructure that supports that. And then the scenario number two which is the most difficult one is how can one manage to do this also in a public setting. So we decided to focus on that one because if we can solve that one at least the other ones should be easier to solve. So let's go to the most difficult one. That has some consequences and this is already in the task. So you would need some form of data encryption at the user level so that every user can encrypt data for user that specific use and also some sort of management actually that part we don't really need that much in galaxy that's happened to fall out a bit. This is based on the standard from the Global Alliance for Global Health which is called the Crypt 4GH which sort of makes all of this possible really. Pulsar we need to consider and also proof of concept implementation. So this was the task goals. A few words about who we can trust in this setting. That sort of sets the scope of solution. So the one thing that we definitely can trust is the data provider. So that is the AGS would be officially approved in different countries and the authentication authorization is there that's sort of assumed to be safe. The other thing that we do assume that we trust here is the user's local computer environment. It's not that it necessarily is secure but it's sort of out of the scope that it is the responsibility of the user to do that. So and that's also part of the given when user can get data from EJ. It is assumed that the user are able to keep that data safe. So that's sort of out of scope. And the compute environment. So we want to be able to do sensitive data analysis distributed way, right? So we assume then that the environment is safe. So this is not in the detail of exactly how to set up that environment. And also that these three ABC that there is a safe way to authenticate and exchange keys between them. So it's not going into the authentication issues that that's sort of outside the scope here. So who do we not trust? Well, the internet, obviously. We'd also do not trust the Galaxy code base, sorry to say that. There has been some security holes and there probably still lots, I don't know, some left. We cannot, when it's talked about sensitive data, we cannot really trust Galaxy code base and the Galaxy admins, sorry. We cannot trust you either. At least not trusting you should make your life easier though. Tools and workflows, we don't really trust them but we some, this is out of scope but we need to have a sniper some trust, some certification mechanism, something that's out of scope, let's assume that this is there. And also that compute should be shielded from internet so that it's sort of just to share your data while it's running, that sounds cool. Right. Okay, so where we started with this is the issue of private, this is a private public key pair issue and you would need to, at least we thought that we would need to share their private key somehow because the private key is needed for the encrypting of the data from EJL. So we looked into the issue of could we use the Galaxy vault for that? Because we need to give Galaxy the way of access to the data, that's what we thought. We decided that in any case, even though this would be a rather secure solution, that giving away the private key is still less secure than not. So we ended up looking into the other idea if what if the user do not share a private key at all so that the private key will remain on the computer or user environment, all time it's not shared to anyone. And it's actually crypt4gage provides a solution for that. So basically you can re-crypt data sets and you don't have to take the whole data set. You can just re-crypt a small header and that header contains keys that can use to decrypt the rest of the data set. And then there's two different ways of that that can happen in Galaxy. One is to do that re-cryption inside the browser in the front end or one can do it in the back end. Okay, I've got a little timer. This is what we came up with. This is a quite complex thing. Oh, that's right. So let me take me through this in super speed. Data import, as I mentioned, sensitive data is available. This is sort of already set up in the FIGA system. There's this header which is re-crypted for Pacific public private key pair. So the user provides public key which comes in the top pair. And the header is re-crypted so that the only user with the user's private key can decrypt this. Then there needs to be some script for a gauge support in Galaxy. We have added a script for a gauge specific data set. And the way we solve this is that this header is then provided in the metadata. So the data set in Galaxy remains the same while you can change the metadata header and re-crypt that and that is so secure. And then on the user's side, we have designed a way that we can set up a REST server that's run locally and which provides the re-cryption. So basically Galaxy talks to the locally running REST server. Ask, can you re-crypt this data set? And everything that is connected to the keys happen, the private keys happen and locally, it's not shared through Galaxy at all. Yeah, so in order for this to work, then the public, the re-crypt service needs to have access to the public key of the compute node and there's the need for a key server, that does that exchange and let's go into detail of that now. How would that work in the user interface? So this is the one way that it could work. And you have a functioning working prototype of this. So basically we just added a small key icon which then starts this request to the local re-cryption service and re-crypt to the compute node that's configured then locally and you get a re-crypted method header which is added back to the data sets. You get new data sets, but it's just the same, pointing to the same data sets on disk, just the change in this header. And also there's also then an expiration date of that key, so after a certain while it will stop working. And then in the secure node, the data sets are decrypted, you exit the job in the containers as usual and it's then encrypted again for the encryption of the user's private key and then the user's private key is already there in the compute node with the key server which is not really shown here, it's shown there, it's not easy to see. We have yet to find the best way to set this up in Galaxy so we need some help with the figuring out that. So we can set up like pre and post processing perhaps that's the decryption in encryption but how to do that and how to do that across job runners, not really sure, so that's something that we need help with. But I mean, it is definitely doable. And also the question is also how to manage the intermediate data sets within here. I mean, this should not be any traces left but this is also assumed to happen inside the secure compute. So finally, once this is done, the output is then encrypted and user is on the one that has the key to decrypt it and can download the data sets and decrypt the outputs or alternatively can re-crypt again for a new round on your analysis just starting from the scratch basically. So limitations is that visualizations will not be available which is probably quite large limitation. One could probably figure out ways to do that but that's a bit difficult so it's out of scope or this project possibly within the client possibly with some sort of terminal system. Yeah, also this is more detail on the key server but I'll skip that now, it's too much details anyway. It's there if you want to look at that later. Yeah, here's the acknowledgements and logos. Yeah, thank you. Thank you, we've got time for one question before we start with the short talk. Thank you. There are a number of standards when it comes to key management. So this FIPS variety, which one do you satisfy? We have not looked into that but basically the key server in our proof concept is just a simple Python thing but you could put proper key management systems there possibly but I mean that we haven't looked at detail on how to do that. Thank you. All right, we might get your slides for your short talk up now. While we're doing that, does any of that work help with the reverse of uploading to Phenome genome? I think we're just gonna make that process any easier from personal experience sometimes it's not. I haven't really thought about that but possibly, yeah. Possibly it can, yeah, but I have to think about it. That's enough for now, we'll go over it. All right, on to your short talk please. Okay, so over to something completely different. And the world-card tyranny, embrace path-based interactive tools. This is a bit of a call to arms or something with arms. Okay, so this is an issue which has been here for a long time and everybody who has been working with interactive tools or developing interactive tools or administering knows probably a bit about this. So the issue in brief is that interactive tools in Galaxy, they are really, really cool. However, they are difficult to deploy which hinders their adoption and availability. And there's a technical issue for this, I'll go a bit into detail here. So the issue is that every interactive tool instance, every time you start a tool, you get a specific URL which needs them to be routed to the corresponding container, the container that runs the actual tool. And the first solution to that problem is to have this in the subdomain. So you have some, it's not exactly like this, but you have some then hash things that's put in the beginning of the domain. And you need to map that to the actual container. And this is done in two steps. So first, you have a web server proxy like engine X running on your server which maps it to this service called the Galaxy ET proxy. And that server then maps this specific hash into a specific container, a specific port that runs that. So this is sort of how it's been working all the time more or less, I think. The problem with this is that it requires something called a wild card DNS certificate. So in order to secure the connection, you have this DNS certificates and this need to be of a wild card type. Basically, you can have anything.mygalaxy.org, for instance. Problem with that is that these certificates are inherently less secure than normal ones because it allows attackers basically to set up subdomain servers if they're able to reach the system and they will look very, very safe to use it because of the domain. I mean, this will look like a real URL. So many admins do not like that at all. And many students do not allow the use of wild card DNS certificates at all, which is not good for attractive tools. It's also a bit clumsy, et cetera, but it's not sort of a bit out of the ordinary way and often admins drop supporting this completely. The alternative is to have a path-based URL. So instead of having these hash numbers in the beginning, you add them after some slashes in the path. And that was introduced in 2019 or thereabouts. And basically, there was two steps of that, a transformation from this way of running URLs into the subdomain way that happened in the Yuvisky context and then the same proxy that was there. This unfortunately stopped working and also hidden, I don't know if many people knew about it, but in any case stopped working when we replaced Yuvisky with Unicorn. So I have then made that work again by, first solution was just to provide a similar setup using nginx instead of Yuvisky, but this would then require an extra sort of web server on top of Yuvisky. So especially for development, it's not really a good solution. That's the first solution and then the second solution, which is now part of the latest scale release, is to also support path-based URLs in Galaxy IT Proxy, which is now there and this now works out of the box. The problem is that there's only two of the tools, all the tools that are there, at least the ones I tried. I just made two of them work with this and so OpenRefine and the Ubuntu desktop. All the others you crash and burn or something. It's very easy to test, it just requires domain, which is a tag in the tool XML, set it false, try to run the tool and if it works, it works, but it's just for those two. The problem is that there's typically a web server inside the Docker container and that might not be set up for to allow anything. I mean, if all the links in sort of the pages are relative, it should work and that's probably the way it is in this too, but otherwise typically you would need to configure the container with the correct path. This is the path you're served under and that will, I believe in most cases, solve the situation, might not solve everything, but I think it will solve the majority of the situations. So first we need to find out, okay, for every time we run it, this is the URL you run under and provide that to the tool and you need to configure the tool and that this needs to be done on per tool basis. So it's quite a bit of work, but not very much, it's sort of doable, but one needs to go together and actually try to do this and solve this problem. So this is where I end. Can we in this co-fest come together and try to make at least some, but perhaps also all the interactive tools work with path-based URLs. I will be here and I'll try to help you figure out how to do this. I just need people that are interesting in joining this effort. And finally, this session, like to give us two short talks. Hi, my name's Sujan and I work at the Center for Infectious Digestionomics and One Health. So we're about British Columbia. Like to say thank my son, Hawk, who actually helped me put this presentation together with this magic of video editing. He's joining online due to a very complex, laundry-related issue. I talk, I spend a lot of my time talking to people, like people at the Ministry of Health, people at the Health Authority, privacy officers, governance officers. Why? Data access. So data sharing turned out to be a very complicated problem. Just to give you an example of some of the projects we're actually doing to tackle this issue, trusted execution environment. So some of you spoke to the need for creating a secure research environment. So I have led the creation of a secure cloud environment in partnership with the three of the Health Authority in BC. Another big challenge is mapping related issue. So different research groups are using different terminologies, ontologies, and how do we ensure that we can talk to other people? There's an interoperability between different research groups. To take it even further, how can we create interoperability between research communities to a clinical community, for example? So I'm working with the global standards organizations such as LSF7 and FHIR, make sure that there's an interoperability between the two issue. That's a mapping related issue, which is my next presentation, not this one. So this presentation, I'd like to focus on privacy, which is a massive challenge. I believe that researchers, we all have responsibility to manage the data well, to mitigate any potential harm for privacy. Good news is that there's this emerging community who are working on privacy, the privacy enhancing technology, which is what my presentation is about. So I have only 10 seconds to talk about, I'll define the problem space, and I'm actually gonna play a video. The K, who's my collaborator, says Samsung SPS, who has worked with us to solve this problem. So the use case for us is this current Shigella outbreak, and then we're finding that the genome-only clustering is too broad. So we're finding that by accessing broader sets of contextual data that are associated with the sequencing, we can do analysis that's much more powerful. But it's been very difficult to gain access to a lot of sensitive information for broader contextual data, especially coming from the hospital. So using Beacon, now this is where Galaxy comes in. So Galaxy has integrated with Beacon, I believe the latest version is V2. So we have way to easily search and find the data. So once that initial contact is made, you really find out who has the data that you're looking for, how do you go about accessing a broader set of data who may be sensitive? So that's where the technology such as PSI could come in. PSI stands for Private Set Intersection. So hopefully this give you a flavor of a different type of privacy preservation technique. The traditional technique for privacy preservation often assume that if you gain access to data, you have to see the data. So this give you a different perspective of what if you can compute on the data without seeing the data? Hi, everyone. This is Kyohan from Samsung Estates Research. I present about Private Set Intersection, which is using our demo for privacy. I'm sorry that I cannot attend the conference offline. Let's start with the definition of Private Set Intersection for sure PSI. PSI is cryptographic protocol to get intersection without revealing each party's input. Suppose two institutions want to know about some information on intersection between two data sets in each party. Each, if they don't want to share their data set to opposite, PSI can be used. Actually, the PSI is not just for the intersection. There are various variants of PSI which are PSI cardinality, PSI sum, and circuit PSI. PSI cardinality is to compute the size of intersection and does not reveal any information except the size. PSI sum is to compute summation of values which are corresponding to the intersection. And circuit PSI is general protocol to support arbitrary computation on the intersection. At this point, some audience may be curious about how PSI can be used in real world applications. Intersection itself is powerful computation than expected. Many questions in real world applications can be represented using intersection. For example, if two companies want to know about how many customers are in common, they can use PSI cardinality with each company's customer list. If web browser user wants to know about their breached password that I'm using, PSI between passwords saved in web browser and breached password database can answer the question without revealing any personal information. For this reason, PSI is already used in various systems like Microsoft Edge, Google Chrome, and Apple's PSI system for users' privacy. In addition, the performance of PSI is improved a lot. For example, I think I'm running out of time here, so fast forward to our use case using Shigella. At each party. It's like cardinality to get the answers. This is a recorded video of our demo. First, we can see the C3 file at each party. With proper command, including... So I guess the key message is here is that the privacy is very important. That's an important area for research as well. For those of you who would like to learn more about the privacy enhancing technology, like I said, there's a broader community who's working to tackle this problem. So UN, a privacy enhancing technology lab is one example of that, which I'm part of. And Samsung SDS is part of that consortia as well. So please do get in touch for those who are interested in learning more about this privacy preservation area. Thank you. So this one is not a topic related to data access and data sharing. So I spoke a little bit about the importance of mapping and I believe standardization is truly powerful. So I just came from the MedInfo conference and interacted with a lot of people, global standard organizations, FHIR and LGS7. And I was just speaking with some of you about creating research repository and implementing FHIR's specification to enable ease of interoperability between the research folks and research data and then the clinical data which is really the primary source for some of you who are working in the healthcare sector. So there's been a lot of mapping, exercise and work that's been done to date. The Seneca worked as part of work group three. So Seneca working group three pushed out a lot of product related to data mapping. So this is to enable harmonizations of co-hormata data for human curation purposes. So a lot of NLP related product, data standardization and specifications. And then it works with the beacon in terms of searching data and discover those data. So there's a lot of work that you could leverage for this community. So in terms of mapping pipeline, imagine you have a cohort data description and then they're relevant entity you can identify within your cohort. Imagine you can map them easily to other relevant research work through ontologies. And there's a standard ideation and consensus in terms of how to do quite things. There are four different NLP models that's been developed to really enable and automate this data mapping processes to enable interoperability between different research work. So Zuma and disorders are created by EBI, the European Bioinformatics Institute. Black Smapper is developed by our institute that Simon's Region University. This is currently adopted by FDA both to use for foodborne related surveillance. So this is another video I have to cut this short. So we have Yizhu from EBI. He's gonna talk a little bit about the Zuma as a model. There's the aggregated API. So you can leverage and access this API within Jupiter, for example, within Galaxy. We have a plan to package this tool up within Galaxy and make it available for ease of integration. I'm curious to see what people might be using this tool for further data standardization and interoperability. Zuma is an ontology annotation tool developed at NBL EBI's port team. Zuma is backed by link data repository of annotation knowledge, which contains curated annotations derived from many publicly available data sources, such as Expression Atlas, OpenTargets, and GeoScatlub. Therefore, Zuma can facilitate annotations relating to a diverse range of topics, including disease and phenotype drug treatments anatomical components, species, cell type, and more. Furthermore, Zuma can be easily configured to use new data sources or prioritize certain data sources for others to enhance the context sensitivity. As part of the context mining group, we have developed an aggregated text mining API to query each of the models developed by different teams. The API is exposed as a simple web tool and we use to annotate short terms using different models. In this example, we annotate a short phrase using Zuma and Hesso models. They give different ontology based on the knowledge they have. Hesso, giving UMLS calls most of the time when Zuma giving ontology is stored or used in its repository of knowledge. We are also developing a Galaxy tool wrapper around this aggregated model. So it can be used in Galaxy to annotate terms with the text and ontologies. The stuff that I haven't talked about here, it's on the poster for those of you who are interested in terms of how to use this aggregated API within Galaxy environment. I have a poster up front and I'm happy to chat further. And yeah, so a little bit of I guess promotion that this LaxMap tool that we have developed today is currently used by the FDA for their data standardization purposes for food-borne disease surveillance. And then also the IRIDA, which is the Canadian product used for SARS-CoV-2 surveillance, it's using the data specification enabled by our technology Alexa. So that's it for me. Thank you very much. Thank you.