 Welcome to the second day of the Fragmentation Training School and Biomage Analysis. So today we have the pleasure to have with us Professor Carol Goble from University of Manchester, Christian T-Share from EMBL Eidelberg, David Russo from University of Angers, and Ignacio Grande Carreras from University of Del Paz Vasco. Okay so hello everybody and I can't really see you very well if I'm also showing my screen so I'm just going to put some of you at the bottom of the screen. In the chat I've put a link to the slides so you can kind of read along and I've also put in a couple of other links to pages in the research data management toolkit and also videos around metadata associated with bio-imaging and data management because I know you're a lot of you are bio-imaging people and so that's a bonus that I've just given you there as well. So what I'm going to talk about today is fair computational workflows and as I said Stuart will be helping me and we both represent Workflow Hub which is a workflow registry environment which represents the workflow collaborative work that's been done in the EOSC Life project and this is all training associated with the EOSC Life project and I'm also highlighting here contributions from other members of the club. Lots of projects have contributed to this. A workflow is this way of being able to chain together effectively a pipeline or some sort of multi-step computational analysis that links together multiple steps and those steps could be tools or command line tools or even calls to containers or other workflows and you have some inputs and you have some outputs. The key thing that separates out a workflow from just normal regular software is that it has an abstraction and composition properties so it has the notion of a workflow so there's an abstraction a specification of the different steps often presented as a graphical format but it could be a YAML file or it could be some other format that is used in order to be able to describe the steps and then there is some sort of software execution engine some workflow management engine behind it that will take on that specification and handle all of the pieces to do with access to the computational infrastructure to do with supporting tool interoperability and portability and all that kind of thing and the idea is to make it kind of provide an abstraction layer over the analytics in order to be able to get some sort of implementation independence. Now we'll see in a minute it's not always like straightforward as that but that's roughly what a workflow is and it also the second property of a workflow system and a workflow is that it's really designed around composition that we have these steps and these steps can be put into workflows then could be used in other workflows and these steps will be heterogeneous they could be different codes or different languages often different third parties and so the notion of compositionality and abstraction is really underlying the whole principle of computational workflows but they come in many different shapes and sizes so some of you will be using Jupiter notebooks and in some senses an interactive electronic notebook the co-lab or or Jupiter notebooks are kind of workflows in the sense that you have through a convention a series of steps that you might be executing through I say by convention because actually it doesn't force the ordering of those steps. Other environments you'll almost certainly be familiar with are things like scripting environments and scripts are kind of in some senses a workflow although often that abstraction piece and that composition piece isn't highlighted so much because scripts often sort of muggle up the kind of the implementation of it of the steps from the abstraction of the of the series of steps that you want whereas workflow management systems and their associated execution platforms really try to do this kind of separation of providing an environment for you to be able to manage those those steps and to have this abstraction layer and there's lots of them there's about currently at the moment known to us 320 of these some of which you'll know snake make next flow galaxy some of you may have come across these and associated with this are the repositories where you might find workflows being managed and developed by various different communities so for example in next flow there's the NF core community and bioinformatics in snake make there's a snake make catalog community and then there's registries that enable you to register those workflows so that they can be found and also one can deposit workflows in the general repositories and we'll go into that a little bit later on so you can see it's kind of an interesting environment so each of these workflows have different communities and characteristics snake make is a kind of graph of jobs it's like make with python galaxy on the other hand is completely the other end it's like an online portal where folks are building and we're using workflows around that are already available or through up pre uploaded wrapped and pre-installed tools and it's much more orientated to people managing that place and making workflows available to others whereas snake make is much more about doing some sort of sort of almost like a bake file for my analysis and next tower next flow and next tower like a fully fledged kind of programming environment for computational pipelines and these are I would say these are the three most popular workflow systems that we see in bioinformatics right now these three are really the kind of leaders in the field an open source there's also things like nine but that's not that's partly open source and really these workflows are used for interactive and exploratory analysis with human in the loop and also production and automated approaches as well pipelines really as well as things like tool chaining batch processing and job control so there's there's an entire galaxy as it were of workflows so how do you choose which one well based on the specifics of its data types and its codes and the and the kinds of workflow that you want to use are you trying to do simulations are you trying to use HPC are you trying to uh is it is it a sort of chaining tools or is it doing job control what kind of things are you trying to do and we won't get into the details of that because that's all to do with workflow craftsmanship rather than fair workflows what's the skills level of the workflow developer so if you're not very skilled you want a platform that you can just is basically you pick out steps you want to do and you build the steps through a graphical environment perhaps like galaxy and but the main thing is how popular is it is there a community of practice that does what I want are there reference workflows I can use and is there good documentation and support basically use the one that everybody else uses more or less and there are many different different kinds of users of workflows and this is going to affect the fair piece okay so there's many different stakeholders shall we say in in workflow environments from people who develop tools that go into the workflows which need to be wrapped and maintained so that can be executed by these workflow systems those who actually build workflows that have to be developed and run and maintained and those who might then use the workflows that have been built that need to be understood and explained and and of course there's a lot of labor and the labor diminishes as you go along this kind of stakeholder space but the reach of who you can affect increases as you know because one person can make a workflow that many people can use if they can find it that is which is what fair is all about fair stands for findable accessible interoperable reusable I'm usually assumed by now most people have heard of it but I'm still often surprised that some people haven't this is a huge principle of science and research data and increasingly software across the entire scientific community it's really a meme it's nothing new is just how do I find things how do I make sure I can get older things how do I make sure I can link them to other things and how can I reuse it so it's it's really not a hugely novel but the word is very cool so this is the first time we've managed to get this principle sort of kind of really noticed hardly anybody has read the paper by the way that this paper is now cited nearly 8000 times it did my h-index no harm at all but but most people have never read it right and this is actually what the principles are they're about for data anyway they're about signing unique identifiers to things describing things with metadata being able to access things through a standard protocol using formal ways of describing and so this is actually what the fair data principles are about but what does that mean in practice for workflows and we practice what fair work fair principles mean for workflows is can I find some of these already existing workflows of which you have just spoken Carol um can I make mine findable if I can I access them where's the git repository what's the license what language is it written in what tools does it use do I have permission to rework it is it feasible to rework it is it well enough described so I can understand it and use it how much description do I need to so somebody else can use my workflow always assuming that you want to share them of course um how updates and versions manage which of course doesn't necessarily apply very much to data because the data is normally considered to be fairly static but we're talking about workflows and workflows are a form of software as we'll shortly see can I use it am I allowed to use it can I run can I run it in my infrastructure is it portable you know can I test is it comparable to others and also as a process as a piece of software that is effectively doing a data flow most of these they have a control flow associated with them how do I control the execution of the steps but they have a data flow associated with them which is data flows from one step to the next so as it's um flowing is that data that is flowing and being derived and created by the workflow is that fair and are their best practices to make sure um and also how do I start using workflows but will I also get credit for my workflow and can and can I track that credit so this is also all kind of tied up in fair workflows in practice so if we go back to what those fair data principles were um actually what they were all about was about enabling automation they were actually devised so that effectively you would have machine actionable metadata associated with data so that you could then use that data and be able to have workflows be able to use it so because you would be able to actually interpret the metadata associated with the data that was actually the the real underpinning of all of this but this is what it basically comes down to persistent identifiers register things enable automation it's got nothing to do at all about data quality or reproducibility or credit the fair data principles that's extra on top of this actually so let me go back to workflows okay so how does this all apply to workflows well remember I said that workflows have some sort of abstraction they have some sort of specification description could be a graph it could be um a YAML file it could be in and it's just some sort of JSON it's some sort of description associated with them which also includes things like the inputs and outputs and the parameters that you'll be putting into that workflow some configuration information and then it has this execution piece which is the actual engine itself which is going to manage all of the calling and and using the computational infrastructure they are using as well as the codes that you'll be executing the steps that you'll be executing and potentially containers and then it has many associated objects for example the resultant data associated test data sample data maybe other workflows that will check whether this workflow is producing the correct data so you have kind of checker workflows to make sure that was it nothing's particularly changed so that we can check that the results are correct still the workflow is still in good shape because workflows are effectively software so what we have here is the description is kind of like a description of a fair method so it's like a data object really and the and the actual but it's actually software the workflows are actually software with data going in and out and some associated services so they actually are hybrid objects they're both software and often interpreted as a sort of data description of method as well so this means when we're talking about fair principles for workflows we're talking about three different things we're talking about them as method objects right that describe what are you going to be doing the steps that you're doing and so we can use a fair data principles for that their software so they have issues to do with are they usable and reproducible or maintainable and here there's a whole new set of principles called the Fair for Research Software Principles these don't just apply to workflows it supplies to all software okay these are hot off the press okay and and then we've got the these are really instruments for the data flow the below the data that flows through those those workflows through those steps they have to be fair as well so we have to be able to support fair data principles as well so I put handy links here so you can that's your homework to go and read these these papers so that means that the principles that we'll be looking at are around so for the abstraction which is this kind of specification and I've used here this is actually a common workflow language presentation of a workflow here in this picture the fair data principles apply to you know this is an object okay because it actually says what you're going to do so even if you lost the software you'd still have this description so you might want to recode it into another workflow language or another workflow system but at least you've got the instruction of the method this is the method okay but but they're also software because it's going to be living and there's issues to do with stability and you know the lifetime of the software they're compositional all those little steps I don't know if you can see my arrow wondering about this modularization and dependencies and all the different components may may change and then there's usability like you might you might have it can you actually use it so is it actually usable so to sum up this piece this is the context piece what are the fair for research software principles well actually they're basically interpret sort of variations or or slightly change changes to the data ones so they look more or less the sort of same the the differences there are some principles that have been extended to finding versions and binding components because software and workflows are versioned right because we have multiple versions and they have components they have steps they have libraries they have things associated with them so they're extended to that interoperability is based on standard apis and common metadata exchange between the steps so this basically says for workflows that the metadata that you use you have to figure out what the inputs and output steps are from the handoffs of the data from each step and standard apis associated with that there's another slight wrinkle to that which I'll go into later and there's an emphasis in reusability to not just think about is it usable so it can be executed and is it reusable so it can be modified so there's reusable I can reuse the software I can reuse the workflow and adapt it for my needs and it's usable and it says I can actually run the damn thing right but there's still nothing on software quality or credit in the principles so that's an extra okay so what does that mean in practice uh for me and my colleagues well that's the what we're going to now talk about is some steps towards fair workflows by looking at registering workflows better metadata and best practices to be able to to try to get these fair principles into our workflows okay findability now you're all going to be making workflows and by the end of this training with Rocco and all of his chums you're going to have some workflows hopefully and where are they going to be where are these workflows going to be or you might want to find somebody else's workflows and reuse them because that would be really good if we could do that we actually reused other people's uh you know carefully designed computational um analytics right well there are all these different places how do you find them they're in community repositories all these different communities all live in communities right they they all have their own little private gangs right um and they have their own repositories usually in a form of github right so but how do you know where they are well telepathy or just by reading papers or something um so and often people's stuff is just in github right like this one here all they're in community platforms so things like a galaxy or next flow they put their workflow that there's kind of installations of these um that people can then deposit their workflows in and they execute within these kind of um very large um well varying sizes of installations with multiple platforms in them so they're like cloud services effectively you find them in publications because people have published some stuff and then you point to them uh you can find them by using google which is how we all do research isn't it folks and um or you can find them actually in data repositories like zenojo and dataverse and open air um but you know that's a bit hit and miss as well and then you they're just packaged up in zip files and and deposit so how do we overcome this very distributed and fragmented and variable world and you know all these silos you know so i've written a workflow in you know using r say and somebody else has written a workflow using some python platform and um how do you know they could be the same workflow or it could have been useful to have used that one but i didn't know about it um so registries this is how we kind of try to to solve the problem and there's two really major registries dock store um which um is from the us and very orientated around docker as you'd expect from the name and and is tied to several us big platforms execution platforms for executing workflows and workflow hub which is what we've developed in the european open science kind of a life project and what stewart and i are involved in what we're involved in here now a good idea about a registry is that it gets one place to go to which is searchable okay and we can integrate it with all these different repositories that we know about it has some notion of standardization around the metadata and the way things are presented you can cite things if they're registered because it can give it an identifier um we can support interoperability by showing different workflows and being able to support what interoperable processes also interoperable languages which i'll go on to shortly but the big thing for fair is metadata right we really we can get some metadata associated with our workflows so we're going to concentrate on workflow hub because that's the thing that we use it's got you know two URLs because came out of EU but we've also got the org one this was launched in september 2020 so it's coming up to its second birthday it's about to go out of beta as a result um and this is sponsored by elixia the research infrastructure for life science data and i know you're you're doing bioimaging that's uh euro bioimaging research infrastructure but we're sisters and friends it's very much an open development project so you can join in it currently has uh two when i did the slide 273 workflows in it from 12 system types so the thing about workflow hub is it really doesn't care what your workflow system is so if you've got a fancy you know next-flow workflow great let's have that but if you've got some python scripts or you've got some r scripts okay i know that we said you know there's workflow systems and they have an abstraction but frankly we'll take anything as if it's got multiple steps in it we'll have it right so the notion of it being a pipeline or some sort of multi-step um processing we also have um uh 160 teams in it and the teams are interesting they help towards fair as well because you're basically creating a community of practice with over 300 people have um registered and participating in this so it's a growing resource so the thing about it so how does it help with the with the fair okay the system agnostic you can search for and discover workflows of different kinds across it because you've got this tagging part tagging becomes very important of course this example is a galaxy workflow actually which is showing itself as a native galaxy figure that's what galaxy workflows look like when they're in galaxy so you can search for it you can put you associate the authors with things you can say what the license is there's analytics around how many people are using it you can create versions this is version one when i took this slide shop um you can see there's this is only tagged with covid 19 because we set this whole thing up originally just uh for for really supporting the covid piece we have some metadata standardization things that we've done um which i'll go on to briefly later on we don't worry too much about those and you can give it a doi so you can give a doi and so you can cite it and here at the bottom i've got a workflow that's been cited it's got a doi associated with it with the creators and there's a citation okay um so that's really um important and in fact we're working with several publishers now in that workflow hub will be a recommended registry when you're publishing a paper which has got a workflow in it that the workflow is registered in workflow hub and so there's a doi there and you can add that into your uh your paper we have collections so of different kinds and particularly one form of collection is a team right so um workflows are typically can be done as um as individuals but normally they're done it could be just a team of one um but typically they're done in labs or with teams of people so we um we have a team mechanism so you can associate your workflows with a team and that team becomes a community of practice really and communities of practice are really important when you're trying to find things or or do better workflows because you're you're then then you have a place to go to that says who is doing stuff like i am what are they doing and um we can get together in order to be able to find out you know what is the best workflows that uh that i could be using around that particular topic or being done by that particular community so it's another way of building a community around fair workflows and and that includes things like showing permission so you can set up you could set up um i'm going to have a team or a space for uh this defrag bio imaging tutorial and we're going to create that team and we're going to all be in it and we'll all share our workflows and it doesn't matter that those workflows may be half baked who cares it's uh because we support um workflows that are not yet ready to be published well they don't give them a doi that's fine and you can share them just with your team rather than anybody else so you have sharing permissions so it's not just an open thing um it's so you can get access to them remember fair doesn't have o in it it's access not open right fair is not about being open it's about being able to access thing and that might be emailing somebody to say how do i get hold of this we also have ways of working we also have accessibility through um through the platform itself workflow hub it uses an api called the trs api which it comes from the global alliance for genomics and health standardization community and that means that you can interoperate it with all sorts of other systems so for example if you find a galaxy workflow on workflow hub you can launch you can run it on galaxy um by pressing the uh use use on galaxy uh button and also other systems like the zepuro system from um from japan but it also costs links to other services so for example um the life monitor workflow system that i'll be briefly saying which um or system that manages and tests whether your workplace work or not so this gives you integration with execution platforms using standards and that's useful because one of the things about fair is we need to make sure that they actually work the workflows actually work um now this the important thing here is it doesn't replace uh community um and development repositories it works with them um and that's really important because you can upload a workflow just by filling in uploading some files or you can connect it to a github and therefore you're supporting you can use the github environments all these different communities like you know the uh the next flow people will be managing carefully and curating and looking after their workflows there but then they can register them into workflow hub and then everybody can kind of see them including versing versioning management so that's really important right uh because um we really need to to support that and yeah you can also add documents and test data and we partner with very high profile systems to really improve things like metadata extraction and so that if you have a workflow that is particularly um good at a workflow management system that surfaces its metadata in the workflow language then we can we can scrape that out and help register it into the workflow hub one of the most important things is the citation cff format for being able to cite software which I'll briefly mention later on so there's different ways of getting metadata into these uh into this workflow well registering I mentioned as well that there's a whole new world of standards associated with this there's um something called bioschema which is a version of schema.org which is the metadata markup language for the web that's in all your web pages and that enables you to have a simple way of describing what the workflow is about um there's something called the common workflow language which is a way of describing the steps of the workflow um in a canonical way so it doesn't matter which workflow system you're using there's one way of being able to represent it and we're trying to move people towards that so that we can do better richer metadata collection from the different workflow systems and there's an ontology that types the inputs and outputs of steps which is the edum ontology in biology and and then there's a way of packaging all of the stuff you need all those things I had in the earlier picture um into a package called a crate uh which is all the metadata and all the companion objects so that you can move it around um systems so you'll be able to pass the workflow to say a workflow testing environment or a workflow management system would generate this and pass it on to the to the hub which would then unpack it to to register it in the hub so this is this is all this stuff now the good news about this guys is you don't need to know any of this it's all under the hood right so that's the just telling you about it so that if you ever see these words pop up you say oh yeah I know all about that so the bottom line is that using a registry gets you a long way to fare because if you're putting something in a registry you're giving it a global unique and persistent identifier you can get it back from that identifier and we manage all the common platform to do it you're describing it with metadata um which you can get hold of um you and uh you can describe it with many different kinds of metadata um the workflow is licensed we know where it came from because it came from you when you registered it um and we can check whether it's actually uh meeting community standards for how it's described so this gets you if just put the workflow in a registry you get a heck of a long way um and that's why we're going to focus today on just getting you on boarded into a registry because that you know all the stuff that you're going to produce we want to be able to um help you make sure that's fair even if it's just some python that you've knocked up um in your homework um a few other things we'll say before we get on to the the last bit oh well the the practical which is um if you're going to if you have software and you want it cited use citation.cff who knows about this so this is a file format that you put into your github and it can be picked up well be picked up in the next release of workflow hub but it'll also get picked up by Zenodo and by Zotoro and it means then that you're instructing people how to cite your software and if you want some credit for your software put a citation file in right so uh so this is and if somebody else if you're using somebody else's software credit them right so uh we need to be able to to make software a first class uh research object right not just papers i'm on a mission to eliminate papers right we're gonna have software and data okay um our workflows and like all our software actually is supposed to be is you know supposed to be reusable and when we're they're composite things right you got the steps and the steps could be command line tools or they could be um some you know um API driven access to a tool or it could be another workflow or it could be another script right or it could be just some you know script in the workflow itself right but the rules about uh interoperability in the fair research principles for software is are that you should be able that these steps these components should interoperate with each other through APIs and through standards right uh and that means the workflow and its steps should read write and exchange data in the way that meets domain relevant community standards and what i'm thinking what we're thinking about here is unit testing and validating of workflow blocks which is basically rather than monolithic scripts thinking about how do i produce modular components scripts or workflows that uh do a particular task and then i can reuse them and i can test them to make sure they really work and they're validated and then you can publish them and other people will be able to use them right so we need metadata on the IO so this is so that it can be understood and modified and built on and incorporated and uh there's a movement towards this and this is an example of a movement this is um the bio bb so these are the building blocks for um a community that works in computation and molecular modeling and what they've done and these are all in the workflow hub all available so you can go see them is that they built a collection of python compatible wrappers for their tools which you can then use in all different workflow systems galaxy or cwl or jupiter notebooks or a particular exotic called pi comps which works over hpc right and they all have equivalent functions right and that means that they've worked out things like license combinations and access permissions and they got clean interface and each of these is properly tested and properly managed and then you can make lots of workflows out of them so it's like building it's like lego right like here the building blocks so you're building lego and and they're making sure that all works so when you put them together all the interfaces all work together and another step to towards that is this thing called the common workflow language which is a way of describing workflows independently of their workflow management system and that means that um you can you can kind of therefore exchange workflows between different environments if they are common workflow language you know compliant so you can move you know workflows developed in toil for example can be executed by workflows passed somebody who is using the or system and this is what's done in the microbiome and and they can still and they can run them because they're described using the common workflow language but they're executed using a workflow environment so this is a very interesting way and in workflow hub we use this as not as an execution environment but as a way of being able to standardize descriptions so we can compare workflows in the system in the registry that are from different languages because each workflow has a different language that is described using you know to describe its its steps and and it's also very good at describing what the interface at what the steps are so it forces you to do the metadata there's also the issue of how do i put work tools make them ready to put into my workflows and there's a whole set of rules about that which i'm not going to go into the details of but there's a luckily there's a paper called 10 simple rules for making a software tool workflow ready so which is supported by a bunch of people from different including us uh from a different uh different systems about how do you make your your tool your piece of software so it could be put into another workflow system right so that's um important piece um and this comes from um some our Japanese colleagues who say yeah basically what it comes down to is um if i want to make a workflow reusable and usable then i need to have some basic stuff like testing materials open source license know what the version the workflow language is um and be able to um find that workflow and who has done it so this is another way of thinking about it availability validity and traceability okay so that means i'm asking you guys to make sure that when you're producing workflows or your scripts um does the workflow run with no errors does it produce expected results does it validate its parameters have you got test cases um have you got checker workflows that make sure that it it does what it does right so so you can kind of run it on on test data and see that it still works and that something hasn't changed um and there are dedicated frameworks for simplifying all this testing uh for different systems but also life monitor here which is an emerging um testing framework for being able to to test across different workflow systems and so look out for that right that's that's coming up at the moment and this is going to be connected to um workflow hub so if you put your workflow in you can check check check does it still work by using life monitor and that's just a i'll skip over this bit uh okay and okay and the last point i wanted to make um when you're building your workflows do the um a couple of things to check when you're making them the data products you're producing do they have identifiers are they licensed um is there any restrictions on the reference data you're using that means that um nobody could use that workflow because they don't have access to that reference data are you validating the parameters do you know where the provenance where did the derived data that do you produce where did it come from can you report that are you keeping a log and and basically the bottom line here is that the data products that you are producing when you are writing a pipeline will exist outside of the pipeline right they have to exist somewhere else afterwards so are you enabling them to do that right so are they fair data so the summary is how you can make your workflows fair uh register them right if nothing else comes out of this right if you don't remember anything else um register your workflows and use a citation uh file as well in your github um and then you know the rest of it is use standard identifiers think about portability so somebody else document your workflow for a stranger right don't document it for you well you can document it for you because you will be a stranger to your own workflow in six months time um but document it for strangers and use a workflow management system that is fair enabling so you know if you're think about using one of these platforms rather than you know writing your own you know on the metal python um in order to be able to to do your your workflows and um and think about it as software you know so have a management plan associated with it with a checklist okay and this is an interesting paper we just came across which is the first paper we've ever seen with f star star star in the title I think that it's going to be hardest to google for this paper uh but anyway um which basically says this is a really interesting paper guys actually because um this is an example of of a paper what these guys have done is they found some workflows and then they try to to access them and reuse them and this is all the pain that they came across when they came to do it so it's really interesting to read it and say hmm if they found one of mine would they have the same pain um and probably would okay so that's the end of our talk okay hello everyone um thanks for joining yeah so I will give this talk but actually my colleague Bura prepared most of the material so and but he cannot be here today so a big thank to him and I also want to say Bura is so he's employed at EMBL and actually um he's working sort of full-time just on that stuff so just to sort give you a feeling that we are taking this very seriously here and want to basically do more and more work like how what I kind of will give you a flavor today okay so actually we got into this whole business not out of a conceptual reason or something like that but it was a pure practical thing a bit like Edwin he just didn't know what else to do so and um so actually and there were many many people involved in this project so I will not go through all of them basically as many of them are actually on the biological side um some of them on the electron microscopy side are the image analysis side so all the people that are highlighted here with the yellow pen are the ones that actually contributed code python or java code to make all of that work um and then the main guy behind the scenes many of you might know him is of course Josh Moore in the lower left corner who doesn't really I think incredible job of keeping the global community together on that project and really hopefully you know after 20 years of suffering we will all have one image data file for about 500 plus okay let's see where that goes I'm quite hopeful and then I think I would like to also thank the chan suckerberg initiative and actually not only for funding me but I think they're really helping in the sense that they are giving money to for example projects like right right software to open ome czar in apari right so where else do you get funding for this so I think they actually really found a place where they can help is my feeling so thank you thanks to them okay um okay so what was our challenge our challenge was we had this really big 3d em data set so eight terabyte uncompressed 3d initially we started in over 10 000 tiff planes and then you can actually look at the data most of you know if you do file import image sequence virtual stack in fiji you can actually deal with such data right so it's not impossible the problem is if you don't want to look at it like this but like this you cannot okay because tiff file format is plane wise you can only load one plane at the time you cannot load efficiently this without loading the whole terabyte and actually the specimen unfortunately was imaged in the way that that would have been the good way to look at it and not that okay so that was a big problem so what to do so what you actually want and that's not the secret this is I think been around for decades you want chunk per middle image data storage so and this is what google maps is using what the maris is using and that's really old old stuff right so the the idea is so the fine grid here are the actual pixels and the sort of more solid grid is how what's stored together together on the hard disk so that means I can actually read these pixels here very efficiently from the hard disk because they're all together and why is this nice the cool thing now is if I rotate my data set and want to just look at at some section like that I can actually load just that section without loading all the other stuff so in tiff that's not possible because in tiff this is all stored together on the hard disk and not in these individual chunks okay so that's the point of chunking and tiff is a chunk file format but it's a plane wise chunk format what you want for such data is a 3d chunking with little cubes that you can load the other thing that you need is multi resolution so for example if you zoom in you actually store the data a second time at higher resolution and you only load the higher resolution when you can actually see it and now again the chunking becomes very important because now if I'm zoomed in I only can see on my screen like the center part of this image and then I also need the file format to support that I actually only load that center part and again this is why chunking is important okay so and again this is not I mean this is basically a grid upon that this is the way to store big image data so in practice what we tried is the big data viewer file format which is doing exactly that so we loaded the file as a virtual stack image sequence and then you can actually save it but then actually didn't work so I think our installation just couldn't cope with the 8 terabyte we had some memory leak it was crashing in fact then we struggled a while and then we said okay let's just try maris and that actually worked and they they have a really really good library for reading and writing image data I have to say the other thing they do very well I think is that they made the good move of making the file format open so the specification is just in the internet and the fact that this is open and also the way they do it is basically almost identical to what big data viewer is doing I don't know who copied who but that means you can actually open that thing in Fiji there is maybe some of you know maybe know that there is for also for more than 10 years there is plugins with data viewer open developers okay so that is actually I think one of the few examples where a commercial file format is actually helpful yeah because it's open and there's good integration um okay so that was great but then we wanted to share this data with collaborators that are not at the MPL so in fact there is even a sort of add-on to this whole Fiji ecosystem that you can download and install and it's called the big data server and what that does you put that so this is not in Fiji you have to install that somewhere else but you kind of put that on top of your data set which is actually by the way I didn't say it's technically stored in something that's called HDF5 and then that thing would expose a web server where via HTTP you can actually say um sorry wrong direct give me that chunk okay so it translates the the the HDF5 storage chunk into an HTTP request and then somebody in the else else in the world can open Fiji and then connect to this big data server so this all exists also for many many years I actually didn't know but then I learned it and then it could in principle lazy load individual chunks like that okay we thought awesome yeah that's it but the problem is our security person at IT said no you don't do that so basically we will not open an HTTP access to our file so that's just not gonna happen okay just to insecure that was a very sad so what's the problem so I think the problem is if you sort of give some access to a normal file system then if some sort of hacker gets there they have a lot of things they can do okay so a normal file system basically supports tons of things that you can do and I think that's hard to manage and my understanding is this is part of the problem why this is such a security risk okay and actually I think this is also part of the reason I'm not the super expert and then my other reasons why in the cloud people typically don't actually use file systems if they want to share stuff with with other people on the globe but they use this thing called object store and and the point is as far as I understand it is why is this now much safer it's because it's much much simpler you can actually almost forget about the lower two there's basically just two things you can say you can either say put an object there or give me an object and that's it there's none of all this other stuff which is then much easier to manage okay and this is why actually Amazon I think they are very big on that and also Google they have all their object stores and I think that's how they do a lot of the things so actually also our the IT department I think they realized that this is the thing and then Yozef actually spent a lot of time setting up an S3 infrastructure at EMBL and I think they're really rolling this out now sort of seriously that each user at EMBL gets access to that okay so how does that now work with with this big image data so we need now actually a file format that works well with this simple idea get one object okay and this is basically where Josh came in and in fact also in parallel Stefan Saalfeld from Janelia I don't know some of you have might have heard about N5 I will not go there now but it's basically the exact same idea was a sort of parallel development but I think also luckily now Stefan Saalfeld from Janelia they are also behind what Josh is doing so we will have no fighting there between two new standards which would be a sort of annoying so I think that's actually working okay so um so the idea is actually super simple so you just say one chunk is one object okay so basically the way you store your image is every chunk here is one thing that you store on in the end I think also an object store is nothing else than a hard disk at some point so there is just one file also or object per chunk and then you can actually very easily say if I want to hit that chunk just give me that file okay so this is the very simple idea that both Saalfeld had and actually the Tsar community um okay long story short we did all of that in fact actually it was of course a lot of work and initially we didn't actually do OMEs are because I think it wasn't really there yet so we started with the N5 implementation from from Janelia but now we are switching over and that actually really works so and this is something we will try in the practical session together so you can see yourself that you can actually browse an 8TB image on some random computer with that technology aha so I'm always getting confused myself so that's why I'm okay I think the easiest right now is maybe if we all go to this github thing that actually Bura lovingly prepared okay so and we have to go now to the band again so I will and we do this together okay so and that's actually sort of part of the course also so if you want to do stuff in the cloud there's not only Jupiter notebooks you can actually have full-fledged desktop computers which is I think for our field also not so bad because then you can run Fiji it's just I think a bit difficult okay so I think here we said four CPUs but actually many of you might have now already a running desktop I'm not sure if that so if you have a running desktop then you just click go to desktop if not then you do four CPU 8 gig memory and then you say launch okay so I already mine is still running for some reason and I will then just go there okay so now comes for me the challenging part we have to open the github repo website also in this computer and I don't know how to copy and paste into it so but I think you might have probably already so the up here is the the firefox icon and we have to now get there and I think so we thought a lot about how to make that not too difficult and the easiest we could come up with but maybe there are more clever ways is so we made this tiny url I will post it again so that's something you can actually type by hand up up here basically yeah so http but maybe you did all of that already because you must have somehow copy and pasted the first of all everything we do right now you can also do in theory at home you just have to install all the software that we installed and actually I will not go there now this will take too long but if you kick afterwards on that link there is how we install all of that and basically if you I think if you know how to use conda a little bit then everything should be sort of possible for you but get get back to us so it's all based on some conda installations of something so nothing difficult so the first thing we will do is we will introspect and all these are data set and actually we will for that copy something that's actually stored in the cloud so this is a sort of cloud address which is hosted at EMBL and we will copy it locally okay so and for what that I will just copy this whole command and paste it here and press enter okay so this is basically now you learned how to download data from an object store onto a computer okay so copy so it's basically there is this thing called mc and that's a command line thing and mirror means copy okay so that's not too difficult so then we can look what did we actually download so for the non-linux people ls is sort of look into this folder and as it was already mentioned all these are is actually a folder so paste and okay and we actually see there is a bunch of files okay and a bunch of other folders okay so now let's look a little bit of what that actually means so important files are this dot set attribute files they contain all the metadata so let's print the content of one of these files so for this we just copy and paste this thing this is basically just a linux way of saying please print the content of this file and it's just a simple text file okay so that's also important that everything should be simple right and not something complicated so this is just a text file and in the text file for those of you that might know we have this sort of stuff and that's called a jason way of specifying metadata and that's also an absolute super duper standard okay so we have a text file and we have jason i think you cannot be more standard than that on this planet right now okay so that's just to you know give you a bit of the idea um um okay what can we see actually our file seems to have several axes and they have units and they have types and actually this would look very familiar for you it's the typical x y z channel time data set okay so we support actually currently exactly those dimensions or a subset of them and you can have units and you know this is a space coordinate and it's in micrometer yeah so i think this is actually fair enough i think that's understandable then comes maybe a little bit more hairy so um and i also have to think now shortly so now comes the coordinate transformations so there were long long long discussions how do we want to encode pixel size in this file format okay i mean that's obviously very important you want to know how big one pixel is and actually the consensus is now we don't want to do it in that simple way we want to say there is a certain voxel space okay this is how we store the data and there should be a mapping from voxel space to physical space okay in a very generic way it could be something very complicated and and the way to map from voxel space to physical space is via a coordinate transformation from voxel space to physical space like in some of you might have enough okay and right now we only support one or actually not but mainly one very simple coordinate transformation which is called scaling okay and and this is basically just the voxel sizes okay and you have to and it's the same axis order as the axis on top so this would mean 0.14 seconds then channel was also big discussion how to is that even a normal dimension or not so right now it is and what's the unit of a channel actually you see here there's no unit so you see all these things are actually harder than it looks okay but then the z oh somehow the set dimension in my case is lacking the unit for you too probably okay so I have to I think it doesn't okay so there should be micrometer I don't know what happened okay and then x and y they both have the same so 88 nanometer all right sounds good so this is how we store metadata at the moment and we want to get more fancy here soon like that you can do like transform stuff more correct rotate stuff while you load it and so on okay then actually you see there is another coordinate transformation and another one and actually everything stays the same but on the the space core the x the x the so this is x y x y there it becomes bigger the voxel size and the reason is that these paths they specify the resolution parameter okay so we store the data in fact three times this is at the highest resolution level and then down sampled one and down sampled another time and that means from this down sampled voxel space to physical space we need a different scale okay so that's that's basically the all the information that's on this highest level metadata any question just I can't actually see you so you just have to speak Christian maybe very obvious so what you specify is actually just the level of resolution but there is a single version of the data you are not storing the lower resolution data right yes so it's actually saying what is the number to get the the lower resolution version yeah but it's stored so in fact if we um if I go back here so there are three subfolders and this zero and this one and this two are exactly those zero and one two and this actually doesn't have to be zero one two that could be hello world tomorrow okay so and and in these subfolders there is physically stored the the different resolution and if there are any reason why you need to store because you know resampling the images may take a long time or that's the google maps thing so if you're on google maps if you're on assumed out version of the world you don't want to load all the voxel data that tells you where the flower pot in heidelberg is okay yeah okay actually the point yeah what time yeah um so um about this different resolutions and this so there's now there's three different resolutions right um is it always three or this is somehow designed that that's kind of flexibly whatever you want actually this will come yeah okay thanks yeah so I actually think we are at that line so now let's look into one of these other files into one of these subfolders copy paste so now we are actually in the highest resolution uh part and there is a file called set area let's look what's inside actually this is interesting okay this is the data type I the first time I consciously look at that I would have hoped there's something more readable but I think that means a bit unsigned integer or something um so I have to get checked back on that why this looks so funny but um that would be nice if it's a bit more human readable but maybe there's a good reason because you want to know your data type but but maybe this is one byte unsigned or something yeah and I don't know what this thing is maybe this means well I don't know okay but that's the data type so sorry I'm actually not sure now then the shape is important so this is basically how big is this image in terms of pixels okay so I have 11 time points two channels five set planes and this is x and y okay so that's basically where you could check how big your image is if you want to go that way I mean there will be libraries of course that do that for you but I just want to show that this is actually not you know human readable and this is the chunking so what that would would mean each channel and set slice and time point are one individual file this is actually how some people store tiffs also right so basically you just store one tiff per actual plane so this is how that would basically be the same we have now one file for each x y plane and we don't chunk in x y so there we have everything in one file okay so that's basically where you can get such information so how it's laid out on disc if you want to look at that all in one go there is actually a way so there is this cool thing called tree so if we copy that and paste that then you can in one go see how this whole thing looks on disc so so we do see this is one of the disadvantages there are a lot of thing files but essentially there's a whole folder structure so each of those zeros in the end would be one plane like you are and this is just a simple binary without any metadata so this just really contains the raw pixel data okay so if you have an application that really only wants to load that plane then this application can basically only load this chunk from the disc without loading any of the other stuff okay that's the idea but you wouldn't do that manually there are libraries for you that will do that and i will get to that in a second now let's say your data is actually in an object store not on your hard disk but you still want to know sort of what's in there there is this tool that the poem e consortium maintains or poems are so i copy this and you see this is actually a web address okay now that comes to cloud thing because right now there was not much cloud this is all a computer i'm apart from that this is a cloud computer but as i said that matters now we're really looking into a cloud stored version of that whole thing and getting some information and everybody in the world could do that now that has this link so everybody in the world could do that on their computer no matter whether they live in wherever country okay and here actually it's sort of other summary of the data so it tells you how big is your data on the three different resolution levels i didn't try that actually i just pasted it in let's do that for one other file which is actually this eight terabyte file that i was talking about before so i don't know if this will be now much slower i don't know actually what that exactly does let's see okay fair enough so this is in fact if you have a good memory you might remember that i said we had 11 000 tiff slices that was the original data format so this is this eight eight terabyte 3d volume that's stored at that web address and you see here we have actually a lot of resolution layers that are not needed to for smooth browsing let's look at it for example in naparie so i just copy and paste this line and this is now opening the locally stored data set that we copied in the beginning of the practical on our computer and launching naparie is taking a bit that's usually not assigned that anything crashed and who some of you might notice data set very classic image today sample data and we can do the usual stuff we can go through the set dimensions and the time and so on okay so this is basically just so i will not teach you anything about naparie okay no time but the point is you can open such data with naparie okay and people are supporting this um i will immediately close it again because now comes the sort of cooler part the next line actually this is a web address okay so copy paste and same thing okay you can open this and in fact you have to trust me the way it does it it does not download everything initially but it will only download on demand what it currently needs for the display okay so it's it's using this chunked storage for loading only what's needed okay good now let's try that with the a terabyte data set that's the ambitious this for sure it this will be approved that it will not download it right if that works oh there's something let's go to a different set plane and it does load it but it's not it takes time right so it's not super duper smooth and i think if now if you zoom in i think zooming in is by assuming out works zooming in mouse wheel it actually oh not so bad so this is with the mouse wheel yeah okay i think i think the message here is it's sort of worse okay so i think if naparie and they admit it they are not very good at reading very big files okay it's sort of working but it's not like you see it's now hanging a little bit okay so that's basically i think if i take home message that is fair that that's not you know not super convenient okay but in principle possible maybe they will work on it let's close naparie um let's do the same in fiji okay so to open fiji we just type fiji okay and then in fiji we have to do what's written here and i think to save time maybe you can just trust me that the small data set works maybe let's immediately go for the for the fat guy okay um so plugins big data viewer ome czar open ome czar from s3 so s3 is the object store and now comes the challenge you have to somehow copy and paste from here into here i hope you manage somehow this is actually code that i co-maintain so we just put that there because we think it's a nice default but okay that's actually good with the default and you don't have to copy and paste i forgot then you can just press okay all right so now in big data viewer changing the set slice is the mouse wheel zooming in is with the arrow up key on your keyboard and actually so i i wrote that code i thought it was interesting you can you can actually it really tells you what does it load from the cloud okay so you can see exactly which chunks are loaded from from from the object store and and also which with which speed and you see we save the data it's unsigned integer 8 bit in a way that each chunk is roughly one megabyte i think somebody had the question so we we thought we benchmarked it a little bit but what feels good and here you see this is from the third resolution layer and that's the number of the of the of the chunk in the file so this just means we can go even higher i think because we had like six resolution layers or something i know zero is the highest so yeah okay so now it's loading from the highest resolution and that's sorry yeah there are way too much partially the identifier of the chunk to you know some position in the sample um yes so if if you i mean that's exactly what the coordinate transformation is doing okay uh it actually not really so the chunking no it's the voxel so the chunk is bigger than voxel um so for the chunk itself i think it's not so easy i mean you could write down the formula but it's not i mean this is what the library is actually doing for you so this whole chunking stuff is not something what we actually specify anywhere really on the metadata because this is what the zou library in fact is doing sort of yeah so okay because i see i see that this could be helpful for user you know just to understand if i observe this where it is in my sample is it at the bottom part of the sample or the top at least to give some something about this that's true but i think this is on the left is also nothing i'd ever used for anything but for giving a course i think it's just cool to see that it actually loads the chunk but it's nothing i don't know i think i never had the use case where i would have to share that information with someone so i don't know yeah but yeah but i think a theory you could do it actually another could feel cool thing now with big data viewer i think that's something you cannot do in naparie is if you hold your left mouse button you click in the middle you hold it fixed and then you drag somewhere you are rotating the viewing axis so this is what i said initially this was really what we needed for this whole project we needed to look at the thing along this because this is the head and that's the tail and these are the arms okay and this is basically a non orthogonal slicing of the whole data set and in you can also rotate and stuff like that okay i mean this was basically the why we got into all of that stuff because we needed to look at that thing from you know different axis and so on good yeah did it work for you all of you you can open it okay so i mean that was basically our main use case why we got interested because we wanted to share such data with the world basically and we had no way but i think now these days this is possible which is sort of cool okay then i'm closing fidgy and let's go to yet another way of looking at all these are and and that is in the browser so we are looking now at the exact same file again now the small one because i think this also has trouble with the big one again same HTTP address but that's only again so chrome wins clearly today so that works better in chrome so to open chrome you have to go to applications internet google chrome so applications internet google chrome and then copy and paste this web address here and actually also like that it's very easy to understand so the first part of the web address is the viewer and the second part is where is my data okay so good for fair science i think um paste and go to and qp i can look at the same data set in the web browser i'm actually not okay here are the set sliders here's the time slider and you have brightness and contrast settings yeah so this is really i think the whole vision of that and i actually really honestly excited about that because i saved the data in one place and i can open it in apari in fidgy and in the browser and i think that's really pretty cool um and it's no complicated nothing it's just standard software and very everything very standard simple yeah no complicated anything i mean i don't know but for my taste this is quite quite nice let's quickly create an omisar in the end okay because i think this was the question so um uh first of all we need a normal image so we have to copy one from an object store so you can also store normal images in an object store by the way right so it doesn't have to be an omisar in fact you can store everything as long as it's an object which means it's a bunch of bytes as far as i know so now we copied something from the object store folder image data into our local folder image data let's look what did we actually copy we copied a tiff file you who okay and it's actually the same thing that we had before but as a tiff file and now converting that to omisar there are python libraries there are java libraries i i think the current sort of standard way that i would recommend of doing it is installing this thing on your computer and you can do this with konda and you can figure out the installation on the link that i had above and then this is also what officially supported by by om e and then you have a bunch of options how do you want to compress your data there are several compression algorithms i think all of them lossless as far as i know how many resolution levels you want this is some sort of technicality that i don't even understand myself just something we had to do to make this practical work smoothly so i don't know um but then you could also specify the chunking and and all of that so it's basically uh one command line call and then it says very simple one input and output okay so that's what i would use today to convert a tiff file to omisar the good thing is this is bioformats so what so here you can put anything that bioformats under stats okay so this is very important of course like for example your whatever your microscope outputs okay and then we paste this it's converting it okay it doesn't actually say something like super i'm happy i'm finished just which is a bit weird um so let's check if it actually did it and i don't have the command here now okay so what we have to do is we have to copy and paste we have to type ls for list and then copy paste the folder into which we created the thing and i have something but i'm not sure now if that's just me because i did it before so that's basically how you create one of these guys and i think the rest i can for a sake of time well actually i can just say it is also not it's sort of and that is how you would then copy it to the object store okay so then then you have your data so you basically if you would like to share data with anybody in the world it's copy and paste these two lines into a terminal window press enter okay um if you have installed these two tools and if you have an object store which is probably the biggest problem but otherwise it's actually not so difficult okay it's really only create it locally and then copy it and this works right now for you because and this was why the installation procedure was a little bit weird because here are actually two credentials this um okay this is the actual web address and this is under the hood the web address with the credentials so uh hello everybody and thanks for staying until the very end my name is Ignacio Ganda Carreras and in this very last class i'm gonna tell you a little bit about how to use machine and deep learning methods for segmentation which is a crucial task in biome each analysis so this is what i will follow more or less so we have this very first 45 minutes of lecture as in the previous session where i will go through the basics of segmentation and i will jump now that you are almost experts on deep learning about the most popular approaches to do segmentation using deep learning either what we call a top-down methods or bottom-up methods we'll go through the details of the probably the most important architecture to do this the unit in 2d and 3d and then we'll see some tricks to the post-processing and obtain what we really want then we'll see the state of the art um architecture that and methods that have been proposed lately as a start this and sell post in a little bit of detail and then we'll see uh we will propose a more generic pipeline based on multitask network and samples processing that i will explain because then we're going to be able to do this in in the tutorial for well my data set but you could also do it for your own data set again the last 15 minutes will be for questions and answers and finally we will have 30 minutes in google collab to play with the last thing that i mentioned okay so let's jump into it what is image segmentation probably you many know otherwise let me introduce it to you this is the process of partitioning an image into multiple segments so that's why we call it segmentation image segmentation it is typically related to finding either the what we call the objects of interest in our images or at least their boundaries right so based on this definition and if we have an image like this then this would be a proper segmentation we have separated or we could consider the object of interest in this image that are for me the nuclei and the cell lives right but more precisely we also call image segmentation the process of assigning to every pixel in the image one label such that pixels with the same label they belong to the same category maybe related to what we have heard before or that serves some some characteristics in the image for example in this case we have pixels in blue the blue label that means that all those pixels belong to the category nuclei the green label is for sito plasm or cell bodies and the black label could be for membranes and and background so because we are assigning categories and we are assigning a meaning to every pixel this is what we usually find in the literature called as semantic segmentation okay but there's also another type of segmentation that also fits with this definition that is this one okay we could also apply one label to every instance of every category that we find in the in the image meaning that every instance of a nuclei i'm going to assign it a different number a different identifier a different color and every sito plasm the same thing right so here every different object has a different identifier here we represent it with colors but at the end of the day this is what we call a label image so every pixel with the same color it just has the same number but we have usually assign a color map to visualize it better and because we are here assigning labels to instances this is called instant segmentation okay as opposed to semantic segmentation so how do we measure how do we evaluate how well segmentation method works well and we have some very common metric for example what we call the intersection over the union that can be used either for detection or for a segmentation in in detection instead of having the mask of our object we have only what we call a bounding box not these four points that define a box around our objects but it could be the same thing so what we do is we take the our predicted bounding box and we want to measure how well it matches the ground truth bounding box or the ground truth mask of our segmentation so what we do is that we take the intersection area and divide by the area of the union of both regions or mask so basically we have a value between zero and one that's why it's called also the Jakarta index it's an index where zero means no overlap at all so our result is a disaster we haven't detected anything or we haven't segmented anything that we want or one where they perfectly match not the segmentation of the detection is is perfect in general people say okay when do I know that my results are good well results let's say of 0.80 something 0.9 they are already very good they usually with that you can work and do proper analysis but this is for one object and one class but what happens if we have different objects and different classes how do we know that this prediction was aiming at this object here and not at this other one over here so for that what we usually do is saying okay I consider that this box is matching this ground truth box if they overlap a minimum value and that value is defined by what I just defined the the intersection of the union so when you read results based on on the usually detection or multiple classes you would you would see something like okay it has this precision with IO an intersection of a union of 0.5 meaning to consider it a positive case they have to overlap at least 0.75 that includes much more overlap and while we do this then we start we can count to positive positive we can extract the typical metrics and then a very common one that you can find is the average precision we have different classes we take the mean of all those classes or is the mean average precision but with the intersection over the union we can work quite nicely so what kind of approaches do we have in deep learning to do this well if we go to the to the field or to the domain of natural images and by natural images I mean images that contain anything like this one for example no our daily things dogs and houses trees etc it is very common to have what we call what we call a top-down approach we have a monstrous network that what it does is breaking the big problem of multiple object segmentation into smaller ones so inside the network there are also other networks for example let me go very fast through this one that is called the mass RCN or the mass region a convolutional network and then you will see what I say that it is a monster so basically it takes the input image it passes it through a CNN convolutional network it's called the backbone network and this one are usual networks that are popular at that time okay so maybe a resnet 50 or anything that you can imagine that has been trained usually pre-trained on classifying images like this so it has a lot of information about natural images and then out of the output of that network at different layers then we get some feature maps that goes through another network that has one target that is producing proposals of parts of the image where it's more likely to find objects so it provides on the one hand bouncing boxes of possible objects and on the other hand it provides also a score saying okay it's 90% likely to have here an object that is 05 or 04 okay and then all those proposals are sorted and the most likely ones go through another network that does classification and then every box is classified to say okay contains an object of this type okay and then it also refines at the same time the the bouncing box and in the last version of this architecture they included this mask so they have yet another small network on top to do the small mask small segmentation of every object so they do at the same time uh region proposals okay they provide the bounding box locations they classify what is inside that box in this case there's a dog a dog a cat okay with respect to the background for the background is set apart by the the region proposal and then it does the mask prediction as well so this is usually a very large and heavy network yeah so can we use this in volume analysis well uh we can why not I mean uh we can take uh an image like this and then try to process everything that is in the field of view of of the of the network okay so in the end and in the in this um type of image this electron microscopy image is where the target is to segment every single uh mitochondria this is what we will do in the tutorial well the mass carcinin say well if we train it enough and it's not uh also that easy to it provides the bounding boxes of every single object and the individual mask okay what is the problem here well the processing images uh need to be of usually a larger field of view as we have in uh by image analysis now this is an example this is actually a representation of this image this is a small uh all data set from 2011 and this is a a data set that uh we released a couple of years ago where we have all the uh mitochondria segmented okay we want to segment this how do we do with a a mass carcinin well I mean there's there's there's a big problem of that it doesn't fit into the so-called backbone network right so one um intuitive idea would be to okay divide and conquer we take the large data set we divide it into patches and then patches of course that fit into the that uh backbone net and then we can calculate if we have train our model for that and obtain all these uh nice results we still have some problems and that is that okay we may have instances that go through different patches you see that here some of them are cat right so moreover not only in a couple of adjacent patches in this data set that we released there's plenty of a small uh mitochondria that is also a considerable number of very large mitochondria that go through the entire data set and that we need to follow okay so in that case there's the the problem of assigning uh instances that we have correctly identified in between patches okay and if this is uh not trivial at all so because these are another problems in bio-image analysis we usually uh try to do something different that is the the opposite approach we do the bottom-up approach where we focus on solving first the small problems okay and then we integrate them into a complete solution by building the so-called workflow or a pipeline in this case we have our full image and we could say the first small problem that we want to solve is to find the other probabilities of our uh obvious interest in this case our mitochondria and once we have hopefully this perfectly done then out of this image we would like to extract the individual instances okay so the the purpose of this approach is to design a whole pipeline or workflow to first identify here all the pixels or voxels we got we could do this in in cd as well of every object of interest and then somehow we don't know yet how extract the instances so of course we're talking about uh deep learning right so the very first step and it's also a common approach is usually performed now building here an architecture a deep architecture actually this task is a semantic segmentation task so we can look for semantic segmentation network and then for this last small problem we may think on uh different solutions so what kind of networks can we use for this well if we go to the literature we will see that especially after year 2015 everything gets dominated by a type of architecture that is called the unit okay or unit like architectures you will see that they are quite popular this is an architecture that was actually the first one uh published um as a convolutional network specific for biomedical image segmentation but as you will see it can be applied in many generic problems in computer vision and a year later it was published uh well it's 3d version which is also very similar and it this is the foundation of many solutions so let me go and spend a little bit of time explaining your hiding works this is a figure from the original paper this is a unit it's called unit because it has this characteristic u shape okay and then it has uh mainly two paths what we have what we call the encoder or the contracting path where the input is our image and it goes through some uh convolutional layers okay so here we calculate some uh feature maps and then we would like to also track features at a different scale so we don't sample the the feature maps right usually either the max pooling or an average kind of operation that allows us to reduce the dimensionality by two and then we do more convolutions to extract more features at that uh level okay usually we double the number of filters where we reduce by two the size of the of the feature maps so we do this convolutions done sampling convolutions done sampling and we have a very reduced or summarized the representation of the feature maps in this region here that we call the bottleneck and then we go up we do up some blitz and convolutions up some blitz and convolutions until we recover the original size of the input image in the very original paper that the size wasn't exactly the same but now it's very common to use convolutions that maintain the the size so the input is an image of a specific size and the output is one or more images of exactly the same size another characteristic of this network are these connections here that go from directly from the feature maps of the encoder to the feature maps of the decoder okay this provides a way of sharing information between both paths and also it has a lot of positive effects in the in the training it holds some problems and maintains the stability of the network okay so as I said this is the this was published for segmentation but as you can imagine if the input is an image and the output is an image of the same size or several images of the same size we can apply this for segmentation but for any other type of operation that we call image translation now for example we may have a noise image here and a denoise image here we may have a gray scale image here a color image here we can do plenty of operations like this in computer visuals thanks to this architecture a year later it was presented the 3d version it's actually as you will see this very same idea with the convolutions and ensembles and ensembles and convolutions and skip connections but everything is in 3d the number of filters that are here well it's actually a parameter that you have to play with in 3d you usually do less filter because it's it's more expensive and also what you do do that down sampling or the up sampling um you if the data is isotropic you do it in 3d but if you have different resolution in z you may not want to reduce the same the same amount of pixels or voxels in in that direction right so this is something that you can you can also play okay so we have now a solution that's actually proved to be quite efficient to perform the very first task now the object probability calculation in 2d or 3d how do we do out of this the instance extraction how do we pass from the object probabilities to something like this label image well a very simple idea is to apply a threshold okay so we have if ideally we get a very nice high probability objects here if we apply a threshold of something like 05 then we will identify nicely the objects and then we can find the connected regions somehow on on these white objects right so how do we do this there is probably as some of you already know there is a classic method to do a connected components it's called connected components labeling or connected components that basically transforms a binary image into a label image an image with every instance with a single identifier actually if we have an image like this we have to invert it because it works on white objects and black background usually and there's plenty of implementations to do this the the most classic one is based on flat feeling no like you when you're doing paint you click once and then it feels the entire region with a with a single color this is this is quite efficient and there's implementations in in in python in in mj etc the only thing that you usually have to select is the connectivity there's one parameter to decide if the pixels in the neighborhood of of every pixel can be considered a neighbor if they are in the diagonals or not right so basically in 2d this could be four connectivity or six connectivity what does it mean in practice it means that if we use four or six in 3d the objects usually get more rounded that we use the diagonal connectivity so if you are playing with images like this where you have to segment blob like a structure cells mitochondria etc you want to use the smaller connectivity values okay but of course as always there are some problems if the most the two typical problems when we do segmentations are spits and mergers let's imagine that I have these object probabilities that are more or less fine this let's assume these are provided by our model and then we binarize them so as we see in the ground truth here we have a single object and then we have a split it into three different regions okay so this is a typical error here maybe we use a very low threshold so these two objects they get merged together and they were a single object so the problem is that the results are too sensitive to the binarization process and also the the probability calculation that we have so can we do better well we can use something a little bit more complicated than a simple binarization that is using the all good classic watershed method it's still a very powerful solution maybe some of you already know the the watershed algorithm is now classic it's from the 80s but it's still a very nice solution that takes a gray scale image and then it makes a topographic analogy it considers the gray values as if they were altitudes in a in a topographic surface so high pixel values there are mountains right and low pixel values they are our valleys so in the algorithm it identifies the local minima and then for every local minima it rises what are called some flat basins so they flats the valleys from the bottom towards the top with different for every different minima we have a different label and it goes iteratively from the bottom to us to the top or to a specific top value that we decide and when two water sources touch each other we set the border okay so we set what we call the add-down in the watershed paper actually we have an image for example like this with a nicely a fluorescence marker for the membranes well it would represent it in 3d they actually look like a mountain change with valleys and peaks and this is very intuitive to see how it will rise water from each of those holes when they touch each other they're gonna do it at the brightest point of our membranes right so this in principle should work very nice of course nothing is perfect because even if we have images like this that look nicely dark in the inside they may contain actually many local minima it's not just zero here it could be one seven eight six etc and from every local minimum we're gonna rise the water with the different labels so we may end up with a classic problem of over segmentation there are different solutions but one could be to impose those minima either manually, semi-automatically or completely automatically why I'm saying this because we can also train our models to have an output which is the marker now one single marker per object and then we can help the post-processing by having these outputs in our models that is of course a bunch of solutions based on classic methods like applying filtering or use some tolerance in the location of the minima that may allow us to do this in an automatic way without needing of a deep learning model but you have to keep this in mind because it's also a nice solution and you may think okay but what happens if I don't have a very nice fluorescence marker that reflects the borders of myself so nicely but still I have some some contrasted objects well in classic computer vision what we used to do is okay we apply a gradient filter okay so basically in the regions of maximum change of contrast we get peaks and then we have kind of the same idea of the fluorescence marker right we have the the borders nicely highlighted with respect to the to the rest of flat or homogeneous regions and then we can run water cell on top if this is not as obvious there's applying a kind of a gradient filter again as we will see we could also make our model to learn the representation like that with the idea of facilitating the post-processing and finally there is an idea that I need to explain you first because it's related to what we're going to do later in the tutorial or in the pipeline that I'm going to propose you but also because this is something very popular in image j or figi you ever use it probably you know that there's a command that is called watershed but it doesn't do what I just explained the watershed command in image j what it does is out of a binary image like this it finds separations between objects that are touching so how does how does it do it how is this possible so the the thing is it doesn't do only watershed it does more things imagine that we have this image and we have binarized using for example the threshold value of an automatic thresholding algorithm and we get this nice initial segmentation but our objects are obviously merged together and there we cannot run watershed on top of it because we have nice valleys and mountains we have just two valleys this is black and white so what we can do is to simulate those valleys and how do we do this we take an operation that is called a distance transform or distance map that assigns to every pixel in the object the distance to the closest border so basically for every object that we have peaks at the center especially the rounded center of the of the objects again we don't we don't want peaks at the center but we want valleys so what we can do is just take the inverse or the negative distance as we do here and then we can run watershed on top of it and then we obtain these watershed lines that we can then impose on the original image and separate at least those objects that are more or less rounded okay so this is something that you can sometimes found as distance transform watershed and again I wanted to keep this in mind to see that okay having a distance representation from our objects to the borders can also help the post processing so we have a plan we have one way of doing the first small or solving the first small problem okay with a unit like model in 2D or 3D depending on our data we get object probabilities and then hopefully they're good enough so we can do post processing with either connected components or the watershed algorithms okay so if we want to apply this to a very large data set what do we have to do well we make patches of our data set we apply the segmentation network to the patches and then in the second step we undo the patches to put them together the the things that are overlapping they would be imagine here mitochondria it will continue here hopefully and then we can run on top of this our extraction method hopefully the extraction method can fit the whole data set into memory otherwise we may need an implementation that is able to also run in parallel what is the problem of this pipeline what the problem is maybe a bottleneck that I mentioned before is that we may be too sensitive to the to the object predictions if we don't have nice object predictions here we may end up with mergers and splits as I mentioned in a few slides ago okay so can we help the post processing somehow yes we can actually there was a paper a few years back but they said okay we get the splits or artifacts that's okay because they're more or less easy to correct in appropriate in part even automatically the artifacts with some morphological operations we can get rid of them and splits is easy for a human to click click click and put them together but if we get mergers it's really a pain because especially in 3d we merge two objects together then we have to start selecting how to cut them in 3d so this is something that we should work on before and the idea that they propose is to as I said before create a model the deep model that not only provides object probabilities but also provides border probabilities is so it's a multitask network or the multitask loss that does both things and also with some weights in in in this paper they say okay let's let's put some weights on the regions that are more important that maybe if two objects are touching each other too closely let's insist that the model provides a border there so in in our pipeline this is easy to do we just need to incorporate here a new output so the model does object probability detection or object probability prediction and boundary probability prediction okay and then we continue the same way can we do even better than this actually this is one of the ideas behind the star base project that came up a year later where they also were trying to identify the most common sources of mistakes no and especially when doing cell segmentation nuclear segmentations that are very common problems in bio-image analysis and they say okay usually we have problems with noisy images but also with crowded objects no when they are touching each other and they are a bit noisy the borders are not that clear if we do something like a regular unit for semantic segmentation we end up with tons of mergers and if we try to do this with even something like a bounding box-based method like a mass car cnn since the bounding boxes overlap so much they are considered the same object anyway okay so the solution that they came up with is what they call star convex cell priors okay or a star base and the idea is also to provide a network that does not only object probabilities but also provides an idea of the boundary locations using radial distances from the center of the objects to the borders of those objects okay so together with all these rays they produce what they call another complete set of candidate polygons so basically we have a network that it could be any network that the shape of the unit or a classic network that produces several outputs okay we will see object probabilities and a representation of those distances and then in the post processing out of those rays we create the polygons and take the most probable ones let's see an example we get an input image like this again it's very noisy plenty of cells are touching each other so it's quite a difficult problem and then this model is going to provide us with the object probabilities of every single cell or nuclear or whatever it is and then we have a more output channels one per distance to the border okay once per ray this is a parameter in that way as many rays and the more rays we use well the the better defined it is the the polygon but it requires more parameters and more training okay so we can get different directions that we can combine together and then produce a set of polygon proposals okay in the end we get usually usually a bunch of polygon proposals per object so we have to do a small post processing that is called non-maximum suppression to get just the most likely one for every single object okay based on our lab as well and then they produce nicely done instant segmentation results a couple of years later they came up with a version in 3d the idea is exactly the same but instead of having this star convex priors into it they do it in 3d and also the object probabilities are producing in 3d so the network is a 3d network but it works exactly the same so it's very nice it's actually competitive with things like the mass car cnn but with many less parameters remember I told you the the mass car cnn is quite a monstrous network in in for well for these standards but this approach has a small problem that is that okay this rays are only well defined for pixels that are contained within the object meaning that it only works for convex objects and only convex objects can be segmented okay it's a problem if we don't work with things like like cells or mitochondria that are usually convex no potato like blob like structures otherwise this is a very great solution can we do better well this is the idea from cell post the other state of the art algorithm that's actually called the generalist cellular segmentation algorithm they say okay look not all cells they look like blocks so we're going to propose another idea instead of these rays to simulate these distances they actually simulate what they call a smooth topological map so from the manual board annotated border of the cells they simulate a diffusion from the center that can't represent it with the spatial gradients in x and y okay actually this later can be represented in a normalized direction from 0 to 360 okay in this very colorful way and this works not only for blob like object but for all kind of shaped cells or whatever object you want to to segment it's more adaptable to all kind of shapes the whole idea is in the end quite similar they have an neural network in this case it's in 2d okay and the outputs of the network are the this flows in x and y these gradients in x and y and also the cell probabilities well this is cell post so this they were focused on segmenting cells and combining these three outputs they can get this flow field and then once they got this the extracting the instances is almost automatic what they do is for every pixel they check to which pixel it converges and if pixels convert into the same pixel they assign to the same label okay so they they get the individual mask like this they propose an architecture that is also again a unit like architecture it has residual blocks in the in every level they double the number of convolutional layers and also have some peculiarities well for example they they say okay um they're trying to to be generalists so they say if the if the images have the same style the same appearance they should be following the same path so to keep to enforce that in the decoder they they do some global average putting where they say okay they claim to have the style of the of the feature mask and they pass that through the rest of the of the network in any case the the main idea is to use this multiple representation in the output and then perform this post processing and how do we enforce generalization well there's no an easy way to do this so their approach was to use a very generic uh yeah generalist let's say dataset with dataset that included many types of image modalities or different types of upsells from fluorescence right field many modalities and also non-microscopic images of things like this that there were uh repeated objects that were touching each other like fruit uh jellyfish rocks cells etc okay and it works very nicely as as you will see and for 3d well they propose a solution that is uh applying the 2d network in the in the other uh 3d planes so you train only into the annotations in x and y and then you have a block let's say for example the the mitochondria one you can apply to the cx plane and the z y plane so you you obtain the probabilities and the flows on those uh directions so you have six different um gradients that you can combine together into a 3d flow image so you can post process the same way okay so to finish we can learn from these ideas and in our generalist framework we can say okay why don't we do a similar thing with a generic 2d 3d pipeline where we add one more um output to the network which is at the distance map okay because it seems that it's it's important the distance map actually it's genetic it doesn't have to be a flow or a ray or anything like this right so we have exactly the same uh model but with three outputs so we have a multitask uh network and then we can combine them to do um the final distance segmentation and this is the key how do we combine these three images into this one well the idea is very simple uh by using water set because we can use uh there's a version of water set that we can use with the mask and with uh as we said before impose markers let's say these are my predictions okay for simplicity i put them in in 2d we have the object um probabilities the borders and the distance map okay so we can threshold the object probabilities to get a maximum mask to which we can uh raise the waters okay we can use a medium high object probability to get a top of um where they can uh a flat and then we need markers so for the markers we can combine these two we have we use a very high probability here we just get the centers of the objects as i said before and then to make sure we separate those um that are touching each other we can threshold this and use it to uh to prevent these guys to touch each other okay so we get one marker per cell and then we can run this on top of again the distance as we said we use the distance we have mountains on top of the objects we won't hold so we get the negative distance and we can take this tree run the water set on this image with these markers with this as top um that's limit of the of the water set algorithm and we get nice uh instant segmentation okay so just to to finish with this part i'll leave you here um a bunch of open source frameworks where you can do this of course the ones from stardust and cell post that are very uh powerful and popular in cell post you also can play a little bit online with their model which is very nice and then i i i want to mention this last two because i one is from us and one is from some collaborators we work on on electron microscopy we have one base on on pipe torches for connectomics it's called like this and with a might use a student uh danie franco we are developing a set of biomechanalysis pipelines in tensor flow but we can do 2d and 3d semantic segmentation instant segmentation detection and uh classification okay unit data or um more or less big data to to play with um i want to emphasize that we release some large data set with some uh collaborators not only on em this is for mitochondrial segmentation we have also nucleus segmentation in em and micro ct we have uh axon segmentation and we have more and more data sets that you can uh play with and also compare yourself with because there are uh open challenges