 just one more word about a new bias academy. It started last year during the first lockdown and we have actually around 25 webinars on our YouTube channel. We have an impressive number of registration and even more impressive number of views on YouTube and we are really happy to see that it's having this success. And today we are on the fifth webinar of this big data series. We saw how to visualize images, how to register images and do some quantitative analysis. And today we are going to dig a bit more into parallelization and also how to handle very, very large images. So our speakers today are Pavel Tamanshak from Dresden and Stefan Zarfel from Janelia. Pavel will speak around half an hour and then Stefan will take over for the last hour. So welcome to everybody. Welcome Pavel and Stefan and Pavel, the floor is yours. I will stop sharing and let you share your screen. Can you hear me? Can you hear me? Yes. Okay, great. So welcome everybody. I will somehow hopefully ease you into the complicated topic of exploring how to parallelize image analysis using Fiji and in particular in this case, open MPI. So my presentation will be structured as follows. First, we will look into why do we actually need to use HPC to do image analysis? Then we will discuss a little bit how the efforts which we have made in the past to apply parallel processing to a very specific image analysis task. Then we will go a little bit more general and in detail about the principles of how we can really use image J and Fiji to parallelize relatively simple but very diverse image processing tasks or many small images. And then I will give you a glimpse of how one can use the more advanced image J2 in infrastructure to parallelize also processing on relatively large images, but maybe not so many. At the end, I will show you that it is possible to put it all together and in principle also work with many large images but the realm of this kind of a webinar to really, really large images this is what's gonna be discussed by Stefan Salford after my presentation. So just that we are on the same page of one example of a big image data in biology is coming from light sheet microscopy. What you are watching here is the development of Drosophila embryo, which has been image with the size light sheet microscope from five angles. We collected all together 715 time points. We tortured this poor embryo for about 24 hours and about this time morphogenesis is almost a complete and muscle systems are connected to the nervous system and the embryo will start twitching but we are still imaging it because we want to show that after 20 hours this embryo will actually hatch into a larva. So from the point of view of image analysis this generates a lots of data, 4.1 terabytes of raw image data, which is something which is hard to process by any means. Let's now wait for the embryo to hatch. And that's gonna happen about now. Okay, voila, happened. All right, so in order to get such a video what we need to do is to do image processing on this data. We acquire it using either a commercial open source microscope and we acquire in this case so-called multi-view data, same specimen from different angles. So we have to register the data which is the first step here. Then we have to combine the image stacks of the different angles together to one output, almost isotropic stack which might be deconvolved. So that's a very computational demanding step. Then we have big data which we need to visualize. We are using here as an example a big data viewer from Fiji to do that and we might want to do some very diverse data analysis on this kind of data. So this technology really highlights the big data challenge in biology. I mean, until recently we were using mostly confocal microscopy. This is a slide from Jan Husken and if we would image an embryo for 24 hours we would generate 85 gigabytes of image data but the embryo would be probably that long before that. If we use a classical, let's say now commercial available SPIM and we let it run for 24 hours we generate five terabytes, which is difficult, right? If you use a state-of-the-art development optical table machine it is not unusual that you generate in 24 hours 90 terabytes of image data. So the big data problem in microscopy is really big. Another way to cast this problem comes from the IT department of our institute where they say, well, I mean, you have maybe routinely you plan to acquire 13 terabytes a day but you are saying that you could also acquire 128 terabytes a day, right? So just please consider this is what our IT people said that the combined production of CERN is 82 terabytes, right? So if you want to do something like that you have to invest some effort and some money into the problem. So what one basically does is to make this pipe of how you process how you deal with such data in various ways into much thicker one, right? I mean, instead of a single drive use redundant array to get a better internet connection and instead of a single machine use something called HPC high-performance computing resource which allows you to paralyze the processing. So that basically means that you are what we are in particular facing in the case of the Lightsheet Microscopy data processing is that even though our individual time points might not be so large, maybe five gigabytes maximum and it takes us minutes and hours to register and decollolve them if we acquire thousands of time points these minutes become very soon hours and the hours become days. And so then having the solutions in open source software is a big advantage because with open source software what we can easily do we can parallelize such a task per time point and we can spread it over the cluster and we don't have to worry about anything in terms of licenses and paying for running the same software on many different nodes on high performance computing resource. So that's one of the biggest actually most convincing argument for using open source software platforms for doing image analysis on big data. So we implemented first this very specific solution for this particular task, right? It is schematized here it is a very complicated scheme but one should really point out here is that in such a pipeline some parts are very easily so-called trivially parallelizable we can change the format of the data for every file separately without affecting the other files. We can do registration also for all the images which are representing one time point and we can spread that each two to one node of the cluster. However, we have to also at some point consolidate the data for example in the SPIM registration pipeline we're doing so-called time lapse stabilization which means we have to collect everything together to the head node and do some processing which requires interaction between the time points. And so one is to really develop a pipeline which is able to do all of that. So we actually did it, right? And now here I will just very briefly jump to the Viki page which describes how one can do it, right? And I named the title of the command line rules, right? So actually if you are a geek, this is fun, right? You play a cluster like a piano, right? You have all this huge amount of stuff which you have to read through, many commands, many little scripts you are on the command line, you are actually sending jobs to the cluster you are editing text files, it's really, really fun, right? But it is complex, right? I mean, I'm still scrolling essentially hopefully you are seeing this, right? It's very well documented but it is not easy to use, that's very, very clear. Okay, so we address this by trying to make this kind of solution more easy to use. And this is something which we do in cooperation with experts, with people which are running super computing cluster in Ostrava, in Czech Republic at the project, at the academic project which is called IT for innovation. They have several large clusters in there, they have this wonderful name like Anselm, Salomon, Barbara and the most recent one which is called Karolina, absolutely beautiful names. Okay, so what we do here is that we are on our local computer inside Fiji, we upload the data to this remote cluster which is in Ostrava, it submits the job, it does the processing, we can influence how the processing is gonna be done through a single, not simple but single configuration file which lives on our file system locally, we can change it and we can resubmit the job and try to run it again on the cluster, right? All that is happening from within Fiji and we can examine the results whether the cluster did what we wanted, whether it registered data, also from our local Fiji by looking at the data which are still in Ostrava through the big data bureau. At the end of the process, we download it, right? So this is simpler than what I showed you before, in fact, it's so simpler, it's a graphical user interface, there is a so-called HPC workflow manager plugin which you find in the Fiji menu if you install the appropriate update site and you really can submit the job to the Ostrava cluster from within such nice graphical user interface and even you can monitor how your job is proceeding, because as a biologist, I find it always super annoying if a computer scientist says, okay, so now I submitted it to the cluster and then I'll wait the day, right? And then it will be done, right? But you don't even know that the cluster is doing something, right? So it's very important for us biologists who are using this to have some visual feedback that processing is happening besides the fact that it's actually really kind of nice to watch to see how a computer somewhere in another country is kind of churning on your data. So we developed all these, the ability to submit such a very specialized job to the cluster and to monitor how it progresses and it is published if you want and if you want to look at how it works, we have a pretty decent documentation for all of this. But disadvantages, it has many disadvantages. First of all, as of now, it only works in the cluster in Ostrava which is actually meant to provide service to the scientific community, but believe me, it is not exactly straightforward to gain access to a supercomputer because there are massive security issues and there are many hoops through which you have to jump. Okay, maybe you have a cluster in your local place and in fact, you can install the system which we developed also somewhere else. We did that in our cluster in Dresden but this is not simple. And the other thing which has a big disadvantage of this is that this is a very specific solution for a very specific task, right? I mean, this parallelizes the SPIM pipeline and it doesn't parallelize anything else, right? So for the rest of my talk, I will try to show you how we are kind of on the way to develop solutions which address this kind of problems. And so the solution which we came up with is to build a bridge, to build a bridge between Fiji, open source software for biological image analysis and something which is called OpenMPI. OpenMPI is a Bonafida standard for parallelization of software of code on HPC resource. It is something, it is a program which you will find on any HPC cluster no matter how obscure it is. It is really a standard, right? And so we built this bridge in two different ways. We built it first as a bridge which is built around image A macro. This is something which is really targeting biology users who know how to play with macros and we simply provide them with a relatively simple way with a few new syntactic elements to parallelize an existing or newly recorded macro. And here, you user, you are deciding how the task is going to be parallelized. You are in the driver's seat, you decide which node does essentially work. And this is really suitable if you have a directory of 25,000 images, each one of them is not so big, but if you would do it serially, it would take forever. You know, here you can really batch it into groups let's say thousand and run it on the cluster. Now the second bridge is more complex. It is building on the more advanced infrastructure of image J2 and it is building a bridge between the open MPI, which is the cluster software and the image J ops. This is something which is targeting more experienced developers who can actually deal with the ops. The beauty of this is that as you will see it creates no new syntax. So if you have an op and you use our new version of that op and you have a connection to open I enabled cluster without changing a single line of code, your code will run in parallel on the cluster. So this is fantastic for automation and it is targeting the second big task which is if you want to process on a very large image which you simply, you know which is not so easily quantized into pieces. So now we are basically giving the task of spreading the processing of such image to the open MPI which is very clever in its way how it will be distributed over the available resource. So here you as user you are not in the driver seat at all you are relying on the software to do the right thing. Okay, so now it is shown here let's say one more time graphically we have many small images on our local computer. We have access to a supercomputer, we upload it and we say process those two images on node one and those images on node two and those images on node three and download the results, very simple. The second way is a script which is gonna be a Jiton script able to include the image J ops commands. And once again you have one large image which you upload to the cluster and the program itself splits the task into appropriate pieces so that the three computer nodes which are available are maximally used and you download the data, right? Okay, so that's kind of high-level overview. Now I will show you how this simple parallelization actually works because it is highly unintuitive but once you get it, I guarantee you that you can do it. So how does open MPI parallelization actually work? This is a cluster which is computer consisting of many other many smaller big computer consisting of many smaller computers which are called nodes and usually you don't get access to all of it, right? So for example here, whenever there is the Fiji logo this is the nodes which are available for our specific task. Now in order to understand open MPI you have to understand that each one of these computer nodes are going to run the same macro, the same code, right? And this macro will have access to two numbers which are originating not from the parameter which you put in but which are originated from the open MPI, from the cluster itself. The one number is gonna be a constant which will tell you how many of those nodes are available to you, how many you can use, right? And the other one which is the rank is a number which is gonna be different on every one of the computer which where the macro runs. So if the macro runs on this computer here and it asks for par get rank it will get a number which is zero but if the same very same macro runs on another computer it will get another number if you have number which is two which is one and two and three on this one. They all will know that they have 11 in total. This is super simple but this is all you need to know in order to parallelize image in macros. So let's see how it's done, right? So you record and or you can write or whatever you use the existing macro as you have it. And now I will show you, I mean, I will not really show you but I will give you an impression of how you can kind of insert very small syntax into your existing macro to make it parallel. And then I will show you how it, you can do a little bit more complicated thing. Okay, and then, sorry, you insert the syntax to make it parallel. And then if you are running this on a computer which is configured to connect to a cluster which runs open MPI, this macro will distribute across the cluster. What is a little bit more complicated and which I will try to show you as well is that you could also insert a little bit of syntax to provide the monitoring that something is happening that your processes are actually progressing and you can really decide at what pace you want to have this kind of monitoring or you just don't do it at all and trust that the cluster is gonna do what you ask it to which is a kind of tall order sometimes. Okay, so now let's see, now we have here 10 images which are numbered zero to nine and we have only one computer to process on them. So we will probably process, we will do some, we will have some code which has a function inside which loads the image, does something to it and stores it, right? This is what we are doing, right? This is in the macro. Now in order to apply it to the 10 images we put it in a loop, right? This is very easy. We iterate from zero to nine and we make this function run for each one of those images, right? So on this computer, which this computer is a cluster with single node, the node is zero. Basically it will process on the first image, second, third and so on. So it will serial process across the images. This is the default thing, no cluster, single computer. Now let's imagine that we have a cluster which has four nodes. They are numbered zero, one, two and three, right? So now I will try to remind you that we had two numbers available to us. One of them is the number of nodes available to us which is four and the other one is the one number of each node which only the node actually knows, right? Okay, so we will have the same function to do the work and now we will change. The only thing which we will change in our zero code program is what is inside the loop, right? We start with the number which the node gives us and we end with the and we increment the loop by the number of the size of our cluster. So what does it mean, right? If I am on the node number zero and then we have to put this inside a little syntactic sugar which is initializing this parallel, let's say processing but this is a detail. So if I now am on the node zero and I get this image my target rank will be zero and I will initiate this loop on the image zero, right? When I visit this loop one more time I will increment by target size but by plus four. So the next time I run the loop I will do work on image number four and next time I will run on image number eight. This is now nothing special, right? But the magic comes here, right? If I am not on node number zero, I'm on the node number one my part get rank returns one which means that I will be actually working on this image and then this image and then this image. And like this I can in parallel work on the 10 images, right? So that's basically the principle of how to deploy the open API parallelization using image and macro. Now it would be of course nice to see that it's happening and you can actually be implemented ways how you can visualize the progress of just such computation in a beautiful way like it's shown here. Okay, now our pseudo code will become a little bit more complex. This part here is the same. We have a new set here which basically repeats this loop and creates a task for every image which we generated. And then inside our function we put a little syntactic elements which tell us the code proceeded until here the code proceeded until here, right? And using those two things you can basically monitor how on a given image where your program is and it can report it to this graphical element which will display it in this way, right? It is very simple. It doesn't have to be done but it makes it a little bit more accessible. Okay, so if you want to use it and if you want to find out more then please go to this webpage. I will go there very briefly. I think I will actually manage to finish more or less on time. So this is a very nice guide which was done for the I2K conference. You really can start this, you know the basic prerequisite then you click next then you get really nice screenshots how to get to the plugin, how to set up your cluster, how to connect to it, all this stuff and you continue, right? So this is a very, very nice tutorial which obviously I couldn't cover here but look it up, you know you can jump between the different places. You also learn how to write the macro you know this example which I showed here graphically it's shown here on another let's say example. One should actually, maybe we don't have that much time but one should actually say that there are many other ways how to use those two numbers the size of the cluster and the rank of the node to parallelize, right? You can group your data in whatever way you want just do it inside your macro and you are deciding how to do it. You know, once you've read your head around it it's very simple. Okay, but now let's come back to the second example, right? What if we actually don't have such a simple simply that parallelizable, trivially parallelizable task where we have many small images and we can just, you know put them in groups and send them to the nodes, right? What if we have a one large image which we have to actually split? So here we have to do something a little bit more let's say at advanced which is a library which was developed by the great people in Ostrava, Sajjava, Parallel and MPI which builds the bridge to open API on a much deeper level of image day two ops, right? So this basically takes care of splitting the problem into chunks. It, you know this is really using the open API in its full let's say a power meaning that the nodes they know about each other they can communicate with one another but however, you know if you would want to do it in open MPI the code is behind, it's complicated it's really difficult, right? So now this completely obstructs it. I think that the computer science people like to say that it's fully transparent but I'm not 100% sure, right? You know, U.S. biologies you will not need to write any MPI related code we have developed a new version of the image ops and their syntax will not change I will show you that though eventually when it's completely finished it will be available through an update site and hopefully it will become actually useful right now it's still a kind of experimental thing. So what does it do? How does it work, right? We have a large image okay, this doesn't maybe look very large but this is an example it's an image which contains nine planes, right? And we have a computer which, you know now the open MPI makes all decisions let's say it gets a computer which has only eight nodes, right? So it will divide that image these nine planes into eight chunks what you can see is that the chunk number one it doesn't correspond to the plane it extends to the next plane, right? Because for the computer for the open MPI this is like a chunk of data which it needs to divide among the resources which it has so it creates these eight chunks then it processes on them using also multi-threading inside it doesn't really matter and then when it does one processing step it re-synchronizes it sends all the other nodes the data which the other nodes have been computing on, right? So this is a necessary step which is using the inter-process communication and it makes it more powerful because you don't really have to do any decisions how to split it it can do it completely automatically but it introduces an overhead because the time it takes to synchronize the data across the cluster increases with the number of cluster nodes if you have only one node is obviously nothing if you have a few it's not so long but if you have 30 nodes if they all have to receive the copy of the data it can take some seconds, right? So this is one overhead and I will get to the limitations of it in a moment but I find this absolute beauty of this approach is that now we are doing a difference of Gaussian on a let's say a large image we are using it using gyton using the image J2 ops, right? You don't have to understand this code we open the data set you do the convolution with one kernel you do the convolution with the other kernel we subtract them and we save the data the beauty of it is that if this would be done on a single computer the code would be identical there would be not a single line or whatever character a different but because we are running it on a computer which is set up for a cluster access it gets internally on that computer split into the three nodes which are available and it will run hopefully faster given the overhead which is introduced here. Okay, so that basically means that this is something which is still rather experimental but if you want to find out more it is already documented it is on GitHub and I encourage everyone to give it a try because we need people to basically better test it, right? Now, last point, you know could we actually combine the two approaches? Could we operate on many large images? So the answer to that is in principle, yes the hybrid approach of macro and ops is very much possible. Sorry, this is a little bit busy slide I mean what you should take as a take home message here that here inside the code whatever is blue is macro, okay? One has to say that this is macro which has been wrapped up into gyton because you have to be in gyton to be able to use the ops, right? And then whatever is red is ops, right? So you can divide and conquer your task with the macro commands when you decide how you divide it let's say you take this large image and send it to those four nodes and this large image and send it to those four nodes and this large image and send it to those this is actually automatic you just set it, right? And then inside it will split it using the ops and operate in parallel on this large, large image, right? One has to say I think I have it here in the corner I always move your picture so that I can see yeah, this is under construction this is not really working yet entirely it's a proof of principle we hope this will work and we are finding out ways, you know how to figure out then in what scenario this is a good approach that given the overheads which it has that it actually works so this leads me to my last slide, right? So there is of course, you know, some gotchas, right? One is I already mentioned that the synchronization makes it slower the second gotcha is that your data set cannot be infinitely large because at some point in the pipeline there's to fit in the memory of one cluster node so it gives a very, very hard ceiling to how big your data can be depending on what cluster you are at these clusters usually have a lot of memory so it's not such an impractical thing, right? But on a very large data sets you simply, you know, cannot do it, right? You can either do something else ahead of time to divide the big data set into smaller pieces or you stay tuned and listen to Stefan Salfeld to show you how they do this kind of magic edge on area. So what I showed you was, let's say, you know more developmental stuff, more practical on the type of data which, you know many of us might encounter and I think the take home message here is that there are now ways how to do it on a very specific data sets or SPIM registration there are ways how to parallelize your macros is injecting very little new syntax where you are in charge and you can parallelize whichever way you want and there is also a way how to use the awesome power of OpenMPI to parallelize automatically using the image ops but you have to be a programmer to make, to take advantage of that. This is a product of the hard work of the team which I established a couple of years ago in Ostrava which is the supercomputing cluster in Czech Republic and, you know, there are pictures of all those people and their names I mean, they have been really instrumental and Vladimir Ulman is in the background answering your questions about this. I went over time almost not at all so I'm kind of reasonably proud of myself and I am, you know either gonna answer some questions or I will pass the baton to Stefan Salfeld who will talk about the really big data. Okay, thank you Pavel. I will give like a few minutes I will allow a few minutes of questions for you. There was a question about memory limitation and this was answered by Vladimir. Actually, there is no memory limitation because we are on the cluster so the limits is the limits of the cluster but I think related to that the question was also how to retrieve the information you process the results you process from the server and I think like loading the data is also an issue. Very good point, very good point. So I would like to say that we addressed it but I mean, it's more like that we are in the process of addressing it in our ground one of the work packages have been to figure out how to push the data to the cluster. Obviously there are some physical limits to it, right? I mean, in our experience typically we are actually most of us are living on a backbone of the academic network and our sinking data to the cluster is actually not so prohibitive, right? That sometimes the problem is that cluster might not necessarily have a massive storage capacity, right? So especially in Ostrava actually it's been always difficult to find enough storage for big data to sit there for a while, right? You know, there are ways how one can think about parallelizing the transfer of the data or you end or using a compression and we are actually exploring those but there's nothing really concrete to show yet for this. I would say last thing to say to this I mean, the amount of effort to set it up is still significant, right? I mean, getting access to the cluster and figuring out that it works and all that the time you spend on that is probably gonna be longer than the time it takes to copy your data there. So it's not such a prohibitive thing. Okay, and I would have a second question. So if from a facility point of view if there is like, I mean an image facility point of view if there is not yet an infrastructure like a real cluster, how do you go to the IT service? How do you start this process? Because I, so from what I understood that OpenMPI is a software but it's probably not available everywhere. You have to use- No, this will be available everywhere. That is not really the showstopper. Any cluster will have it, right? Okay, so- It can be installed on it. I mean, it's open source, right? It's not- Okay, so there we go. There is a, of course, so if you do have a cluster there is some kind of expertise required to set it all up. Yeah. Some of the links which I showed during the presentation to the wiki are making an attempt at explaining this, right? But the person with the IT experience is definitely required to do that, right? If you are using something which is maybe commercial a little bit, right? Okay, so nowadays you would probably think more about the cloud services but we are talking here about connecting to a academic HPC resource which is available to the outside world, right? So there will be, as I said, many hoops you have to jump through initially to gain account on that cluster, right? And sometimes it could be kind of formidable but the question is, it might be worth it considering that, let's say, I mean, to some degree the infrastructure then to eventually run there exists, right? So it's not without its issues because, of course, security is a massive point for these HPC resources. They are constantly being bombarded by hackers, right? So you really have to make proper handshakes and stuff like that and it's not penetrable, right? But nevertheless, it can be done. Okay, so yeah. Vladimir, do you see one question that would be still interesting? I don't want to go too much over time but because we have a bunch of really technical question. Those are not for me. Yeah, I think I would prefer to have Vladimir and John answering them by writing and then give the floor to Stefan so that we don't go too much over time. So thank you. Thank you so much. This was very enjoyable talking to the geeky stuff from my living room, excellent. Hey, thanks so much, guys. Yeah, as a, as a, as a proud Pavel alumni, it's really a pleasure to present in this Tomantra collab session on new bias and I'm very thankful for this. It's very cool. In this seminar, I want to, I want to give you a hands-on demonstration of the toolbox that we use to develop and apply processing pipelines for relatively large image data sets and large as a relative tool. So we will talk about that in concrete numbers later. Keep, you should keep in mind when you hear this presentation that this is our personal perspective. This is maybe not the right answer for everybody of what a good framework for large data processing should look like. And of course, this personal perspective is heavily biased by the fact that we developed many of these libraries and tools ourselves and we have been using them for several years and we're kind of happy with that. So I want to share this happiness with you. I want to show you how to use these things and you will have to decide whether this is for you or not. First, I want to show you a few examples of the data sets that we have been working with. Together with Khaled and folks from Jniria Scientific Computing and in particular Eric Trotman, we stitched and aligned the first and so far only complete results of a LaBrain image with electron microscopy. This data set comprises 7,040 nanometer sections, each consisting of up to 2,000 overlapping tiles imaged at 4 nanometer lateral resolution with custom transmission electron microscopes developed by Davibok and his team who spearheaded this fabulous project. The size of this data set is somewhere between 50 and 100 useful tera voxels. There's a little bit of overhead and if you consider the background pixels as real data, then you end up a little bit above the 100-terabyte range. So that's the data size that we're mostly dealing with. So let's look at another data set to get together with Jniria fellow, then Jniria fellow Dagmar Kainmiller. She's now a group leader in Berlin and our colleagues from Scientific Computing we developed the methods to stitch and align a volume of the Tosophila central brain that was imaged with focused ion beam scanning at 8 nanometer isotropic resolution after it was sectioned into 13 slabs of 20 nanometer thickness that's the stripes that you've seen here and here we had to stitch and align the Febesi M slabs and then unwarp and align the slab series. The size of this data set is about the same range compared to the other data sets that we have here. About in the same range as the serial section transmission E-M data set that you've seen before. However, the resolution here is isotropic so reconstruction of neurons and synopsis is a little bit easier than with the non-isotropic resolution that you've seen before. The last example that I want to show you is this one. We developed the code to stitch tens of thousands of 3D tiles imaged with that as light sheet microscopy by Eric Betsig and his people. This includes a method to estimate and correct the flat field of the imaging system from all tiles. The advantage in this case is that we have so many tiles but unfortunately no flat field image that we can reason the flat field from the 3D tiles this has been published before by Kevin Smith and colleagues but was only implemented for 2D images that fit into main memory here we have like 10 to 30 terabytes of image data we want to use all the pixels to estimate these flat fields. So this was the last video here is an example of what we do then when we have this data assembled we have worked on correlating light from microscopy as you can see here you can see an unbiased average atlas of the Tosophila brain and ventral nerve cord that we created from 60 symmetric brain samples labeled with a generic synapse marker and imaged at 0.2 times 0.5 micrometer resolution with confocal laser scanning microscopy LaVessa in the lab then trained a deep neural network to predict all the synaptic lefts in the TM volume that you see here at the bottom shown earlier this was the first example and automatically registered the resulting synapse cloud at the top with the light microscopy atlas and now thanks to the Tosophila brain being highly stereotypical genetically labeled neurons in light microscopy can be associated with the constructed neurons in electron microscopy we've done that also for the other brain samples and more recently a new sample came in of the entire ventral nerve cord and we also established a registration between these two samples and the registration work by John Bogovic who's actually answering questions in this in this session. The library that we're using for most of these things is our beloved image processing library ImageLift 2 is not necessarily an image processing library but a data management layer and the thing that we like most about ImageLift 2 is the fact that it is lazy so what does it mean that ImageLift 2 is lazy? ImageLift 2 is lazy at several layers first of all it so basically we should start at a different point ImageLift 2 tries to express all mappings from discrete and continuous coordinate spaces into arbitrary pixel domains so this means we come from in dimensional vectors that are either on a grid that's the discrete space or in a continuous domain there will be real space real functions and we met into pixels whose type we don't know yet these pixels in the classic sense could be 8-bit numbers 32-bit numbers floating point or integers or something like this it can also be an image or a pixel can be a dataset somewhere sitting on the cloud and the only thing that you need to do is to implement logic that you then apply to these pixels but when I'm talking pixel in the follow-up of this talk it's not always numbers it can be all sorts of things and they are abstracted through ImageLift 2 interfaces that you use to talk to them what does lazy coordinate access give us first of all yes we have discrete and continuous coordinates this is wonderful we do not only randomly access coordinates which would be the classic image processing way of doing things I'm saying I go to pixel x, y and then I'm reading it and stuff like this but often times you just want to loop over all pixels there will be iteration or you have a dataset that will be used like when an oceanographic research endeavor runs with the ship over the ocean and makes samples at various GPS coordinates and stores the samples with the GPS coordinates that will be an irregular dataset and you want to be able to identify the samples or interpolate between the samples so ImageLift 2 gives you means to this by using neighbor search or radius search and stuff like this and also you have random access and again random access in the discrete domain would be on a pixel grid some sort of rectangular pixel grid and if you have this in the continuous domain you can access arbitrary points so in addition to this we have coordinate transformations that include simple things like translations flips, crops boundary extensions that are virtually calculated where no data exists and we can also rotate, interpolate project, add dimensions or remove dimensions from a dataset or even arbitrarily work to do things like registrations between EM and PICOSCP datasets so the next thing is lazy value access to the pixels I mentioned before pixels aren't always numbers but most of the time they are so that's the codomain of our functional space we go from coordinate space from some Euclidean coordinate space into an arbitrary pixel codomain the standard way to do this for any normal image processing thing would be accessing memory the advantage is that because ImageLift 2 has virtualizes this you can access shared memory of other processes and stuff like this but you can also generate these pixel values depending on the coordinate so you can implement functions if you have an underlying image and you want to change what the values do you can implement converters so you can have functions that apply gamma corrections or change the pixel type or apply a lookup tables or do other things with pixels you can have multi-variate functions where you combine several inputs into a new output which is kind of the same as a generator because you don't really need to know what the inputs are and of course you can implement interpolators trivial things would be interpolators of pixels that can multiply and add so they can implement things like N-linear interpolation or Lankos interpolation but it's also possible to interpolate between images which we recently did to un-shear and spin recordings because you shift these things under arbitrary deformation you know that these pixels have an numeric type and they can generate a virtual interpolated image between them so last but not least we have lazy data access so even if image lift tool accesses underlying chunks of continuous memory to represent the data this data can be generated lazily by cell loaders or array loaders and these cell loaders and array loaders can be used to implement caches that go either through main memory or come from permanent storage like a disk or a cloud store okay so this is a lot of words and it's not fun to read all these words so this is the place where I would like to go into the first examples and for this I need to change the way how I am sharing my screen I go to full screen and that hopefully works so we have here Eclipse integrated development environment that I favor these days and what you see here is a repository full of examples that I want to use to guide you through this so these examples are modified versions of a similar or kind of related tutorial that we did at the I2K 2020 conference and I posted the link to this repository in the chat so that's a branch of this advanced tutorial let's first go into image.lib 2.1 tutorial we could start with loading an image and showing it to you but everybody can do this so what we start with is instead of using an image that lives in main memory we will create one from pixel coordinates so what we use is this thing here that is called a function real random accessible things in image.lib 2 have very long names that are descriptive for what they're providing to you so if something is able to implement random access in an n-dimensional real-valued coordinate space then it is a real random accessible and you see this work here and this particular one implements functions so how do we do this we call this function a Julia set this is the pixel type the output type that we want to generate and this constructor has a few parameters first one is the number of dimensions so this is a simple case where we just say this is a two-dimensional function and now we implement the function and we're using Java 8 features because they're cool this is a lambda a by-consumer that gets two input parameters the first one is x and the second one is f of x because we're overriding this is the way how it's used x is an image-lib 2 type that is called real localizable that means nothing else but that it is a vector that has real-valued coordinates and f of x is of our output type and I'm hovering here you see that this is supposed to be an integer type and so now let's see what we're doing here here we implement a function that does some magic gets the coordinates of this x-value writes the coordinate a dimension zero into c and the coordinate a dimension one into d and then it do some stuff and it actually calculates the Julia factor and then once it's done and it found the iteration def of this particular pixel it sets this value that it found to f of x which is an int type and we also have to provide an int type that the function has an object to operate with this is the function that's great and now we're using the big data viewer which is a project spearheaded by Tobias Peach also at Tomanshak I'll online or still present so that it's a collaborator at least and this has a lot of convenience functions the most important convenience functions is that you can display everything image-lib 2 with it very quickly by just using this BDV functions static function collection here we display this Julia set we create some sort of interval the big data viewer knows where to paint the box and we give it a name and then we can provide some options in this case the options are that it is a 2D data set so it doesn't have to do any 3D magic and we want to use a display range from 0 to 64 because that's the limit of our Julia fractals so it's not going to be brighter than 64 so now let's run this and for this one I press the run button and here we see a big data viewer window pop up we can rotate this so this is cool and we can shift it around and we can also zoom in and I'm using the keyboard to zoom in so and one of the nice features in this example is that this is a function definition on a real valued space input space so there are no pixels this goes on and on forever until at some point we reach the precision limit of double values let me see if we can reach that actually takes a while so here we go so this is the precision limit of double values but to show you how cool this is we can also zoom out from here and if you want higher precision you can certainly change your input space into something else good so that's great what else can we do with this here we have the second example we do exactly the same thing we have our function random accessible and we show this and then we do something else lazy we say raster this thing okay so there is no memory in print of this it just rasters it and then we show the raster image and before it gets too complicated let me show you this here we go this is the original this is the original image and then you see this so we're showing only a single source and we show currently source number one that's the thing that you've seen before and now if we go to source number two oops what happened well the Julia fractal and the space in which we display this goes from minus one to plus one so the resolution of the raster image that we generated from this is three by three that's a surprise right so how can we fix this kind of easy so we could say change the scaling here and then we change the interval here of stuff running and we show this again and either oh yes so what I didn't do is I did not change the interval of that function so I should probably also do this so they did up your nose what we've been after so this is the simple function and then if I go to source number two you see this is also bounded because we we cropped it virtually and generated this bounded thing so now if I zoom in I would expect to see some pixels at some point that is actually true here the pixels if we rotate them we see that they're little squares and if we switch on interpolation and do this here we'll see okay we can also interpolate them so something that you may not see because zoom is a little bit slow and transmitting the signal is that this gets a little bit slow when I'm going full screen and the odd thing that is happening right now computationally is that this rastered image is generated on the fly from the real valued function every pixel access asks the real valued function for a value rasters it and then if we do interpolation on top of this for every pixel that we're rendering here on the screen we actually ask the real valued function for four values which is computationally more expensive than doing the original which is this guy here so that's a bit odd how can we fix this we see this in tutorial number three we start the same way we have our function we show our function I changed the intervals here and then we do this little trick here from this interval that we rastered we create a cached instance and we use a cached cell image with a block size of 32 by 32 this doesn't matter too much how big this is and you can see everything else remains precisely the same so we can run that and we see okay this is the original function then source number two is the on the fly pixel rastered thing source number three looks exactly the same source number two but it's the cached version and the cached version is significantly quicker because what this image with two caches doing is it takes the original source it looks for little blocks that are stored in main memory and it fills them once with values and then keeps them in main memory but it does it only as long as you have enough memory otherwise things are getting lost again and will be reloaded as you need them again so for zooming deep into this area all the stuff that is outside of this field of view is basically irrelevant and you don't need it anymore and so you can have these speed improved situations where you can use intermediate caching to save compute time okay so the last thing that I want to show you in this image the two tutorial is lazy evaluation of on top of these data sets so we've now seen that we can create functions we can raster them as a coordinate access is transparent we can cache things into main memory so now we want to do something to the pixels and what we're doing here is we create on the fly high-calorie gradients from these input images one thing that we do first is we convert our raster data set into a new type because gradients are sometimes negative we cannot only deal with positive numbers before that we had some data type that can be used in types that should have worked anyways double is better so this is how it works it's an on-the-fly conversion of every pixel value using again a bi-consumer that takes an input value and an output value and it sets the real value of the output value to the real value of the input value and then it doesn't matter what the input and output values are and we use a double type so calculating in doubles is cool now I have a helper function here I create another random accessible interval that is undistinguishable in image 2 from memory back thing even if it's calculated on a fly using this helper function that creates a gradient at dimension 0 from this value and let's look at how this function is implemented it's actually relatively simple it well it forwards to this function here and here we make an array of offsets that is as long as all our dimensions and we set the offset dimension at the dimension where I want to calculate the gradient to minus 1 and then we create two virtual sources that are virtually translated one in the inverse direction and one in the positive direction so one of the images is shifted here no memory duplication they're both still completely virtual and then we write a converter that uses two sources so it's a bivariate function that uses the value from source 1 and the value from source from first it sets the output value to the value of this pixel then it subtracts that pixel and then because it's a distance of 2 we also have to multiply by 0.5 so that we get a correct center pixel gradient right so that's it let's go back to the tutorial and let's see how this works and then we can show these gradient x and gradient y's what I'm doing here is I do a little bit more of big data viewer conveniences I'm not only showing this but I'm also storing a reference to the big data viewer and then changing the color of the source so that we can overlay them on top of each other doing different colors so let's see how this works so there's a bunch of sources you can see them here in the side panel to move my zoom overlay a little bit here this is the original thing switch them all off and go into fused mode now we see two things first of all the gradient stuff seems to work right so we have the magenta channel with the x gradient and the green channel with the y gradient the negative and positive so around here this is 0 so we can rotate this we have also seen that this thing built up as we zoomed as we opened it the other thing I noticed is that while I implemented this and why I didn't display this none of these gradients had ever been implemented never being calculated because it's all lazy only at the moment when big data viewer wants to show things we actually start seeing them and then they're being calculated and then I'm using this cache mechanism to store intermediate versions from them so you can see here that I'm using a cache on the gradient x so all these 32 by 32 pixel boxes are being filled okay next example so this is a tutorial for what else we can so wait a second this was actually well this actually goes forward to the next section in the presentation that I want to show you so I think I have to go to slide number nine when we're talking about so we've seen how we can use lazy evaluation and lazy pixel processing with image lift tool so that's very convenient and now the question is we've seen how to use caches to store stuff in main memory and reproduce it as necessary when main memory is not sufficient and of course you can use this to load data on demand right so instead of generating it and calculating it you can also load it from a data backend and so now we have to talk a little bit about data backends so the classic data backend for image analysis is two dimensional images or n dimensional images and for two dimensional images we have a bunch of established formats jfxpng is most important jfxpng is probably the standardized tiff format and also the more modern htf5 standard this is great but it's inefficient for very large images so if you have an image of 100,000 by 100,000 pixels and you store this in a single 2d image which is a linearized linearized representation in memory and you compress this in order to load a small chunk of this data that you then want to process you have to kind of decompress the entire thing or you have a I don't know depending on your compression algorithm you may jump forward in block sizes but it's sort of difficult so it's not great so what's a good solution for this you tile your image space some established solutions implement this for example again htf5 tiff has a tile that you can use to load images and then there are these web based standards that are well known google maps for example is a pretty well known standard for this that stores 256 by 256 tiles and you load only the tiles that you need and they're independently compressed so you don't ever have to access the entire data set cabinet is using a similar structure as well so that's fine because when we have small tiles we only have to load these tiles and then we can focus on the pixels that we want to process how do we do this when we have more than two dimensions well if you use tiff or or mrc you will see these series of 2d tiles same problem if you want to get a small box from this you have to load all the tiles that are covered by the box this can be inefficient if the tile size is large and of course you can use tiled images to do this which gives you the same efficiency as mentioned before unfortunately this is the place where your IT department gets a little bit angry about you because these tiles if they're small then you have a more efficient access so you would like to make them as small as possible so you have to load little data but it also means that you start having billions of teeny tiny files on your file system and that's not the thing that most file systems like so it would actually be better if instead of these 2d tiles we would have 3d chunking or nd chunking and again this is something that is well supported in the agdf5 format and also in web services like cloud volume boss or divot some of them are very heavily focused on doing this only for 3d hdr5 however would do this in n dimensions for you and that gives you the opportunity to just load very small numbers of chunks for these data sets so you've seen hdr5 mentioned on all of these slides very prominently because it's really great and for anything that is not outrageously large so let's track back hdr5 you should use hdr5 for almost everything you're doing because it allows you to have structured data in a single file container you can link between file containers you can associate structured not structured but at least type metadata to your structured data sets in this container format you can have compression you can have this chunking you can have random access to these chunks you can have several hundreds of gigabytes data sets this is great the only place where hdr5 doesn't excel is in parallel writing of many chunks into a single data set so assuming you have a volume of 50 teravocals and it's one volume and you want to write many of these chunks into this single volume this is where hdr5 is complicated because the chunks in particular are not aligned on a standardized set grid but are adjusted to each other so you can only sequentially write these chunks out there are some libraries that help you out with this and do at least parallel compression and then use some native layers in hdr5 to directly write into the byte stream but so far this doesn't work very well when you work on a compute cluster and you have independent computers talking to the same parallel file system for example so that's why we thought like we want to have everything that hdr5 has but we want to make it simpler and use parallel file system capabilities to do parallel writing because that's what they're built for and that's what they're good at and that's why we ended up building the n5 api full disclosure we started to implement a new format specification and then we figured that it's silly because you shouldn't invent new formats and realized that what we're trying to do here is actually just a programming api for some primitive access patterns that are consistent between file based backends that write individual files onto the file system and also hdr5 so these primitives are that you want to be able to create, delete and list groups and data sets in a hierarchically structured thing so this hierarchical structure could be an hdr5 file or it can be cloud storage or it can be a file system right so say you want to read the three of directories you want to be able to create, delete, list and read attributes for groups and data sets so we want to store n-dimensional data sets or something that's related to this and we want to store them in this hierarchy and we want to be able to assign metadata to each of these things so this is not something that in a file system is standardized so we decided okay maybe we just write a json file into this and that's what we did for the first implementation of this in hdr5 it would mean that you associate these metadata elements to groups or data sets so what else do you want for the data sets you want to implement chunks you want to create, delete, read and create, delete and read compressed data blocks these are the chunks or cells in these data sets and then for the first implementation of this we created a file system backend that uses directories for groups json files for the attributes because that allows you to express arbitrarily structured data and to use one file per data block hopefully this file is not too small and implemented the standard compression algorithms that are available in standard libraries to compress these things so in addition maybe if I'm having a couple of questions about like stirring and accessing this n-dimensional data with chunks how are the processing or analysis of structures spanning neighboring tiles done if you have a connected component occupying multiple tiles I think this is a problem that is common for h5 and for n5 actually yeah so it depends on your processing algorithm but what we did for the COSM project we used well I'll actually get to this but let's answer it now so what you can do when you do connected components is you process the connected components in chunks in parallel it doesn't necessarily have to be in the chunks of your file system or your data container but something that is feasible for your processing pipeline and then you run over the connected components that touch each other in your extracted data structure and you make a union find over these connected components and re-label them with the adjusted label ID and if you follow some primitive rules like you enumerate so in your block processing you basically give label IDs for your connected components by using let's say the first the first pixel that you find that belongs to the connected component so you don't have any overlap in labels and then you can do this union find over the adjacent blocks okay okay thank you and I think Hoco has had a question also Hoco if you want to speak yes I would like to ask Stefan Pavel as well if you are on a plane and you have just two parachutes and there is the HDF5 format and TIFF which format you try to save that means usually we acquire that set with a microscope that produce TIFF and of course some file format they are better for processing so we may convert this that set into HDF5 eventually other but also we are presented with the problem to keep the row data and you know if I have one terabyte of row data and after I convert this in another format or eventually a third format what is the suggestion that you give about which it should be very practical question so my proposal to everybody building a new microscope that you are thinking about a file format to store this in is HDF5 use HDF5 directly for streaming out your data from your microscope so HDF5 gives you all the beautiful things that you want from a good data format you can define arbitrary metadata dialects and you can later extend them and it supports all sorts of types for your pixels that you can imagine streaming from your microscope and you don't have to resave it because you can immediately use it through APIs like the N5 API for example TIFF is not so great because it is built for 2D okay thanks so use HDF5 for everything that is moderate in size and you get compression so you can have arbitrary compression so the data the first layer of data losslessly compressed is already smaller than if you store this in raw TIFFs or DAT files that are arbitrary structure okay good so let's continue here a bit we have bindings of this and 5 library to imagelib 2 which allows us to transparently load these in 5 containers for imagelib 2 and have them memory cached memory memory we use imagelib 2 cache for this and we have support for some arbitrary adventurous things like pixel types that contain an unknown multitude of labels with an unknown weight in addition to this we have alternative backends and compressors so N5 is just an interface API so Fudus interface is used from this imagelib 2 library for example we can have several backends so we implemented 2 cloud backends for awss3 and google cloud we implemented a backend for HDF5 so you don't have to change your code when you change from this cloud storage stuff or file systems single files backend into actual HDF5 containers that's by suggesting use HDF5 which is reasonably small because it's great for that we have a backend for ZAR there's an unfortunate dichotomy right now because backend and type conversion is currently not separated and then 5 API we'll work on that at a later time so that we can support google cloud and aws transparently also for ZAR there are some forks in the wild developed by Josh Moore and Tishi that implement this for part of the aws support where we actually want everything ZAR is a very related format to the original and 5 file system specifications so and then another cool thing is that you can add new compressors very easily 2 and 5 by implementing some compressor interface and annotating them and at runtime the JVM explores all the available compressors in your class path and makes them available so we added Bloss compression which was important to support ZAR completely and also some xpxperimental jpeg compression which is good if you want to share data over the web okay last but not least there's a bunch of tools out there n5utils contains command line tools to visualize data sets browse them and do something like unique or copy so if you want to copy from HDF5 into a ZAR container from n5 into an HDF5 container you will specify the data sets that you want to copy and it finds all the attributes and does everything right we also included it in Fiji and you will see this later in the more practical session and we have some utilities for parallel processing on a spark cluster which we will talk about a little bit later okay of course all of this is available as Maven artifacts if you want to write software you need the dependencies I deposited well the more established aspects of the n5 library on on Maven SciJava and the more experimental things on a local Maven repository that we're maintaining from the lab okay so when this brings me into a place where I want to mention the data that I'm showing you in the following examples is from this fabulous COSEN project with all these great people that has a highly trained deep neural networks to predict a multitude over 30 of different organelles from FIWRCM data and here you can see a small snapshot of this so you see that the neural network knows what a plasma membrane is with mitochondria and mitochondria membranes are what the R is a different cell here it's the ribosomes denser here than there okay so we'll see some of this data in the following examples and so go back into the IDE and go into n5 tutorial 1 so n5 tutorial 1 is meant to show you how simple it is to open n5 dataset which can also be an HDF5 dataset as I mentioned before from in java code right so here we say we have an HDF5 container and this is one that sits on AWS on Amazon cloud we want to open a dataset in this n5 container these are two things, the container and the dataset this is also how HDF5 works we have the surrounding HDF5 file and then you have the dataset here I have a little convenience function that opens the right reader for various url schemes so it understands what an AWS what a Google reader is and what an HDF5 reader would be and when it should use R that's a little helpful thing this creates me oh jesus so this creates me an n5 reader which is the interface for all of them and then I'm using the n5 image2 library to open again an image2 data structure which is something that is bounded so it's an interval in not real but discrete random access so an image and I open this from this container this dataset and then I use the BDV functions as we've done before to show this so let's see how this works I'll probably complain that I'm currently not logged in into AWS which I'm not but this is a public dataset so it'll be able to load stuff anyways and now we can see data popping up here there's a downscaled version of one of these cells so that's okay so I can move this around and try to scroll through this but you will also notice that so now I'm trying to rotate this and it's like super stucky this is not zoom this is me that's terrible and the reason for this is that we are currently trying to load this dataset lazily from AWS which accrues some latency and it takes some time to get this over the internet while I'm doing a zoom call right this lazy loading is blocking it's good for processing because you need the pixels immediately but it's not good for visualization so the big data viewer font then has some gimmicks to improve this situation so we changed this a little bit we do the same thing here we open a reader and now we use a slightly different method here to open this dataset that uses a backend that provides the opportunity to tell the consumer that grabs a pixel whether the pixel is actually there or not but it always gives you an answer so it doesn't wait until the pixel is loaded but it just returns you some crap value and tells you that this is not actually here and this is super useful for visualization because you want to visualize stuff while you're trying to load things and only once it's there you want to actually display this right so we can use this for big data viewer we use a shared queue that uses several available processors so we're slowly approaching parallel processing of big data and we show this through a convenience method that wraps this image as a volatile data structure so let's practically see how this looks we get the same so first it has to find the dataset right and then we get big data viewer and you can see that I can immediately move things around and that the data is not necessarily there when I want to see it but it drops in right and then because this is all memory cached it keeps sitting in main memory and once it exists I can smoothly browse for this that's pretty neat it's also not the highest resolution thing I'm interpolating this it looks a little bit better okay cool so next example what we can do with this is exactly the same as with the standalone image lift 2 example that we've seen before and I decided that we just do exactly the same thing we can convert the dataset we convert it into doubles in this case it's actually relevant because this is an unsigned type that comes in then we make three gradients because this is a 3D dataset and we show the three gradients with three colors in big data viewer let's see how this works most often time it's good can already see that this is actually generating some stuff so that's the original image that we may want to switch off so we can see four gradients and here they are three colors one in X, one in Y one in Z and they can rotate things and you can also see that we're using slightly different cache size block size for the gradients that we're using for the incoming dataset the incoming dataset says relatively large blocks but the cache for the gradients is much much smaller still every gradient that needs to be calculated needs to wait for the big block to emerge but once it's there it's all relatively fast because it works for this multi-fitted cache queue so this is on-the-fly processing and caching of data in big data viewer Stefan? sorry to interrupt it's five o'clock I would like to know what you still have to present and if you think that you will be done in like five to ten minutes yeah I'm trying so I'm trying to go where I go in five to ten minutes and I'll stop in time okay so then we continue for like maximum ten more minutes so now we have seen how we can visualize these things but of course visualization is not the only thing that you want to do so here I have an example where we load this data set and then we copy it so again here we open the data set we read some attributes from this from this file and then we copy into several output containers so then we're measuring the time in this case we generate an n5 file system writer with the n5 if you generate a czar writer generate an hdr5 writer so how these things compare time-wise I will see something very very surprising so first we copy to the n5 file system and it takes a while 24 seconds then we copy to czar that was much faster, 5 seconds and then we copy to hdr5 which is also in the range of 5 seconds so does that mean that the n5 file system is slower than the czar backend? unfortunately I can't hear you so the answer is no we're saving the same data structure into n5 fs and in czar and because this thing through this matching method is memory cached the time that we accrue from loading this stuff from amazon cloud is only spent once namely when we're storing into the first data set and you can actually try it at home and interchange these things you'll see that the speed is very similar for czar and hdr5 because they're basically doing the same thing and for hdr5 as well so now where does multi-fitting become interesting this is this 5th and 5th example this is exactly the same code as the number 4 except that we're now using an executor service namely a pool of 10 threads that we're using to save these data sets and we're doing this for all of them so let's see what happens now my prediction would be that storing the first data set is probably not much faster because that's a little bit IO limited from my internet connection but the saving for the subsequent operations would be a little bit quicker well okay it was faster so 12 seconds for the first one 1.2 seconds for the next one that you should remember that was the motivation to go into this and 5th business in the first place storing multi-threaded into hdr5 at least for libraries that we're currently using is not particularly helpful because the compressors block each other and so writing into the hdr5 file is basically exactly the same speed as if I were to do a single thread but writing into a multi-threaded multi-file container gets a lot faster when you're using several threads because aws has a lot of latency we're actually benefiting from spreading the latency over several threads that load starts so even the loading time is a little bit quicker okay so this also shows us that we can use a simple executor service and this n5util safe method to paralyze workflows because what we've shown so far is something that generated new data into an ImageLib2 container right and we can just save this ImageLib2 container which is then calculated on the fly into an n5 output data set or an hdr5 output data set and we benefit from multi-threading through this executor service so in this case we would use 10 CPUs to do the job and we only specify this at the very end and this is actually what happens in this very last data set let me see if this is useful it's actually what's happening in the lazy tutorials here we have several ops that are cell loaders that implement the logic to fill the data for these individual cells not with data but with something that we process on the fly the logic there is exactly the same because a cell is just an ImageLib2 random accessible interval and fill this random accessible interval with whatever you please like a filter or in this case contrast correction and we can also see that storing this stuff out is basically exactly the same as showing it and so we can use multi-threading to accelerate our operations so since I'm running a little bit out of time and I have only 5 minutes left I will skip the other examples there is a link to the repository available in the chat or later under the YouTube video and I want to go into the last part of this talk which is this one how do we paralyze things on compute clusters we use Spark Spark is a very well established framework for distribution and the basic magic is that you paralyze by splitting your data into a so-called RDD the Resilient Distributed Dataset and these Resilient Distributed Datasets are distributed by Spark over a cluster and can be processed in parallel and they can do joins and all sorts of interesting things however in our scenario the data is usually too big to exchange the data because it faces the same problem that data has to fit somewhere in main memory right and so what we do is we basically generate RDDs that only define the block grid in which we want to process and then inside every processing function every map function inside this block grid we generate the entire dataset here lazily so we don't actually do it we just generate the metadata and then tell the processing pipeline which little block to process and then whether to save this block or accumulate it with other results so you could do these four mentioned components for example you can do contrast correction filters and all sorts of other things what's great with Spark is that it has this implicit and fault tolerance so on clusters one thing that always happens is that individual nodes die or don't do the job because computers fail occasionally they fail rarely but if you use very many of them one of them always fails and Spark is built to deal with it so you don't have to which is wonderful and we need this desperately so we care not so much about this implicit data parallelism but about the fault tolerance I talked about how you split these workflows into these grids of processing units and then in these processing units just ramp up the entire thing and work from this and another cool thing with Spark is that you can run it on not only our local cluster but also on AWS or Google Cloud for example and I have to admit that I was lazy because I'm always using the Janelia cluster and John yesterday did us the favor and show ran a workflow on AWS and he recorded a little bit of a screenshot video how he uses AWS's console to start an EMR cluster and then run a small Spark job on it they can see here so this is loading the JAR file the compiled source and then he provides a few arguments to the source what it's supposed to do this is the source of the N5 container or data set and blah blah blah this is application specific and he's using Amazon's step function to concatenate two Spark drops that are independent so this is the second one the first one would do contrast correction on a data set the second one would generate a scale of the data set and then he decides how many he wants and that he wants to terminate after the cluster runs and that is basically it right so and then in order to show you a small glimpse into one of these into one of these jobs I have here a small Spark tutorial where you can see how this works in practice so first of all you create your Spark context then you want to do some global processing before the whole parallelization goes on you need a reader you need to read some input data sets you create an output in this case right then you generate you use a convenience function to generate the metadata for the block rate that you want to generate this goes into a list then you make this list into an RDD this RDD is then the structure this resilient distributed data set that can be distributed by Spark as many compute nodes as necessary and then you run the actual code and this is down here and in the actual code you see that for every grid node every metadata of a grid block and this data set we do the following we create the complete data set we create a new and five reader we run we open the full image then we do contrast correction using some some local cell loaders we crop the block of interest and then we write it out and if we run this so we actually have to run this with a bunch of command line parameters that you will see here so you can run this locally on my local laptop I have only four CPUs so I'm running a Spark cluster of four CPUs it's not so much and then input parameters and then you can run this Spark generates a little bit of output and then it also provides you a web interface where you can watch how it works and here you can see okay this is the for each job that Spark is running and it's currently doing some work and it should be finished in a little time so the first few things because it's a relatively small data set so four CPUs are a good test ground for this right and then you can see it finishing at some point and then we have a contrast corrected data set okay and with this because my time is over I want to close the presentation with an advertisement to that we also developed a cool tool Pintera which uses in 5 extensively Pintera works again the last time I presented it we had a problem with Java 8 now it works on Java 11 and I want to thank everybody who helped me with this particularly the lab people that have been in the lab before and all the fabulous people at Genila and elsewhere have generated this massive data and everybody who is in this open source community and it's developing code all right with this I share my screen and stop thank you very much okay thank you a lot of fun for your presentation really great I would have one question and then we wrap up in which cases would you go for solutions like AWS Spark and in which cases would you go for local cluster solutions is it a question of accessible resources or size of the data I think it's accessible resources so AWS costs money as Google cloud oftentimes a local computer cluster if you have one at your institution is a little bit more affordable there is also the proximity of data if you store your data also storing data in the cloud is expensive so if you want to store your data locally you should process it locally if you plan to process it on AWS or Google cloud you should put it up once then process it there and not move it back and forth all the time because this costs extra money right and then stay there for as long as it takes and only consume the result or even never return it so the cool thing to do the same project for example is we got a very sweet deal from AWS because we provide this data publicly they don't charge for the public data that is accessible to the that's awesome it's not a general thing but you can apply for this and if you have data that you want to share with people and there are ways to get it for free or for normal so you would recommend AWS when there is no access to a cluster solution so the nice thing with the commercial cloud providers is that you don't pay for the maintenance of the cluster so if what you're doing happens once in a while then you go for a commercial provider it costs more in the moment but then you don't pay anything else for the rest of the year okay okay thank you thank you very much Tiffan and thank you also again Pavel that was really great it's time to end this webinar on big data I hope you enjoyed it and I would like to remind all the attendees to fill the survey for new buyers yes thank you very much and see you maybe next time