 The Nubias Academy started in April last year with the first lockdown. We started a series of webinars about bio-image analysis. Today we have 24 webinars online on the Nubias YouTube channel with around 30,000 registration and more than 40,000 views on YouTube. We are really happy to see that this series of webinar is successful and happy to have you here today to listen to this webinar. So this webinar is the third of a series of five webinars on big data. We had two weeks ago the visualization of big data and last week we were talking about registration and stitching and this week we are talking about how to analyze quantitatively this big data. About this webinar we have today three speakers. So Matthias Artz is going to talk about LabKit. He is from MPICBG in Dresden, so in Germany. Anna Kaschuk is going to talk about Elastic. And she is from Emble at Heideberg, also in Germany. And Joie Thineve is going to talk about Mastodon and is at the Institut Pasteur in Paris. So with these three speakers we have a few panelists with us that will help moderate the questions and keep the time with the speakers. So I will let Matthias start. Matthias, could you show your screen? I will start sharing mine. Yes, so hello, I'm Matthias. I will share my screen. So now it should be visible. Yes, it's looking good, Matthias. Thank you. Okay, thank you, Marion, for the introduction. I'm Matthias. I work in Dresden at the Max Planck Institute there. And I want to introduce you to a feature plugin which is called LabKit. LabKit can be used to segment big image data. I developed it over the past years with the help of many of my colleagues. So I want to thank you, Deborah, Tobi, Tom, Gabriela, Robert and Florian for your contributions there. LabKit is a feature plugin which does pixel classification. Pixel classification is one algorithm that allows to segment an image. Many or I think many of you listening might know pixel classification from other tools like elastic and trainable vector segmentation. And the advantage of LabKit here is maybe that it's part of the feature environment and also of the big data viewer framework that allows it to work very nicely with really big image data. So what is pixel classification? We try to find out for each pixel in an image to which class the pixel belongs. So these classes might be, for example, just the foreground and background. Now you can use LabKit to figure out which pixels of your image are foreground and background. This segmentation is usually the first step you do when you do an image analysis pipeline. It makes it much easier for the downstream processing to analyze the data. To maybe count the objects in your image or measure the size of objects. So let's have a look what it looks like in real life. So here I have my feature running. I open an image and then from the menu I select LabKit. Here we see the LabKit window with the image open. I mark some of the pixels as background and mark some others as foreground and then I run the pixel classification algorithm which automatically figures out for all the pixels in the image if they belong to foreground and background. And now if I'm happy with the results I can save them to a stiff file or a further process them in Fiji. So here we see the result, a binary image that can now be easily analyzed in Fiji by, for example, counting cells or measuring their size. Okay, that was the basic introduction to LabKit and now I want to talk a bit about how we actually manage to segment big images and afterwards I want to show you two new features in LabKit which is the GPU support and the cluster support. So how do we manage to segment big images? There goes a big thank you to BS Peach who developed Big Data Viewer and a lot of libraries that are associated with it and LabKit is built on top of Big Data Viewer which allows me to easily show this big image data. And I also use in LabKit the BDV image file format to store the big image data. So after this basic things are solved, the visualization and storing the data, first actual problem that LabKit needs to solve is the segmentation of the image data and here really the data size becomes the problem is the problem. So some of the images that you might want to process with LabKit might be bigger than the main memory of your computer and that's the problem because before any data can be processed with a computer need to be loaded into the main memory. If the image is too big, you cannot load it into the main memory. The solution here is easy, we divide the image into smaller blocks and then these small blocks can easily be loaded into the main memory and we can easily process these blocks in the image. There comes a challenge with this strategy. Usually if you segment these blocks one by one, the results get worse close to the border of these blocks. So we take a lot of care to do that correctly to really segment every part of the image with equal quality no matter if they are on the edge of the block or not. Okay, these dividing images into blocks also give us a big advantage and that is that we can quickly preview the segmentation to the user. So segmenting a big image might take a few minutes and it would be a bit annoying if you as a user change some settings, segment your image, then you have to wait for a few minutes until you see the first results. So by dividing our image into blocks we can show individual blocks to the user and usually if the user looks at big image data, the user only looks at a small portion of the data and only a few of the blocks need to be segmented to be visualized to the user. Okay, there are again a lot of technical challenges in the background associated with it but these challenges again have been solved by Tobias Peach and the library called InClip2Caches, which I use. Now I want to talk about the two new features in LabKit which are GPU support and cluster support. So GPU support and a GPU is another name for a graphics card and new computers almost always have a powerful graphics card and LabKit can calculate the segmentation of an image on the graphics card and that's now 20 times faster than if you would do it without graphics card. This feature is still experimental, it only works currently for NVIDIA graphics cards but you can already try it out by yourself if you install these update sites listed here to CLIJ and CLIJ2 update site and the LabKit preview update site. So CLIJ by the way is a library which allows us in PT to use the graphics card for image processing and Robert Harser is the person who made this library and thanks to him for this. So now I want to show you this in practice again. For this I use the image which I downloaded from the cell tracking challenge and then I converted it to the big data view file format by using big stitcher that you have maybe seen last week in the webinar. So let's start the video. By the way I'm using a normal feature with the update site CLIJ, CLIJ2 and the preview activated. So here I have a feature running. I start LabKit from the menu, open my big data viewer XML file and here you see the image already. First thing I have to do is to click auto contrast to fix the contrast of the image and then I can already nicely scroll through the image. By the way the size of this image is four gigabytes so a big data viewer does a lot to view the image so nicely and quickly. So now in the image I scroll to an interesting part and I mark some pixels as background and I mark a nuclei as foreground. Next thing I do is I go to the pixel classification algorithm to the settings and I activate the GPU support. I do some under minor settings and then I train the class file so and after a few seconds already you see a preview of the segmented image and can have a look at it. So I'm happy with this so I can now save this result as HDF5 file or here I choose to show the result in the normal 50 image view. Now to show the result in 50 the entire image needs to be segmented which roughly takes eight minutes here. I don't want to wait for so long so you cancel this and what I do now is I save the classic file that is used to segment the image into a file so and I do this to be used later. Okay maybe a few words about the class file so in LabKit we train a random forest to segment the image and once the random forest is trained you can use it to segment many other similar images and to do so you need to save the class file that's what I do here. Okay so that was this presentation of the LabKit on big images. You saw that we already quickly segmented the image or quickly got a preview segmenting the image takes another eight minutes for this four gigabyte image size. So that's maybe okay to wait for eight minutes to get your segmentation. If your image data is even bigger you might consider using an HPC cluster to calculate your image segmentation and LabKit also supports this for segmenting an image on a HPC cluster there's a LabKit command line tool. A command line is what you usually use to interact with cluster hardware so I also had the video to show you this which is pretty recorded of course so here you see my computer connected to a cluster the window on the left side shows a folder on the cluster which already contains my image data and the LabKit command line tool is here compressed in the zip file which I downloaded from the LabKit website and what I will do now is to classify which I saved earlier I copied that to the cluster and then I go to the terminal here and run some programs on the cluster. So first thing I do is I unzip this LabKit command line tool so now you see the LabKit char file which is this command line tool and the snake make file all left to do is in the snake make file I edit the name of the image that I want to segment which here is data set xml and I specify the classifier so the random force to be used for the image segmentation which is this flu n3dl classifier and I specify to use the gpu so next thing to do is to start snake make snake make is a tool which will run this LabKit command line on the cluster on here I specify to run it on 10 nodes in parallel so I specify some further parameters and then it starts so now you would have to wait until your data is segmented which takes for this data four minutes on this cluster I will scroll forward so now everything is done and now I will update this folder you see that there are new files so there is an output folder which contains the segmentation results and I will use big data viewer to open these segmentation results quickly so you can see them okay here you see the segmentation results in big data okay so that was the demonstration of how to use LabKit on the cluster it's really pretty simple and it's also described with a complete example on this webpage of the LabKit command line tool okay so that's the end of my presentation here so thank you for your attention and to summarize it up you saw how to use LabKit with big image data it's really quite easy you saw the gpu support which makes it super fast you saw how you could segment even bigger images on the cluster and yes I think now we have time for questions yes thank you Matias for your talk it was really great we have a lot of questions from the audience and so maybe I can start with some of them I would start with one which is about the size of the image so you mentioned the image you were using in the first part of the talk was four gigabytes and you were classifying this as a big image and the question is how will LabKit cope with images that are above 500 gigabytes so you showed afterwards this one terabyte image okay so it seems to be coping well but I would add an extra layer of question to that which is as a user how far can I go in terms of size of images without having to go for a cluster okay so LabKit scales very well so you saw me handling a four gigabyte image and that took about two minutes per gigabyte so it took eight minutes to segment it if you now have a 500 gigabyte image it will take a thousand minutes so I don't know how much that is maybe half a day but you could still do this with LabKit on your computer okay so that should work and opening a five gigabyte tool with the interactive LabKit plugin and getting results there quickly or a preview of your results quickly will also work just for then calculating the results you will have to wait these this amount of time which is just necessary to do the calculations okay okay so I have two other questions that are related to this resolution of file you have the first one is do you take advantage of these pyramidal resolution levels for the segmentation and the annotation and the second one is can you define the user the block size can the user define this so no I don't take advantage of the motor resolution pyramid and this is used to to nicely show you the image but when I do the segmentation I don't make any advantage of the motor resolution image and the second question was if the user can specify block size and the answer is currently no okay so if it's necessary for whatever reason you might ask me to help you with that and then you can see if it's okay I'm looking for other questions yes so I have questions are more related to what can LabKit do in terms of of processing so do you support object classification after a pixel classification um no it only does pixel classification so object classification is not supported and for the when you're using LabKit can you also process time lapses or it's only uh so yes single image yes it works for time lapses it works for 2d images 3d image multi-channel images so you can process all these different image modalities you can also take em images and try to process these okay and can you also just repeat what's the format because you said it's based on big data viewer does it it does it needs to be h5 files so um no you can open any image that you can open the 50 with labkit it works and yes just if you have big image data it's usually better to save it in the bdv format or as the maras file would also work okay so if you have tiff files it's working with labkit but it's better to if they are big too yes if you have tiff files that works too but for this 8 gigabyte tiff file that takes a feature a few minutes to open them so that's a bit annoying okay thank you I'm just looking for if there is a few more pressing questions um oh yes uh so is uh labkit scriptable can you use image and macros and that's also a new thing I so you can use image a macro there's a macro recorder labkit command to segment images okay and maybe can you comment why you you choose this uh the random forest classifier and not another type of classifier will you in the future add classifiers so yes the random forest we use of course because it's fast so it's fast random force um but there are also plans to use deep neural networks to provide it in labkit okay it's great um I think just looking at questions I think we uh went through most of it yes uh yes so so for the other questions that are a bit more specific we will type the answer in the q and a and in any case all the answer all the questions and answer will be posted on the image.sc forum so thank you again Matias this was really really great um we can move now to Anna thank you for having me here thank you hi so should I start sharing the screen or is there anything else you should okay please okay let me find the right one hey do you see my slides yes ah oh it's so good to hear that okay hi everyone great to see such a strong crowd at the noibias webinar again um yeah so Matias actually did a better introduction to elastic than I planned here so I will not take as much time as I was planning before so I'll just jump through the first slides quickly so I want to show you how to do the biggest data analysis with elastic right so I'm not talking about like tens of terabytes here which I believe will be covered in the next webinar so I'm talking about still let's say reasonable size under 100 gigs because we are still talking about interactive training on this kind of data right so yeah first about the team right so you know elastic is a fairly old project a lot of people contributed um yeah so this is the current team they're not all working on elastic full-time but they're all committed to its success um yeah many thanks to the funders as well so historically elastic has been started in Fred Humphrey's lab at the university of Heidelberg back in 2011 or even 2010 and then as I was a postdoc in Fred's lab I kind of took over the leadership of the project and then when I started my own lab at Embell in 2018 it also moved here with me right so the aim of elastic is to make useful machine learning algorithms available to people without computational expertise and also you know what was really important for us is to solve diverse methods with the same approach right so I know many of you are already familiar with elastic but I thought okay I'll give like a very quick overview anyways right so this whole idea of solving diverse problems with machine learning methods I would look like this right so we have all these different workflows like pixel classification, object classification and tracking and auto context or boundary base segmentation and they all base on the same principle of user giving labels and then elastic making the predictions and then user correcting labels and elastic correcting the predictions all right so if you look at pixel classification which is probably our most popular workflow it does semantic segmentation so it attaches a class to every pixel so you start from the raw data you define your labels I just like Matthias now shows the lab kit right so you give your labels and then it assigns a label to every pixel right you can do the same for objects right so you start from two populations of objects you give a few labels and then it assigns a label to every object in the image either in the volume or in a 3d time series right you can do the same for boundary base segmentation right so you predict the boundaries and then you label the correct or incorrect boundaries in there and then it can give you the segmentation of the full objects right so other workflows also follow the same idea I just didn't want to show all of them because I don't have as much time right so if you look at the small data analysis right and then if you just load a small image of I don't know two megabytes and then yeah you give a few labels and then you press live update so those of you who use the elastic and through have done it many times right and then you give some more labels and then it updates itself right so the whole procedure is really going this way you label you predict you look where it's wrong you label some more where it's still wrong it predicts again right in the end you're happy with it you export the result do your quantification and publish right so can we take this kind of a you know in a small data analysis setting it's also fairly clear how to do it right you could even digest the whole image at the same time compute all the features compute all the predictions and then everything is really fast and really interactive so can you do it for big data oh yes but what becomes really important in this case is the file format I because I know you had a whole webinar dedicated to that in the beginning but I think this issue is so important that it really bears repeating and the reason why it's so important is that well if you want to parallelize anything and if you're processing big data then you want to parallelize things and even if you don't want to parallelize because you only have one CPU right at least you want to process things blockwise because then because otherwise you would not fit into your M right so if you want to parallelize if you want to process blockwise you have to cut your volume into blocks if you want to cut your volume into blocks you have to have a file format that supports efficient reading of sub volumes and here it's well inelastic up to say medium-sized files we prefer h5 if you do serious big data say about 100 gigabytes then you should probably use n5 and yeah you remember how I said that yeah you cut out these little pieces and then you process them independently and in h5 this is like naturally there if you enable chunking I know you now heard about the next generation file format we're also very much looking to that we will support it once it's there multi-page tips hypothetically also support loading parts I but for us these parts are too big so if you want to be efficient in interactive training in elastic you really have to use h5 so how do you convert data to h5 there are multiple ways to do this the simplest is elastic itself actually has a data conversion workflow if you worked with elastic before you will see that there is there are just two outputs there there's input data and then output data that's it right you put in your data and then it exports the h5 with all the right chunking right access or the right everything so that's very easy if you have a data set which is let's say not big big big but medium-sized but already too big for smooth interactive labeling then we also have this option where you can just copy your data set into the project file it will make it h5 by itself and as an additional bonus it will also store everything you do in one place but of course you will have a copy so save to project option there is also if you're so both of this would work if your original data is in a format that elastic can read right and elastic can read standard image formats so if you have something more can see or if you want to use virtual stacks or something like this then you need to go to the elastic fidget plugin and there we have a function called export hd5 right so there you can open any image that fidget can open and then export to hd5 with elastic in process it's even you can do it in the macro you can script it so it's all there all available and then finally you can do it programmatically which is how we do it all the time and there is this fairly easy we have a notebook example that does this since I also shared the link to my slides you can also see it here it's really like if you program anything this is really not the difficult part okay is it actually worth it to try to do that let me try to show you how that would look then if you're trying to process a big-ish data set so let me share another screen do you now see the elastic screen yes we do yes yeah okay good thank you yeah so let's add an image and that's a cutout that I just yeah that we made today it's a nine gigabyte volume it's an h5 with chunking so it's about 2000 cubed and yeah I just wanted to show you quickly how what you can do with it and so we can select features so you know it's all like the normal elastic so in this part because we haven't computed anything yet right nothing really changes the only thing that it has done so far is stream the tiles that are displayed here from the volume but if you were not storing it in h5 but if you had a for some insane reason I don't know you can't do really 3d in png but if you had a gigantic 3d tiff that would not be multi-page but just a giant tiff right you would have to load the whole thing now and it would already break and we haven't even started training that so if you go to training this is how you can actually make it smooth for large data all right so you maximize one view all right you come to an area that you think looks interesting like I know I like it here for example so it's not very biologically meaningful that I'm going to label now but I just wanted to show you the idea right so this is a data set it's an em data set of a binary swarm right and here we could for example say okay this is muscle right and this is the chromatin in the nucleus right and then we can have another label and you could have more labels right and this for this other label I just need to zoom in so that I could label more precisely right we could say okay this would be the membrane it's like here or like here right and now we can train on this and see what it will come to and then just like I showed you for the very small data set it will yeah then you keep going right so now you can say okay that is actually a little bit more yellow here all right so this is the general elastic way of doing things and it's not that different if you have a big data set or small data set and the reason why it works is because all the processing here is lazy right so if you now zoom out right so notice how I stopped live update and this is very important because if you keep live update on then as you scroll out of the previously predicted region it will just pile more and more drops and request for more prediction right so but if you disable live update then just looking at the data doesn't take much right so now we can also like scroll around you know look at other things see how it looks in a different region you know go to for example a completely different slice right so we could go here right and then yeah let's find some somewhere where we also have muscle like here right then we can zoom in predict what it would look like here and this is exactly how we recommend people to deal with doing this kind of predictions in larger volumes because I know everyone here is a big champion of reproducible imaging but still conditions change a little bit and if you want to really extract quantitative results out of this you should verify how well it looks in the different parts of the volume and that's how we recommend people to do it in elastic right so you zoom in do it in the small sub part then you scroll around pan around find another area that you think would work well or would not work well you want to check how it works and then you predict again there and now you see that we could actually add some more labels I can say that I don't even know so let's say that the nuclear membrane is also remembering okay so we add some more labels here and then we can add some more around there and this is really like in terms of how you should be doing machine learning this is really the way to go right so you should be scrolling around the place and trying to annotate in different areas and even if you don't need annotation in different areas then at least validate in different areas even for big days okay so that's how it works if you now want to go on there is also this prediction export so if you now want to actually predict your whole 9 gigabyte block right so here is the prediction export and you can export whatever you want it will automatically cut it into blocks and then process it on your machine here with as much parallelization as your CPUs will allow right so this is for the block that you currently have loaded if you want to also load some more you can just select some raw data here and batch processing this is also still on your local machine let me now go back to my presentation and yeah so I think it's worth it to store data in h5 because you can really interactively and nicely and smoothly train on a big data set right so this one was 10 gigs I was really good on the laptop um yeah I mean you can go more so what's important for interactive training so that you actually enjoy doing it and don't just sit there and wait for things to happen you have to remember that the processing is lazy it only predicts what you're seeing so if you're seeing too much it will take a long time to predict right so don't run life predict when you're scrolling I don't run life predict when you are completely zoomed out right so if you have a very big 2d image like I don't know like the pathologies have 20 000 by 30 000 pixel you zoom out completely you press predict it will start to predict the whole thing because feature computation and prediction always happens at full resolution right another thing that is important to know is like some people especially now inspired by the way how neural networks are trained try to programmatically add dense labels don't do that with random forest it saturates and then it just chokes on it so it tries to you know train more and more and compute more and more features it doesn't actually lead to better predictions it's better to scroll around and label the places where it's wrong another thing which I forgot to show you but it's there you can find it in the docs it's in the slides you can yeah rewatch it if you need to remind yourself about this is that elastic and also use prediction masks which means that if you're only interested in a subpart of the volume you can load the mask together with your data and then it will only compute the predictions in the mask area so we found this was quite useful when working with large brain images because there people are also often interested in just computing it in one area of the brain right or a computing statistics for brain part and then you can mask out different brain parts and then just do it for that and this is much faster especially if you have a lot of almost dark background okay after training like I just shown you now you can um prediction expert applet does it for the data used in training batch processing for unseen data this is all still on the machine where you've been training or let's say on a single machine so how can what is parallelized when you're running because you know we all have multi CPU machines now yeah so all workflows are parallelized across files if you have multiple time steps all workflows process time steps independently within the same volume so pixel classification is an embarrassingly parallel operation it can it just chooses the block of such size that it passes in your m object classification you have to select the block size and a halo to avoid the artifacts of cut objects all other workflows actually need the complete volume so if you have like a long time series of not too big volumes it will still get parallelized nicely if you have a very big block very big block and you want to run say multi-cut on this then we have to use special code that we cannot do it within elastic yet okay headless processing which is what you do if you actually want to run on the cluster if you just don't fancy starting it from the GUI it's documented in great detail it has a thousand different options um I think it's actually fairly intuitive despite having all those many options the most important parameter there is called the cutout sub region which is actually specifying what you're working on and also there if you're running on a really big dataset don't forget that you need a sensible output format right the n5 format was developed to enable parallel writing so if you're dealing with real big data you have to also remember that writing can become a bottleneck okay if you want to run on a cluster we now have also a way to do that as a distributed mpi application there is a flag called distributed that you can just pass to elastic headless it's very efficient very fast you don't have to specify the sub regions yourself it can be combined with slurm it's documented in the headless documentation unfortunately because it's mpi based it requires some understanding of the cluster and this how it's set up there to configure it only works within five input and output for now because for us that seemed like the most sensible option for the big data maybe we will of course support the next generation format as soon as that's actually available we also have a docker image which even though we haven't publicized it anywhere enjoys some popularity so if you want to use that we are also happy to support you in this to summarize the most important thing when working with big data if you haven't figured it out by now i will still hammer it in again i'll format it must allow for efficient sub region reading right so the lazy backend of elastic actually allows you for interactive training on out-of-prem data sets for pixel and blockwise object classification headless processing can run on servers clusters clouds whatever you want it's fairly easy to run it's fairly easy to set up i leave future plans we are actually thinking more more about big data so if you repeat the webinar series next year i hope i can show you something completely different so we are planning to explore dusk for it's because dusk has a lot of ways how you can run remote distributed computation on a large variety of the cluster flavors or clouds so we hope that we can kind of piggyback on it and just use their different backends and another thing is that like i said we always compute everything in full resolution but we are now also working on integration with pyramid aware viewers because you know the elastic volume of your is not pyramid aware but others that we are now looking at are so we will support pyramids then for viewing and then it only makes sense to support them for computation and yeah so we are now thinking of the most intuitive way of doing it without people getting too frustrated of losing labels at the different ends of the pyramid but this is all solvable questions so yeah we are still actively developing this part of elastic like all the big data parts because well obviously as you work in a biology institute ourselves we are also confronted with bigger and bigger data all the time so yeah that's what i wanted to show thank you very much for your attention the docs are there you know where to find us on the forum we are on twitter ask the questions now look at the slides later ask the questions and all these contact options we are always happy to hear from you thank you thank you very much Anna for for your talk that's great so nicely compliment complementary to what Mattias showed uh we so i'm looking at the questions yes we don't have so much related to elastic summer already have already been answered by Dominic but maybe you could say something about h5 versus n5 format and more generally about like we maybe re-explaining why this kind of format are adapted when you have large data yeah so h5 is basically one file for everything right but inside the file you can store your data set and little pieces that you can then read one by one right and usually the chunking is say well say 64 cubed right which is fairly granular so you can read pretty much an arbitrary chunk out of it so as and as i try to explain in the talk this is really important because if you can't read some parts of your volume right it means that you will always end up with all your data in rem and this is you know this doesn't scale okay so this was h5 and so h5 has kind of provided for all the community needs up to a certain uh size right so h5 has its topics right it's not a perfect perfect file format but it's anyways it's the best that we had and the big data viewer is based on h5 and the maris data format is based on h5 there is no schema there is you know it's not like it's a single thing elastic file format the elastic project is obviously based on h5 as well so it's um yeah it's not like there's big unity about this but it's just a very useful and easy to use file format where you dump things in and can read them in parts okay so where h5 breaks is when you have to write a lot and this is why n5 was developed in stefan south end labs um so i don't know if anyone from his lab is actually following then they could probably explain it much better than i can so the problem the big bottleneck there in h5 is parallel writing and what n5 does instead is it well to put it simply it writes a lot of small files and you know where things are right so the chunks that before were all lying in one data set can now be all separate files and there is a master that knows what is where and this way you can also write to all of them in parallel and obviously you can read from them in parallel as well so czar is very similar and i believe the next generation data format that i'm sure you have heard about in the first webinar is trying to kind of make the best of all possible worlds and actually build something that we could all use and just have one format for big data how amazing would that be okay so i hope that kind of answers it yeah i i hope to um i i have i i have a question from my side as a as a user when would you go for more than a laptop then when would you go more like on the cluster how much effort does it cost to me as a user how much documentation is there to make the transition um well if you need the interactive labeling right then you can't really go on the cluster right i mean there you still need a single machine okay so i would always encourage people to go on the machine with the most trim they can find and uh there it's uh you know basically unless like there are cases where it doesn't matter right so for example if your big data is a long series of small pngs right here what happens you don't care you can just do it on laptop right if you have if you are talking like big ish then we immediately yeah the more ram you have the smoother everything will be so you can see that on a 16 gigabyte laptop you can already do big classification and get somewhere and it's actually not so bad right if you if you have a much bigger data set or if you just want to go faster right then having more ram is what helps and then also for all the other workflows which have to read the whole thing in memory then that's where it really starts mattering right so on the cluster and once if you have if you have trained the classifier now want to predict the whole thing on the cluster or on a different machine right so if you just have an even bigger server that uh where you just only have headless access i i think running elastic headless is fairly easy so you know i know people who are very not let's say deeply computational right so plenty of biologists here manage this just fine right and for the mti stuff we actually have an example there but we can also i mean we will provide more examples like for storements on but because every cluster configuration is actually a bit different it's hard to just provide a script for we could tell you you can just run this right so this script for us is the elastic minus minus headless itself right and that can just run one elastic on there and paralyze that but um if you want to submit many elastics and then have it coordinated where they write and uh you know what what regions they read and yeah then you have to submit 10 drops but i think it's just a drop that calls one command so i think it's fairly easy okay and the documentation is on the elastic website yeah it's all on the elastic website there is like big sub part about headless processing okay okay good um related to that uh how do you can you if you want to prepare data like in the a n5 format for elastic how would you do that and would you uh prefer to use n5 or h5 i think it really depends for you know how big it is okay uh so after a certain size you can't really even manage an h5 and you know no one likes to have 500 gigabytes single files so i think up to let's say 100 gigs you can probably still go with h5 but yeah or let's say up to 10 up to 50 after a certain point yeah it's it stops getting efficient so i don't actually have any examples for n5 conversion but i hope that the n5 authors would actually have some on their own website i can look it up and put it for you in the excel okay thank you yeah it's not difficult okay so so we will look for the answer and we will add this to the investa.sc thread that we make for this webinar so and uh i have another question related to time lapses uh how does elastic handles the time lapses and do you need to split the time point when you're doing the segmentation or no actually time lapses are the best right because in time lapses we never compute features across time right so it's always within the same time frame and that means that it can be fully parallelized across time right so if you want to do a full prediction on time lapse data it would just end the the individual volumes are actually small enough it will just parallelize by time itself right if you are tracking also there are parts in tracking that require segmentation of object classification it will also do it for every time step individually so time lapse data for us it's like the easiest to parallelize on okay it's good to know yeah and i just saw in the chat related to the n5 question that uh someone posted the oh yeah tool for copying between n5 and hgf5 so but we will put this back also in the this right on the questions so i don't see any other question for elastic so um marion just one question about the overall format because we are already um yeah so i mean we still have a full half an hour for janiv so that's we doesn't get disadvantage here because matias and i took too long so i don't know if there are general questions that you would like to ask all of us you don't have much time for those but maybe you could still do one or two at the very end but overall i think yeah janiv take as much as you need you know it's like yeah our fault but we you know we kept a margin for you good yeah if you if you have yeah for for all the attendance exactly as you said ana if you have marginal questions uh you can still ask them and we will uh complete at the end ana has to leave but dominique is staying for the elastic part so so thank you ana for your presentation and the questions it was really great i will give the floor to janiv now thank you hello everyone um hopla my name is janiv tinoise i work in a small facility in the institute buster dedicated to uh image analysis and uh i will try to speak a little bit about tracking when we do in the context of sorry of large images and i will try to convince you that we really we have it relatively easy compared to what matias and ana uh do what i'm going to show you actually uh has been done by our wiss tobia speech and ready me a woman i think they are here today so if we have like complicated questions they can be here to answer roughly speaking the outline is the following right i'm going to introduce mammoth and mastodon a disclaimer that's your projects uh we we we did uh i won't be speaking much about your other other frameworks and mainly i will try to introduce you to the fact that mammoth is not enough and explain to you why the reason and i will also speak a little bit about practical aspect and basically i'm simply going to repeat what ana did say to you just now so roughly historically uh we were interested in tracking because we were trying simply to actually plot lineages in c elegance embryo and c elegance embryo are not large images and for that we made a plugin called trackmate that's something that's tightly coupled to fiji and it was good enough but this is not large images at all right and so it was based on actually opening the image in fiji which means that you know not only the each time point each 3d stacks of one time point was in memory but there's all the time points were in memory so of course this does not scale at all this is not suitable for large images and trackmate is not a good plugin for these questions if we speak a little bit about you know data organization uh we can say that you know the the the the this new bias workshop is about you know big data analysis and if i take this uh oops sorry if i take this let's say words of wisdom from kota actually uh he defined in the famous paper and numerous discussions in the new bias context when you know what is an analysis workflow it's basically you combine several components into something that give you analysis like scientific results on which you can draw scientific conclusions in the case of tracking it's fairly easy you know the first component is what allows you to open and manipulate an image you have the tracking components that take this image and actually output tracks and then finally you have you know the tracks are never the answer into a tracking pipeline you want to measure speed you know where they go and so on then you need to do analysis for instance in our matlab or python interestingly actually if you look at the the size of the data items that teach these steps you could say like you know the image are typically large they take a large a lot of space on on on the disk the tracks don't take a lot of space it can be medium to small in the analysis phase so that's more you know you always go through a certain data compression the question that we have now you know this that would be using trackmate as a tracking component is what do you do when you have very large images and in that case when i tell you is this you know trackmate breaks and the reason why it breaks is because of this we don't have a good component to manipulate large images that's what we see and so at the time like i'm speaking of a fairly old plugin now i think the work started in 2015 or 14 and when i met these people anastasios pavel pulos and karsten volff who are there here today they were interesting actually in big embryos and measuring them over a long period of time over these fantastic machines the light sheet fluorescence microscopes and what you see here is actually the reconstruction from you and seven or nine terabytes data set i think of course you know there was no way we could we could track anything like that fortunately for us you know tab to bian speech and the lab of pavel tomanchak created the big data viewer and together we created mammoth i'm going to introduce you very briefly about that it's based on our two interests on actually being able to harness and analyze large images either the context of the rental biology or in my case infectious biology it's it's it's super simple right the the tobias actually created at the time a beautiful file format and visualization engine called the big data viewer matias introduce it to you if you attended last week's seminar it was the focus of last week's seminar and two weeks ago tobias presented it in java it's really the it's been the the core at the center of so many interesting developments and mammoth you know he simply says hey we take the big data viewer as the image component and we combine it with trackmate as the annotation engine tracking engine it worked actually now mammoth it's it's uh it's a finished project i still maintain i don't worry but it's a sense that you can subscribe to it it it tries to all these goals and uh and tassos pavel and cast and they were able to to to actually use it as you know a third party users to to do science it looks like this if you know trackmate you will find like incredible resemblance with trackmates but it's really a user interface that helps you manage like getting your bearings in large data with several windows and a lineage tree right which is what what it was about and uh sometimes people miss that but i think one of the the key interest of the the big data viewer is that it's something that can you manipulate 5d data you know time z channel x y but also actually several views of the data and it was made for actually the the speed man so you can have a combine several views of the sample and you know this is what was useful for for the movie you just saw but you could perfectly imagine use it for instance correlative microscopy or combining different modalities with that it's built in you have it and it's for free because it's done already anyway scientifically the mammoth project was successful this is you know the work of tassos and caston published in this paper from 2018 on which they were able to actually use it to produce some work and uncover you know the source and origins of lineages of progenitor cells and you know how they formed these beautiful digitations that we see on the shrimp uh that will uh that in which this embryo will develop if you want mammoth you like everything you've seen you just have to check the box mammoth in the pg of data for now it's documented and ready to use the problem was the the the following so if i take again my my small graph you know with the pipeline the notion of condensation is this now we have something that can actually harness very large data so that's that's something great and mammoth as a tracking component is still able to actually track this very large data the problem is that on this schematics you see here i put the tracks as you know medium to small insights and you know sometimes it's uh not correct because the the we thought that initially only people wanted you know to manually track a couple of lineages but not the full embryo of unfortunately somehow you know existing techniques were were out that were able to actually process these two embryos as well as the use of cluster or computed clusters in biology and so um after that we meet for instance you know katie mcdoll that you know told us this like you know i would like to track one billion cells because i have this in my mouse embryo that i would like to follow it over a long time and this is reporting in our paper that you see here and you find on the armad actually proposed uh and the colleagues as this is the work of you know one of the three stefan stefan prybish and salfel and colleagues proposing techniques that can actually track and detect cells in such a large data and so it turns out that actually the the trackmate engine was not made for that at all it could because it was actually developed on small images it's only could harness a small number of tracks you see what i mean small images found reasonable images just generated a reasonable amount of tracks and suddenly what we needed was you know we have gigantic images that generating a gigantic number of tracks and you know mamut nor trackmate could harness this is why we have to literally to start again from scratch this is the moment you know with tobyas pat me on the shoulder and say you know we probably have to start from scratch and you know this is mastodon and so far this is our best effort into actually proposing something that can track a large amount of data so mastodon actually just which beta uh i think the the form the i2k conference and his uh user interface looks like this again you will recognize some of you know features that resemble what you find in trackmates such as you know the wizard based detection sorry detection steps and tracking the lineages like in track scheme but also what's super important for us is that everything is interactive so you can still manipulate millions and millions of tracks but you know still keeping the ability to actually interact with the data blows to it edit manually move a cell one million and so on and uh yeah it's it's in beta it's free to use actually i don't have the time to make any kind of demo right now but but maybe i can show you if we have the time so roughly speaking what amounted in mastodon was we had to rewrite everything but we still used the big data viewer as a as an image component which makes like you know everything you did with big data viewer would work but we have to restart from scratch when it comes to the track data model and writing efficient structures and so here's a quick benchmark actually just to show you the amount of this space it takes to actually store a certain lineages and you see that i realized the alignment is not good nevermind realized that you know in the mastodon versus mammoth the the data structure is i can say so well done because you know that's mainly tovia's id it's so well done that you know it takes much less time to actually create it but there's a much less memory space up to the point where mammoth wouldn't be able to display in such a lineage while mastodon has no problem with that so you achieve something which is 30 times smaller in memory is 30 times faster now in processing time and now the the the one of the cool thing is that it's easy and we don't need any kind of complicated or let's say elaborated hardware everything is done you know in the 2012 mammoth pro now the the plenty of cool stuff in mastodon and i will not not pass that that actually required a lot of of work is sometimes excruciating it's really made for you know end users with without complications the one thing now i would like to to explain to you in the last part you know is you know how to go forward using actually sensible data organization okay as we move you from small or reasonable images to large images we generate a large number of tracks and of data and in the end you notice data somehow has either to be created or manually annotated by several people about many people so how do you work with that how do you actually share the load of having several scientists working on that now what i will detail you know is some of the the practical the tools that work in practice to do that first starting with the data organization so it was just the image data management we can say that you know the most classical way of handling your data when it comes to tracking is having everything on your computer so you would have the the the bdv files and actually the it's made of three files the hdf files that store the pixels xml files actually is the master file contains you know the the position in space and the metadata and then you have two settings the xml files that just contain the display settings and then finally you have a master on five that actually points to the image it contains the tracks right and so you have the image data and the track data are separated that's very convenient but they're all together and now the the fact is that you know this guy can be like several terabytes and so it's not it's not a good idea to wave it on the on your laptop on your computer you want to have it in a centralized space where it's packed up so a classical approach that i've seen several times you know is giving justice right to separate a little bit the storage and use a classical approach in labs to have a network drive i mean i think it's until recently it's relatively common to have that it's not fantastic approach why not having something like this and the desirable i would say data management would be the following is that you would have the very big the very large files the one storing the pixel on a server and actually as an image data on your laptop you would just have actually the master file and so it turns out that you know again hang on kimoon in the tobya speech again actually they created that a while ago the big data server and something that will run on you know a remote machine actually just serve you know the sub volumes i'm not told you about i'm sure you have seen that before so that in your master file instead of having a pass to a file you see if the app that http then an address and so in the metadata of this talk if you see if you look at you know the the google doc marion shared with you you will see that you can download you can download the mastodon file that contains that i will share my other screen if that's okay with you if i found the control to do so right i hope you see my screen please say yes give me your life right so this is the mastodon user interface so if if you follow the instruction download the file this is what you have and what you see here actually it's the tribute a tribulum again taken for the self tracking challenge i think it's a 400 gigabyte data set but this data set is not on my hard drive at all it's something that actually is served i'm sorry i'm home in the in the paris region and you know this data that you see is on the pasta machine somewhere that is exposed so if you don't know this file you will access that and you see it's really doing a good job what what's key for me is that everything is interactive so that i can inspect the data i can zoom and it rotate it and explore it because you know we are microscopy so we need to interact with the data as i move in time you see that you there's the pyramidal resolution that's zooming and that actually loads the data remotely but nothing compromises the interactivity and so it's a very good solution actually the mammoth simply plugs that over that without any issues so what you see a mountain mastodon sorry what you see here is actually the result of the fully automatic tracking of this of this data set on on i don't know locally on my computer at all right and i can make some coloring by spot intensity for instance so that you see with me the how the cells vision happen right and it didn't take that much time to get that nine minutes actually the fully automated detection only took nine minutes and so i have this on my computer the full mastodon file is maybe 14 megabyte i think i have all the tracks data here with me with the analysis intensities and so on i can even use the coloring to inspect it and the lineages of the cells are these are not truly lineages because i didn't detect the the cell divisions but you know all of these you can browse right and this is kind of the of it's it's a desirable let's say data organization you have the heavy data that's you know remotely on the server in pastor while i can do all the processing locally with little overhead computationally and still have all the data locally and everything is very interactive right it's not something that runs through a browser and so on that's really i don't know you assume it's work for you that's something that's very responsive that i can track and so on and so i can delay stuff and undo and everything right okay and so that's an experiment actually we wanted to carry and says okay let's say if we have that we have one copy of the data stored somewhere in pastor how many people can work on it simultaneously and so this is why we tried in the i2k the last conference tutorial so we had like two times four hours at very odd hours i would say of the day sessions where we would uh taught the people how to use softwares and so one of the things i say say look there's 12 people each time let's try all of us to actually connect to this you know simple server that would serve big data and see if we can crash it and we failed to crash it also i have to believe people because they were coming from very different regions of the world but all people say yes it's still actually interactive and you can still browse it so this is something real six the only thing i had to give them you know is this three files the master file pointing to the xml file the xml files pointing to the server and this is what we'll find with the the materials attached to these sessions right so this is more or less the the summary right this is what you've seen from my screen and that's what you see here is the the is the raw data okay what would be the next step the next step actually with this if you think about it is that every time i have the image data remotely and i have the local machine with my local tracks but sometimes you know some embryos can be so large that you know you want to be have several people's actually creating it simply inspecting or actually manually annotating and so you would like to have the same way collaborative annotating annotations and being able to share actually the the the data together and so that's not my work at all here i'm i i'm actually getting the glory for the work of Vladimir Vladimirozma who was supposed to be here today and so he wrote a mastodon extension that just does that right that actually you have an image server the big data server that we just see but he actually put together a lineage server where actually people would track locally and then push upwards to a lineage server that would make sure you can merge all the lineages coming from several people into one and so Vlad actually made plenty of things you know such as something that tracked over time how many spots are added manually and who did that and so on it's working beautifully but this is you know something we have right now in Pavel's lab Pavel Tomantia Club with actually clever strategies that you know when you want to merge actually several lineages it's a good idea to ask people to track different parts of the embryo to avoid conflicts but there's your conflict resolution right so roughly this is what what i wanted to say in practice when it comes to tracking my first conclusion is that we have it easy when it comes to tracking like images and tracks they're well separated and actually even the automatic detection of cell doesn't require a lot of new image processing or image access we can exploit the pyramid pyramidal decomposition and be very efficient both in memory anytime and when we have the tracks well we don't need the image anymore so you know so that's very convenient and very very efficient our main challenge there is that when you move to huge images you generate huge amount of tracks and this actually prompted us you know to trash these two guys trackmate and mammoth are not good enough at all for very large images simply because large images generate large amount of tracks so you need something specialized again not only at the image data but also at the track data and this is our answer our best answer is is must know and finally i wanted to attract your attention you know to to the existing solutions for for reasonable image you know sharing in storage this kind of big data server that there are also solutions but this one is is really great and we're looking at you know for the n5 file format actually stored on the server via amazon for instance or you know what the way me consortium will generate now with this i have to thank these awesome people i've worked with and that actually helped me a lot more than anyone some did the particularly vladimir uman and tovia speech and also on past side vlad uh dimitri and true actually that helped me actually set up the whole big data server system now with this i thank you very much for your attention and you and here are the links if you want if you're interested in actually testing mastodon and that's even the manual and thank you for your attention thank you a lot joy that that was a really great talk about tracking in uh in big data i have a first question for you um in track met you have the possibility to have the automatic detection of spots is there is something similar in mastodon yes in mastodon you have both fully automated that means you fill the whole data set and it does the detection everywhere and the semi-automated and semi-automated is you know you click on the cell and it follows it over time okay um oh i have a nice question about the future will mastodon replace mammoth and track mate it will replace mammoth okay because all functionalities are in mastodon and there is even more in mastodon than in mammoth right yes um track mate will stay yes because the i'm actually writing or working sorry on sometime focusing just on 2d images small images but actually bringing machine learning and deep learning components to that an interoperability with existing software so this i cannot do in mastodon but i can do it in track mate okay okay that's a good point um so the examples you showed us in the presentation are based on tracking of nuclei but do you have other examples of particle tracking uh i guess with uh with your tools wait i don't know the the can i can i it's okay if i reach out my screen with the demo yeah so the the the key limitation and we have this question a lot is that you know the the why is the the sorry voila the the tools actually that can automatically detect let me delay if i delay tabasing the tools that can detect automatically the objects in track mate you know they're very simple so far and that's all every all the tools that are good at dating what i call blobs and a blobs you know it's something that is roundish and bright and so if you have something like uh oops sorry if you have something that's like you know a cell label for its membrane something that has a tourist shape or a complex shape or or something like this you know mastodon will not be good for that it really needs to be something that resembles a blob but as soon as you have that it's honestly okay again i don't want to show off but it's it's reasonably reasonably good and is it possible to have the merging of particles i guess absolutely it's made for that there's the we started with with toby has actually porting all the algorithms that we have in track mate and more and so there are you know the fusion detection and the track splitting detection i think that you know the like most of the time we have the practical approach and say we know inspect everything by hand and correct for the cell vision because they're very rare event and very crucial for the detection but you know i'm sure that you know with a good scientific project as a as a background we can develop something some something that are more more sensible like you know the anna cresci didn't speak so much about it but she has awesome tracking algorithms in her elastic and they're very good at that right yeah and so that's a that's a solution too okay uh yeah we have five minutes to we are yeah we have five minutes to finish um have some more questions about uh how you handle the big data i mean basically do you support gpus and also what what would you recommend for for in terms of specification mainly ram when you're processing data with mastodont and how when you're doing your benchmarking test what kind of specification do you use what kind of computer do you use why why i'm missing to you but i'm just tracking your cell vision just i'm very happy to say that we don't support gpu you don't need a gpu to work in mastodont overall and we had this discussion before and it's very good the reason is that now you have mastodont is a software that just requires java and nothing else so there's no dependency no nothing so it can run on a cluster it can run offline it can run on the whole computer as come to memory it's made to be interactive on a small laptop right this is not a fantastic machine at all and so i try to run it on a something a very old pc with four gigabytes uh four gigabytes is a bit not enough right particularly if you want to do automatic stuff i recommend eight gigabytes laptop the most important part if you want to enjoy the interactivity is use a mouse right you know the demos and tutorials we make people use the trackpad and installable it's really the most important more important than a gpu with a mouse for mastodont okay um i would have a marginal question related to uh so first the interaction um of uh on data that are stored in the server so you talked about uh big data server is it a um a technology that could also be applied to lab kit or that could be used by any tools using big data viewer uh here hailed to the the the previous programmer like tobias matias onkymoon that make this i think you know everything big data server can work with the big data server right matias yes yes you can store your image data on a big data server and use it with lab kit okay also for the segmentation it has to download the highest resolution level so okay it takes sometimes it for us what it means to actually move from a local storage to uh to a big data server is literally one line in the file while you point to the server you give the ip of the server instead of actually to to to the file and it's as simple as that okay okay that's that's that's great i have a second question which is also for for uh matias and also dominik uh john if you showed this possibility to collaborate in mastodont the possibility to annotate uh together uh dataset which is really important when you have such large data especially when you want to track billions of cells it's a problem to handle the file but it's also a problem to curate the data with a human uh is it possible to do this in lab kit and uh elastic or will it be possible in the future do you have this kind of collective annotation um i don't know about elastic but i don't have it in left so okay but i know that people nowadays use lab kit to also manually create dense annotations of their images and to later use it for a neural network training and i think that's something really interesting and i want to improve it but i need to find time for it okay um and for elastic maybe uh you in principle you could combine labels from different project files so if everyone works on the same data and then you could probably write something in python that combines them but we don't have anything built in okay okay thank you um so we're reaching the end of this webinar it's five o'clock um i would like to thank again the three speakers of today so um anakrashok and matias arts for their great talk i would like also to thank all the participants for staying till the end please fill the survey that we posted in the chat window this is really important for us and for the speakers to understand what went well and what did not went well and uh go well and what are your expectations for these webinars so thank you to all of you and see you for the next webinar the next webinar and big data next week thank you my people thank you