 Okay, welcome everybody. So this is the day three of the defragmentation training school. And we will see example by image analysis work flows. We have the pleasure to have here with us Daniel Saj from EPFL Luzan, Beth Semeny from Berlin Institute, Anna Klem from CYLAB, Life Lab, Sweden and Thomas Bangle from University of Minnesota. So I ask Daniel to start sharing his screen. Okay, thank you very much. So hello everybody. So that's my talk about zero code deep learning tools for bio-image analysis. So my name is Daniel Saj. I'm coming from the yes, EPFL Luzan in Switzerland. And I'm working in the biomedical imaging group, but also for the new unit, which is called the EPFL Center for Imaging. So today, I have an introduction about the deep learning for bio-image analysis. Then I will make some demos and many demos with very easy tools to use it. And I share all the material for the demos so you can, such a way that you can reproduce yourself if you want. You don't need any special tools for that. So if you go to this web link, so I think you can maybe put it in the chat or so, or is it somewhere, such a way that you can, if you want to redo the demo that I have done, you can have access to the material. And then I have a conclusion at the end. Okay, so what kind of tasks we have to solve in bio-image analysis? So there is different tasks. And certainly when we start to speak about deep learning, we have to well identify the tasks that we have to solve. So the first one is the classification. So I give to you an image, and you have to classify malignance or banning typically in this case. But there is also classification of images is a very common problem in computer vision. There is also the denoising. In this case, we give an image, and the output is an image of fully a better image with less noise or enhanced contrast or something that happens on the image. So we can classify that as a denoising problem. There was also something that works pretty well in deep learning is called in-painting. So it's all these things where it's missing some information in the part of the image, and using the data and using specifically deep learning, you try to fill out these things. With the classical tools, we speak about interpolation maybe. And here I have a very, there is a very cool example in the care software package where there is this, we try to recover the axial resolution using the lateral resolution. That's a nice example of in-painting. There is also other application specifically using also deep learning. It's called virtual acquisition. It's the fact that, okay, here, you know that in a microscopy, you have a very open fluorescence microscopy that gives very useful information, but could be toxic for the sample. So here we prefer, obviously, to have unstained acquisition. And we try to, with using a deep learning system to provide the fluorescence microscopy. So this application could be a little bit at the limit of what it's acceptable in the field. So there is also nice paper about this application. And obviously the main application that we have in our field is the segmentation. So segmentation, that means we want to divide the image in some part where there is object, background, or different stuff in the image. There is two kinds of segmentation that are very important for first is instance segmentation, where we mainly draw a box around object. So that's exactly what we see in computer vision when you see the car trackings or when the car, the self-driving car tries to identify the scene. So that's the instance segmentation. So we are able to count the object, give the position, and so on. And maybe also classify the region of interest. And there is the other one, which is called segmentation. Here in this case, it's more pixel classification. That means for each pixels, we give a class. So here, it's a binary segmentation. So that means we give the class zero for the background and maybe one or two fifty five for the for the sets. So segmentation. And that's the, obviously that's the main application that we have in the, in our field in bio image analysis, typically to identify the cells nuclei and so on. So segmentation. So I usually prefer to call that's like pixel classification to really point out that we, for each pixels, we assign a class. And that's exactly the demo that I will do. It's the pixel classification. It's the simplest thing to start with the deployment. So now, when we have an image analysis question, so usually, okay, we have seen that many deep learning systems works very well and people just say, hey, which, which tools should I use? And I want to, to, to revamp a little bit the story. And the first thing that you have maybe to start, and that's the tool that will, the other teacher will explain to you is the, what I call pixel based approach. So that's really to start with, let's say, classical method, like threshold, digital filter, morphological operator and so on, to extract the feature. Here you have to, to work pretty hard to design, to engineer your, your systems. And typically at the end, maybe you have a workflow of image analysis operation. Typically, it could be a Fiji plugin or Fiji macro to do these kind of things. And if it works, you don't need to go to the, to the deployment because, okay, here you have a system that works. Why you will try to learn something if the standard method works well. Another approach is to use the physical model or model based approach. Here you are, you try to identify or to have a mathematical model of the image acquisition or on the specimen. That's typically what we do in the conversion super resolution or with active contour where we have a model of the shape. So when you have a strong model of the shape, obviously you can design an optimization program to find the model in your data. And that's, that's typically what we do with MATLAB code and everything. So it's also a very good approach, but obviously there is more mathematical background behind because you should try to find, to set up your equation such a way that after you call a solver to solve this equation. But image analysis problem usually, if you think how it works, if you have an explicit model that could be a very good approach to solve a problem because we are sure that you have an optimal solution at the end. The other approach that you can take is the machine learning or the, let's say the shallow learning. Here the idea is to uncraft the filters. So typically you pass your image with through a bank of filter. And then you only use a classifier to classify the feature that you have to extract with, that you have extracted with the filter. So a classifier is the random forest, is the one that people use in the soft, is very nice software which is for elastic. I don't know if I do not remember if you have pronunciation of elastic, but if you want to try elastic, it's a very, very useful software that you can, with a graphical user interface that you can play with. It's very easy to use and give pretty good results also. So that's typically what we do using this kind of approach. And okay, if the first one, if the pixel-based approach do not work, so if you do not have an explicit model based on an explicit model, and if the machine learning or the shallow learning do not work, you do not have an explicit way to set up your problem. In this case, you can go to the deep learning approach, completely based on the data. So it's here, you do not have anything to engineer. So we just have to prepare the data. So your goals here is to provide data, and I will explain it a bit later, but not only the raw data, but also the annotated data, such a way that you can give a training system with these two kinds of data, and learn an artificial neural network. So it's the deep learning approach. And here, typically, you have to deal with this kind of library, which is TensorFlow PyTorch, which is a huge framework that we have in this field. Okay, so I will introduce a little bit this new field under the deep learning, and usually it's pretty hard to set up these kind of systems, but I will show, for one specific application, very easy tools to use it without any code. That was my claim. So you see in my slide, there was two kind of approach, and we have this engineering approach where we have to design a system, a human design the systems, and here it's more the learning approach, where the system is not designed by a human, but is designed from the data itself. So that is two part that we have in the image analysis. So I don't want to comment everything here, but there is this model-based engineering approach where we have what is nice in terms of explainability. The results here are very well explainable, so we know by math what kind of output we can expect. We can analyze the error, we can track where we have what is the kind of error we do, because we have a very explicit processing. In the data-driven side, there is no very easy way to explain the result, so that's the effect of the black box. We provide the data, we press on the button, and we expect that it works. But in the opposite side, the main advantage of this data-driven method is that we have a strong adaptability to the data, so we don't need to give to the system the rules, all the rules to detect the object. The system will guess from the data. But at the end of the day, for user bio-image analysis, in fact, when he has a problem, he has tools, so it's like a toolbox, you open your tools, and you have a kind of image problem to solve, so you can say, I can use the watershed morphological operator and so on, or maybe there is also tools now available for us. Using Bayes and the deep learning, for the denoising, there is this noise to void. For the instant segmentation, we have the YOLO, Starlist, all the software like CentPos, which works very well, and there is also one which is called the UNET, and I will explain a little bit later the UNET, the one that we use for the pixel classification in this talk. So we have two paradigm, don't forget that the classical method or the model method can work on some kind of problems, and if it works, we have to use it for sure, and when there is no explicit way to describe the data, describe the problem, we can go to the deep learning using the some tools that are available now. So in the deep learning in bio-image analysis, so it's for image classification, object classification, so we can solve also object detection, image enhancements, and certainly we can solve also image segmentation, and some people start to solve also the tracking problem with the deep learning, but it's not only where the deep learning is important for us, it's also important in bio-image in image reconstruction, so here typically denoising the convolution, registration, super resolution problems can be solved, or we can at least remove the artifact of the reconstruction using the data itself, using the deep learning. But for both, we need data, and when I say data, it's a lot of data, you know that it's everything is based on this kind of big data, and here I put also data plus label, so ground truth annotated data, to search a way that the system can learn, so it's not only the raw data that's coming from the microscopes, but it's the data that a human has annotated with using some tools. So that's typically what we do with using this supervised learning, so it's 99% of the deep learning are supervised, but there is other things in deep learning, there is also the weak supervised or self-supervised that I cannot comment to them, which is obviously more difficult to train, but there is people that try to avoid this bottleneck, which is the label, and that's very often we do not have enough label to train. Okay, so I just wanted to set up the workflow in deep learning, probably it's a reminder for many people, but you should know that the first step is the training, so when you have a new system, you have to provide the raw data and the annotated data, so if it's a pixel classification, we provide the raw data, which is DIC or face contrast here, and we provide the labels or background with a black cell in white and so on, so but when I, okay, we provide a lot of data, then we train the model, so I will try to open a Jupyter notebook that do these kind of things, and when the model is trained, you can save your model, then you can reuse this model to make an inference, so that's you take your model, you take your images, and using the very simple stuff, you can make a prediction, so running the, using the model, make a prediction, and you expect to have a good output. The output of the train model, when you train, there is something that it's very important to check is the learning curve, so here's the training is obviously an iterative process, so iteration that we call epoch, epochs after epochs, we check the error, so the difference between the predicted image and the annotated data, so and the error should decrease, so and we have always a machine learning two data sets, the training data set, and the validation data sets, the training data set is used to optimize the parameter of the model, and in addition we have another data set, which is called the validation data set, that we use to control is everything is working well, and when everything is working well, we have a good fit when the two curves decrease in the same time and with very low value, here it's an example of a good training. So what I forgot to mention here, the training is very, it's a very slow operation, iteration after iteration, we have many, many parameters to learn, so usually you need a very good computer, so using typically GPU, and in the opposite the inference is a very easy stuff to do because we just, we have nothing to learn, so it's a one-shot operation, you just provide an image, apply the model, and you have the prediction, so you can run on your standard laptop, and I will show you a little bit of deep image just after, deep image, it's a way to make an inference on the, directly on image on the standard laptop using the CPU. Here I should mention there was another step, which is very used also, it's called fine-tuning, so sometimes it's too hard to retrain from scratch because you need too much data, so we can take already a trained model, there is a lot of trained model for the bio-imaging problems, and we provide few data specific to our problems, so that's what, so we provide the model, then the few data, the raw data, and few annotated data, then we can retrain the model, and obviously it will be faster because some part of the training is already done, then we have a specific trained model for our data, so it's what we call the fine-tuning, but for both training and fine-tuning, so the very crucial question behind that is the data, so you should have very good data, curated data, labelled data, that represents very well your problems, so the main question behind that is the data, obviously when you make a training you should check the validation curve and to validate, to have a quality control of what you do, usually you have to choose to design a neural network, I should say nowadays it's not any more a question because there is good already architecture like using by cell, pole, star, distance, and so on, and for the pixel classification there is this unit, so now the unit works for this kind of problem very well, but obviously there is also a cost, it's not only a cost in money, it's also a cost in CO2 because we have to use very, very powerful computer that consume a lot of electricity using this kind of GPU or cloud computing, and today I will show you a GPU notebook that runs on cloud computing, we will collab, and there often you need an expert in IT to set up the computer, if you want to run on your own computer, probably you need to buy a GPU to install these kind of things, patterns or to access to some server to make the training, so when you start with the deep learning maybe the first thing that you have to do is to check if somebody has a similar problem, because many things you don't want to reinvent the wheel all the time, and certainly you don't want to annotate a lot of data if somebody has already done something which is similar, so maybe it exists some already trained model, pre-trained model, and there are few that solve our things, okay nice to voice, it's not really a pre-trained model, but okay for the denoising there is noise to void, where it's a little bit different because it's a self-supervisor story, but at least it's a very good denoiser using deep learning, and you can provide your image using an already pre-trained model and make a fine-tuning to your specific problem, in segmentation there is this two nice pieces of software, one is called Stardis, the other one is CELPOS, it's the provide pre-trained model that you can eventually fine-tune if the data is not exactly the same for your problems, Stardis is very cool to detect objects and specifically round-ish shape objects in very dense images, so there is an interface for Python, Jupyter Notebook, or even now for Fiji, and CELPOS, it's also a very good software, they have a lot of pre-trained models that you can probably directly use in 2D and now also in 3D, so that's the way to avoid the code to retrain something for you for your problems, and there is this yellow mask here, it's more for the detection in fact, so the detection, I mean you want to detect objects in images, so draw the rectangle around the object, and so now there is also another place where you can find the pre-trained models, it's a new initiative, they can buy many people in the field, it's called BioImageModelZoo, so it's a place where you have the link to pre-trained model, and the effort of this community is to define a model that you can share with different software, so in such a way that you can, if the model was trained by let's say zero-cost deep learning for microscopy, it's one that where we can train a model, you can use it in DeepImage, which is this software, in fact you can use it in Image or you can use it in Elastic, so a model, now it's an effort to standardize the model in such a way that you can exchange a model from different platforms to have better interoperability, so please before to start visit these places, maybe standard cell poles or the BioImageModelZoo, maybe you will find something which is similar, and if it's the case, you can directly use it or maybe make it just a fine training steps. In the BioImageModelZoo, to have the model on this Zoo, the people should provide everything at the data set, the way to retrain and make the fine training, so all the information are open, so today I want to introduce two things, so that's the zero-cost and I will make a demonstration with these, that is one, so it's called zero-cost DL for microscopy, so it's a self-explanatory notebooks that runs on Google collab, so Google collab has the advantage that it's free, okay free, not completely exactly because you have to provide Gmail address, put your data on the Google drive, and obviously Google will scan your data, that's the drawback of this Google collab, and we have an export to the BioImageModelZoo, so it's a way to train, I will start in few minutes this Jupyter notebook, to train without any setup to do, so you don't need a GPU, you will access to the cloud computing of Google collab, and at the end you can export to the Zoo on this Zoo, then I will also make a short demo of DpMage, so DpMage is a wet run inference in Image, it's a standard plugins that you can use it, you take a model and you can just say, I want to apply this model on my image, and directly in Image which is a nice piece of software too, so maybe I will start the demos, so what I provide to you is the link on this folder, so there is a Jupyter notebook and there is some data where we can play with, and I have already saved a model, but okay, with this Jupyter notebook we can run a training using this data, so in the data I have this kind of things, I will open it maybe, so I provide the folder called demos, and I provide some kind of images like that, so the goal here is to segment these kind of cells, so I should say that it's pretty difficult, even me I am okay, consider that I am a specialist of image analysis, if I have to segment these kind of cells without using deep learning, so I will be a little bit confused, okay, some people come in my office and say, hey, I want to segment that, there you see that it's pretty tough, that's to segment that, if you use obviously a threshold it not works, that's there is no hope, obviously the grade level is the same, so probably you can use a sort of bandpass filter, but I'm not completely sure that we know, I have tried a little bit, even using let's say a sort of filtering to improve to enhance like your bandpass, maybe a variance, local variance will give the, will isolate the flat area, there are areas, so you will see that it's pretty tough, I think there is no hope to solve this problem without deep learning, so but in this case, I'm very lucky, we are very lucky because some people have, sorry, some people it's have annotated the data, so what they have done, okay, they have put, sorry, they have put zero and one in the cells, so one you do not see it because we should modify the brightness and contrast, so I see that I have annotated data, so that's something that I have, so for each of the image I have the annotated data, so one people has outlined with the mouse the cells, so now I have this material to train the systems, so it's what I can do, so there, so I will start the jupyternet book, so what I use, it's a slightly modified version of the one that you find on the zero cost DL for microscopy, but it's a good place to go and to visit at this place, it's called zero cost DL for microscopy, so maybe when you have to train something it's a really good place to go, there is a wiki here where they explain how to train, they provide a lot of light star disdenoising cells, and I use this one here, a modified version of this one which is called unit, unit so it's for pixel segmentation, I used a modified version of this one so that you can open in Google collab, so now okay here I am, so I have when I right click on these things I open with Google collab, so obviously I should add the data on a drive, a Google drive, that's on my Google drive, and then I can start a step by step to run these cells, so the first point, the first step is to install all the dependencies, here we use a library which is called tensorflow1.15, it's a pretty old version of tensorflow, so we install it a little bit times, but you see that's the, so to run the training what I have to do is to run all these kind of cells, so the first one is the installation of the of some dependencies, some library, the second one is to connect with the Google collab, so make the connection with the Google collab, so we will give the authorization to Google to scan the data here, then we will load the data, so including the source and the target, the target is the label data, then we have the something that I didn't mention data mutations, so it's to provide more data, then we will create the unit, the networks that we will use, and then the next step is the training, so we will be able to start the training here, so by clicking on this, hold this button one by one, we will observe also the plot, the learning curve, and we have the exportation of the, at the end you can export in the bio model images, so by using these things, so is it done, so you will see that here when I check the GPU, that's my case, I used too much, I used too much the GPU with Google collab, so the Google collab with this account, they do not give me access to GPU, today I will not have a GPU, so I won't be able to train, but I just want to continue a little bit, so I connect to my data, so here I use, okay, my professional account, so authorize, so I give the commission to Google to go to my data, to this specific data also, and what happens here when I connect, yes, okay, so my G-drive is connected to the Google drive, so there I can go to the data, so I store the data somewhere in the, yes, here, so the good way, so I should provide, in this Jupyter on the book, I should provide a folder containing four folders inside, the training target, the training source, raw data for the training, the training target, that's the label data, then I should provide the test source only for the, the copy at the end, which is not used for the training and test target, so it's, so here I will copy, so copy and put the dataset path there, paste, so I copy, paste the things and I can run, so in this, in my dataset I have 72 data for the training source, 72 images for the, as target, obviously I should have the same number, and I have 27 images for the, for the test, for the quality control at the end, so the size of the image is 512 by 512, and I can start to read this image, so here it could be a t-image of PNG images, and the systems start to read, so load the image memory, load the source and the target images, so when you understand that in the source images I have a raw data, 8-bit images, and in the target I have an image with zero and one, only zero in the background, one in the cell, that's it, that's my legal image, so here I list all the images that I have in the data, I can display one of the data, so that's an example of the data that we have in this dataset, so you see that we have the zero and one and the raw data, the target, then there's this step that we can do or not, I will skip it here because I do not have GPU, but it's the data augmentation, so when usually we do not have enough data, so we rotate a little bit the raw data and the target image to increase the variability, or at least the geometrical variability of the data, and then here I will define now the model itself, so I should give a name to observe for a course, and pass on the test, and here we have one important parameter, that's the number of epochs, so obviously I cannot, usually for this kind of data probably you need 100 epochs something like that, we take around 10 or 20 minutes if you have a GPU, for me it will never end without GPU, and that's the other parameter is more advanced, so you can just give the different value, that's it, and so after you can define the hyper parameter, functioning, we do not have anything to do here because we do not provide a pre-trained model, we can now prepare the model, because I should run this sensor, okay, it's recommended, but it's not only recommended it's mandatory to do it, so here you see that's the, okay it's very very slow because I do not have GPU, it should be faster, but we observe what we call the error or the loss function, which now starts from 0.9, so it's pretty bad value, because we start from an insider point, then we should see that the t-number decrease and decrease epochs after epochs, so obviously I have stored already a model somewhere, show you what happens if I run with a GPU, so I should go there, I provide a model that I have trained with this stuff, so it was in demo, yes, the curve, okay, let's say that now, okay, imagine that I have a GPU, I run 120 epochs, so around 10 minutes, at the end I am able to observe the curve and to save the model in the bio-image model format, so it will give to me a zip file that with the name that I give to go to the systems, and then I can download the model from Google Drive to Image J, so to use it directly to make a prediction, to make user data that it was not seen, so I or Fiji, so after what I have done, so first you all know Image J, Fiji, I assume, that's okay, okay, so Fiji is there, so I have installed using the update of Fiji, something which is called Deep Image J, so you have something called Deep Image J, so you should have to use the update of Fiji, check the box Deep Image J, so now you have something to make a prediction on your computer, not using Google Drive, on your laptop, I'm running here on my laptop, I'm using a Java, Java based program, so I have saved this model, the model was here, so that's the saved model, so saved using the Jupyter Notebook, so now I can try to use it, so to use it I should provide an image, let's say that I provide an image, I provide one of the test images, so this image was not used for the training, so now I can just use it like that, so Deep Image J run, so to make a prediction, so here I have this window, I can load the model which is, let's say this one, so it was trained with 120 epochs, okay, I will do the pre-processing, so that's a mandatory step to normalize the image, but I will not do the post-processing, just to just have the inference, because obviously there is always this pre-processing inference using the Deep Learning and post-processing, then I can click on okay there here, and I expect to have an output, so that's the output of the network, so I provide this image and I can obtain this kind of pixel segmentation, so it's a probability map, so here when I move my mouse in the black area you see that the value is very close to zero, zero dot zero three and so on, when I move the mouse on the white area I have something close to one, so the probability of this pixel, let's say this pixel has a probability of 0.987 person to belong to the cell and this pixel has a very low probability to belong to the cell, okay, so if I want to, you can make a max of the probabilities just here are thresholds, so we should put the threshold as 0.5, let's say from zero to, okay, we have now a binary image, apply, convert to mask, so that's the mask, so I should invert this image, I don't even know why, okay, here, so now I have 255 here and zero here, so it's the binary mask, but the most probable pixel belong to the cell or to the background, so obviously if you want to have something, it's not completely perfect, and you see there is some kind of effect, but to obtain this result with classical methods is pretty difficult. You are deciding the size of the patches right 512 by 512 anything special in your laptop I mean in your computer is it a normal computer limited amount of RAM? Okay 512 by 512 is a very common setting so obviously if you have larger image you can do with 1,000 or 2,000 by 2,000 around 2,000 by 2,000 you start to have some kind of problem with the memory of the GPU in Google Chrome so it should not be too large at some point and after if it's too small let's say if you take a 32 by 32 you take a very very small patches the problem here that's the to take the decision you will not have enough context and that's that's also penalized the robustness of your of your systems so that's something that you have to deal with the size so 256 256 is also a good size or something like that it should be due to the unit structure that I have time to explain you should be able to divide usually by 32 so it's because there is a scale we divide the image by 2 by 2 by 2 by 2 at least four or five times so you should so the size is determined by these kind of settings yes can you fine tune a trained model using only the graphical user interface of DPMJ or DPMJ no DPMJ is just for your inference and pre-trained models no yes DPMJ is only for infants if there is no possibility to train is mainly due to the fact that the the most common package which is PyTorch and TensorFlow are written are accessible by Python and all the tools for the for the optimization for the training are provided in Python and in my as far as node there is no way using user interface maybe now with Naparie there is some connection but I think yeah that's the only way that I know is to the most easier way to do it is to use this zero cost DL for which is a Jupyter notebook but it's you need a specific version of TensorFlow which is the TensorFlow for Java because Fiji and Java and this one comes with when you install DPMJ you download also the TensorFlow Java even you can just visit the bio image models who choose a model which is makes sense for you I don't know you you have similar images you want to obtain the main brands so just install this model and you can use it there is nothing to install everything is is out of the box I have another another model which is called glioblaston this one perfect okay this one is another segmentation problem so if I use DPMJ I have to train the network to to do these things for where it is okay perfect we have a pretty image and we have now a pretty image which which seems okay so now what I have done is to start to interpret a little bit the results so I have prepared an image that I call attacks so what I do is to I have text these images and I modify a little bit the input image adding some I copy you see that I take this part of the image and I copy there and I add some kind of noise there I add the speckle noise gaussian noise here I blur a little bit the image blur also here I create the circles only with noise there is no data here I draw by myself around a circle and to seem to imitate the cells here I do all the modification then we can check what happens as output so it's free to to to test a little bit and it's that something that we have really to do to to check the output of of the network in different conditions let's say I use this one glioblaston okay and here that's we can start to see okay it seems rare the cells were well trained or learn okay the result is exactly the same so at least there is no diffusion of the bad stuff but here you see if I add some speckle noise on the data the the systems do not see anything so we just forget to see it seems background but it's not background it's a cell post noise and here when we have gaussian noise what we see it's a square so that's completely that's completely crazy so the system we can say that he hallucinate and he has he has very bad behavior and in all the opposite since if I am a little bit out of focus I can consider that it's not so bad as a result when we have out of focus for these models for this kind of for the data that I have used so I can conclude for for this a specific case with the data that I use for the training with the with the model that I use blurring or out of focus the system will be robust but it's not from it at all when we have a noise and here okay a little bit more dangerous I do not have any cells just noise and we start to see some kind of cells which is bad also or if I invent a shape that looks like cells yes or something so that you see that the kind of thing that you can add okay so typically if I want now to that's a question for you if I if I want to to detect the cells even in this condition what should I do take the model that I have and add these kind many examples of these images with this kind of noise in the training so we start the training and obviously so that's always the the things so this kind of touches was never seen by the system during the training obviously can not work so that's a there was no miracles that's that's so what you are able to do to segment is the data that you have seen during the training not of the other one okay that's what the message that I have to typically in segmentation use a lot of these jack car index which is called also the index of our union or so it's a metric to to check it but in images so you provide two images and you can compute a certain number of metrics yeah just to repeat about fine-tuning of the model you say the one has to go through the zero cost notebook to do fine-tuning right yes yes there's a step so in fact it's it's very similar to train and you go to the again to the so you see how long is it here if so you have to provide the the same way you have to provide the data but only few data in this case and then at some point you say hey I have a pretty I want to use a pre-trained model so check this box provide your models and make the training there is just it is a step to do if you want to make fine-tuning so regarding some application like for example the one when you deal with the very noisy images or some application where you have trained your model with what exists already for science do you think there is the risk that applying deep learning new things in science may not be discovered because the model has been trained on existing data and is not let's say trained on something new yeah yeah that's that's the the risk exists for sure so yeah it's really the the the ethics of the people or the scientists if in fact it's now before okay before the deep learning so when you write an automatic system so you check all the points and you have something you write a code and you are you are confident that the people that write the code is fine and you have a nice code now you place your your confidence to the people that prepare the data in fact the guy that prepare the data the scientist that prepare the data should be should do a very good work to cover all the kind of aspect that you can imagine in your application so so you would say you in field have huge power huge responsibilities right yeah that's a huge risk in fact it's a huge responsibility and probably it seems and it's a huge responsibility and it's completely hidden it's not we see okay we have data that's okay that's fine but it's not enough we should okay validation could help to give a sort of answer but it's not the the distribution of the data the viability of the data is really important and certainly the risk to miss something which is never seen it's huge in fact in the industry okay thank you so thank you for ending on these philosophical notes and so I thank you once again for the tutorial and the lesson um and now we can switch to the second part so uh yeah welcome all I mean you know that this session will be about a set profiler and this will be to be a set of to run set profiler on the cloud and the honor to present here the set profiler software in the beginning and then you see that I guess most of you have heard of a set profiler and have probably also already used it and if not then this is your chance to try it out basically and to get into it so it's a free and open software it runs on windows windows mac and linux and then this is super highly cited and I maybe I can also add from as a personal note that so we I work at the you know value image analysis facility in uh sweden and then we also really used a lot in in user projects and and yeah and so because it's very good for high content screening I guess it's also used to many pharma companies and you can see here like the top pharma companies using it and and maybe I assume most of you have heard of set profiler set profiler analysts for example I had learned a bit later and um it's a set profiler it's really good for like workflows image analysis workflows uh quantification of like really a big batch of images and then the set profiler analyst is when you have run your set profiler analyst uh it's a profiler pipeline and you can take the output and use the set profiler analysts to kind of look at your data so explore your data really interactively and but not also only look at the data but then also use like machine learning to classify for example uh things or to plot um values and then all this of course every time you can go back to the original image and look at really the data where the numbers come from and that's I think super important I guess that's why you have like these two steps like that prefer to extract the parameters and then if um of course then when you're extracted parameters with set profiler you can use work like all kinds of software but set profiler analysts is quite good to to explore it and the interface is like that you see on the left the workflow that you're building and then so meaning let's say you want you have images with three channels and then you have some images and you want to crop something out of it and then you want to um identify let's say nuclei and so on and measure stuff then like this is your workflow and you would build here and you would see kind of add modules and see your modules that you have in your workflow on the left side and then um that is also called a pipeline panel and then on the right side what you would get is for each of these modules you get the settings panel so you see like a description of what's happening in your in your module and then also then of course uh like where you really set what this module should do so for example here and this first one where you load the images you can see them which images are loaded in set profiler and should they all be read or should only and a certain with a certain field after applying a certain filter be read and then you can also see here this so you have already some um information also here and then you get a complete module help um when you click here and actually here in this little window you can write your own documentation of and describing what you want to do and then so the idea of uh this interface is that you basically have a test mode to test your workflow to build and test your workflow to set your parameters so you go into the test mode and then you could also set an output folder so then you run your workflow where should like the measurements be saved and then should your control images that you might want to create be saved and this you could set in these output settings and then once you're very happy with your pipeline with your parameters that you interactively set up you can analyze your images meaning the run over all the images that you have in your folder and uh this for example now and like a bit like a zoom in into like this one workflow here that apparently here the first step is like after loading the images here which are kind of the default modules to crop an image and then here you would basically set the parameters okay here you want to crop it as an interact angle and so on in image position and then what you can do is to like move some modules up and down so in these ones you would add modules you would remove modules but also kind of move modules in the workflow and then then this little triangle is there you can press it to run then this module and you can also press here to click to your different modules in your workflow and then there are quite there are a lot of symbols which are good to of course to know so and this little checkmark is will it be executed or not executed so you can for example test something out and then say oh no okay I want to run my workflow without this one step to uh yeah this is kind of to control what you're doing and for testing it's also super helpful that you have this pause function so you can activate it and then press run here and then it will run everything until that pause symbol that's quite nice because then you don't need to step like step step step to all the steps for close ups and then also especially when you're testing your workflow you want to see the output of every step and for that you have these little eyes when you open it you will get the outputs for example here you get then a crop you would get the input and also the crop version I assume at least and then like when you of course once you have set this very often you probably want to close your eyes if you then run it or go like 100 images and then also very uh it's very good to pay attention on the kind of whether there's an error or not there's this little red cross or the cross symbol error symbol here and then when you actually hover over you also already see an information error message of what could be wrong so and then usually of course you go to that one and set the parameters and then there's also a warning sign so that for example here like when you just it's not a real error and I guess the most typical warning inside provider is that when you are in the test mode no data will be exported to a spreadsheet and that will be also done with me when you hover over this little uh yeah warning sign and yeah so there's a workspace viewer then and so yeah this is for yeah I'm looking at your workspace and then also um all right this is also like of course when you set up your pipeline then you tested or automatically tested on the first image and you can step it down to the next image set that's of course very important because you want to know or you want to make sure that your pipeline works not only on one image but on a few images you usually test it before you then run it on like all your images and you can also somewhere like randomly pick an image to test it and so once you are really happy then you would exit the test mode and then um run it yeah and then this is actually I don't well through this slide here basically there's uh so because you have a workflow right and then you do something with the image let's say you crop it and then you do something with a cropped image for example identifying some objects primary objects and then it's very nice to know like um like which are input which output flows where and then there is this uh trace input outputs so like and you can see what you like yeah what feeds into and out from from everybody super nice good and then so what kind of module data there are really a lot of modules to explore and um so of course some are needed and then like nicely sorted in categories and some are more needed for um yeah of course for loading the data saving the data then image processing could be some filtering um then uh object processing is also really like uh getting the object first place and then maybe making it a bit bigger like your nuclear segmentation and then measurements there uh like of course you want to then measure whatever you have done segmented so that could or can also be measurements of the entire image and um so there are really many many modules and it's really worth exploring it checking with a module house also of course online and to see what these modules are doing and you can if you of course don't know where the module is sorted to in this category you can either just click on this all or also search for a function if you don't if you just want to explore but or something if you have in mind exists um then to um look at the images in the sub-profiler what you do it and if you can for example click on or you get this output and then when you have such a figure window they have additional menus and so some one is to like really like zoom in and move around and go who set that and um then what you can also do is to uh measure something and it's in the next step but um also of course important is like where are you located let the x and y position of wherever your cursor is that's sort of the change of course and then you have intensity measurements of your pixels and um yeah I guess if you have never seen sub-profiler maybe it's worth worth noting that the intensities are always expressed as between zero and one with one being the maximum possible value of your pixels and then you can measure um length so for example it's always good to know what what dimensions are you working with like how big are the objects that you want to segment and then you can simply just measure that usually the workflow would be to set up and your modules play around to test it on a few images and then of course if you have like 100 1000 10 000 of images you cannot test them test it for all but like of course what you should ask is that this okay you test it on a few images and then do you agree um on like do you agree that the segmentation works like in general that your workflow works and gives you the segmentation for most of your objects and um then of course it will need I think it's it will never give you really perfect results because that's how biology is like maybe you have cells that are like two clusters and so on and then um you would you know kind of need to check whether uh like the so there should be no condition where the workflow fails in comparison to to other conditions that's especially in this like um so that relates to other nuclei for example well segmented in terms of like all nuclear data want to catch like also the dimmer ones are they are they segmented but it also means like are they in one piece or in the other in several pieces or are they merged and then like these are then the parents that you would usually play with like you would play with the parents to get this right and then um as yeah you will always have some errors and then it's important that if you have like different mutants or you should have the different treatments um that like the error rates are more or less the same for that all the conditions good um so where can you get more information or like if you say okay that I would like to do that but it doesn't exist to separate that then um or then of course you can search for more help and you can for example also post it on the form on the interactive form image.sc and um if you know that it exists in imagej there's a run in j macro module and then um you can yeah you can run in j macros within the seprofiler and then you can also try to write your own um like plugins say in in the in seprofiler and their templates and which is how to do this and also if you are like a more advanced image analyst and of course that can help you to make analysis available to users for not so not so efficient in programming. Good and so we are here a bit like of course to see how seprofiler can run on the cloud and like that of course works then in the direction of what to do with large image sets and then you can easily have large image sets especially if you do like the sky company screening kind of analysis and then let's say if you have only a few images and a few could be like a few hundred images then you're very fine to run it on your local machine and um then like what seprofiler does it is automatically uh multi threading the the the analysis so you can see kind of how it like opens up uh and like different processes and depending on your on the scp and then if you have like between something like tens and thousands or oh no if you have between hundreds and tens of thousands then what you uh want to do is of course to to move it away from your local machine and then what you can do is uh like run it on a cluster and that's usually like a you can contact the local system admin and that helps you to to set it up and then you can also use docker and that is what you also hear about more in this session and then if you don't really have many images many many images then of course you would consider more like self-processing and then also here you will hear more about that but I showed you already that okay you can input uh files in like just in this image module that I don't know what you remember but basically you can also for example just track and drop files and then but what you can do is to um say it should batch files together meaning kind of processes and in in s1 batch and that um you can also do in the set profile also there's this create batch files modules and also this uh yeah it's very it's it's easy to do so you you would create this module and say um like these are the batches that you want like to have and you copy basically the structure that you have on the local machine and then you should have like the same structure on the cluster and basically this is an information file that was created in the h file file that then like um you can be moved to the cluster and give the cluster the information of where your batch files are located just to ask about the the way you you set the the cell profile pipeline to run on the cluster I see this module appears at the end of your pipeline so this module just to repeat will actually only create a file that contains um that tells the cluster of the the file are organized or is actually duplicating the dataset that you have as a h5 file uh yeah it's it's just giving another cluster of pointer so that your data needs to be organized in the same way on the cluster as it is um on your local copy of the data otherwise the file would just be really too big to sort of practically move around okay so it would be very uh very small as a as a file compared to the huge dots that we want to do yeah okay so profiler never actually stores images inside of it um it will store either just the pipeline information or it will store pipelines and pointers to images but it never actually stores images and it's again because then if you put tens or hundreds of gigabytes of images into it and you wanted to share your workflow with somebody else it would be tens or hundreds of gigabytes okay thank you I think the only thing that that I will just sort of mention about cell profiler is I think um the the reason that cell profiler has lasted as long as it has is um the folks who are outlined in blue here are professional software engineers who have contributed to the code base over the years but there's also these folks in green who are biologists um and there's been a lot of biologists input to cell profiler over the years so um uh while we have sort of the benefit of having people who know how to write excellent really performing code um sort of helping us make the tool better we try and make sure that particularly things like the documentation are all written by biologists so that they are more approachable to somebody who doesn't have a degree in computer science doesn't spend all their time thinking about computational image analysis um the goal of cell profiler when it was originally created was just to essentially um have something where you could run the same analysis on lots of images without needing to learn to code um something which is still tricky in a lot of tools like for example image j and fiji unless you're comfortable writing scripts and if you're comfortable writing scripts scripts are great um scripts are incredibly useful um and it's definitely a skill I recommend everybody work on picking up at least a little bit but we didn't want it to be mandatory in order to sort of do good image analysis and so this idea of a workflow tool is what um Ann and Ray who are the original two authors set up to do so again Anna gave an outstanding overview of sort of what cell profiler is um in terms of you know why you might want to use it how you might want to sort of configure a pipeline um almost always we'll be configuring a pipeline locally on your own machine with sort of local copies of the images so that you can sort of turn all the knobs and dials and sort of see how things change but if you've got hundreds or million hundreds of thousands or millions of images which in our group is not that infrequent of a of an outcome um you probably don't want to run that on your laptop and so cell profiler is designed to allow you to have an interactive mode but then be able to apply the workflow that you've created somewhere else headlessly and by headlessly I just mean not showing the gooey to you so you can run cell profiler headlessly even on your local computer and I'll I'll talk about that in a minute um but the idea of just no longer having to sort of use make the overhead of sort of actually creating the gooey and running that and so that we can run it you know either more performantly on your local machine or run it somewhere where we're not using a gooey like a cluster um so most of the things I'm going to actually sort of get to toward the end of this are wrappers around you yourself having to interact with the cell profiler command line because in general we try and look for things where we can sort of obscure the code or put the code behind a gooey that makes it make more sense and be more approachable to people who are not used to using the command line um but that being said it's valuable to understand the information that cell profiler needs for a command line command because that's the information that ultimately no matter what wrapper you're using is going to have to be put into cell profiler in some fashion um and so the important thing of sort of the next several slides is not that you memorize this command but just that you sort of understand the sorts of information that are needed for a headless run so that whatever wrapper you choose to use um you understand what it's doing so the first part of a cell profiler headless command is just telling your computer that you would want you want to run cell profiler um this might look a few different ways um if you are running with cell profiler that's installed in python on your machine um cell profiler runs on um python 38 or 39 as of the moment um that might look like so just the command cell profiler or python-m cell profiler or python 3-m cell profiler it will depend on how you've aliased python in your system so just try one of these until one of them works um but one of these should work depending on exactly how you've aliased your various python things on your local machine um but you don't have to use cell profiler installed from python um it can have advantages because you have access to the latest sort of code upgrades um and you can also then use plug-in tools that require extra dependencies so for example we have plugins that allow you to run cell pos or stardust in cell profiler um but those you need to be running cell profiler installed in python we can't include them in the main sort of downloadable program because it makes installation for end users way harder and we'd rather make it so that everybody can at least install some version of cell profiler um so if you are running uh cell profiler .exe on a windows machine that you downloaded from our website um you can replace this cell profiler here with this command if you are running it on a mac and you downloaded the executable from uh from uh from our website it will look something like this um you can uh in most terminals on on mac and windows you can just drag and drop the actual executable file into the terminals if you're worried about having to type all of this out um you can just actually drag and drop it in it's a lot more efficient and then you don't have to worry about typos the next part you don't really need to worry about there are some flags that cell profiler wants to see in order to run headless um if you are running headless command locally you need to know this but if you're running a wrapper you don't the next thing cell profiler needs is a pipeline it needs to know what is the analysis that you want it to run um so if you have made a batch file with create batch files you can use your pipeline that that you made that is included in that .h5 file that was made with create batch files um and it can also be um a cp pipe text file um so uh recall that um cp pipe is just sort of a flat text file that has just the steps of the pipeline in it but nothing else um cp proge has the steps of the pipeline as well as the pointers to the files um and because you might be later telling it which files to run uh using this sort of file format causes problems so we recommend that it's either sort of the plain text file or a batch file that's been specifically designed to run cell profiler like this where do you want the output to go um so um it's pretty straightforward you want to know where to pick up your your files later um but one sort of note is in your cell profiler pipeline you can in a lot of the different exporting modules things where you're saving images or saving masks of objects or or saving quantitative data you can usually override what the default output folder is going to be and say I want it to go to a specific place if you do that and then you are moving your workflow onto a cluster into the cloud probably the exact folder that you specified that you want to look at is not there on your cluster on your cloud um so cell profiler allows you to just for all of these modules say I want to use the default output folder which then at runtime you can set um it in the command in some location that you know will be there when you're going to run in your uh on your headless application the next thing that cell profiler needs to know is what files are going to be put into cell profiles um and this is probably the most complicated part and so um I need to take a few minutes to sort of go through um how data is actually put into cell profiler in the first place this is nobody's favorite part of interacting with cell profiler but it's really crucial to getting your pipeline working the way you want it to um so the major way that most people will try and get data into cell profiler is through what we call the input modules um and the goal here is not for you to sort of read this every line of text on this slide but briefly there are four of them um images metadata names and types and groups um images is pretty self-explanatory you put images in metadata if you want to extract any metadata from your file name or your folder name or from the the file headers if you have say a um a file that's a container file from a proprietary format um you can use that to associate pieces of metadata with the individual images so say I have a time lapse movie and I want to say and the way my data is structured is all of the time points of my time lapse movie are in one file um I can both sort of say the name of the file is the name of my movie um or is what cell profiler should call name of my movie and that each time point um can be extracted from within it and so there are 20 time points for example um in names and types you tell cell profiler essentially what does your experiment look like um there are probably as many kinds of microscopy experiments as there are microscopes and so there is some information you need to tell cell profiler here um are you working in 2d or are you working in 3d if you have one channel if you have an rgb image you have many um individual channels um and in theory support as many channels as you want I think my personal best is 50 something channels um going into cell profiler um but you need to tell cell profiler how to understand your data um usually whenever you write software the trade-off is between um flexibility and ease of use um and in most cases in cell profiler we've gone towards sort of flexibility in being able to support lots of use cases but it means that then you have to specify what your use case is going to be um and I've bolded the groups module here um even though it's actually usually optional um in cell profiler in fact it is optional even if you're running headless but it can be really helpful um and the reason I say that is because groups tell cell profiler which data has to be processed together um so if I just have data from a high content screen and each individual image is a separate thing it doesn't really matter um whether you know well a1 and well a2 are run by the same exact uh cpu or not but if I return to my timelapse movie um if I want to track objects through time it's pretty important that you know all of the frames of my timelapse movie are being run in the same copy of cell profiler but also that it doesn't try to link the objects into you know for movie one into movie two and so groups is really critical for telling cell profiler how does the data get put together um so this is especially important when we're going to take the data and sort of spread it across lots of nodes in a cluster or in the cloud uh I see we have a question um uh is headless mode overriding parameters specified in the pipeline so the output folder specified will command line ignore and use the folder you specify the great question so um in the in the GUI you can set default output folder and default input folder to something but that's actually not saved to the pipeline that's sort of local to your copy of cell profiler um if so in the context of an individual module like save images um default output folder is actually a variable and it just says whatever I find as default output folder at runtime is what I'm going to use um and so if you're running it in your GUI it'll use whatever your GUI thinks the whole output folder is if you're running it in a cluster it'll be whatever your cluster has sort of set the default output folder is but if you are not using default output folder if you've actually given it a specific path it will still try and find that specific path and that's why not setting not setting hard coded paths as we sometimes call them but just sort of saying the variable default output folder or that folder would then a sub folder into it um is is really critical for your data ending up where you expect so you can specify the temp directory for while you're running because cell profiler will make temp files um you could put your output also into something like temp but you have to make sure then that you get it out um before you uh before it gets deleted and overwritten by something else but yeah you can specify the temp directory there's a lot of flags I'm not covering here um but there if you go to the link that's in the bottom of these slides it'll uh then send you to it will show you where the page is that has all the flags why would you use uh headless mode locally as opposed to on a cluster you usually wouldn't um and the reason for that is when you're running cell profiler in the GUI um so profiler will automatically multi thread so if you have eight cpus on your computer it will run eight individual copies of cell profiler um you can tell it to use fewer but it will use as many as you let it use um when you run headlessly it will run one copy um and so almost always if you're running on a local machine um that will support using a GUI um you might as well use the GUI because it will have the advantage of it will handle multi-threading and how to sort of group all of your files without you needing to worry about that um but um you can run it headless locally if for example you had something where you were using something else that was using a lot of your graphical card and you didn't want to sort of bother creating the GUI interface it'd be a pretty corner case though it's one of those things that's possible but not a not usually a good idea um the exception to that is if you were running it in a container like Docker locally which I will get to in a few slides sorry I don't know if there is any module that will create will divide images in um patches to be analyzed independently if you're running on local machine is there yep um so the save crapped objects module will allow you to save out objects into patches so that you can use them in a deep learning application later okay thanks but if instead you want to identify primary object and start from an image would you need to divide the picture into multiple images with a macro before um like if you wanted to run identify primary objects separately in sort of different parts of an image yes if I imagine my image is very very huge and my computer will not do it um yeah so that's not possible yet but that's something we're working on right now is um how we can handle in the same way that say q-path does where you have a sort of large cool slide image or something like that can we under the hood tile your image and then sort of break it down into tiles we're still sort of right in the middle of the weeds of that but it's something that we know people want to do and we want to be able to support it so you can load in a parameter file format but it's just going to take the first tile um for now so profiler five is going to have a lot of improvements to file handling including again we hope this sort of tiling under the hood and better access to what level of a parameter file um yeah so first time is almost 20 years old so parameter formats are not something that was conceived of um when it was created but it's something that we're working hard to add because we know that more and more data is going this way so yeah this is why I I just mentioned that grouping um which you know is when you're running on your local machine usually sort of an afterthought with the exception of some cases like tracking or running a z projection um really critical um when you're setting up your pipeline to run headless it's not totally mandatory but it will make your life a lot simpler to do that um so um some folks on the team did a great um blog post with associated um video demo um if you go to bird.io slash self-propelled input I also um have linked to all of these slides at the end so you don't have to memorize all these links um they will be provided to you in a PDF at the end um but I would definitely recommend checking this out um it's like five different people in my group who I'll put this together it's really fabulous um the other way if you don't want to hassle with these four input modules and you're pretty comfortable scripting or you have access to some scripts that will do this sort of automatically for you using a module called load data so if you there's um uh if you add the load data module to your cell profile or pipeline it will actually deactivate these four input modules and replace it with itself instead and with load data what you can do is you can um rather than teaching cell profiler which files should be analyzed together you know what's the name of your DNA channel versus your mitochondrial channel etc if you're more comfortable just scripting all that information you can just create a CSV file um and load that into cell profiler um and so your CSV file will end up looking something like this with file names and path names to where your images are any metadata that you choose to include um but otherwise um there's no sort of overhead of trying to sort of configure these these input modules you can also um when you tap such a CSV and load it into cell profiler you can add grouping if you need to or you can say only run rows one through 20 or anything else that you need to sort of subset this CSV so if your data names are regularized and you're comfortable scripting um this can be a sort of a faster way to sort of do this sort of input bookkeeping than the actual modules so the last thing is just uh are so input um and telling it which files is going to look slightly differently depending on which of these input modules you've used um if you're using load data um you pass in dash dash data file with the path to where your CSV is if you're using the input modules and the cp pipe file you have a couple of different options you can make a text file that's a list of all of the files you want cell profiler to think about and pass in dash dash file list with a thing to a text file um but if that sounds like too much work and if the you have your files that you want to analyze all together in a single folder you can just pass dash dash i path to folder and cell profiler will consider all of the images in that folder um maybe not great if you have lots of other things in that folder as well but it can be really handy so that you don't have to worry about setting up um a file like this or a file like this um and if you're using a batch file because with the batch file you already went in and said you know here's the data i want you to run and here's where on my headless machine you can expect to find it you don't actually need to put anything in input at all because that's encoded in your .h5 batch file that you passed in in the pipeline and then the last thing is how to group um like uh like we've talked about now um the cell profiler when you run it headlessly is only going to run one copy um and in general you want uh your data if it's big enough that you're thinking about running headlessly you want lots of copies of cell profiler um so you need to actually let it know how to group it or what copies should be running specific things um so some workflows like tracking require specific groupings um which you're typically going to want to use some metadata to specify um but otherwise um you know setting up grouping as part of your pipeline allows parallelization um rather than a small number of CPUs each running thousands of files thousands run small numbers of files um this is faster in terms of sort of calendar time the time from when you know your data run starts to when your data run stops but also you know not that cell profiler would ever crash um but if it what if it happened to if it crashes you know partway through a huge run you might lose some or all of your quantitative data depending on how you've structured your outputs um and so running many copies of cell profiler each of which has only a few files to run um vastly decreases sort of the likelihood that a crash will happen or sort of the pain of what's going to need to be recreated if a crash does um so you can choose to group by a piece of metadata so if you've set that you know your data is in a multiple plate and it has wells you can pass a dash g flag and say um this copy of cell profiler should run well a one and the next one should run well a two and so on and so forth um you can also group by image set count so if it doesn't really matter how your data is sort of scattered across all your CPUs you just want it scattered um you can say this one run files one through ten this one run files 11 through 20 etc etc um and so uh you can either sort of create these yourself with scripts or if you're using these handy batch files um you can tell it to actually print the groups present as well as like the command that your cluster should use so you can essentially get all of your commands um and then just sort of dump those into a file and send them to your scheduler um or you can um if you point at a batch file um point at the batch file with a flag called images per batch and tell it you want 10 images per batch and it will again create all of the commands um that your cluster is going to eventually need in order to do this so it's not unpainful because you do need to figure out what the groupings are going to be yourself but we've tried to make that straightforward for you now that you understand that cell profiler is the thing that can be run headlessly and what information cell profiler needs to run headless where are you going to do this how are you going to do this um I haven't even put your local machine here because realistically as somebody pointed out earlier you don't actually want to do that um so your local cluster might have an installation of cell profiler um so profiler three and below run python two which is past end of life and therefore might be a security risk so we definitely recommend something in the cell profiler for family um you generate execution commands for the job in question and I've shown you some tools about sort of how to do that and then you put it into your cluster submission system which everybody's cluster works a little bit differently so unfortunately I can't give you exact things here the pros it's local it's probably free um so you don't need to move the data anywhere special and it probably isn't going to cost you any money um it depends though on your local bandwidth so if you don't have a cluster or your cluster is typically way over subscribed you might be waiting a long time for your jobs to run um and you're probably going to need to work with your local IT department most local IT departments are lovely not 100% of them are um and installation should be smooth um in sort of relatively up-to-date versions of Linux and Python you should just pip install cell profiler and it should work um but I can't promise that that's going to be how it goes 100% of the time um and it's hard to support multiple cell profiler versions so realistically if you have a workflow that your lab has been running for 10 years and cell profiler three and you need to make sure your data matches that um your cluster has to support cell profiler three and then somebody else is using cell profiler 4.0 and some humans using 4.1 um sort of a nightmare for your IT department um so that's why in general we recommend containerization and all the rest of the solutions I'm going to show you um take advantage of containerization um so if you're not familiar with this concept essentially um software containers such as docker and singularity um are essentially an operating system in a box and the person who creates the docker container sets up a specification file that installs things once and only once and then you use this forever after and you'll be guaranteed that when you use cell profiler 4.2.4 in a docker it's the same one that when this person runs cell profiler 4.2.4 in a docker it behaves the same way um and that it won't change um you can usually use it anywhere and so if you have for example a mac or a pc you can still run a container that runs Linux I've put an asterisk here because as m1 and other sort of cpu architectures are becoming more common this is getting kind of dicey in a few places but hopefully it's going to get figured out um it typically involves some code to run um many containers haven't um put the gullies for the software inside the container or that you just have to know how to actually get to that gully so it's not you know the best solution for everything um but there are already a tremendous amount of them available um bio containers has more than a thousand different biological softwares with that are dockerized um and somebody you brought this up before um the two major flavors of containers they'll typically see our docker and singularity um developers prefer docker because they're easier to develop and you can have a lot more permissions for installing things um your local IT department doesn't like docker because it gives you lots of permissions for installing things um and so they have to give you a lot of permission in order to run docker containers um so typically a lot of especially academic clusters nowadays are running singularity rather than docker um but singularity can run most docker containers so for example self-profiler we provide as a docker but you can in singularity take a self-profiler docker container and just run it you don't need to do anything special it will just run so containerization is the solution to a lot of problems so you can run self-profiler in docker um which you could do this locally you could do it on your cluster or you can do it somewhere in the cloud um you don't have to memorize this but um this is what it would look like to do this um this is just taken from our test suite for when we build our self-profiler dockers um the first line here says docker run a container um the second and third lines say um where where what should I mount as the input aka where my file input image file is going to be um and the output where should my files end up on my local machine outside the docker um because in general docker only understands what's going on inside it so you have to actually tell it how to access files on your operating system rather than its sort of containerized box operating system um you then call a particular version of self-profiler and we have everything from self-profiler 2.3 through um what always our latest release um and this last one should look pretty familiar um it's just the same headless command that you saw before and any other flags that you want to put on here um so this makes it really straightforward to sort of switch versions to run it anywhere um to have a single command that will work in a lot of places um you guys are going to do galaxy next time so I'm not going to go much into galaxy um but just to sort of say um galaxy is um an easy to use way for end users um to put a GUI onto an analysis and to make the analysis shareable and reproducible it's really powerful it's really cool like you're going to talk a lot about that next week so I'm not going to go into a lot of the details um and basically anything that is a container can be run on galaxy if somebody sets it up which is the the issue with you know running some things on galaxy if somebody needs to actually go in and make an xml file that wraps the analysis and tells and tells um self-profiler how it wants it to run and um depending on how this xml file is set up some of these nice flags that we've talked about may not be accessible so for example the the installations of galaxy that exist right now although my team is working on uh working with some folks at galaxy to make some better ones um you can't actually run any grouping flags so you're only running self-profiler on your whole data set without any groupings one thing at a time um galaxy has a concept called collections which you could then use to to sort of try and scatter things that way um but there's going to need to be some stuff that you do to set it up so that self-profiler can actually run efficiently um the other issue again may also just be bandwidth because um galaxy is not sort of one central place to run things it's actually many many many smaller places to run things um so the particular particular instance of galaxy you may want to run may be oversubscribed um that being said in terms of sort of making something reproducible and easy to use it's hard to find something that's nicer than this um so you can access um self-profiler 319 or 421 in galaxy 319 should be able to run anything in the self-profiler 3 series 421 pipelines for anything through 421 it should probably be able to run most 424 pipelines too but it will sort of freak out and tell you uh oh this is advanced or this pipeline came from a version of self-profiled more advanced than me i'm not sure if it'll work tera so um tera is a thing that is um primarily used a lot in sort of single self sequencing and high content genomics work um but does support self-profiler as well it's made at the bird institute which is my home institute um google and microsoft and the idea is to be able to put a pretty easy face on running things in google cloud um you can keep your data in google cloud or a couple of other cloud services as well but the compute happens in google um it uses something called the workflow description language which is sort of a standardized language for describing workflows um to then be able to call essentially anything that's containerized or anything that you can teach um teach tera how to install um because your data's running in the cloud bandwidth is never an issue um google you are never going to use up all of the availability of google cloud um and there are a lot of examples and tutorials and um the the team has worked really hard in sort of making some some nice things for that but you're paying for it um you can get three hundred dollars in credits when you first sign up for tera but if you then need you know five thousand dollars of worth of compute three hundred dollars may not be so much i should say um self-profiler running a 384 well plate in tera in terms of like feature extraction is somewhere between five and ten dollars so three hundred dollars will actually get you pretty far but it's not going to necessarily cover everything you need forever um the current implementations only support cp pipe not batch files and they don't support a lot of those sort of really helpful grouping strategies right now although again folks are working on this and again you have to learn another workflow language so here you have to learn um galaxy's particular flavor of xml here you have to learn um something called widow um and there's only so many spaces in the brain for learning things like that um and so the last sort of wrapper i just want to give a shout out to um is one that we made called distributed cell profiler which is a wrapper for cell profiler for amazon web services again this is a cloud so it is not free but it's also not bandwidth limited um but you don't actually need to know how to code in order to do this um you can um it will take care of making all the infrastructure keeping track of all the infrastructure it will try to save you money wherever it can and then shut everything down afterward um because this is something that we make basically anything you can do in cell profiler all of the sort of grouping and batch files and things like that we make sure are supported in distributed cell profiler too um and this preprint just came out literally like yesterday um but we are extending this um we've already extended this this to tools like fiji um but this sort of approach of being able to sort of easily containerize analyses and sort of scatter them across lots of different cpu's um we think is useful so if you're in a situation where for whatever reason you don't want to use your local cluster or you don't want to use galaxy you want to use a paid cloud um we definitely recommend that you do this and we've been using this distributed cell profiler for six years and at least half a petabyte of data so it's it's pretty nicely uh sort of bug bug fixed at this point um if you want to learn how to do this um image.sc is always going to be my answer for just about everything um uh and you can see that you know we've got tons of groups that are on there that can help out with the stuff but my team are all distributed cell profiler users and are sort of used to running cell profiler headlessly and can help out with stuff like that um which that and my time are sponsored by the Center for Open Biomage Analysis aka COBA um which is an NIH funded center which allows us to work with biologists to solve their biological problems um do community engagement like writing tutorials um as well as developing a software um so definitely reach out to us and check out our website if you're interested in learning more about what we do there's lots of opportunities to sort of collaborate um and then this these are the folks in my lab and our sister lab um and the folks who have paid money to do this um so distributed cell profiler basically involves running four python commands which are sort of here at the bottom and again you can download these slides and take a look at them yourself um the first one um and of course my connection dropped come on come back i was trying to be all slick and get it set up before you guys got here and then my connection dropped um so i'm just i'm sshing into a machine that's running on amazon i don't need to do this but the part of distributed cell profile that essentially acts as a babysitter and keeps an eye on things and shuts things down um if that's running on your local machine i'll crud the machine turn itself off um if that's running on your local machine then you um your local machine needs to stay on and stay connected or the babysitter goes away um so typically we recommend that you run this in a sort of tiny inexpensive machine i think this machine is like five cents an hour or something like that um and you can see it has an alarm on it to turn it off so that when nobody's using it it stops running and stops spending money except for when you wanted it to stay running during your demo but we we built in a lot of instruction on how to run this on sort of an academic budget um so there's a configuration file um that we give you guidance on how to configure but basically what you just tell distributed cell profiler is what to name the run so that you can take a look at the logs that it's generating um what version of cell profiler to use um how many machines you want um what kind of machines do you want how much are you willing to pay for those machines and then if there aren't any available at that price it will just wait until there are some um and then some more information about sort of the size and sort of compute requirements and so you can run way bigger jobs with distributed cell profiler taking advantage of really big machines that you wouldn't be able to necessarily run on your laptop so we've used this to do sort of large tissue things that uh otherwise cell profilers a little too memory hungry to handle um so i'm just going to run a setup command the setup command is going to read everything that i pre-filled in that configuration file and it's going to make some infrastructure so it's going to make um a queue for me to put the list of things that i want to do um in it's then going to also set up a cluster for me if i don't already have a cluster it will make one i mean it will give that cluster what's called the task definition which is it will tell it how to act what docker containers to run how to sort of put docker containers onto virtual machines all of that good stuff and this has a 60 second sort of thing in it because it means to check that something worked um which is always great in the live demo but yes if you want to know more about how sort of in general these things work and how to build your own version um this is the the pre-print that just came out this week um so the next thing to do is to tell cell profiler what jobs to do and so um we we have here a script that sort of helps you automate that if your data is sort of regularly structured such as in a multi well plate um if you give it the name of the pipeline and you tell it how you want things grouped so in this case we have a preset up um for you know grouping things by say plate well in site um this is sort of preset for you although you can do any configuration of metadata and any sort of scattering that you want that's just something that helps make it a little bit easier um what um oh i probably named the queue incorrectly didn't i um live demos guys i forgot to specify what job we're doing um and so this is set up for a few different common workflows i'm really selling the whole easy to use thing here aren't i um as i sort of insufficiently caffeinated try to run this on the fly there we go um so it's now submitted in this case we have a workflow where we want things for every um for every well of a multi well plate and so i can just um show you that here is one job this is the job for well c9 in this particular plate but um my job my queue has uh 96 jobs in it um that are ready to be run i'm now requesting a fleet of 48 machines to run those jobs and of course it's starting an error because this is a live demo um usually this just works guys i promise um what is the error that you're actually about um in any case um once that has finished um that's not going to start for some reason um but um the the steps that i've run so far have made um a babysitting file that i can use that will keep an eye on the jobs that are there and shut this whole thing down when it's done i apologize for all the errors in the demo again i swear usually um this just works but um it makes it very straightforward to yeah start running um self-profiler jobs as big as you want and as many as you want with just a couple minutes of configuration this is a bit of a classic for an advice uh workflows it's uh also being published as a as a workflow in the in the book i put a link to the pdf of the book here so you can just download it if you if you want and what i'll do first of all is i'll share the link to the um to the to the notebook and if you don't mind or if you like and you want to follow it on you can just click on click on that link and that will bring you to um to to this and then you can save a copy to your own drive um and um by going on the file menu and save a copy and drive so that you can modify it and run it in at your own pace we don't need any fancy um GPUs for this workflow and so the advantage is that you know even if you get assigned to a cpu it should it should work uh it should work fine uh you can also download a copy of your um of the of the notebook here and you can run it in your own favorite environment um there are a number of packages that are installed as part of this but that's done you know in this in this part here so so um yeah i mean this is an interesting uh workflow because it showcases a few pieces which are useful in many different image analysis workflows um like spot detection is quite a common thing you're trying to look at um you know things that are bright over a dark background or sometimes dark on a bright background um tracking would take essentially those spots and connect successive frames over time that's also a quite a common thing directionality analysis is something that i haven't encountered that much in my my experience but it is an interesting um way to analyze the tracks and this particular uh workflow actually showcases an example of that and then um it also touches on Gaussian mixture models um as just as a way of um of trying to find out what are the two most likely um distribution of data points if if uh if we were to model them as a mixture of of of Gaussians so um all of the units in this case are in pixels or frames um the um the uh that just makes things um things easier we're looking at directionality so the scaling is not is not as important in this uh case of course if you're looking at distances or you're trying to measure uh velocity or things like that then the units are um are really important but for our case we're looking at directionality and that's likely um a dimensionless um feature um the um you can in principle change this um this url to any image that you like um in this case we're just using the one that was used for the uh for the book and but in any case you can you can switch it off or switch it out switch it out to any other any other image that you like um so um to you can also open it up in Fiji I have it pre-opened here just because it takes a bit over the network but you can basically select the url and drag it onto Fiji and that will initiate the the download already did that so this is the image that we will be working with and as you see it's time series and we have essentially growth of microtubules and they're kind of going in all different directions and what we're trying to find out is what a what are the quantifiable characteristics of the movement of these of these growth directions um there are a couple of parameters also that you can play with these are the ones that I've used for the for the for this particular notebook but it's interesting to try and try and modify these parameters and and see what what the effect is right um and then this is um the figure size that's just basically just for visual visual purposes there's not anything yet particular but there are a couple of more parameters here this must be I didn't catch this error but essentially there's a couple parameters here that you can modify that doesn't seem to pop up before so I'll modify that after the class now there's a few libraries which were depending on the original workflow had combined different software together so it was the tracking was done in Fiji and then the track information was exported and then imported into either R or MATLAB in this case that's one of the nice things about python is that there is a large number of libraries available and and we can do everything in a single in a single place so um MATLAB you know piplot is just a plotting library uh second image contains a lot of interesting and image processing routines um numpy is kind of a must to learn in the kind of the python data science world and I'm sure you've been come to that before if you've ever worked with python in in in this context and and pandas is the is the pandas is the is a library to manipulate tables essentially and do some statistics and and does and has some very nice expressiveness features in terms of like grouping parts of the table together calculating some statistics and doing things that otherwise you would use packages like R or other statistical um software packages and then there are a couple of other packages done this one is to add uh scale bars to the to the image but this one here is just so that I could put the Greek like letter on the matplotlib and and it is one of the the ones that takes longest to to install that so it's not super um crucial but yeah it doesn't take a long time to install either so in any case I definitely go through and execute the first one and then um we will execute the the install and if you want if you like you can do the same in the copy of the of the notebook that you have and so this is just installing a bunch of the latex package is quite large um and also quite powerful as you might know or anyone who's written you know articles and latex you know we'll know this is this is what's it's quite a quite a hefty package and then this one here is a track pie is a track pie is a package for performing tracking and we need the latest version and I'm because it's in a still in development phase I essentially downloaded the a specific version from github and installed that just because um if there were you know changed overnight and breaking changes to the api that would mess up with the with the work um so this is all you know this is something that you can always do is install a package by giving the url of a specific zipped version of of release so it's finished it takes about a minute so not too bad and this is just to tell the plotting engine that we're you know keeping the fixer size and also we would like to have um my latex and then this one is reading actually the image from um from that s3 bucket um we can verify that it's actually loaded um and when you load up an image the first is actually the number of frames and then this is the the other two um x and y dimensions I just put a little warning here if you're trying with your image we are expecting it to be plus t image and so if it's not three dimensional image it's going to throw up an error and this is said just a little snippet to show the first and the last frame of the images to make sure that um we can we can read the image there's nothing super fancy about this if you've never encountered like if you're from coming from matlab for example um this would be equivalent to the end so we are indexing the um 3d array with with a minus one that indicates the last the last element so now the first thing that helps is um to enhance those features that we want to detect and this is a classic way of doing it um so this particular method called the difference of gashens uh takes the difference between um an image filtered with a small kernel and an image uh filtered with a large kernel what that means is like a slightly barreled image minus very blurred image a very broad copy of the image and what that does it's essentially smooths out anything which is above what we're trying to look for and it smooths anything below the size that we're trying to um that we're trying to detect and so these the first sigma should be you know smaller than the smallest feature you're trying to detect and this should be larger than the largest feature that you're trying to detect uh luckily um second image has a function to do the difference of gashens so we just use that and um and this one is um this one is like I mean there's many ways of doing this but what this does is essentially taking the difference of gashens of the image with those small and large parameters which we specified at the start of the notebook and then it iterates through all of the frames in the image and applies that difference of gashen function puts it in a list this is this parenthesis here and then this one essentially compacts all of those into a single uh into a single stack and oops I didn't execute this and this is the result for example of the first of the first frame after going through the difference of gashens now look at the intensity so we have the the the color bar here to help us identify what the intensity levels are and you see that zero is about at this at this level here now we have things that are darker than zero and these are these spots over here these are darker than zero and that's because um of course when we take the difference of two things uh if the um if the second element is is larger than the first then we get negative numbers in this case it's it's okay because the spot detection algorithm is not going to worry we're looking at peaks of information you know something that that is um there's a peak and it so it doesn't matter like the app if the absolute floor of the image is is above or below zero we're looking at differences in intensities but this enhances like spot like structures like here and also gets rid of this kind of this varying background that you can that you can see over here so and then we have the track pi and this is um this is the the the library that we imported at the start and so it's the locate function is the one that performs the spot detection and we're just going to do it on the first frame and we're specifying the typical size which in this case was set to 11 it needs to be an odd number no that's why it's 11 or not 10 just the way of how the algorithm is implemented and so that should be quite fast there it finds the 338 spots and it calculates eight features from it and we can have a look at the table here we've got the x y position mass you got the size eccentricity how much is you know what's the signal level and in that spot and then some additional metrics which are closer to the data um so yes can you specify about the mass measurement yeah what is meant here so the mass is the integrated uh usually defined as the integrated intensity over the over the spot so if you would you can think of it as the as the equivalent of image a integrate density the sum of that's right so there is yeah yeah so let's have a look at how those detected spots look like in the image and it has a function that essentially annotates an image so it's you know quite easy to to to look at the results and we can see we can see in fact that there are a number of places where the spot is very weak and we're maybe not that interested in in keeping those spots so we'd rather keep the ones that you know are kind of a brighter intensity and so what can we do to uh kind of fill those out well first let's look at what are the features that have been calculated which are essentially the headers of this table here and then we can make a list of these um well we take away the x and the y and so this is a python notation for um I don't know if you're familiar with the sets but these are essentially groups of elements without duplication so in this case we have the the list of all of the headers and then in this case we just take away the x and y and you can take a difference between sets and that as the as intuitively you can you can think of as taking away the x and the y from those from the list of of potential columns so um the result is a set you can see it's a set because it's um it has these curly brackets on on either side and then what we're going to use this is basically to automatically generate a histogram for each one of those and so I just turned off the latex because it was messing up with these special characters here and then I'm going to create a series of one of a one row with a number of plots which is the length of this set and like specify hard you know hardcode a figure size just to keep everything in proportion and so what we're going to do here is we're going to use this special python function called enumerate which basically takes a list or a set or any terrible variable and then returns two values one is the um an index into that uh a terrible thing and um and the actual element from the from the list so I would be zero one two three four five and the m is going to be ep roma signal size x and s and so we take the ith axis that I've you know that I've created here and then create a histogram essentially with with the uh taking the table and taking the column that has that particular name and then I also set the title of course of that of that element so if you do this again there you can see the result already there but you know just to prove that it works and so there's nothing really there that stands out the as a criterion you know there's a there's something here in the signal so we could use we could definitely use that um in this case I've chosen to look at the size of the mass against each other and and try to see if there's any any pattern when you break those two apart you know you're basically showing the scatter plot of all of the all of the spots plotted by size of mass you can see that actually if you take the mass there it seems to be kind of a concentration of the spots which have a very low mass and so well what we can do then is try to see if if we isolated only the spots that have a mass above one what would the detection look like and so you can do that here with the annotate function you just take the the table of spots the image which is the first frame in this case you say that you want to split it by category using the mass column and that you want to use one as threshold and that you want to plot one in in red the ones that are below the threshold and in green the ones that are above the threshold and you can see that this you kind of kind of nicely separates out the ones that that have a strong mass and so this looks like a reasonable number to try out for the for the detection so at that point then we can we can run the look you know the localization on the entire on the entire set and that's quite well this is for the first one so it's just for the first one just to check that everything is is is in in check right so essentially it's saying that the minimum mass is one and perform the location on the first frame and then show us the result of that and we get the same images before so that's good and now we can perform everything we can perform the spot localization on the entire video it's quite quick even on a shared a node like this it takes you know yeah it says it takes three seconds so it's quite quite quick to do the the spot detection so at this point we have all of the spots in a single in a single table in this case what we're trying to do is really trying to identify a good number of these right spots in such a way that we can then perform tracking on them so even if we don't get absolutely all of the spots we what we're trying to get is enough so that we can relate successive the same spot in successive frames in such a way that then we can calculate the you know the displacement vectors so in this particular case I didn't you know try and fine tune the parameters and you really dig deep into if if each one of the spots were was identified correctly because in this case it was like more of we have enough spots to be able to calculate statistics on the on the movement but I'm not aware of any specific comparison between track by and the track mate but there is there has been a tracking challenge a couple of a few years ago now and that was a systematic comparison of of many different tracking algorithms okay so in this case it says here it's using an implementation of the cracker career linking algorithm so I'm not sure if that's the same that track by that track by uses but I'll also include this in the in the chat if everyone wants to kind of dive deeper into into that so then in this so in the next section what we're going to do is we're going to get all of the spots that we that we identified and try and link them together so that you essentially try to identify if if you have two spots in successive frames do these belong to the same track or not and that's where that algorithm that that I put in the chat comes into comes into play so it's an important an important step um there's a description of how a track mate does this and in the book chapter and so I'd encourage you to have a look at at that at that chapter to as a kind of background information or additional information so in this case then we have a new table um we get again the most of the information we had from the spot but now we have two additional columns one has the frame where that spot appears in and the particle which is now an identifier of which track essentially it belongs to and we can also plot all of the trajectories together on the image just to have a look if the kind of a sanity smoke test if you like to see if the tracks look um look reasonable and this little thing here hides what in Fiji you would do through the minimum projection so you can go here and go stack z project and get the minimum projection and that's essentially what this image looks like so you're taking and it works for um any like 3d matrix so this is an interesting one so essentially taking the stack um saying I want to take the minimum across axis um um zero which in this case is time you remember that the first um index was was time so you're taking the minimum in time of the intensity and this is just to give kind of a darkish background to them to the track image that also kind of hints at what where the structure is um that's it and then it plots all of the trajectories so it's um you know quite a convenient um API and that's really all there is to the tracking step it's really a very very quick a very quick um operation there are a couple of parameters that you can tweak and one of them comes into play actually two of them are specified here one is the search range which essentially tells you how far you expect from one frame to the next particle to move and then in this case it's set to five uh five pixels and this is an important parameter so if you if you put if it's too large then it kind of brings together it might risk to bring together tracks which don't belong um right together um and um and it really depends on on kind of the velocity of your of your particles if your particles don't change position much between successive frames and you can keep this um this search range um low um memory is um there are similar parameters in TrackMate maybe if I have TrackMate up here I could also show how that would work it's here if I don't have it so they they're analogous parameters in in TrackMate and um it's um so if we go on to the detection um the Grange of Gaussian there you see the DOG detector so we could actually use something which is similar to this so the the link max distance is essentially the search range and then the memory parameter would correspond to the uh gap closing max frame gap so in case it's too as um as um as a parameter so this is just in case you wanted to kind of have a feel of how the two work um against each other you can always um pop this one of them and try it on on TrackMate and this part is admittedly the one that is um the one that you would well it's an interesting the most interesting part of this uh of this work for which is the analysis of the directionality and so um we're gonna look at what is the largest what's the longest track um to do this we can take the table of tracks um get the particle column here and just take how many of how many spots have that ID and so the longest one in this case is 230 and yeah of course because I before I know it's 230 um but you if you get another value it's easy to you know to set a different one here so what this is going to do is just to filter out the spot table just by taking that that single track out and the reset index is needed um just to make sure that the um that the index is is um goes from zero to in this case 71 so then we just plot the x from my positions and this is how a single track looks like this is the longest the longest track now to get the average direction we could take the individual vectors and then um and then we take take the average of those um but you will notice and there's a discussion in the book and and a point that particularly to figure um you know 6.8 actually I might what I'm gonna do is I'm gonna open up the um the book so that we have it we have it here and so that we can actually look at that particular figure I was mentioning there this one here so essentially if you take the average displacement of these three vectors it's essentially the um it it is essentially the difference between the left if you sum these three vectors up it's essentially the difference between the coordinates of this point and the original point and if you want the average direction it's essentially dividing the difference between the last and the first and then dividing it by the number of of steps that's basically what I was um uh what I was referring to but this is a a visual of uh of that same concept so back to our tracking side so for example if we take the displacement for the longest track we just take the last element that's how this is how how you would express that in or one way to express that and then the same for the y and so it tells you that it's 20 pixels an x and and minus 41 in y so we can do that for all of the particles and this is where the expressiveness of pandas comes into play so we take the tracks table we just take these few columns that we need and then we're going to group the table by the particle ID so this means that it's basically going to group a track by track right and it's going to it's like splitting it up in many different tables with with the particle IDs as the split criteria and then we do this function the aggregate function which essentially you can pass a single function and it will apply it to all of the sub tables that it that it creates and in this case it's an anonymous function it's just a way of expressing like an expression if you like and so given x it's going to get the last minus the first of whatever thing that you express here and it's going to apply to all of the columns so let's do that and this is then exactly what it does this is the difference between the last x coordinate and the first core x coordinate the last y coordinate and the first y coordinate and this is the number of particles right because you get the last frame minus the first frame or the length of the of the track if you like so there's a lot of them which are you know which we are not very interested in and so what we can do is we can at least take the ones that are that have at least one frame difference between the last and first so essentially excluding all of those where the spots were lonely and didn't find a companion a companion spot and we can double check if the calculations are correct and yes so this is the this is for a friend track 230 you can see that the differences are all the same now to the fun part we can calculate the direction of this vector to calculate directions the function that is used both in matlab and actually their name the same is a slightly modified version of the arc tangent so you know that the arc tangent is the inverse function of the tangent and so you give it for the arc tangent you would give it a ratio between y divided by x and that would give you back an angle but when you do this ratio you lose information of where you know we're in the or in the in your suppose you have the x here and the y is here it loses information on on what quadrant you you're actually in and so the angle only goes from zero to to pi or from minus by halves to by halves and so in this way with the atom two it takes you give it two arguments the y and the x and that gives it enough information to calculate an angle which covers the entire the entire plane so to do that we just take the first two columns here and then we apply the arc tan two function in in giving it the y and the and the x coordinates now the the length of the vector that's quite easy you know you just take the of course just the the square of the displacement and you sum the square of the displacement of y and then you take the square root and that gives you the length essentially of that of that vector okay so one of the interesting ways of looking at all of the directions now that we have all of the all of the direction we can plot them and a good way to do this is to is to use the a polar plot to i'm sure you're familiar with essentially it's a it's a plot where you can identify points by by the distance from the center and and an angle and that's exactly what we calculated here and so you get this polar plot here now in every of course in any scattered plot one of the problems with scattered plots is that when points get denser you don't you know you kind of lose information and everything kind of gets merged together and so a better way to look at that is to look at the histogram in this case of the of the directions and we can actually look at that and we can see that there's a very nice you know bimodal distribution which essentially means that in our video if i go back to this in our video there most of the particles are either going up or down now that's what's shown here so now we can look if you for example if you had an image and the directionality is not very clear the profile is kind of you know something like this you might be wondering if that's just random or if there is if it's a uniform distribution across the the angles or if it's or if it actually shows a preference in directionality and to do that you could just do a chi-square test and compare it with the uniform distribution and in the case of a python there is the sci-pi stats module which includes a chi-square function and this case if you don't give it any parameter it's going to compare with the null hypothesis which is essentially that each bin has exactly the same number of elements so if you run this you see that the p-value is very very small which means that you discard the null hypothesis that it's uniform and therefore you're saying it does have some kind of shape that is not that is not uniform now the next part is seeing that we have kind of a bimodal distribution one of the ways that you could kind of estimate the directions is to model it as a mixture of two gaussian distributions and so the scikit-learn package now also has an implementation of the gaussian mixture model and we can just use it out of the box and so we would just import the corresponding package then the the only parameter it has is that you have to decide a priori what how many gaussians are you expecting in your distribution and so this little parameter here tells it how many they're actually you're actually expecting now this long paragraph here is because we have to basically reshape the data in a way that it fits what scikit-learn is expecting we have a single vector of values right so you would say you know there's anything there isn't much to do but in because this gaussian mixture model can be applied to data points which may have many many different features right so you can think of it it's not just the angle but you could have like the you could put in the radius if you like and so the in this case the gaussians that you're trying to fit are not one-dimensional gaussians you know like this but they're actually you know two-dimensional gaussians and so the downside of that is that um or downside the the consequence is that you have to tell it if when you give it a single array is that a single point with many many features or is that a set of points with just one feature and you do that by telling it the you know the shape of that so it's essentially you have to convert that's one d array into a 2d array even if one of those dimensions is one so in this case we want everything to be in a single um in a single column and so that's essentially what this is doing you you fix the number of columns here they're reshaping it you're fixing the number of columns to one and then you're saying just you know just calculate the number of rows and um I don't mind here and the nice thing I'm putting minus one here is that essentially if you change the number of spots you don't have to manually as specify or calculate the number of what rows that the output um means has once you do that then it's just uh just a fit you know and then uh what you do is extract from that fit the estimated parameters and in this case it says that the estimated two angles are 1.6 and minus 1.45 so let's have a look how does these look like so we're going to plot the the bar plot as a background and then in front of it we're going to plot a stem plot where um as x coordinates we're going to put the means which is this one this one and this y coordinates we're just going to put the maximum counts so this is going to go right up to the top here I've painted in red and I forget what this parameters was here but it was needed in this case this notation simply says take this list here and multiply it by two meaning make it make it um twice as long so in this case it's a single um I mean another equivalent way um but less um nice to look at is to do this and this has exactly the same the same meaning okay good now we can convert those angles to um to degrees and there we have it and at this point we have the estimation of two of the two angles now you might be wondering what's the confidence how confident are we on the and the estimation and so this last step is essentially performing bootstrapping was essentially random sampling on your list of angles repeatedly and calculating that estimation again and again and to see what the variability of that estimation is this is for repeatability so I just fixed the random number generator seed in the same ways and this and that means that if you tried and I tried we both get the same the same result what I'm doing here is I'm just defining as a very simple function which takes the list of angles and does the agatian mixture model fitting and then returns the the list of two means as as as two elements so if I apply it on the entire list I get the same result as well now what we're going to do is apply it 100 times we take a copy of the of the angles then what we're going to do is we're going to shuffle that and then you're appending to like an empty list the list of sorted angles that we get from the from the estimation so we need to put the sorted here because there are two elements and depending on how the algorithm is initialized it may find that the first one is is the minus 1.5 or the first one is the 1.6 so in this way I'm just ensuring that they're you know in the same in the same order and after that you just put everything into a single into a single array okay so at that point we have we can find out things like the confidence interval and we essentially take the the upper 2.5 percentile and the lower 2 percent 2.5 percentile and we we put those here and so we you can see that for the first angle it goes from minus 1.45 to minus 1.44 and the second one 1.64 to 1.65 you can have the same in degrees and so you can see that the estimation is really is quite tight so for what that essentially means is that if you model that distribution with those two cations the dependency on the actual data points is is quite low and the estimation is is pretty consistent and that's it this is just a visual representation so if we take like the first angle this is all of the possible all of the estimates that we got these are all of the means for the for the first angle and you can see that they're quite tight you know from minus 1.45 to minus 1.447 and for the second angle 1.46 or 6 and 1.652 and so yeah so they're they're pretty tight and then you can also summarize this with a circular plot and so this is essentially a bar plot but with a polar Cartesian you know with a polar diagram as a background and I did put this in the as an exercise down here to see how would things change if we used a different way of estimating those two cluster of angles would how how would it change and there's a snippet here that you can use and paste at the appropriate locations in the code so that you can you know you can see you can test the different ways of estimating those angles looking at the chap there where you were using at least three different software you are using image f1 correct you are using matlab and r for the statistical part this is a really a success on how to integrate every part together in a word two of their runs just through python now you do this in python because this is compatible with the google collab do you see the possibility of doing it in another programming language you know and when it's still on the cloud like would matlab be um a good choice if you want to do things on cloud i don't have experience personally or other i know there's a matlab online so i think there's a way that you can run matlab online and i haven't tried it much on my end and some of the other commercial packages like matematica also also has the option of running things online so that would be certainly something interesting to try r definitely you can run in the cloud in fact and i'm not sure if in cola but in jupiter you can definitely specify r as a language and you can you can run r code in you know in your notebooks if it's configured like in particular actually the beaker kernel allows you to switch languages in between a single notebook so you can combine even within the same jupiter notebook you could combine um like java you could combine um r python and other not not matlab of course but other other similar languages so definitely there is a possibility of using different ones um and of course i mean with some work you would be able to create like a fiji plugin for example to do all of this but there is a lot of things that are um kind of not best suited for a fiji like all of the plotting and the statistics those require some external libraries and there might be some equivalent for java and i'm not aware for of of many so it's um i'd say if you had to learn one to to to do most of the things that you might encounter at least from a coding perspective then it's hard to find something that includes everything like python and then again julia might be uh you know the next uh the next um next python but i haven't i haven't tried to code all of this in in julia okay thank you and that you are aware are there other library capable of doing fitting of multiple distribution other than the ocean yes i mean the psychic learn is essentially a very generic machine learning library and that does have a lot of clustering algorithms and uh and um there is a number of different um ways that you could do the same the same analysis that makes sense in fact for the for the book i implemented a matlab routine to um i have to go find circular modes which actually doesn't do fitting what it does it basically looks at um does basically peak detection on the on the histogram and that's that's quite that's a bit of a different approach right your your um there in fact the results that you gather are slightly different because you can see if um what you're essentially looking for is the most common angle right the mode of the most common angle and that corresponds to the peak of a histogram right but in the case of a gaussian mixture model what you're saying is that you have you're assuming that there are two uh normal distributions that are kind of mixed together and therefore it works best um whenever the two distributions are looked like normal right and so that's there every every little algorithm has its own now has its own assumptions and it's important to kind of be explicit and and also think about these um in a in a way that is you know specific for your project so it is definitely possible to use many different algorithms and i put a couple there on the on the notebook just because i think they're interesting but um and also key means is very generic too it's also quite robust to to initialization so well not robust to initialization i should say it's just a it it tends to give um ecosystem results so in this case it would you wouldn't mention is trying to find if i don't know if you know how that key means probably everyone does but i'm just going to repeat anyway it's uh you take two initial estimations then you assign each point to either one or the other cluster take the centroid and and then um of each one of the two clusters and then repeat that estimation and and until you have some kind of conversions right and so um in fact uh well if you're curious i'll encourage i'd encourage you to do the test again and try to see how how it looks like i but if you want the um the answer feel free to contact me and then i can i can say um if they look similar or not