 So, good morning, my name is Stefania Melillo, I am a researcher at CNR, where I am in charge for the experimental and tracking activities of COBS Lab, which is directed by Andrea Camagna and Irene Gertina. And in this talk, I will first introduce our experiment, and then I will present Sparta, a spatio-temporal tracking algorithm that we developed in the last years to track individuals in large and dense animal groups. The activities of our lab are focused on collecting data of bird flocks and insect swarms in their natural environment to retrieve the 3D trajectory of each individual in the group. So first of all, why is our experiment so hard? First, because we want to reconstruct the 3D position of the bird, so we want to recover the X, Y, Z position of each individual, and this cannot be done with only one camera, so we need a multi-camera system. Second, as you already saw yesterday in Andrea and Alexandra talks, we are in the field. So as you saw in these images, we are on the roof of a building during winter in the cold to collect data of birds, while we are against the sun in summer in several parts of Rome collecting data of midges. And besides this very hard physical work, having a performing experiment in the field means that every day we need to bring the equipment in the field, mount everything, collect the data, and mount everything and go back home. So it's quite hard, but the real reason why the experiment is so hard is that accuracy on the 3D reconstruction is not for free, and it depends on two main factors. On the setup you choose of your cameras, and in particular on the baseline between the cameras, and on the accuracy on the calibration of the system, which in particular is the accuracy on the measurement of the mutual orientation of our cameras. Again we are in the field, so we need to find an easy way to measure the mutual position of our cameras that we can do every day, but being sure that our accuracy is high. So for our experiment with birds, birds are 100 meters far from the cameras. We choose a baseline of 25 meters between the cameras, which means that we need to find roof where we can put our cameras so far. We need to bring with us cables 25 meters long, and what we do is that we fix the position of the cameras before the experiment starts, and then we wait the birds are in our field of view to collect the data. This can be very frustrating, because sometimes you have craziness happening on the side of your field of view, like birds spelling your letters, your names, but you cannot move the cameras, you have only to wait that something will happen in your field of view. To calibrate the system we use a simple but very effective method. We mount each camera on a bar with a gauge on its side, and we pull a fishing line from one side of the bar to the other side of the other camera here. We use this fishing line as a marker on the gauge to measure the position of the cameras, and we test our accuracy performing 3D tests. So what we do is that we go to the roof of our building, we set up the cameras as we were in the field, we put targets on the roof in front of our building, we measure the distances between targets, we take pictures of these targets, we reconstruct their position, so that we can compare the estimated distances between the targets with the real ones. What we achieve is a very good result because we have an error which is below 1% on targets which are very far apart, and an error of one centimeter, below one centimeter in targets which are very close to each other, at a distance of about 70 centimeters which is the body size of our birds. The situation with midges is quite easier because the setup is much smaller, so we have midges 8 meters apart from the cameras, the baseline is about 6 meters, and what we can do here is that we can post calibrate the position of the cameras. So we start the experiments chasing the swarm, we can move the cameras, and then as soon as we start the acquisition, we will stop to move the cameras, and we will take the data, and at the end of the day we will take 50 pictures of these two targets here, we measure the distance between the targets, and we use 30 of these pictures to fit the angles between the cameras, while the other 20 will be used as control targets, and even in this case we have a very good result, because our accuracy is below 1%, our errors are below 1%. In both the targets that we use to fit the angles and the control ones. Let me show you some video. This is a movie of our Stirling Flots that we took in a thermostation where birds are used to roost during winter. It's a flock of about 600 birds, and as you can see from these pictures, we lose all the details on the birds, which appear as small black dots over a light background. This is a movie of a swarm that we collected in Parco de Iaquedotti. Here we use the backscatter of the sunlight on the midges to have them appear as white dots over a dark background. In both cases we are losing the details on the individuals, and this will make our tracking very hard, because we cannot use any information on the individuals. The main issue of all 3D tracking methods is how to deal with optical occlusion, which occurs every time that two objects are on the same optical line of one camera, so that the image is the same on the camera. When this happens, you completely lose the identity of the objects you are looking at. This is what happens in this case, where the blue and the green objects are separated, meaning now they are getting an occlusion, and after that they will be again separated, but even by eye you cannot say who was who, who was the green and who was the blue bird. In the best case scenario, using two cameras, you can use the other one to recover the identity. This happens when the two birds are separated in the 3D space, but only occluded in the 2D space of one camera. As in this case here, where the upper video is the same as before, while in the bottom video you have the other camera where the two birds are always separated. So using the matching information of the two cameras, you can recover the identities and overcome the occlusion. Most of the tracking algorithms are designed to solve these issues, while the worst case scenario is when the two objects are really 3D proximity, so that they are occluded in both cameras. You will lose the identities in both the cameras, as in this case here, where the blue and the green objects get occluded at the same time in both cameras. At this point, what happens is that the information of the two cameras are useless, or you cannot recover the identity from one camera or the other, and what can happen in the results of your tracking is that you switch the identities of your birds. Our new tracking method, SPARTA, is designed to solve this kind of occlusion, which are particularly hard, and to do this, what we do is that we move from 3D body centers to 3D volumes. What we do is this. We first detect the objects in the images, but instead of, we do not associate to each object its 3D body center, but we consider the entire, all the pixels belonging to the objects, and we match these pixels across the cameras. This is something that is well-known in other field of tracking, not in collective behavior, not with these big objects, and doing this, what happens is that we reconstruct the volume of the objects we are looking at, so we reconstruct a 3D cloud of points, and a 3D cloud of points, so in the case of a 2D occlusion, a simple occlusion as the one before, since we are working directly in the 3D space, we will have two separated, well separated 3D clusters, the green and the blue, while when there is a 3D occlusion, what we have is a big cluster, but it is only one cluster, which is much bigger than the normal ones, because it represents the volumes of two objects very close to each other. What we want to do now is to use dynamic information to split this volume in the two objects which are occluded. So we first perform temporal linking, we use point-to-point linking based on the velocity to define, to associate the 3D clusters, and we build a cluster graph, which is made of clusters linked in time. At this point, we identify the connected components of this graph, which can be made of one-to-one connected clusters as the one, the tree on the left, or they can be, they can have some of these clusters can have multiple linking, so the one-to-one clusters represents single trajectories as the one that we obtain for objects which are 2D occlusion, while when we have objects in 3D occlusion, what happens is that the connected components with which they belong has this typical X shape, with the center of the X, which is the occluded cluster. So what we do now is that we focus on a few frames around the occlusion. We define, we switch back from clusters to point, we define an energy function h, which is the negative sum over i and j of w, i, j, x, i, x, j, where i and j are different points. X, i denotes the cluster to which the point i belongs, and w, i, j represent a coefficient which is associated to the pair i, j. w, i, j, this coefficient could be attractive if it is positive, because what it is doing is that it is decreasing the energy. So, sorry, we want to find the solution which minimizes this energy. So when the coefficient w, i, j is positive, it will decrease the energy and we'll go in the direction that we want, while when w, i, j will be negative, it's what we call the repulsive link, and it will tend to increase the energy. So the crucial point is how to choose this w, i, j in a way that the solution that we found minimizing the energy will split the x component in the two different identities. So we define w, i, j for both from static and dynamic points. So we define static coefficient w, i, j between points i, j belonging to the same frame, while the dynamic coefficient are referred to points which belong to two consecutive frames. And w, i, j will be positive for those points who are at the distance, which are at the distance smaller than the average nearest neighborhood distance between the points. What are we doing here is that we are positively linking points which are very close to each other, so that in single clusters, we will have single clusters highly connected, and we are defining w, i, j with negative values when the points i, j are far from each other. So at a distance which is bigger than the average cluster size. What are we doing here is that in a cluster which represents an occlusion, we will have blue, here the blue points are attractive links, so these blue points here are points which are connected by attractive links, but we will also have points, but the points which are very far from each other will be connected with the repulsive link. The idea is this, when we have the occlusion, we will have a bigger volume, so we want to split this volume in such a way that the points belonging to the side of the object will be in a different partition at the end, and this negative links will be also useful to separate the clusters belonging to different branches of the x. So at the end, what we have is that from a static point of view, we will have single clusters highly connected, highly and positive connected. The occluded clusters with some repulsive link and clusters belonging to the different branches connected with negative links. Now we use the dynamic coefficient to link to each other, the clusters belonging to the same branch, and what we want to do now is to cut these negative links here. We want to cut it, minimizing the energy that we defined before, and it is exactly what happened. So after the solution, after the minimization of the energy, we'll give us the two different connected components, the blue and the green one, which corresponds to the two different identities that were occluded in the central cluster. We tested this method on two different datasets published by Zengu from Boston University, birds of bats going out from a cave, and the two datasets are different in the density of the flocks, and we evaluated the results comparing the results of Sparta with other three methods. In terms of Mota, which is the multiple object target accuracy, so which measure the percentage of well-reconstructed points, and we also evaluate the quality in terms of identity switches. So what we see in these two datasets is that Sparta is giving us the IS value of Mota in both the datasets, with an negligible number of identity switches, which is second only to Greta, which is actually the other method developed in our group, and so we think that this is a very new approach. We still have to improve it, testing it on more other data, and to make it better, and to make it better. The method is still not published, but you can find all the details on the archive. This was developed with Leonardo Paese and Federico Ricci D'Arzenghi, and that's it. Thanks for the attention.