 Okay. Hello. Our next presenter is Florian Wilhelm. He is data scientist at Inovex. And the title of his presentation is Handling GPS Data with Python. Thank you, Florian. Thank you, everyone. Welcome to my talk Handling GPS Data with Python. My name is Florian Wilhelm and I'm a data scientist at Inovex. Inovex is a technology consulting company based in Germany in all the major cities. So I once had a project where the task was to relate the brake pad wear of trucks to the route they're driving and how much they're driving and where they're driving. So this was the time when I first had to, like, deal with GPS data because what we had from those trucks were GPS data and, of course, the current condition of the brake pads. And so this is how I came to that. And when I tried to find information about libraries, about what kind of libraries I could use to, in order to fulfill this task, I found it quite hard, actually, because normally you're used to that whatever Python libraries you're searching, you get a lot of good tutorials and so on, but it was kind of hard for GPS. And so this is the reason for this talk that I want to share a little bit about information, about things I found out during this task. And, of course, I'm also, as a mathematician, interested in the algorithm and the mathematical algorithms in that domain. And it's always good to have a talk, right, when you're going to a cool conference like the EuroPython. So first of all, when you're starting to deal with GPS data, it needs to be stored in a way, right? So the typical format for GPS is the so-called GPS, the GPS exchange format. And it's based on XML. It describes three different points. So we have here three different things. We have way points. We have a route. So route is, yeah, the way you actually want to go, if you're, for instance, using it for hiking. And then the actual track that's the route you then actually took, because maybe there was some kind of mountain in between. And this is basically what this format is describing. So to give you a little bit of an idea of how the general structure looks like, we have here the way points as shown before. We have the route. And the route is just a list of different route points. And we have the track. The track is composed of several segments. And you're using segments, for instance, when you lose your GPS signal, you start a new segment, or when you're hitting pause, for instance, on your running sensor, then you start a new segment. And each segment contains many track points, which build up then the track you actually took. And for this presentation, I will use not the customer data, of course, but data I took from my watch, from Polar Flow, it's called. And this contains only a track, because if you go for a run, you normally don't plan the route, you just go for a run and it tracks your points. And we have here the latitude and the longitude and the elevation. And of course, the time when this measurement was taken. So how do we now deal with those kinds of files? So there's one library called gpxpy. It's a gpx file parser for reading and you can even write gpx files. It's licensed under the Apache 2.0. It contains a nice small command line tool, gpxinfo. That gives you some basic stats about your files. So if you're just interested in, okay, what was my average velocity, then you can just use it. It's Python 3 compatible, what is always important with small libraries like this. It's written by Tomo Krajinda and it's used on his website, trackprofileup.com. So how do you actually use this? It's really, really easy. You just import it, you open the file as a file handle and you just parse it and basically that's it. So when you get back is an object and you can then use the tracks attribute to access the different tracks. In this case, we only have one track. And the segments, in this case I also had only one segment. And then normally what we do, right, if we want to deal somehow with this information, we make a data frame, a pandas data frame out of it and that's exactly what I did here, set the index time and that's the data frame. So far so simple. So, of course, since GBX is something really visual, we want to plot it and just to give you an idea, so this was one of a run I did in Hamburg. So this is if I just plot the longitude and the latitude with the help of mudplotlib, exactly. And this was just a track, just a line. What happens if we plot not only the line but the actual points with dots, then we see that we have a lot of points. We have more than 10,000 points actually. So at that point I might think, okay, maybe 10,000 points for a small run is way too much. So how can we reduce it? How can we reduce the number of points that we have to deal with less data without actually destroying the geometry of the track? So a really simple trick is we use only every 105th point. And in that case it works because I was running with about always the same speed, but this would not work if I had done this with a car or we also see that on straight lines like here, of course you would need less points to describe a straight line, only two points actually, but in a curve you would need more information to describe the curve. So you see there's the need to somehow use a cool algorithm to simplify a GPS track. And one algorithm that is quite often used is the so-called Rammert-Duglas-Polger algorithm. And I think it's best explained in how it works. So let's assume we have this GPS track and we want to simplify it. And what does simplify mean? So we have to give some kind of epsilon, some kind of error that we say that's okay to make that kind of error, but not any more, not more than a certain epsilon. So the algorithm just works taking the first point to the last, drawing a straight line and this epsilon surrounding, this epsilon environment around that straight line. And the algorithm goes, if now all points would be inside this epsilon environment, then we could just reduce all points in between. If it's not the case, then take the point which is the furthest away from this line, which is that one, and apply the same algorithm recursively on the two segments that are by splitting at that point. So one segment is that one here and the other will later be that one. So we start with inspecting that one. We see, okay, all points are in this epsilon environment. We can just remove that. And now we do the same here. We have the problem, not all points are included. We take the one that is the furthest away. So that one, we're going to split up from here to there, and then from there to there. So first this one, we can remove that point. Here we have the same problem again. And this is how this algorithm recursively works. And we see that we have simplified the track, reduced the number of points. And if we go back and forth, you see this even better. And this is an algorithm we can use in which it's quite often actually used to simplify a GPS track. We will later make use of this. So in this case, I just use this algorithm. And, yeah, at that point, one should also make it, yeah, one word of caution. So never ever use recursion in Python. You will always run into problems if you go, if your recursion is too deep. So if you have more than recursion depth of more than 1,000, then you will get into trouble. So it's better to reformulate this algorithm iteratively, which makes it more complicated. But this is what I did here. And, yeah, so I used an implementation that is iterative. And I just run it on the long latitude. And I can reduce the number of points from more than 12,000 to less than 200. And we see here the outcome. We see as expected, straight lines are really straight lines. And here, for instance, during a curve, more dots are actually used. And, yeah, that's it about the simplification. So when I was doing this project with the trucks, I had only longitude and latitude. And, of course, I was interested in kind of finding out, okay, what was the height, the uphill and downhill distance the car was actually doing. But, okay, where do I get the elevation from? So I was looking around. In my track, so this is the track from a bicycle ride I did in Italy. And there I have the elevation, but I will now just remove it. So that it is like it was in the project and I don't have it. In that case, there's a really cool extension to GPX PI called SRTM PI. And SRTM PI stands for Shuttle Radar Topography Mission Elevation Data for Python. And that was, I think, maybe some of you remember in the year 2000, there was this huge NASA mission where they were using radar to find out the elevation almost everywhere on Earth. And this data is publicly available and this is just an interface for it. And you can really easily use it to get the elevation data. So you're just imported. You say SRTM get data and you say add elevation to the GPX file to this object we opened before. And what SRTM will do, it will just start downloading partial files with this data and add it to your GPX file. And additionally, we can say at least smooth the elevation. So two neighboring elevations are averaged in that case. And, yeah, so if I just plot it to see, to compare the elevation that my sensor took or measured to compare it to the SRTM data, we see that it's basically almost the same. So this works always great if you don't have the elevation data but you somehow need it. So, of course, if you do a project in the end, you want to present something to the customer. And the customer always likes nice pictures. So I was also facing the problem of how can I now visualize the data in a nice customer-friendly way. And I found out about a library called MPL Leaflet. And the nice thing about this is it can just use any mudblot and put it into a panable, summable, slippy map. So all these maps you see in webpages where you have OpenStreetMap or Google directly embedded in a web page. It's extremely easy to use, also a new BSD license. It integrates fantastically well with the Jupyter notebook and Python 3-compatible. And the word of caution, you should definitely simplify your tracks with the Rama-Douglas-Polger algorithm first, because if you started using it with 10,000 points, then it will not work. Okay, so how does it work? As I said, it's really simple. We start with a mudblotlet plot, that one we have seen before. And now we just project it onto OpenStreetMap. We just import MPL Leaflet and say display the figure we created before. And this will in Jupyter embed it into the output widget. Or if we would say, it would have said show, then it would just open a new window. And this is fully interactive and summable. And it's two lines of code to actually embed this somewhere. And this is, I really, really like this. That is so easy. And another track that I want to talk about a little bit more now is, yeah, this bicycle ride I did in Austria and Italy here. And when I was looking at the data of the track, I found out that there are some curious things. So if I call the get up hill, down hill method of GBX Pyre, it tells me that I was actually doing height meters of 4,446. And, yeah, and down hill about the same. I was wondering, okay, I mean, sounds good. That's a lot of height meters to do actually by bike. But the organizer, he told me or it said it's on the website that it's roughly 2,700 height meters. So strange. So somehow my sensor must have measured something strange or maybe I'm doing something wrong. So I was trying to start investigating. And if I looked, when I looked closer at the elevation, I realized that there are just bumps here. And these bumps are quite unrealistic. So we directly see that the elevation sensor measures in accuracy or precision of one meter. And sometimes it jumps up and it jumps down and it jumps up and down. And if you do that a lot, of course, if you measure the uphill distance on that side, you gain uphill meters, although this seems quite unrealistic, right? So we have to somehow smooth the data because this is an artifact of the sensor that's not the actual position I was at that time. And also when I was looking at the speed, I was quite, yeah, feeling strange that I was sometimes doing 230 kilometers per hour, which is on a bicycle. I mean, I was going downhill sometimes, but even downhill, that's way too much. And this was also, I mean, I was using the GBX file at missing speeds function, which basically takes the time and the distances and calculates the speeds. So it shouldn't be a mathematical problem. It's more like the problem has got to be somewhere in the data. And we directly see that this peak is extremely unrealistic. But why, how do we know that this is unrealistic? I mean, it's like, okay, we have the picture of a bike in our head and we say, a bike has a position and some kind of velocity. And if we know the velocity, then the position will be, the next position will be somehow based on the current position and on the speed vector we have. Or if you see someone like walking, if you see me walking, then you know in the next second I'm going to be there because this was my direction I was going. And so we have this kind of model in our head. And this is exactly what the Kalman filter is all about. So Kalman actually, he passed away just three weeks ago. So really beautiful mind. And he came up with this idea of describing, giving a model to some, to all kind of physical things. And in this case we have, we want to describe like my bicycle ride. And the idea is that you have states, states you can't really directly observe, not like the real states. And I can think of my bicycle like having, or me on my bicycle of having a position and some velocity. So this is this state and state acts. And somehow I want to project this into the next state. So if I have, like I said before, the position and the velocity, I can just use the velocity and add it times the time to the position and I will get a rough idea of the next position. And then we have in the general form, of course, he's a mathematician. He made it as general as possible. We also have a control vector. This in this case we would not need. But for instance, if you're looking at a falling apple or falling ball, then this could be the acceleration due to gravity, due to gravitation. And of course we always have an error term. So the error term is when I'm walking and slowly changing my direction, then this is part of the error term. And then besides the state equation, we also have the measurement equation. So this is where the GPS idea comes into play. So the state, I said, we cannot directly observe it, but we observe a measurement and the measurement is somehow generated from the state plus an additional error. And this is basically the main idea. And now that I kind of have a model how I think something physical, some process behaves like my ride on a bicycle. And on the other hand, I get measurements and those will never be exactly the same. I somehow need a method to bring them both together. And this is the main idea behind the Kalman filter. So I take everything I know up to a certain point about the process and about all prior measurements. This is called here x hat minus. It's a kind of a priori state estimation. And then in the next step, I get the measurement. This is the set k. And then I want to somehow get an up posteriori state estimation, which is better than my a priori knowledge I had before. And this I do by taking my a priori knowledge plus some residual between the measurement and the state like mapped on to the measurement I would get directly from my a priori state. And this is multiplied with a k, the so-called Kalman gain. And finding an optimal Kalman gain is now the hard part. So optimal in the sense that you want to minimize the error covariance. And if we had something like this, and actually we have, we can just use this to predict and to correct and do this iteratively with each new measurement and then we have the basic idea of the, not the basic idea, but then we have the Kalman filter. So what happens is that we use our transition equation, our model about our physical process and make a prediction with the a priori knowledge, also with the error covariance. And then the measurement comes in. We can now correct it. We calculate the optimal Kalman gain, so this is the formula, but you just ignore it for now. Then we do this optimal averaging of our a priori knowledge and the measurement and we get an update and can go back and start doing this for the next time step. And for instance, if you're looking at some, if you're looking on Google Maps and if you're losing this measurement update because you're losing your GPS signal, you see that this, this obviously circle around your point which is starting to increase. And this is exactly what happens in that part, in the prediction part. If you're losing GPS signal for five, six seconds and the circle starts increasing and then some more measurements come in and the circle goes down again. And this is directly what you see in many, many GPS applications. So for our Kalman field and our concrete case, the model state equation is just the next position is the current position plus the velocity times dT, so velocity times the time step plus some noise. And the current velocity, the next velocity is the current velocity plus some error term. So that means I'm not accelerating or de-exaggerating so fast or it's, it should be something, a smooth process. And if I write this in vector notation, I just get this matrix. So this is more or less just transcribing this down to, to this matrix here. In our case, the sampling rate of this, of this watch is one second, so dT is just one second. And the measurement equation is also really, really easy. We are measuring only the GPS position. So with the help of GPS, we only have the position and only implicitly we somehow know something about the velocity. So it means that our state which includes the position and the velocity vector gets mapped to only a measurement of the position plus some, yeah, plus some error. So it's, it's like a, like a really easy equation. And what we also know in this case, we know something about the precision error of GPS. I mean you can look that up. It's something like 10 to 30 meters. And this relates to 10 to the power of minus four in longitude and latitude. And I assumed an error of the elevation of 100 meters. So the, the elevation with GPS, so never trusted. It's really extremely imprecise. And this I take for the error covariant variance. So far everything about the math. And yeah, I hope you get a, you get a good understanding of the basic idea of Kalman. So as I said before, maybe as a small summary, you have some model, some model equation, some measurement. You somehow optimally average those two to get a better idea of the current state. And a really cool library for this is PyKalman. It's a Kalman filter, a smoother and, and, yeah, expectation maximization library. It's that simple to use and really powerful. Comes with many examples and a good documentation besides Afin or, yeah, linear transition matrices. You can also define non-linear and non-Afin state models. It's licensed under a new BST license and it's written by Daniel Duckworth. Now we're going to use it to actually help with the curiosities we found in the data. But before that, of course, we have to do some, some, some data wrangling. So again, we start up with the pandas data frame of our data. And yeah, we start looking a bit around and by looking at the tail, I realized, okay, it's not really one second, the sampling rate. It's, it's sometimes less than a second. So the difference is not always exactly one second and it's really important that the time interval is uniform for this discrete Kalman filter. So what we have to do here is that I was just rounding the, the, yeah, to the next, to the next full second, the time. But what about signal loss? So maybe I had signal loss during my bike ride and, yeah, how can we check that easily? For instance, we could just use the numpy function diff and just do this over the time index and then we see if the difference of two entries is not one, then we know, okay, that's got to be some kind of signal loss. And we have, here, we see we have three times, we have a signal loss. And we can fix this. We need to fix this. As I said, we need a uniform time interval. We can fix this by just using the pandas resample functionality. And this we do with resample one second. And we get some additional rows, which are, of course, not available. Because, yeah, that's the data that I bought during the time of the signal loss. But, yeah, we need it. We're going to fill these values later. Kalman filter is going to help us fill these values. Now, since the Kalman filter works with numpy arrays, we have to go back now. We have our pandas data frame with a longitude, latitude and elevation. And we take the values, so we convert it to, or we take the pandas, the numpy array from it. But we don't use a normal numpy array. We use a masked array because PyKalman expects us that the real measurements are, yeah, are, so note the missing values that they are masked. And this is exactly what I'm doing here. And so the num values are directly masked. And additionally, I just plotted the signal loss points to get an idea where the signal loss was. Here was a tunnel, for instance, and, yeah, to just double check. So now, basically, we are all set and fine for using the Kalman filter. So we have our state transition equation F. This is just the matrix we've seen before. DT is one second. Here we have our measurement matrix. So this is three cross six matrix. And we have our covariance here, covariance error. And we need to define an initial state mean, an initial state covariance, so, like, an initial condition that I just take the first measurement. And we give all that to the Kalman filter from the PyKalman package and say what we don't know is actually the transition covariance matrix. So how fast I'm, like, for instance, changing my direction on the bike. So this is something we don't really know, but we can estimate it. And this is what I'm saying here. So my expectation maximization variables should be this transition covariance matrix. And then we can just fit it. And fit it is just calling KF EM, so expectation maximization again on the measuring and the number of iterations. So this, I took 1,000 iterations just to be sure it really converged. This takes a very long time, like a few hours. And now that our model is fit, we can use it to actually smooth our measurements. And we get in return mean estimation of the states I really had at that point. So now we're going from the measurement to the real to the x to the states I described before. And if I plot now this state, so this is on the right side, and to the left are the measurements, we see that actually it's looking much, much smoother. So we got rid of all those little bumps you saw. And it's looking fine. And of course we can ride it back, the smooth track. And this just basically is we iterate over all the segment points. And yeah, we fill it back in. And then we call the get uphill downhill distance again. And we end up with 2,677 height meters. And this is roughly about the 2,700 height meters that the organizer told me about or told us about on the webpage. And so, yeah, we are fine. We used the Kalman filter without actually, yeah, coming up with some cool smoothing tricks. I mean, I talked to other people, I said, yeah, why didn't you just like take two or three points and took the average somehow, but then you have a lot of variables you need to fit. And here with the Kalman filter, we actually only described the physical process, the relation between velocity and the position. And it turned out to be just fine. And of course, if we do this with a speed, the speed was a little bit more involved since you, yeah, one had to be careful with the points where the signal was lost. So the problem is most sensors, GPS sensors already have some kind of Kalman filter inside. That means if you're losing the GPS signal, and then, for instance, you go in a tunnel and then you start getting your first measurements at the end of the tunnel and what this sensor then does, it still thinks you're still in front of the tunnel because this measurement is way off what it expects it to be. And then it starts like, yeah, going after the real measurements so it realizes, okay, the measurements are really somewhere else and then you get like jumps where you're really, really fast and this is what happened here. So I had to delete a few points and use the Kalman filter to fill in the more realistic points. But in the end, it turned out really good. So I was driving according to this about 77 kilometers per hour, which is still fast, but I mean downhill and I was checking on my tachometer and it said something about 70, so the maximum speed, so they're still a little bit off. And what's cool about the Kalman filter, you could even now use the data, for instance, of your tachometer from your bike to actually, yeah, to average the measurements from different sensors to get an even more precise estimation of your state. But this is beyond this talk, but it's possible. All right, so this is about the libraries I used. So I want to give a short summary of all the libraries. We have seen right now, I put at point, GPX PI for reading and writing GPS files as well as imputing missing values like the elevation which you can do with SRT MPI. We have visualized our GPS tracks with MPL leaflet, which uses internally the leaflet map, which is a JavaScript library. We've talked about Ramaduk Das Polka algorithm. There is actually, there's really an implementation of it, RDP, which you can just pip install, but as I said, be careful with this. It's programmed recursively, and for my tracks it just didn't work. And then there's Pi Kalman, the data sample Kalman filter. I have a few more notebooks that I created during, yeah, during, I made this talk during the creation of the talk. And you can find it here. It's a total of four notebooks where you can play around with it, and I left several parts out, but there you have the, you can get the full pictures and just play with it. Of course, credits where credit is due. So the RDP example, so there's nice animation. I took from Mario's cart house, the waypoint track graphic from Wikipedia. The logo on the first slide is from GNOME Maps. So if you haven't used this app, it's really, really cool. So for all the Linux user, it's a really nice OpenStreetMap tool, yeah, similar to the Google alternatives. The Kalman prediction correction graphic from Pilgen's blog, and if you're interested about reading more about Kalman filter, there's a really good Kalman invasion filters book by Roger Lappe, and yeah, he's also using Jupiter notebooks and showing how Kalman filter and patient filter work. Okay, so you can find this talk under that URL, and thanks for your attention, thanks for listening, and I'm ready for some questions. Thank you. Thank you, Florian. Is there any question? Hi, great talk. I was wondering if standard geo libraries like GDAL or OGR don't supply comparable functionality. Sorry, I didn't get the, what kind of libraries? Do you know about libraries like GDAL or OGR, which are pretty standard geo libraries? I haven't actually checked them out, but the question is, do they have a cool Python interface? They do have a Python interface. I was just wondering if they also supply GPS specific functionality, but you don't know. So I was doing, when I started this research, and for me, the most important point was that I can really easily access it from Python, and so maybe I missed it, but so far I don't think so that you can easily use them, but thanks, I'll try. Hi, so thanks. What would you do if you had to deal with GPS data in feed format, you know, Garmin is exporting just the feed format, and there's a Python feed pass library, but it's not uploaded since two years ago. Would you use some GPS label to transform it to this GPS, or is it okay? I mean, what would be your approach if you have experience? So I would, I don't know this format of Garmin, so since, yeah, I'm quite liking this fear using the polar products, they directly allow you to upload or download everything in GPX, but in this case, I would look for converter. There surely is a converter, and if it's, yeah, a human readable format, this Garmin format, is it, or is it binary? Is it completely... In more efficient in memory, and in, when you, like, have really long tracks, it's really memory efficient, so it's not human readable. Okay, then you might be having a hard time, but maybe there are converters for it, I guess. Fit files can be imported in the Garmin website, and exported there as GPX. It's a cop out, but okay, we'll see. Okay, but that's, yeah, a good point. So if this is always problematic, if you use pro retire reformats, then... Okay, but then it should be possible to find software that transfers it to GPX. I'm quite sure then. So often, GPS data also comes with, like, a value of accuracy. I don't know if that was available in this talk or in this system. Would it be possible to also use that data to make it more accurate, if you know the accuracy of the GPS data? Yeah, that should be definitely possible if you have, at that point, I mean, you have for each point this error covariance matrix, and if you have then an estimation, even what the error is at that point, you can also use that additional information. You could use the error maybe just as new state variable, and then use your measurement of the error to correct this. So yeah, I would need to think about this, but it should definitely be possible. Yeah. So basically, with the help of the Kalman, whatever kind of sensors you have, having different error variances, and even if you have a measurement for the variance for the error, then you can merge them together to get a better estimation. Okay, thank you. You're welcome. Any other question? Okay, thank you very much.