 So, hey everyone, thanks for coming along. I'm going to be talking about a project of mine called intertube that aims to explore London undergone Open Data and ended up getting rather deep into the weeds. So, standard, who am I, slide? Hey, I'm Eater, my pronouns are she, her. I live in London, I have done for all my life. I'm not just incredibly curious about the tube from afar, although that's cool if you are. I like trains, not as much as some actually, but I do appreciate a good train and my day job involves rewriting tour and rust, so absolutely nothing related to what I'm going to talk to you about now in case you cared. So, the origin story for intertube starts with one of these. This is one of those, like, next train indicators you see waiting for a train on the platforms, if you're not familiar. You generally get a list of what trains there are and their destination and estimated time to arrival. This is not actually the right kind of next train indicator for the story. This is one for the Northern Line, and these generally work with no problems. The experience I was having was more like this. This is Bow Road on the district line, and the boards on the district line were frequently complete garbage, often displaying nothing, sometimes just displaying the word district line, or sometimes only filling in when the train arrived, and so on and so forth. So, my experience with poor information quality at some stations led me to look at what APIs were available to get data on the tube. There are two here, the unified API and TrackNet, and from a first glance it might seem like the unified API might be a better idea. TrackNet uses nice modern new technologies like JSON and Swagger, whereas TrackNet uses XML. If you go and look at the PDFs for TrackNet, you have this lovely document with pages and pages of stuff that is still marked as a draft, despite being last evased in 2010. But, despite this, it turns out the unified API gets its data from TrackNet anyway, and then just gives you less of it. So, I ended up using TrackNet and having to pass lots of XML. This is what using TrackNet ends up looking like. You put together a URL that looks like this, this asks for TrackNet's view of a particular station in details mode, there's also a summary mode, but that is not as useful for my purposes, and you give it a three-letter code representing the station you want, along with the tube line you're interested in. These are given in tables inside the massive PDF, which you have to copy out yourself. So, you put together this URL, and you hit it with your HTTP client of choice, and you get a page of XML. The HTML has this massive text disclaimer about how you shouldn't use this to make safety-critical decisions. So, if you work for the underground, you shouldn't use TrackNet to tell you whether the line is clear, and you get some metadata and train information. The really interesting parts are these individual little train elements, though, which you get describing all the trains near the station. I won't make you read XML, this is the plain English overview of the data you get from TrackNet, use some identifiers, destination time to arrival, an English language description of where the train is, and a track code. So, the destination time to arrival is really how you are supposed to use this API, this is just an electronic form of the next train indicator, and that's kind of the point. But, if you look past that, there's this extra bit of information called a track code, and I was like, hmm, what's that? To explain this, let's imagine a hypothetical tube line with five stations A to E. So, a signalling system at the London Underground might represent this by splitting up into nine distinct bits of sections. So, you might have, you know, the bit of track at station A, and then the bit between A and B, and so on and so forth. If you have a very simple signalling system, we might say that each train can be in any one of these sections. You can't have two trains in the same one, that would be known as a crash. TrackNet gives each section its own little track code to identify the sort of bit of track, and that is what you get through the API exporter as a track code parameter. If we put some trains on our line, red, pink and green for simplicity, we're assuming our line is unidirectional and only goes from A to E. In real world, tube lines are on both ways, but, you know, the different colors here actually correspond to different train identifiers from the API. That's just the way I'm choosing to represent that here. And if you ask TrackNet for the predictions for station B, the bit everyone normally uses is the how far away is this train from the station? The red train arrives in two minutes, the pink train arrives in one minute, et cetera, et cetera. However, it will also tell us that the red train is in track section T001, and the pink one is in T002. You do get an English description of where the train is as well, so it will return at A, brackets T001, and between A and B, T002. If we ask C for the predictions, we get much the same thing, except the green train can be kind of seen by C because B couldn't see it because it already passed by B. So we get the same results, plus the green train being at track code T004. So I got thinking, and I kind of realized this opens up a new possibility for us. Originally, we wouldn't, you know, we wouldn't really be able to combine information from multiple stations to departure boards. It's all, it'll be at the station in this amount of time. You could kind of do some clever stuff to mesh it together, but that's, you know, difficult and kind of, you know, not very useful. But if we use the track code information, we get the same track codes reported for each train, no matter what station we ask. And because the trains have unique identifiers that are also the same no matter what station you ask, we can actually go ahead and merge the information we get from multiple scrapes together into one view of what the line looks like. This is pretty useful. All we need to do is make sure we're looking at enough stations to departure boards to cover the whole line, and then we can get a picture of where all the trains are. We can also keep doing this over time. So if we say we do one set of scrapes at 9am in zero seconds, and another three seconds later when you can see things that kind of moved around a bit, then you can note down what track codes we saw at all the trains at various points in time and build up this little history of where each individual train has been. So to recap the last sort of confusing diagram bit, TrackerNet seems to divide the line into track sections, which are given track codes to identify them. Asking any station for its departure board will return, you know, a departure board, but also each train's current track code. Given that trains share their identifiers and track codes across stations, we can build up a track code history for each individual train. But only if we actually know what the codes mean. We need an idea of how they're connected together and what stations in real life they correspond to. If you recall this diagram I presented to you a few minutes ago, we need a way to generate this that showed how each track code mapped onto something useful. And we don't get this. TrackerNet only literally gives you the stuff I said earlier. Maybe the English language description of where the train is might be useful, but it turns out that multiple track codes can actually have the same description. So we can't just use those. And the descriptions aren't really structured enough to be useful. They do do the nice thing of if a track code represents a station, they'll usually say at station and then maybe the platform number. And actually you can also ask the API if I kind of lied earlier that you can ask the API for these, but everything that's not a station, you're kind of screwed. In practical terms, it's as if you have a line from A to C, the information you get lets you kind of correlate, okay, maybe these red track codes are station A and the blue ones are station B, the pink is station C, but there's this massive orange stuff where you don't really know the structure in between. So we've got a vague idea of what track codes correspond to what stations, but not really that order. I mean, what order the station is in the line, but also what order are the little track sections between the stations. So to help us model the structure of a tube line, we can use this thing from computer science called a graph. Now, this isn't the sort of x, y line and axis thing that you might be used to, but rather just a bunch of blobs with a bunch of lines connecting them. In our case, they have arrows, so it's called a directed graph, but they don't always. We can use graphs to represent the connection between bits of track. If we make a new blob for each piece of track and label it with its track code, we can then draw lines between blobs that are connected to each other. Most of the time, this will result in us drawing things that look like straight lines, which is good, but it should also let us represent sort of train tracks splitting up and joining injunctions, which does happen. A computer together with a bunch of graph algorithms in theory can use the graph to answer important questions for us, like, I'm at this track code, which station is this, or what station is my between? And if I wanted to go to another station from here, what track codes would I cross over? To build the graph, though, we need to know all the connections between the track codes. So how do we discover this, given we're not really given it? How about collecting a bunch of data and then using the order we get from that? So to give you an idea of what that means, if we've recorded our history of two coming after one and three coming after two, you could just link one to two and two to three and so on and so forth. So from this amount of data here, we could generate a graph like this and build the following links. Since two became after one, we link one and two together. And maybe we get some more data hypothetically, and then we can even discover the original structure, which is, you know, a straight line from one to five. So I ran something to save logs of each journey on a line for a few months. By scraping track in it, as we've discussed before, you know, you take the departure boards at various stations and sort of note it all down and combine it together and then extract all the track codes from each train in our journey log and link them together based on observation order. In simpler terms, track code links go into our mythical graphomatic and out we get a graph. Remember the fancy computer science version of the graph, not the sort of x and y axis thing. Hopefully. So we run it and we get something like this. Now, I don't know about you, but at least to me, this does not look like a tube line at all. Like, I appreciate that very little of what I'm going to present to you is going to look like a tube line in a way that you can empathize with, but this isn't it. Like, this is no good. We can zoom in and you can kind of see, you know, the mess maybe up close. It has linked together some of these track codes in like a vaguely sensible way. Like, yeah, I guess, you know, Paddington is after approaching Paddington, but the structure is like horrifically messy. You can't really get much out of it. So yeah, it turns out that data quality isn't high enough for you to just log data and link it together. We do need to be a bit more clever. Now, why does this happen? Well, there are lots of issues, but there's one of them that the district line specifically really loves doing. So, we have a train with a given ID here, red, going along its merry journey, you know, doing train things. Dumbly dumb. Everything's normal. It suddenly disappears for a few scrapes. And then we mysteriously get another train under a new identifier, conspicuously just where the red train would have been had it continued. I can only assume this is just the same train, and track isn't actually good at keeping track of things despite its name. Trains don't just appear out of thin air, question marks. And, by the way, this probably explains why my experience was so bad with the district line. Like, the underlying signaling system turns out to be very disjointed and patchy. This has now changed, by the way, asking about that later. But yeah, there is actually some data issue here. So, this isn't really as simple as I thought. We're going to need, like, sort of slightly better quality data to feed into our mythical graphomatic blob. So, one thing we can do. If you read the TrackNet documentation, everywhere it claims that data is cached in the Microsoft CDN for 30 seconds. You only get a new departure board every 30 seconds. You look at the TFL Open Data Guide, definitely 30 seconds. Now, the previous information I've given you, can you guess how frequently the feed updates? You are? No, it's two to three seconds. It looks like that layer of caching has been removed. I have no idea whether this is accidental or deliberate. If it's accidental, TFL, if you're watching this, this is really useful. Please don't remove this functionality, because my entire website kind of depends on it. But, like, there's a problem here, because I don't think they know. Like, if you ask them, you know, how frequently can I hammer your API with a request? They go, oh, well, I only update every 30 seconds. Why would you want to hammer it frequently? And I'm like, so, in terms of services, you should, you know, limit yourself to 300 requests per minute. So, I'm like, I'm just going to limit myself to 300 requests per minute and hope no one notices. So, yeah, scraping more frequently, we can make our track code links a little bit better. We get more links, and the links we get are also more accurate, more than that in a bit. But also, we can keep account of how many times we saw a given pair of track codes come after one another and feed that information into our graphomatic as well. So, basically, we can exclude links that look like they sort of don't really happen very often and are likely due to some sort of data issue. I won't go into the exact details, mostly for time reasons, and this, you know, wasn't really what I ended up using in the end. But it's a bunch of statistics like quartiles and stuff. And so, with this information, we can make a hopefully improved graphomatic 2.0 that maybe will give us better graphs. And so, you know, we feed this new information in and we get this. Now, this might not look like it, but this does look like a tube line. If you don't believe me, here is a nice pretty annotated version that you might not be able to see, but you can kind of see, you know, on the left-hand side, there's like the eastbound leg, and then it comes back up with a westbound leg. You can see it's kind of figured out that you can turn around at Tower Hill, Barking and Upminster. There's like Elscourt, which is a mess, but it's a mess in real life. There's Ezra Road, Eileen Broadway, Richmond, you know. Surprisingly good if you zoom in. Some parts are perfect. This is a straight line between Acton Town and Chiswick Park, which is how it actually looks. It's kind of cool how zooming in, you can see it's kind of figured out some stuff on its own. If you look to the tube map, you might not think that trains can reverse at Tower Hill, but they can, and it has managed to figure this out, because you can go from between Monument and Tower Hill to reversing in Platform 2 to between Tower Hill and Monument. That's kind of cool, but it's not all great. This was Temple mentioned House, etc., etc., before they upgraded the signalling system mind, and it is a bit of a jumbled mess still. This bit might look fine, but there's a more insidious kind of problem here. Really, there should be a straight line between Tower Hill and Allgate East, but it's not. It's pointing in the vaguely right direction, but if you were to use some of those graph algorithm stuff that I sort of discussed earlier, you would probably just go and miss out a whole bunch of track codes, which is not really great. In summary, our graph is nice to look at, and it has broadly the correct structure, but it's not really the quality we need to make a lot of decisions. It still doesn't facefully represent the actual order of the track codes. There are kind of extra links about the place, and the extra links mean we can't really use this for rooting between two stations. So if you kind of think about why this happens, it's somewhat of a fundamental problem with the way this works. We are pulling the departure boards every few seconds, currently every 11 seconds for complicated reasons, but because we're not getting every single movement, we're always going to catch the train at sort of snapshots of where it is. So if the green train moves, or both of these trains move from A to E and stop at every station, we might only sort of get a snapshot, like A and then between B and C and then a D, or between A and B and then C and D and then E, because we're taking snapshots. So if you try and build a graph out of this information, we get something jumbled on the right. Even though it's supposed to look like what's on the left, a straight line, you end up sort of these piecemeal links we get from only polling every now and then, we don't observe every movement, and so end up with a jumbled mess of links. So I looked around for things that might help here. I don't have a degree, so Wikipedia it is, and it turns out this is a problem other people have had. There's this thing you can do called topological sorting, which is basically saying, okay, I have a graph. Let's find a way to list it such that if A is linked to B, then we put A before B in this list that we're going to make. You can see some examples here on the left. These are both valid sorts, the graph on the right, hopefully, and you can see, for example, the blue sets of links, A, C, C, C, C, C, E, B, E, etc, etc. So this is kind of useful, right? And there's an algorithm for doing this called Khan's algorithm. You can just copy it down and implement it. That's not very interesting. So this is pretty useful. If we know a specific bit of the graph is supposed to be straight, we can use topological sorting to straighten it out. So, for example, station platforms are generally known as being straight lines. If you're station platforms split into two midway through, you might be in for a fun ride. So if we take Tottenham Court Road, platform three on the northern line, and the set of data I collected for this, well, first of all, we run into a problem immediately because we have cycles. And if you, your cycle means, you know, you can go from A and find a path that goes all the way back to A. So if you've got A linked to B and B linked to A, you can't really sort that such, you can't really have both of those things. So we need to get rid of the cycle somehow. If we go back to our original graph, omatic, and the source of use for that, there's something that could come in handy here. We can add the number of times we saw each link to our graph. And then it kind of becomes clear which ones we should break. So, for example, going from TN30061 to the one ending in 63, it goes in that direction. 3066 times is all that happened. And the inverse direction we saw once. So I think it is safe to assume here that the one is probably bollocks. And in fact, you can basically make an algorithm that does this. And it's like, okay, well, the really small ones, let's just get rid of those and see whether we can do it. We can sort it then. And yes, you can. So you move all the links that look like they probably shouldn't exist. And then you sort it and you get a straight line, which is good. So, our top logical sorting works. If we know something should be straight, we can straighten it out. So how do we actually use this to fix our graph? Well, one way that we might do this that I had an idea for, go with something like this. First, we find something to group together bits of track with, like we need some way to identify bits that should be straight lines. And then we can turn the bits that should be straight lines into straight lines and then link together the groups. And what do I mean by link together the groups? Well, if we have a graph containing two separate groups, say, you know, a bit that's a platform and a bit that's immediately asked the platform, we know that it shouldn't be possible to teleport from midway through, you know, midway through the platform to midway through after the platform, you know, generally laws of physics. So what we can do is we can take all the links between those two groups and just collapse them into a link between the last item of the first group to the first item of the second group. So we can link together groups and top logically sort. The question remains as to what we should choose to group by. Well, I was like, okay, we could just use the English English descriptions for now. They have a whole bunch of issues that I sort of laid out earlier, but, you know, let's give it a go. You end up with a graph that looks like this. This is the northern line. This is vaguely the right shape, ish, kind of. It's not really that usable. It's very bunched up in the center for one. This is the Victoria line, or maybe, which even doesn't, even better job of showing off what the problem is. It's vaguely a circle. I mean, it makes sense, but there are rather too many errors. Everything's a bit bunched up. We change our visualization software to sort of undo the bunching a bit so we can actually read the text and zoom in. We can see the problem. The descriptions are too broad, as I said. Seven sisters has two platforms going two different directions, but they're bunched up into one big out seven sisters blog for both directions, and same for many other things. So, yeah, the vague descriptions are kind of hurting us here. As you could probably have guessed, they're not the best grouping. Quality varies between lines, but we're always going to have this problem with the English language descriptions. So we need to do something different. So the solution is, thankfully, rather simple. You just split each description into its subgraphs first. So if you say you have this description called description, you can kind of visualize it, and you can write some codes to this as well, and be like, oh, there's actually two separate graphs inside this. Then we just split that up into two separate groupings, and I'm going to label the first one as slash A and slash B, and you'll see that in the visualizations later. So, yeah, we can take the English language descriptions, but split them up first, run the same algorithm, and we get this for the northern line. This is already a lot better than what we had previously. You can see the graph is mostly straight lines that only connect to each other at junctions instead of a big bunched up mess. Let's take a close look at some parts of this to examine the quality again. Some parts of this are beautifully straight as before. This is mostly good. You can see, you probably can't see, in fact, but at the bottom, we've got approaching south Wimbledon slash A, so it's identified that there are two bits of track that are called approaching south Wimbledon, probably for the two different platforms, and split it up into two different sections as it should be. So, that's good. We also have a lot of intermediate blobs. So, for example, between tooting Broadway and Collier's Wood, we have between tooting Broadway and Collier's Wood, and approaching Collier's Wood, we don't really need that level of detail. They're just the bits between those two stations. It would be nice to group that together. Looking at another part of the graph, this is also mostly straight, but there is a small issue. Stockwell and Clapham North are next to each other, and there is no junction there, but somehow it thinks there are multiple ways to get between these two stations. This is not correct. So, in summary, the strategy produces pretty nice graphs. In general, our strategies seem to be making things incrementally a bit nicer and cleaner, which is good, motivating. There are still problems, though. We have too many bobs now. We have to combine them, and we also have to deal with the multiple paths problem. And, actually, I have multiple ways to get between these things where there shouldn't be problem is actually somewhat familiar. This is the same problem we had earlier with the track codes with a very similar solution. So, here's a sort of slightly crazy thing I came up with to deal with this. So, we want to simplify paths between stations. We start by looking at each station in turn and finding all the paths from that station that end in other stations here in red. We consider each path in turn, so I'm just going to take this red blob and examine it separately. And looking at each path, and this is just the red bit, we take all the blobs between the start and end station blobs and just smush them into one new blob, representing that whole path. Now, you might think, well, what if there are a lot of them, and what if they're linked together? This is where we use the topological sorting. We topologically sort the removed blobs and links and then shove all of their track codes in that sorted order into the new one. This wasn't a problem in the previous graph, just to make things simple, but it happens in real-world ones, so it's nice to have that in there. But, basically, after applying this process, we can go from a graph like this, for the paths between A and B and C, to one like this, where the intermediate nodes represent paths between stations in a much more useful and interpretable way. So, we can add, run the graph through our freaky topo sorting machine to our strategy, and rerun the graph generator one more time, and we get out something that actually looks pretty great. This is the northern line again. We zoom in a bit. We can see that, actually, this is done a pretty good job at representing the line faithfully. The intermediate blobs have been made from collapsing things nicely. The green arrows here, you can see we've got into 2835 between Angel and King's Cross and 2842 between Houston and Mornington Crescent. You might think, well, there's that pink blob in the center. That looks complicated and ugly. But, no, actually, this is the Camden Town Junction tunnels. This is actually really complicated in real life. So, actually, I think it's pretty cool that it's managed to figure that out in a way that it can actually use to make decisions without me having to tell it that, and without me necessarily knowing that this was the thing that existed underground. So, yeah, we've managed to go from just making observations about trains, from scraping the API and noting down everything, to having a mostly complete model of the network with tricky junctions and everything represented pretty well without having to input any of this manually. I, for one, think it's pretty cool that you can just write some code to figure all this out. And that's pretty cool. But it's not all roses. The district line is still an ongoing issue. We can do this for all lines except the district and other subsurface lines, but the district is really quite bad. If we try and run our new funky algorithm through that, we get a bit of an ugly mess. This is not very usable. It probably can be fixed eventually. I used to do a bunch more thinking, but it's more work that I haven't gotten around to yet. Talking of stuff that is more work that I haven't gotten around to yet. This scraping stuff is kind of not great. It would be nice to be able to poll a bit more frequently in the 11 seconds, but I have to do a whole bunch of other stuff to enable that. And there are a whole bunch of other data quality issues, like the train destination that's shown on the website often flaps back and forth, which isn't great. It's a lot of time and effort to track these down and squash them. I have a day job. This is not what I do full-time. I don't necessarily always have time and effort. But this has been packaged up into a nice, lovely website, which you can use on your phones, which is actually mobile responsive and on desktop. You can click on each individual line if it's not the district line that is. And you can get a nice display of the line and all the trains on it, which is kind of cool in many different directions. And you can click on each individual train and you get this history of where exactly it has been, how long it has been there for, and it even updates live with web sockets, which is pretty cool. Some brief notes on what this has actually built in. The back and in front end are both written in common list, which is a strange weird esoteric programming language that I actually quite like. All the live data gets stored in Redis. And historical data is archived both on disk and in click house more recently. The main goal is, this is something I'm working on in my free time, make it easy to hack on it without much faff. If you like diagrams, here is a diagram of Intitube. We're getting data from TrackerNet. There is a service called Intitube scraper that does a lot of the processing and sort of the modeling and stuff, and solves all of its stuff in click house and Redis, and Intitube web can read out of that display to users. There are a bunch of things I still want to do with this. One idea I had was, you know how TFL give like minor delays, or maybe you don't. TFL give updates on the state of the lines like minor delays, severe delays, et cetera, but this is all manually input. It would be nice if you can actually, I've got a whole bunch of historical data, you could figure out the average running time at a particular time of day, and then compare it to that to figure out whether it's actually delayed. Actually, some other people have this idea, and they beat me to it, and I'm really mad. I'll get around to it at some point. In general, it would be nice to do a whole bunch more stuff with the historical data that I have, like maybe rewinding the whole web page to an arbitrary point in time. I'd like to build maybe some sort of funky live map visualization of trains moving around the network. I know there already are funky live map visualizations of trains moving around the network, but those are mostly like lying, and being like, well, let's say it would arrive in two minutes, it's probably going to arrive in two minutes. I have the actual data, I could build on this more reliable, and there's a whole bunch of other stuff. Talking of having actual data, do you want this? I have months of tube data. Quality cannot be guaranteed, but if you want to run some sort of cool analysis on historical tube journeys, please do get in touch. I'm more than willing to give this to you. I also have a private beta real-time API that is much easier to use, and hopefully much more prominent than TrackerNet, mostly because a friend of mine was like, hey, I really want your data. Can you build an API for it? And I was like, yeah, sure. So yeah, get in touch. But yeah, so that is basically it. Thanks for listening. Hope you found that interesting, and feel free to try it out on your phone or whatever, and if you want my data, please feel free to find me afterwards or contact me using the link.