 schon ganz schön lang unterwegs. Beim 33C3, da fuhr er mit Xerox ab. Beim 33C3 hat er dann Spiegel einmal gedatameint und hat geguckt, was ist denn dort Spiegel online und hat uns wirklich eine total perfekte Datenanalyse zu dem Thema gemacht. Und beim 36C3 geht's mit dieser Zugfahrt weiter. Bitte begrüßen mit einem riesen Applaus, David Kriesel. Ich glaube, so geil bin ich noch niemals eingeleitet worden. Ich denke, das war die beste Introduktion, die ich je bekommen habe. Willkommen alle und willkommen alle in diesem Stream. Willkommen zu der Aluminium-File, die wir uns haben. Aber die sind nicht da, wo wir uns sind. Aber die sind nicht da, wo wir sie sind. Vielleicht können wir das ändern während dieser Präsentation. Ich bin David Kriesel. Ich bin ein Computer Scientist aus Bonn. Und in der Real-Life ist mein Job zu machen, große Mengen von Daten, verständlich zu finden. Und zu finden, dass es interessante Dinge gibt. Und oft sind sie die Datascientisten. In der Digital-Mahjaube habe ich viele Projekt- und Hobbys. Und manchmal nehme ich eine von diesen und turniere es in ein Gespräch wie diese. In Rheinland, wo ich komme, sagen wir, zweit ist eine Tradition. Und drei times or more, es wird ein Kostüm. So jetzt wird es ein Kostüm. Und ich bin glücklich, dass ich hier wieder bin. Unsere Geschichte beginnt in dem Jahr 2018. At the end of 2018, Deutsche Bahn said around 75% of its long distance trains were on time. And for that we first need to know what does it mean to be on time. The Deutsche Bank calculates this as follows. They say that if a train arrives less than six minutes late at a stop, it is considered to be on time. That is the definition. And then they calculate the percentage of all stops where that is the case. And that's the case in about 75% of cases in 2018. And that didn't match my personal experience at all. And that distracted me. In the recent years I've travelled across Germany very often by train. And then I checked my mailbox and for more than a half of my chips, I've received an email from delayalarm at barn.de. And then I thought, I'm probably that guy who has to be careful not to be struck by lightning while winning the lottery. I wanted to explore this in more detail. So I checked my mail box and for more than half of my chips, I've received an email from delayalarm at barn.de. And then I thought, I'm probably that guy in more detail. So I checked the statistics on the webpage of Deutsche Bahn. And then I saw, there are hardly any. The Deutsche Bahn only offers the percent of punctual stops per month for the entire railway network, separated by long distance and regional trains. You can't filter anything. You can't sort by train station. All the interesting things are the ones that I would find interesting. They're not possible. So on January 8th, I started stockpiling Deutsche Bahn data. And this is the data set we're looking at today. We don't only evaluate, but I will also spend a little time to explain roughly how to approach such a data project and also how you can tell whether you can have confidence in the data. And throughout the lecture, I will always give you free practical tips that you can take home and that you can think of when you are booking a train ride. Disclaimer, I did not talk to the Deutsche Bahn about evaluations. Always keep in mind, at the end, this is a small hobby project, as always. And it may well be that I made mistakes. So, now let's take a look and then you can decide for yourself whether you can trust my data or not. So, here we see a picture's ICE connection. I'll give you a few seconds for a first overview and then I'll explain more. Three seconds are enough. The ICE starts in Munich and every further row is a stop. And at some point, he got here in Rügen. A trip is a sequence of all stops that the train makes from start to finish. Here are the delays per stop. The train left six minutes late according to the measurement of Deutsche Bahn. This would be considered too late, just too late. And then the delay fluctuate. In Airport, we were six minutes early. The delay is negative. Usually this means that the train just stayed at the station for longer and didn't leave earlier. We don't stop at Berlin Airport. Maybe in 20 years I can still make these jokes. And these stops of all ships are the basis of our date. My table has 25 million rows. That's 25 million stops of some trains. These are all long distance stops from January 8th until now. And also local transport, but only at long distance serving stations. And it has different columns, the one that I'm showing here and a few more. We can look into individual places, look at individual trips, look into the exact time periods and also carry out complex evaluations. That's what we're also going to do today. At the beginning, however, we do some very simple things with the data. This is not to bore you, but we have to get to know the data set to get an overview first. So we sort the entire table of stops by train stations. And then we determine the number of stops for each station throughout the entire year. Each bubble is a long distance train station. We have 350 of them. And the size of the bubble reflects the number of stops recorded for the station. For everything that goes there. The long distance and local transport. The biggest in terms of that is Cologne Main Station with 380,000 stops. I've labeled the top six stations on the map by exactly those. Because the sixth station is exactly Hamburg Dammtour. And I didn't want to keep this from you. That's where the Congress used to be. By the way, the Leipzig Messertrain Station is significantly smaller, including local transport. We have about 60,000 stops per year. While we're at it, we can do something new that is interesting from a customer's point of view. So, so we, so this is, this is the functionality. 60% is bright red. Roughly 75% is white, the alleged average. So we see in Germany almost everything is blue. In Eastern Germany, everything is blue. It must be the blooming landscape is that Chancellor Karl always spoke of. In North Royal's family almost everything is red. Cologne only has about 66% Funktionalität. Barn is really one of the worst spots at 59% which other stations I usually go through. In general, the entire density populated area in North Rhine with failure is pretty bad. And I told you, I started this whole project only because the statistics of the DB must be incorrect. But the fact is I just live in a bad spot. Hamburg up there is also bad. I think it's justice. Why should I be the only one to suffer somewhere around 60%. In fact, it looks more red than it is because it's transparent plotting. In terms of punctuality in Leipzig, we're pretty good at 80% or more. Very important. From now on, all the stops that I'm showing will be only long distance transport. This whole talk in fact, we only about long distance transport. Because that's what most of the media will be talked about the most. If I say that I mostly talk about long distance transport, I have to be fair and I say also, it does actually get completely, oh, almost completely punctuality value is over 90%. Please keep that in mind for the rest. These trains have good size and they bring a lot of people to their job every day. I hope everyone from the DB are listening and have heard this. So we'll change the view. Funktionality per train stations was interesting for the customer. But if you want to do something better at the train, at the DB, then you want to see which train stations generate delays. And this is this view. The worst is big train stations with many stops. After every train that stops, they give all of the other trains a little bit of delay. The worst one is Hamburg, Cologne, Frankfurt Airport and Mannheim. All of these came up with more than 50,000 delay minutes that they added to the transport. And the worst one is Frankfurt Main Station with more than 93,000 minutes of delay. Who came from Frankfurt? How did you get here? They probably got here today. I hope you didn't have to hurry too much. However, there's also train stations that work so well that they overall take away all of the delays from the network. The best of these were Bremen, Berlin, Main Station, Spandau. This surprised me. Out of the blue, a proof of this magnitude, something about Berlin actually works. I will see the comparison. So, now we compare how many stops we have from short distance trains and from long distance trains. We can see that most of the trains here are regional trains. And there's also trains provided by other companies than Deutsche Bahn. And we filter this before we really start. So, here you see the regional trains which are split into these three main types. And I will use the following abbreviations you see here during the talk. And these comparatively small blue dots, these are the ones that are interesting. These go all throughout Germany. We take these for us interesting train types and then we check which type of train is usually the most late. So, this means trains that are more than six minutes late. The most punctual are intercity trains with about 66%. Not even 70% of the ECs are punctual according to the definition of Deutsche Bahn. And I can confirm this. The quality of these trains is overall much worse, even from the inside. They look quite old. But they are obviously international because they are Euro city trains. And they might actually import delay from other countries. And there's another type of delay that we can measure, but Deutsche Bahn is keeping silent about it. But of course that means that you are basically just applying to be analysed by me. And that is the percentage of train cancellations. I see the flagship of the German Railway system and apparently they are the most common to be cancelled by far. Euro city about 2%, intercity over 3% and ICE over 5%. So, if you booking an ICE, then in one of 20 cases, it's just not going to arrive. And I thought that was pretty tough. So, my practical tip for you is be careful with ICE trains. I fairly point out again this is an evaluation from the outside. So, there's a possibility that this is not correct. And that other trains might have replaced these. But in their data, these were marked explicitly as cancelled. And in Spiegel, they also had an analysis recently, which came to a similar conclusion. So, I assume that this is about correct. One of the highest delays was from Schuttgart to Hamburg in October 2019. The train had 400 over 400 minutes of delay. That's more than 7.5 hours. And it was not cancelled. And to complete our overview, we will now look at this distributed by time. These are all the long distance connections I have. We have about 800 journeys a day. The ones on Fridays are usually a bit higher. The ones on Saturdays are usually a bit lower. And what can you see as well? For example, you can see that I messed up in between and lost a few days of data. This happens every time. But this time I built a download monitoring. And I thought I was so cool. But then apparently I crashed that server. And it didn't even respond anymore. I had to hard reset. But I was on vacation and I didn't notice. So, technical tip for you. Don't only build a download monitoring but also monitor it from the outside. So if the entire server crashes, you still notice. So, since BAN applied to be checked for the cancellations, we will look at this a bit more detailed now. And these bikes are Hurricane Eberhardt on the 10th of March. And in the evening, the storm had the opinion, it's enough with train traffic. So, this is the hottest day of the heat wave this year. And there you can also notice that the train cancellations are much more frequent in summer. Why is that? That is because of the climate machines, the air conditioning inside the trains. And now we look at the failures per week according to train type. And then you can see really how huge the problem of ICE air conditioning really is. Because compared to the other types of trains, they actually have an even more increased failure rate. So, when it's warm, every 12th ICE is just canceled. In the week of the 22nd of July, more than 10% of all ICE subs failed. I don't know what you feel when you hear this, but this is, for me, goes beyond fault tolerance. So, my practical tip for you is be careful with ICE in the summer. Now that it gets colder, it starts again. But we still have to wait and see when it gets really cold, whether this is actually true. So, we do two little things now. And then we'll talk about how to set up a project like this and the basic rules. Something obvious. I've sorted the stops according to the time the train journey took before these stops. So, from left to right, the already running time of the train increases. So, earlier, the trains are, so, when it has only traveled a short distance, it is more punctual. And the longer it travels, the less punctual it becomes. Why do I say this? I want to protect the barn a little bit here, because in the media, you can, you can, here in the media, that there are some really, really big problems with punctuality between large cities. And the issues with the fast trains between, in Japan, between the large cities, there's not really a comparison to be made here, because Deutsche Bahn has to share its rails. In Japan, they control everything that goes on the rails. My practical tip to you from here is be careful with trains that have already had a long journey. Next, I ask myself, after which delay does it not get any better? So, I check for every stop, how late is the train already? So, from left to right, left less delayed stops, right, the more delayed stops. Then I checked how many percent actually reduce their delay by 5%, but still run and are not canceled. So, what you can see here, if there's a delay of less than 40 minutes, then it's okay, but after 40 minutes, you see a step in the chart, and it seems that the Deutsche Bahn actually gives up on these trains, and it doesn't really get better anymore. Why that is, we will talk about later. Practical tip, from a delay of 40 minutes and onward, consider another means of transport. So, this was a hell right. We've already just stuff various practical tips, and now I can tell you what you should think of when you're doing such a project by yourself. First, organize the download well. The train, German Railway has some public interfaces, two of them. Somebody else already did a talk here, and I'm happy to see that somebody else can feel the pain that I have felt, and you can look at train connections on your smart phone, from your mobile phone. In these train schedules, it is noted when which train should arrive, and the changes state, what is changing, failures, delays, and so on. This is a bit exhausting, because unfortunately, you have to retrieve both in separate queries. And if you query these, you can only do that for the past few and next hours. That means we can't wait until the end of the year and download all the data, but we have to work constantly and continuously pull this data. And this is a very common thing, so keep that in mind. So, you first, we download all of this and sanitize the data, and the analysis happens later. So, we have six and a half thousand train stations in Germany. For each of them, we have to query these two things. Let's say we do this every ten minutes. This results in 6600 times two times 144. That is almost two million calls a day. Such a retrieval is an average 22 kilobytes. For the change data, a little bit less for the change data. So, we would end up with 40 gigabytes of XML data per day. That also doesn't parse itself anymore. For the whole year, that's 14 terabytes in 700 million requests. In this moment, the admins of Deutsche Bahn will probably have a heart attack. And when we are done, they will probably look at the logs and see what requests I have made and send me a huge bill. Sorry, but obviously, that won't happen, because I try to minimize traffic. These are points two and three. Point two, act responsibly. This means you should not generate so much traffic that you kill the target's infrastructure or incur unnecessary costs. This is more realistic than it sounds. Maybe not for Deutsche Bahn, but justice portals, you have to be careful, because they are surprisingly weak. At least I heard so. My solution is that I request only once an hour and only the 350 long-distance railway stations. So, that means I'm down to about 60,000 requests a day. It's a bit less, even when you do it adaptively. And the admins no longer get a heart attack, but they're still disappointed, because this is no longer worth sending a bill. Point three, fly under the radar. This is supposed to remain a Christmas surprise. And it would be bad if the one million calls come from this decreaseall.com server, and that's in the logs. So, the solution is to use anonymous proxies and send it via hundreds of anonymous IPs. So, when I download lots of data, it simply disappears in the noise of lots of requests that come from around the world. And that's what lots of people do. Nobody sees me, but of course the data still is funneled to my server, unless I crash it. That's probably when the servers, when the Bahn admins stop looking at their logs, and I'm glad to have them back in my talk. You don't have to do this kind of thing for all your data projects. It might have been a bit overkill, because I wanted to try the proxy thing. It can happen that you're not sure what you're legally allowed to do. Most of us aren't lawyers, and lots of terms and conditions are very hard to read. So, if you're unsure, ask a lawyer. Ask a lawyer to read the terms and conditions. There are portals online where you can ask lawyers questions for not a lot of money, and then you get a question. The result was that I should ask for permission, for written permission from Dear Beer. And that's when I thought the project was in jeopardy. And that would have been a shame, because I'd done a lot of work in advance. So start by reading the terms and conditions. And point fifthly, try it anyway. I'd poke it and ask Dear Beer if I could be allowed to anonymously collect data and then be allowed to give a small community talk about the subject. And they let me without any other questions. And whether or not they are really so open minded or they simply forgot to Google, I don't know. But this might be worth an applause for Dear Beer, because that was really a sporting of them. Not bad, not bad. I hope they're listening. And sixthly, be fair in your analysis. If you have data for an entire year, then don't pick the four worst months so that you can say mean things about the Dear Beer. And check if you can trust your own data. And that's not so easy. And I'm going to demonstrate that. And then you can decide whether or not you trust my data. And that's my excuse for going back to looking at the data. So the best way to trust your data set is to rebuild a query that the maker of the data would have done themselves. And they have the percentage of on-time stops. And they document how exactly they calculate that. And I built that myself. And would you look at this, it looks almost exactly the same. The main discrepancies are that I measure 0.5 percentage points worse than the official results in January. And in September it's 0.8 percentage points worse. That's where I'm lacking a few days. Other than that, it looks like they actually get off a bit better. You're never going to get exactly the same results. But it's pretty damn accurate for an external measurement. So if your results are this close, then you're on the right path. So external verification is what this was. And now we're going to check the internal verification. And we're going to use the times of day. All of these dots are along distance railway stations. It's the 9th of March 2019. I'm going to go through this day hour by hour. And we're going to see how it looks. These points are going to grow in size by the number of stops. And the color represents the percentages of trains that we cancelled. We're at midnight here. And some few days of the trains of the day before are still on the rails. And this is going to decrease. And time passes. It's night. And a new day awakens. We're nearing rush hour. It's 8 o'clock. There are some red dots here. It might be due to weather. It's noon. The day is nearing its end. The last hour of the day. And a new day begins. It's the 10th of March. These are the last trains of the day. Everybody's sleeping. 6 o'clock. There's some traffic. It's 6 o'clock. We're nearing rush hour again. It's noon on the 10th of March. And we remember something happened that day. And that's where the hurricane Eberhard is showing its first results. And now it has stopped nearly all long distance traffic in Germany. I had to change my color ramp because you rarely have 50% of cancellations. And we're going to end this very bad day. It's midnight again. And of course a disruption of this magnitude will have consequences for several days. We're not going to be looking at those. But we can see that it's not always the bounce folds. So when you check your data like this, then you should be sure to use very good visualizations that cover several Dimensions at once. We had location. We had time. We had the magnitude of the disruption. The best way to spot patterns is our brain. And there's only one high bandwidth wire to that. And that's our eyes. So we're going to do some more analyses. And firstly, I would like you to switch sides in thoughts. Imagine that you're not giving out analyses. We are reading them. And when you read analyses from other people, it's important to smell what they're not wanting to talk about. And for a company, you can take a close look at the core numbers. Deutsche Bahn said that they wanted to have 67.5% on-time stops this year. And in the beginning of December, they had to admit that they would be below 75%. They're still slightly above that in my data, but then missing the goal they set themselves. Deutsche Bahn keeps silence about cancellations. But imagine you're standing at the platform and the train is simply cancelled. It won't arrive. And you have to decide if it's on time or not. Who of you would say that it's on time? I see two hands, 3 out of 5,000 in this room. It's measurable. Who of you would say that it's not on time if it is cancelled? Nämlich all of you. And I pretty much agree with that. Let's say what Deutsche Bahn says about that. Complete or partial cancellations are not included in these statistics. The same applies for other European rail carriers. This is due to two factors. It is difficult to find a reasonable mathematical model. What is the punctuality score of a train that is cancelled in a binary way, whether or not some of those are punctual? But of course this doesn't work with cancellations. And secondly, the so-called fulfillment rate of all DB passenger trains running daily is over 99% on average for the year. This applies to both long distance and local transport over the last few years. I can't really agree with that because we saw that long distance rail has a cancellation score of 4%, not 1%. But mainly, I mean, maybe this fulfillment rate is something else that I don't understand. But a train that is cancelled is not unpunctual. It is simply removed from the scores. So these cancellations are covered up by statistics because apparently they can't be included. So come on, guys. I do this kind of analysis for my job. And I've heard some bad excuses. But this is crass. This is the final salvation bullshit. When you hear this kind of thing, then you know that you found it. You have to look here and not elsewhere. So we're going to be helpful and find a way to get these cancellations into our statistics. We see a train journey with four stops. The white ones are on time. The blue one is not on time. And the one on the right that's shown in red, it is cancelled. So they count the ones that weren't cancelled and measure their punctuality. This would be 66% in this case. And I would suggest that we count all stops that were planned and then count those that arrived and were on time. This would be 50% in this case. So this is really groundbreaking maths. And when you're honest about your cancellations, then you're not at 76.5% and also not at 75. But at 72.5%. And with each percent less, it is much more unlikely that people will reach their connection trains. So don't underestimate this difference that they've credited here. And now I want to talk about something important, criteria for success in your organization. That should lead your decisions. So if the German Railway cancels a train, that is actually better for them in terms of their statistics because they just remove them. So the question is when is it most beneficial for Deutsche Bahn to cancel trains to push their punctuality score? You're already clapping. I can't work like this. The solution is at the end and at the beginning of trips trains often just travel the same route back and forth. So this one starts. Everything has gone well. Here collected a huge amount of delay. That happens. And at this point it is to be expected that the last two stops will also be canceled. So let's just cancel that. We just cancel the train and reverse it immediately. And voila, the train is punctual again. But the statistic improves because canceled trains are just removed. But how could we measure this? Very easy. Hamburg. Also ganz einfach. Hier ist wieder eine Zugfahrt mit all ihren Stops. So this is a train, right? With all its stops. And we just create these classes. Early stops, middle stops and late stops. Early stops are the first three, late stops are the last three and all the others are the middle stops. If breakdowns occur due to technical operations, one would expect that there would be fewer breakdowns statistically at the start of a trip. And then more in the middle and even more in the back. And that's exactly the same with the EC. The failures increase in the last three stops. And for ICEs, this actually fits perfectly. And I have asked two independent sources and they confirmed this behavior for me. This was also in the press. So I'm not telling any state secrets here. So we can call this, named after our Minister of Transport, the Scheuer turnaround or the Parfala turnaround. So another practical tip. Caution at the start and at the end of an ICE train. Try to not book those. And for the sake of neutrality, of course the Deutsche Bahn has an interest in ensuring that the train network is roughly in schedule. And they want to minimize the number of passengers that are affected by these delays. And if you can cancel a train and leave a few stranded but help others to get to the destination punctually, then this is actually in the interest of the Deutsche Bahn. So, what I'm criticizing here is that the only positive side of this maneuver is the statistics. And the negative is not reflected in the statistic. I wonder how many people are buying this at the transport ministry. So, let's get back to a few practice practical tips. Be careful with ICEs generally, especially in summer. Be careful at the end of trips and when they delay them more than 40 minutes. And also at the beginning end the end of ICE trips. And I could do lots of standard stuff with you now. But that won't help so many. So, let's do two other things. Firstly, we are doing our last big thing with the train data and I hope that you will gain something for this for at least a few months. Firstly, if you buy a ticket, then you can choose between a supersaver ticket, then you are bound to the trains on your ticket or you can buy a flexible ticket, which allows you to travel on any train between the destination and where you are. So, the following rule applies on supersaver tickets. If your connection is more than 20 minutes late, basically we will turn your ticket into a flexible ticket. So now we look at the stops that are more than 20 minutes late or cancelled completely. And that's at least 12,4%. So, if you have one of these, then your supersaver ticket will be magically converted to a flexible ticket. So, it would be very interesting to see if we can somehow predict this in advance. You cannot completely predict this, but there are some trains where this happens more often than others. And there are some days of the week where this happens more often than others. And with all this, you can maybe try to create a prediction. So, this is an example. Read with me. This means the Intercity 2221, all stops at Mainz-Hauptbahnhof, have a 53% probability that your supersaver ticket will become a flexible ticket. So 53% were either more than 20 minutes late or were cancelled completely. And on Friday, this was 50%. I have to make this willing so short that I can save space. You probably are already guessing what could follow. So, I search for you the combinations of all days, train stations and all the ones where I had at least 10 data points and for those I measured, for how many percent of these connections the ticket would become flexible. And I only want those above 50%. And the result is almost 500 combinations of the day of the week, the train stations and train stops. So, these are the stops and the tickets. I would not buy an expensive flex ticket if you enter or leave the train at one of these stations on one of those given days. And you can have a closer look at this later when you download the slides. So, this will change. Also make sure that you don't rely on this because it is possible that the train will be on time and you will not get a flexible ticket. And... If you really have to be punctual, then that means the Deutsche Bahn will be improving. For example, there was a new connection created between Munich and Berlin where you can, in roughly four hours, now travel between these stops and this is now a real alternative to flying. So, I hope that even with all the critique I now have talked about today, I'm also somehow happy to see improvements. On my way back, I will also take the train and I will summarize now with... Be nice to our Deutsche Bahn because we only have one of them. So, one more. This is the last talk I will give in this decade. So, I will now leave you be for a few seconds and let you think about what was the most significant change in civil society of this decade. So, for me, this is the rise of the... people who feel offended. And with that, I mean every political direction. And I heard how important the scientific skills of nature sciences are. And that means rationality. And I'm really sad to hear that it is apparently such a good argument nowadays that somebody is offended by something. And now I would like to find a culture where we don't just criticize everything but a culture where we look at the data and present it to each other and then sit together and who could possibly start this, if not us. Let's not rely on the media for this because they like to create Chaos. And of course, you generate likes, the more controversial you are. The shitstorm culture. Let's not trust the media, the stars who live off this sort of thing, who try to create certain songs and try to survive just until the next step of crisis. So, I would like to get you all to focus on these topics and get you some instincts as little as I can, the little that I can, and hopefully show you that this is not rocket science. I ask you all again, who should do this if not us? How can we do it together to get at least some of these people who have nothing better to do than sit on the Internet to get them to analyze, to become investigators. If we can do at least that for some of them, then we would have made a big, big difference. 5000 people in this hall, people sitting next to the seats, next to the tribunes, who congregate between Sylvester and between New Years and Christmas, where other people do nothing but drink alcohol, get their kicks. Why do these 5000 people do something like that? To listen to one statistic stock. That gives me hope. I will come home, I will drive home happy. The trains can do what they want. Thank you all for coming, and I wish you a very nice... You've been listening to Tom. Translating this talk. It would be very grateful for feedback via Twitter using the hashtag C3T. Until next time. Thank you very much for listening. Stehen Sie sogar auf. Danke schön. Thank you. Vielen lieben Dank. Thank you very much. Wow. Wow, yeah. Thanks for me as well. Great talk as always. Very funny. Thanks David for doing this every time. We have some time for question and answers. So please come to the mics or four or five in the room if you have any questions and we're going to start with microphone number one. Yeah, you started by saying that fairly you put the criteria of the DV, the punctuality criteria using six minutes at least. Did you try to use another criterion that feels a little bit more sensible? Or you can argue where you would set that threshold. Did you? And which ones? I did, yes. And when you do that narrowly, I mean lots of trains end up being delayed by one minute. But what I did is that I took a seamless metric. So seamless for me is everything that's three minutes late maximum that's not cancelled and that actually arrives on the schedule platform. That was about 60%, but don't quote me on that. I don't remember it exactly, but it was a lot less if you take that into account. Let's ask the signal angel. Standing ovations from here too. Lots of people said that for cancelled trains there will be replacements. That's not in my statistics. So I wasn't sure what to do. If they have a completely new journey that didn't really appear in the schedule then it's probably not in my data. If they were scheduled in whatever way then it will be in my data. I know that the colleagues from Spieger did a similar analysis on a smaller dataset and they too found cancellation rates beyond 4%. So I'm not entirely certain, but it's somewhere in that ballpark. Microphone number five please. Thank you for this very interesting talk. It was surely a lot of effort to analyze all this. I'm really scared to ask a critical question, but you at the start on the slides you said that the train stations that add or remove delays to train travel. But isn't it rather the case that the distances between those stations add the delays or remove the delays? And wouldn't it be more interesting to look at the connections between the trains stations? That's a brilliant question because this analysis was a bit tricky for exactly that reason. Maybe the delay isn't due to Frankfurt Main Station, it's due to the rails coming in and out of the station. That's why I measure the difference of the segments und behind the station and then the station gets the average of that. I do this to heal this effect so they always get the average. So it always depends on the segments around the station unless it's both of them in that case it actually would be a railway station problem, but I look at the area around the station. Thanks a lot for that question. I thought a lot about this if I should take the delay delta but I wouldn't have if I caught somebody doing this myself I would have torn that statistic apart so I didn't want to do it myself. Ich muss immer ein bisschen suchen, bitte entschuldigt mich euch nicht da hinten. Back there, yeah, I found it. You criticized in the beginning and in the middle rather that the cancellations do not count towards the delays but you had in the beginning a slide where the Berlin Airport always is cancelled under renovations or that are cancelled according regularly. According to my data they're not in the schedule at all. You have scheduled data with stops and then you have the change set and when something is cancelled then that will have a cancellation time when that was cancelled and then you have to look at short term cancellations but as far as I know that would look different in the data set but of course I'm reverse engineering they don't document everything there's a lot of reverse engineering to be done here so take that with a grain of salt as well. Me too, I take trains the regional trains is delayed a lot more often than long distance trains so my question is when do you have the analysis for the close of the regional trains? Where are you from then? South of Stuttgart, right I didn't scrape short local regional rail railway stations but what I did do is all regional trains that stopped at long distance railway stations and those are strategically placed which means that I can see regional traffic so maybe I'm going to do an analysis of those trains because I have those and I might just upload them to my website as a table and then you can have a look so we have some time left thanks again for the talk coming from Munich we have a chronically terrible S-Bahn a regional train so I wonder is regional train and metro trains or close trains a difference? Ja, in my data set so I could have a look this 90% on time performance in regional traffic is for the Deutsche Bahn and its contractors but of course I should have them in my data, maybe I should include them in my analysis that would be interesting if I have the time don't expect it tomorrow morning you too there? On slides 80 and 84 we saw how the DB removes the partial constellations from the statistics but shouldn't the whole train be taken out of the statistics? the delay accumulates obviously and the statistic would be accordingly better I killed my powerpoint here why would you have to remove the entire train I don't quite get that it's nice and granular data based on stops imagine that if all trains are on time for half of their stops and not on time for the other half so that would mean 50% on time performance so that's more granular and finer and better Partial constellations are also removed too is it really the whole stop? No not, I think it's really only those stops where it was cancelled Thanks again for the talk my question is about perverse incentives perverse Anreizung DB is being measured on the cancellation rate might be better if the incentive was better another problem with this is how high the goals high the objectives that the DB is setting for themselves do you have anything in terms of how fast should train be going what is the target speed I don't have the intervals but as far as I know the intervals are quite tight compared to air traffic and this means that air traffic is more often on time or roughly on time but the intervals between trains are very very short and of course these are interdependent so if one train stops on the tracks all the trains behind it have a problem as well and this leads to this fragility that we've seen Do you expect to continue this analysis in the next years and to see something in this direction I'm not sure for one thing I have work and family as all others this analysis is very complicated I'm going to have some looks but I can't promise anything and another little addition as the data could you be able to give those out I don't think so because I'm not allowed Deutsche Bahn has copyright on these data and I couldn't really infringe on that copyright much more than by giving you all the data from the DBS timetable api and then just download it it works it's not rocket science thank you all give a round of applause to david