 Data, Spuren suchen, elektronischem Timelinehmen Verzeichnissen oder auch was euch passieren kann, wenn der Erdgeist euch dazu einlädt, seine Telefonbücher und seine DVD-Sammlung zu durchsuchen. Es geht um Rufnummer zu Teilnehmerzuordnung und damit was man dabei erleben kann, wie das funktioniert und was man dabei so lernen kann. Unsere Sprecher der Erdgeist und Andreas Sehne, zwei Berliner Hacker und Telefonbuch-Sammler mit obskuren Leidenschaften, Sammelätigen und einem Hang dazu, lange auf da anzustauen und darin Muster zu erkennen. Begrüßt ihr mit einem herzlichen Applaus. Viel Spaß! Wir haben ein ähnliches Gespräch vor Jahren gehalten und wir hatten ca. drei Stunden. Und in der meantime waren es noch vier Jahre von der Forschung, die in den Gesprächen zu gehen, also müssen wir uns aufhören. Als der nice Harald sagte, Zeit, dass wir einander kennenlernen. Es ist eine Zeit, in der wir einladen müssen, sich meine Telefonbuch-Sammlung anzugucken. Ich möchte jemanden schauen, wie meine Kollektion von Telefondirektoren oder von Telefonleuchten ist. Es ist nicht eine sehr geringe Liste, es wirklich existiert, es sieht so aus. Und ich habe in den vergangenen 16 Jahren diese Kollektion gesammelt. Das sind wir. Und jetzt haben wir die freundliche Frage. Wer von euch kennt die Einleitung der Ausbindung? Und wer von euch nutzt elektronische Telefonbücher? Wer von euch hat noch ein gedrucktes Telefonbuch zu Hause? Wer von euch steht noch im Telefonbuch? Wer von euch stand je im Telefonbuch? Ok, das Fleck sehen wir nachher. Telefonverzeichen, da gibt es schon eine ganze Weile. In der Höhe von 600 Teilnehmer waren das als viele, als 100 Participant-Firmen waren, die sich in Telefonhundertlos geleistet haben, die Kompany, die 1881 war, die Zeit hinweg sind Telefonbüchern immer größer, immer wichtiger geworden. Sie waren zu ihrer Hochfahrt in der Heydei, in der 80er, Anfang der 90er, sehr dicke Wählzer. Volume, die Sie hatte, wenn Sie eine Telefonkonexion hatten, und Sie würden eine Postkarte bekommen, die Sie nun an Ihrem Telefonbüchern abzulegen. Und das wurde liberalisiert, weil es als eine means for advertising war. Und ich beginnte, das zu meinen Freunden zu erzählen, und ich habe das zu meinen Freunden erzählt und sie würden mir die eigene Beine beobachten. Man kann mir vorstellen, was es so aussieht, wenn man vier Meter der Telefon-Arektorie hat, und dann versucht man auch, die Jahre zu machen. Wir waren sehr glücklich, wenn das geändert wurde, unter den Modern ab und nach Hause geliefert, phonebooks by posts and have them delivered to you under your neighbors unbelieving highs, disbelieving highs. Das habt ihr gerade schon gesehen, das ist tatsächlich die komplette Sammlung, die wir bei einem Ort teilen. They would also be digitized on CD and we got a collection of phonebooks since 92. Zu Anfang waren diese Telefonbücher auf meinem Schreibtisch dann wirklich... In the beginning those phonebooks would be sorted by the area codes and they were in a weird HTML format until I noticed that it was an edition for all of Germany that landed on my table and after I disassembled it and didn't want to accept that something on my computer would not be investigated completely, I started collecting and I ran into the problem that they weren't all available at the time. My wish to have a complete collection only was fulfillable from 2010 onwards and then eBay came up but that became more expensive than the originals where CDs would cost up to 50 Euros which is unacceptable so I was happy to see that libraries still have all the editions. That was the first edition from 1992 that I got my hands on. There are rumors that there is one from 1990 as well but I never saw it personally. But if any of you happen to have it, then that's awesome. But FIFA also wrote, told people to donate their phonebooks to me und after investing hundreds and hundreds of Euros my collection is now fairly complete but having these DVDs or earlier CDs on your desktop doesn't really help you a lot. You can run them in a simulator and click around in the fat file system below. I run this in an old DOS simulator but if you look at the files you see a lot of small files in the times of DOS there was no simple way to compress something like this efficiently while keeping it searchable. So throughout the years there have been interesting attempts to reduce the amount of data in headers. You see one of these files here with the .o1 extension and the header tells you the area code and below that there is a file format that saves a lot of bits by using a text encoding that only uses seven bits und so they use this to cut down on the amount of data by cutting off one bit which makes reverse engineering even best fun and so I would have to try to debug things on a no longer maintained operating system. That's what it used to look like. I don't know, I never owned a PC that old that's version 3.0 und also version 3.0 of the compression algorithm and pk where used to be the hot shit back in the days which they used for compression I could not about this for hours but unfortunately we don't have time but there's this dark blue point that tells you how many records there are some things are the length bytes of the springs that have a dynamic length they're passed backwards and before that there are the bit fields that tell you which bits are in use that took a while to reverse engineering as well unfortunately I don't have time to go into this in more detail this wasn't entirely chronological I had these in reverse order and the first thing that I ever got my handle was the 2003 edition und da they were using LH already as you can see in the first bytes in the header there were some files compressed alongside this there was some development plain text and then there was a binary block but in the same distance as in the other files where the LH header is there are identical bytes und then you only have to subtract XOR and then you get the key and as you all learned in Eclipse 101 if you can see the penguin your encryption is crap so that was my first success I was very happy in 2003 this was and immediately afterwards I brought him the new CD and tada everything had changed there was a new format gab es eine Ausgabe für but fortunately there was a version for macOS where they probably outsourced things where I sat down ran gdb on it and what happened before block was being read from the CD and analyzed and displayed and stepped through this power PC code until I noticed that hang on there are symbols in here I looked at the back trace and they had a function that's called secret XOR encryption so they're reading a block from the CD they decompress it with LZW and then they use the blobs first 27 bytes which are encrypted and there having looked at what their secret key this static XOR string which is a book title that some nerd probably found on his bookshelf and thought I'm going to take this and after two days of debugging and learning a lot about power PC assembler I quasi freehose I got this key more or less for free alongside the symbol this led to me in most cases seeing these source code files like 13, 14 files next to each other that contained all of the names and all of the street names and then you could basically laminate all of them together and you have the entire phone book of the year or maybe even the quarter jedes Quartal eine Ausgabe at one point there was like one addition for each quarter by now if you have a subscription with DT Media then you basically get something to download every two months an updated directory every two months this DVD came about two days ago where they partnered with a that also includes geodata which is very great but this is like drive there instead of calling then after work you sit there and you're wondering why don't we know that but lo und behold the power of followers can someone who knows something about geo information tell us how this works and like how this geolocation how the street maps to this coordinate and then on the 7th on February 7th it's a Lambert projection which allowed us to normalize which allowed them to normalize something but the great thing is this person even made this account just for this so that was amazing, thank you and there was even a link with there it's a piece of sea code that allowed me to convert this projection to WGS4 which helped me a lot because it allowed me to do geo mapping on there und once you decoded this then you get all the coordinates of all telephone numbers in Germany and you know that you were kind of right when decompressing and reverse engineering because this heat map looks plausible looks like it can be this is the distribution of phone access lines as they mentioned in the phone directories and roughly of all phone access lines and it was one big file that wasn't very well imaginable we want to talk about big data so and there's a big file and you kind of like work in there and sometimes you need to know are we in the right track or not sometimes you're kind of like looking for something and you don't find anything and you're thinking maybe something went wrong and sometimes you grab something and what happens and you find something good if you grab for horno you might find something because horno is a city and then there's people there and you can do WZL there and you can see how many people have telephone access there and and then you do this for the next year you see, well there's less phones actually and if you check this so you find that there were actually less less of them and why is that because the mining group actually doesn't exist anymore and now there's no village anymore so but sometimes you ask yourself why do phone numbers in the telephone directories become less well in general we see that there's less and less phones listed in the phone directory and the peak was in 2001 and we see that it's mostly older people that are on the telephone directory and younger people don't even get themselves listed and this is more of a historical part what we're doing here right now now we still have time this fresh dvd. that arrived here it went from 18.6 million to 17.9 million listed numbers now what's sad is that if you look at current listed current CDs the most important records are key lockers key unlockers there's actually spam so they're trying to get kind of like there and we actually find that there's a lot of spam that are spamming our database and this changes and this makes it difficult for me to maintain my database and this also ensured that there was a new sorting scheme for the German postal for the German postal records because it used to be alphabetical but then what happened was that the AAA lockpicking service at the beginning but now it's actually at the end so what you see now is the result of 16 years of tidying up and you can get it with a script allows you to basically get a telephone book from a sorted telephone book from each and every CD or DVD gedammt ist that is basically a dump and this works on the telephone books the yellow sites in Germany which are a telephone which are telephone books that are specified on like businesses quasi dasselbe Format but it's basically the same format and even for our Austrian friends Herold if that are from the DVDs from Herold you can also decompress with this okay, now again there's four different binary data formats that engineers have thought of to go with sort of like the motto of the time well, crypto there's two, there's twice I cracked two cryptos so to speak there's like 51 editions unfortunately not all of them physical because some of the older ones are only in libraries and you can't you can't take them anywhere you can only read them on there and we have when we look at the last couple of years we have these there's like a lot of changes within the different columns now the question why are we doing this because the easiest thing would be to say well, hackers will hack if a DVD is lying around somewhere or in my PC then I really want to be able to access this data and also make fun queries from it there are people with more detailed hobbies and in addition to that maybe something about area codes and how telephone numbers are structured in Germany Germany is distributed into 5202 area codes for example 03307 for where we are currently in Milenberg they're kind of historically grown so it used to be that it was actually physically and we had mechanical devices that actually moved as much as each digit was some of the older ones might remember the click, click, click on the telephone and it used to be mechanical mechanical components and it was optimized so that those area codes that were often phoned would be able to be called very quickly because the more the larger your numbers were the more difficult it was to the longer it took to call basically and like for example Bavaria like for example Bavaria had 7 or 8 because Bavaria, there weren't that many people living there and why did it need to be fast because each click was basically money so it cost you money from the time you started your call until you ended your call basically so they wanted to make sure that it would go very quickly now obviously this is more of a review this hobby is really dying off but I think it's quite interesting to see Bavaria today's dump was like 10 am the last dump was at 3.30 am so this is kind of a strange hobby as well okay again now what is the area code this is the canonical format of a phone number in Germany we have a country area code we have an area code and there's like a plan a map where you can look it up in this case it's 4-0 for Hamburg and then we have the part of your number that is unique to each member basically and 4-0-1-8 010 is the CCC Hamburg and this 4-0-1-8 says that it is like a historical marker of where the go-through was basically and this was where it was decoded so it used to be when when you knew that Schütze was in Zylendorf you would go through the phone book until um an um an um um um um um um um um um um oder nicht. In Deutschland, in addition, like 1-1 is usually reserved for specific stuff, like 1-1-2 is fire services, fire station, and often times city councils and stuff will also have something with 2-1-something. Yeah, and then you have these phone numbers and these number areas, and then you can basically sort by them. Okay, what are we doing with this data? Let's just look at it first. Just like a very brief analysis. What do people talk about in the phone book? What are they recording there? These are actually real recordings. In the phone book. First of all, we see that people write their academic title. Like, I'm a Diplom-Engineer. So, the telephone book used to be something like StudiFoZet, or maybe by now Facebook, but it's not voluntary. And it's very important that it's like a conversational lexicon encyclopedia of your city, who's who. And if you're in there, then you kind of like put yourself in there with your title or your job designation. So, even in sort of like historical things, it's really interesting to look at it. But if you have the means to look at the addresses with like small scripts, I wrote a shell script, which did nothing else, but sort those addresses that only have a surname. The blues are all of the addresses, and the pinks are all of these addresses that only have last names. And for those of you that don't recognize it, this is Berlin. Grünewald is the black part, because there's no phone lines there. And originally I was planning on looking ... So, you see like the suburbs of Berlin, and originally I was planning on having like a time lapse of Berlin. And you can see how the suburbs are growing in Berlin. But heatmap-tools allow you to plot other things. Here's a short quiz, what do you think this is? Ich treffe unsere Annahmen. I mean, I studied things empirically, and I always want to check if our expectations are met, and we painted something. What do you think this is? These are all these who said they would not want to be found in an inverted search, because, I mean, this has to be included to make sure that they can be found in the original direction, but they can't be found in the inverted search. Und, was wir den Leuten gesagt haben, schreibt die Briefe bitte, wenn euch da ihr Euro-Programmatik ist, painted this picture, where we asked people to send a letter if they value their privacy, and it worked for these. Wir haben also paint all the ones who have a PHD-Dorf über Schmand. In Berlin, they seem to live in Charlottenburg-Wilmastorf, all the way in Westberlin, after Potsdam, we did the same for Munich, and you can do it for many other things. For example, with the hundred-year anniversary of the Weimar constitution, well, that constitution abolished aristocracy, and we had a look at who still had aristocratic titles. We also checked who had typically Turkish first names. As expected, this was mostly in West-Germany and some cities in the East. Es ist ganz spannend, einfach mal so seine Annahmen, und es ist interessant, to check if you can verify or falsify your assumptions based on the data you have. We tried to do this for several years, and as soon as we got the ability to do that, we could visualize trends across the years. For example, there are some government offices that move from Munich to Berlin, und es hat ein Blick. Die Erwartung wäre, dass diese Menschen die Erwartung wäre, dass die Menschen von Berlin, Darnen und Zehendorf von äqually rich Places in Munich, und das ist eigentlich so, dass es nicht nur Obstacles gibt, sondern auch in Berlin, und es ist interessant, wenn man nicht nur visualize large Trends, aber wenn man auch zu den originalen Personen zurückträgt. Aber in order to do that, you have to find them across the years, und das ist wirklich schwierig. Obviously, even place names aren't a useful feature, because they can change in between years, or they may, for example, add suffixes to their name, and you need heuristics that make matching really difficult. There are corrections where people, there were errors in data entry in the early years, using the letter O for the number zero is the easiest example, but there were many others, and this is really difficult, because many things like house numbers aren't standardized in Germany. There is a house number zero in Bielefeld, there are negative numbers, there are fractional house numbers, there is of course A to F as suffixes, but trying to sanitize this house number field is doomed to failure. I built a very rough heuristic to try as good a job as possible. You also noticed that the number of the chaos computer club in earlier editions contained a queue until I was informed what that queue meant. I thought, oh, God, I'm going to have to rep out all the capital letters, which is a shame, because they contain information, but then I was told that things such as D1 fu, which was the first number of car telephones, and the first version of car telephones, they didn't have area codes, so in welchem Vorwahlenbereich? You would have to know roughly in which area code the car would be, and then you can try to reach the car in that area code. So geomapping a region there is completely impossible. And then these 51 Editions with up to 38 participants, that's a lot of IO. So just extracting them from the DVDs takes about half a day until you have the data lying about, and then you have to do something with them. So you can sort them and check if two people in subsequent years might be the same person. But with two billion records, the index is already too large for my RAM, and SSD was really expensive when I started playing around with this. Because the telecom company, which used to be the post office, added features that allow extracting what chunks of roughly a similar number of participants. And that's a postcode, and you can take that. They are about equally sized. There are some, I think, 21,000 postcodes in the telephone book, and they never grow larger than about 100 megabytes. And that's the extreme case, and so you can sort them by records, and then you get a good starting point or records checking if two subsequent records are the same. The only difficulty is that I got a 1992 phone book CD when this happened. Yes, because between 1992 and 1995 a lot happened. There was this event, the reunification, and suddenly the postcode system and area code system of two countries had to be thrown together alongside phone number systems. And in this time they renamed a lot of streets. They had Bulgarian communist leaders that got their own streets named after them in some regions. These postcodes that went from four to five numbers, caused a lot of debate, I'm sure, because there are various systems of assigning these, and there were conflicts, and they thought about merging them initially and only renumbering those that were colliding. They decided to do another thing. And meant that, for example, people in the Berlin suburb of Hellersdorf who would suddenly find themselves confronted with all this bureaucracy where they might get a new phone number, a new area code, a new postcode, even a new street name and house number. And this makes it really difficult to follow up on this. Das kriegt ihr nicht gefasst, und wenn jeder, the best regex, can't solve your problems here, und wenn jeder run einfach wirklich lang, if each run, each trial and error takes a really long time, then it's getting tricky to refine this. Und in der Phonebook von 92 hat es vorwahl der deutschen Republik gehandelt. Es gibt auch Phone-Numbers, die starten in 37, denn es gibt noch Eastern-German-Numbers, und das macht es noch mehr schwierig. In einigen cases, würden sie diese Nummern extendern, und dann brauchen sie auch die Matron-Karitäre. Die Straße fällt weg, die Postleitzahl fällt weg. Wir können nicht auf die Strecke, wir können nicht auf die Postcode, wir können nicht auf die Strecke holen. Und natürlich schauen wir die Dinge in die Hand. Und es kann viel Spaß sein, zu sehen, was passiert in Werniger-Ode, wo die Karl-Marx-Street plötzlich in die Holz-Weg verursacht, und das ist ungefähr der falsche Weg in Deutschland. Das war jetzt so wichtig, dass ich jetzt bei diesem Zusammenfassend, dieses Spann-Looking-At-Records in several years sackt down und wrote a larger and more powerful script, that tries to brute-force the next issue and tries to find the respective person. If we could simply map the Postcode, this is the one you see on the left here, is already mapped. If we could do this for all, that would be really easy, but there is no way of mapping old Postcodes to new Postcodes, but I could generate that, because I said, I was hardcore and said I would postcode and street name and surname and first name and put them in a large list. And if I can often find this sexchufel from an old Postcode and new Postcode, then great. And I would do this, I would take away some of these factors at some point if I can't even find the street name with the house number, then something's fishy and that might be that, for example, a street could have been renamed. So if many people from one house in a given name moved to a different house with a different address with a different number and a different street name, that indicates that the street was renamed. In Munich, for example, they renamed the FJ-Straus Street to Franz Josef Straus Street. But most string similarity algorithms aren't designed for looking at two strings in a phone book and saying if they are similar. So I had to combine three of these powerful algorithms in order to generate this result. I now have a mapping that I might even publish in a repo. And this mapping maps old Postcodes to new Postcodes and I was finally able to continue. Continuing means I was able to get this to an SQL form basically easy, because looking at an old phone book which is like, well, it's actually one line in the phone book like you have sort of like Troy Hunt in there, which has a name and then an address and then a phone number and sometimes there's even a link somewhere else, but these lines somehow belong together and if you want to put them into a database then you need to put them together and they basically need to become columns but it's quite difficult to do that because what happens if there's different Postcodes and this might make it very difficult to sort them. So, first of all I've only sorted them by the Postcode of the first line and this led to the scheme on the top left and that did basically erase within columns and if I need to sort that in there I kind of like need a if I want a full text search then I need to split up these arrays again because Trigram by Postgres is kind of like the best thing but it's not very nice actually and it's a shame so I've tried to look for different ideas of how I could make this so for example the table at the top right uses like a string index and a Trigram index and then did an inner join but none of this is nice and especially it's not nice because I wanted to do this within SQL so the academics and journalists could just research on there and it would be cool if they wouldn't always come to me and ask me to write them a script so for us I wrote this web interface where you can see this search which is basically a Flask application and this allows this builds a query so you can see the results up there in Jena, between Beate Schippe between 1989 and 1991 and she did not she did not disagree with the reverse search so this is where we and our friends can research now for us it was kind of like a nerd hobby that you could do which was quite cheap because it's like 25€ per 50€ per year and a lot of time in the end we actually invested 5000€ almost for computers but we actually looked at what other people are doing this historical analysis we found a lot of genealogy stuff we had some religious cults that basically baptized people based on the telephone books and they basically used those from the 30s now the central library started donating telephone books to us and they kind of like scanned phone books from the 60s on and it is just more comfortable to be able to search digitally than look at it side by side which is of course even easier than going to a different city to look at their telephone book and of course there's some ethical issues which we've actually discussed within the club already within the 11th care's communications congress so it's obviously long standing trends that are accompanying us and what we can do we can't really tell right now within the short time frame but we have decided that outside there's like a very practical area outside so we'll be meeting outside in front of the LTE LTE must outside if you're still interested in it so if you have any questions or anything for something that we weren't able to go into right now or maybe have ideas of what we can visualize here or who should do research on these we have this problem that data protection is kind of involved here so disagreeing with this reverse search was exactly included because of the data protection issues we don't want all of these people that want to kind of sell advertisements to have this digital to have this digital record so basically everyone there's some people I trust of course some journalists, some academics have gotten this data so if you want this data please just approach me but please also tell me why you want to do this and thank you guys for your attention maybe we'll meet outside for a beer Vielen Dank, Erdke Sommelier