 Krisel ist Data Scientist und der ein oder andere kennt ihn vielleicht noch vom 31 C3, wo er den relativ bekannt gewordenen berühmten Xerox Scanningbug-Vortrag gehalten hat. Dieser hat ihn offensichtlich nicht nur hier bekannt gemacht, sondern sogar auch zur internationaler Bekanntheit verhelfen. Und damit begrüße ich ihn und freue mich auf einen spannenden Talk und bitte euch nochmal um ein ganz ganz herzlicher Applaus für David. Dankeschön. Dankeschön. Ja, danke schön. Herzlich Willkommen auch noch mal von mir, auch an die Leute. Auch an die Leute. Auch an die Leute. Auch an die Leute. Und auch an die Leute online. Und an die Leute aus Spiegel, die ich in der Audiense kenne. Es ist schön wieder hier zu sein. Es ist toll, dass ich zurückgekommen bin. Mein Name ist David Krisel. Ich bin ein IT-Professional von Bonn. Und professionell, ich kenne mich mit Data Science und Machine Learning. Es ist essentiell, dass ich von den großen Mengen von Data extractiere und seit 2014 habe ich Data von ca. 100.000 Artikeln von Spiegel online retten. Und das habe ich einfach niemanden erzählt. Und ich habe niemanden erzählt. Und in den 2,5 Jahren, aber ich habe niemanden erzählt. Die öffentliche Opinion ist irgendwie pfiffig. Weil diese Tage, die Leute reden von Fakten. Und wir haben jetzt eine große Menge Data auf der vielleicht größten Opinionmaker unserer Nation. Und da sind zwei Dinge, die wir heute heute benutzen. Wir werden die Daten anschauen. Und lernen etwas von Spiegel online. Und in einer solchen Weise, dass ihr es holt. Und benutzen sie für dich. Und zweitens, wir sehen, wie die Daten-Retention-Manier der Age funktioniert. Und wir werden das in einer Art, dass es für niemanden verständlich ist, nicht nur IT-Professionale. Und ich werde es von einer sozialen Perspektive anschauen. Und sehen, wie modernes Daten-Prozess unsere Gesellschaft verändern kann. Und ob oder nicht diese Veränderung besser war, ich lasse das zu dir am Ende. So, first, let's see how Spiegel mining works. Around the clock every couple of minutes, one of my servers looks at Spiegel online automatically. And if there are any new articles available, then they're downloaded and stored. And this gives a huge advantage to my dataset. I can, I get new articles minutes after they've been published. I get them in their original state without any corrections that may be applied later. And this, of course, is much better than downloading articles that have been online for years and might have been added hundreds of times. And I extract certain features from these. Features can be the data publication or the category. I take these features and analyze them. And the most interesting of these analyses, I take to write a blog article, which gives people insight into the world of Spiegel online. And we're going to do some very simple analyses first to see how it works. First of all, I want to see how often all categories occur. The size of the circles shows you how many articles have been published in a certain category. And of course it's dominated by politics and panorama, the green circle and sports, which is the violet circle at the bottom. These three categories make up half of the articles. And of course the next very simple feature is the data publication, which lets us measure how many articles they write per day. And this looks very, very messy. It's almost impossible to see any patterns. This is due to weekends, where of course a few articles are published normal. And here's the first article, the first practical article. I obviously had a, as you can see, I had a gap in my data in March. And this is because March in German has an air, so an umlaut. Luckily I noticed this after only a few days. So if you gather data, then you should build a warning system that tells you if no data has, if you haven't received any data. Luckily I had this, but I left it. I set the timer too long. So eventually we can calculate, it makes more sense to calculate the articles per week instead of the articles per day. And it shows you that they publish roughly 100 Articles per day. So the valleys that you can see here are the Christmas weeks, where fewer articles are published. And the data set for this article is from the 4th of December, so you can't see that valley yet. And if you look at some features individually, you can look at them individually, but it's even more interesting if you look at several features together. And this shows us that, for instance, the output in politics and panorama is constant, but science and academics has decreased quite a lot. And this is not just for these two, there are several other categories. And this data is very interesting if you are in competition to play online. So half of what we have to do is just taking features and putting them together cleverly. This shows you the typical length of an article per category. And an average article in culture is twice as long as, almost 3 times as long as an article in sports or panorama, and still twice as long as politics. But despite their short length, short average length, politics, sports and panorama are the most popular results in the most popular categories. And this means that what they're optimizing for is reach. So the categories that are shortest are those with the highest output. And I'm saying this without any, without any, I'm not trying to value it. It's simply an observation. Those who hate aren't taken seriously. It's just an observation. And most of the things that I'm going to be speaking about in my talk, that probably is the same for other media. Another important thing is experimenting with features. This shows the volume of publication per day and per hour. The rows are the weekdays and the columns are the hours of the day. And of course during work hours, during the week, most articles are published. And now you can learn how things are done in data science. You always find views validated that you've always expected. That's a boring part of data science, but it's good to check your measurements. We can see, for instance, that a few articles are published during the small hours of the day. But secondly, in data science, you always find patterns where you didn't expect them. And this always happens when you combine features. So I'm going to show you articles by weekday and time. And the length of the articles. So the red articles are longer than the blue articles. So the long articles are most often published at five o'clock during the weekdays. And the same is true for the weekends. They will be delayed a bit. And 30 data sciences is also kindling the worst kinds of prejudices. Give me a show of hands who thinks that the people from the culture section like to sleep in a bit. So for the internet we have a room of thousands of people and almost everybody raised their hand. And the solution is yes, they're right. The cultural scientists publish, tend to publish their articles later. So the epigraph shows you the articles in all countries but culture and the bottom one shows you just culture. But because they come in so late in the morning, they also go home early. But to not just feed your prejudice I was invited to speak a lot in October and that's what I told them as well. And then they said no, no David. Some of these articles are scheduled in advance, I have to say that. So just to keep you on tabs, if you work on these things don't stop thinking on by yourselves what you really can conclude from these findings if you went in with some prejudice like we did just now. We have seen how evaluations like this work. So now we can go one step further and of course in the internet things get really crispy when personal data gets involved. So I thought, wouldn't it be nice, a nice feature if we would read what the authors were from those online articles and that's what we'll do now and evaluate these in two ways. One, when you're kind and the second would be somewhat politically incorrect. Now the first evaluation will try to uncover staff structures in turn to speak online. Now you do not just know who wrote an article but who writes with whom and with articles, when authors often write together then you can assume that they work together a lot. So you can understand which authors are important for which other authors and those that do not write together often are not important for this kind of view. So we can kind of build a map of authors this way and this is what it is and this is part of the social network of speak online authors generated in time and every author is one of these bubbles authors that don't occur a lot have been filtered out and these clusters of authors that cluster together and that looks like these are the teams. Now we have to check whether this kind of uncovering from the outside is actually true. So we now color the authors according to their categories which we can get from the Spiegel imprint and in many cases we see that the categories, the different departments have formed automatically. So in this network politics a bit more distributed I didn't circle them all, panorama, travel I'm not going to name all teams but you'll see how it works. The red distributed buttons are the bento team that work a bit overreaching that's the young Spiegel if you don't know and the thing is we have quite exactly been able to map who is internally in the same team with whom and now look at all those gray bubbles these are gray because they cannot be categorized from the Spiegel imprint maybe they've left the, for example the head editor has suddenly turned gray recently so next to these color groups we can still assign them to the same teams we can say something about them although we don't really know anything about them and that's quite interesting so we can live find something out about these people and now we go to the politically incorrect part I can now turn your attention to something every line here is an author and from the left to the right you have time passing and every stroke is an article published by this author at one time and if we know one of these authors now then we know who will publish when so we see this line with a regular pattern a columnist who publishes once every week apart from certain weeks and with other authors the density is higher so we know very well when these people are on holiday too because these are the gaps in the slightly dense lines so if we know the holidays we know whose holidays with the hyper portion overlap with someone else's so things like Christmas almost everyone is on holiday you can just reduce out so now I'm appealing to your experience and now assume that you have certain colleagues that always go on holiday together so joke aside with such data you can read who is linked with whom in a romantic way and this is why I've anonymized the authors so clear well of course these are not all couples but these are candidates for couples and if you're interested in such a thing you are in a 99% stretch of the way and you've all been laughing now can I just ask for a show of hands who of you has taken holidays to be here from their employers so about all of you these data exist believe me so let's just stop for a moment and ask ourselves what we've just seen and what the social implications are what we've just seen is gaining knowledge about internal information in very personal areas of life from data that doesn't really look like it would be about this at all it's just spiegel articles and now we've got some clear evidence who is romantically linked to whom and we have some structures of my talk if you publish data it's not you who decides what you are publishing it's your adversaries and we haven't even looked at the data themselves we haven't looked at the contents of the articles just metadata times and authors just as with data retention that's all just metadata as well so just take a few months of your metadata whom you've sent mails and what messages to what websites you've visited I can then tell you what your best friends are if you have an affair what your sexual orientation is whether you're pregnant, whether you're sick what your political orientation is what your religious beliefs are whether you have financial problems and everything I've just forgotten so the abuse potential for this kind of data and of data retention cannot be put into words I'm not going to start with conspiracy theories we can all believe that data retention is there to clarify crimes and is useful for that and that's quite a plausible thing to think and you can also believe that the people that's storing and interpreting these data are all good-willed we can all assume that but that does not mean that someone will come into power next month who has very different attentions so what we're building here is the infrastructure for a general surveillance that even George Orwell's big brother would be ashamed about und that kind of surveillance this surveillance infrastructure we are now expressly putting aside for the case that a new government is malignant and wants to use it that's what was happening right now so we've now had a brief detour to metadata now back to Spiegel Online to raise the mood a bit and a small insert now that you can use the next time you reach Spiegel Online so let's go for something slightly bigger when I was reading the authors' names from those articles then at some point I was quite annoyed because sometimes they are on top of the articles as you see on the left or at the bottom as you can see on the right and if the authors are on top the names are written out and at the bottom you have short names so you have a full sense at the top from Marcel Rosenbach and at the bottom you just have a nickname sometimes just the last name or four or five words the friendly Mr. Philipp Alvarez de Suza Suarez I've written it down explicitly five words just for one name so data science can be annoying from a technical point of view don't say I haven't warned you so I just said what the fuck why are there authors in so many different ways in different places so I took that as a feature asking whether authors would be mentioned on the top or the bottom and I took some measurements from these groups of articles and compared them the authors on top articles and the authors at the bottom articles and the authors at the bottom articles without full names are typically articles roughly 300 words long or less you can see the length of the full names and towards the rights articles will be longer but if the authors are on top an article typically is more than two and a half times as long about 750 words so you know what you want to be googled with as an author right? and another thing the long articles have a two percent probability of the short articles the percentage is higher so if you want to know who wrote the articles themselves look at where the names are if you have a short agency news items names at the bottom so we've already seen that at the beginnings of the day that is the time where most long articles are published and these are in fact the ones that are written by the authors themselves comparatively high in the morning so we can now take a step back and ask what's been done and we have this huge amount of articles and we've just cut them apart in very simple ways we've divided them into weekdays times of the day categories and well with the simple ideas we've had some very interesting results already but all we haven't done at all yet is look at the contents and wouldn't it be totally cool just divide the articles into the actual topics which they are about Spiegel online delivers a good help for us here because they have keywords every article is given about 10 keywords from their authors the article on the left has the keywords politics abroad Saudi Arabia King Salman of Saudi Arabia so I took those keywords from all the articles I have about 65.000 keywords which I have found and now let's look how often keywords appear in the same articles different keywords so the keywords that almost always appear together they are in a way they are married to each other which you can regard as one and the same and on the other hand there are keywords that have their own existence and are never or almost never in the same article so they are not related and then there is a certain middle way an example here politics Angela Merkel Angela Merkel so an Angela Merkel article quite often has the politics keyword but the other way is not quite the same there are many more articles on politics without Angela Merkel as a keyword so these keywords are not the same but clearly they are linked so we are measuring for all 65.000 keywords pairwise how related they are and then we link those that are strong related with very strong springs in a physical sense which pull those keywords together less related keywords get weakest springs and we now run a physics simulation and see how these thousands of thousands of springs will adjust each other and themselves and you can see that some keywords will be brought together not so strongly so we have a topic map of all the things that Spiegelmann reported about in the last two and a half years and looks like nothing is happening but this is where the detailed work is going on you can't see it from such a distance so let's zoom in very closely to really learn what we have created here this is the dieselgate affair folkswagen affair you see the keywords have different sizes the size of the keywords affects the number of articles so these are the articles that are in those keywords and the colour shows what the primary category is for these articles so this kind of yellow is economy which fits so the funny thing is that this kind of depiction is very strong you can gain a lot more insights from this kind of image not just what's related you can have all kinds of measurements that you will show with colours of course so you have these coloured keyword landscapes now you can see whether topics and measurements are related and that's what we will do today but let's look a bit further first various airline accidents this is between panorama, green and politics red and the political elements are from the Ukrainian shooting down of an airplane and my voice is failing now that's better well not mine as a translator ok now this is the Greek crisis now this clearly has been politics red and economy yellow and Wolfgang Schäuble has been lay out in the same place he is in grey not due to his age but because his keyword has no dominant category and now something more recent now this is the US presidential election 2016 we see Hillary Clinton and Donald Trump and everything that's gathering around and of course it's all red for politics and to see the keyword emails being added to that and from there we'll now look at just understand the whole size of the landscape did you see the microscope talk today where people were zooming in and we'll now zoom out and understand how huge this whole map is and we're zooming out now we are just seeing the old frame and you see how the US presidential election is embedded into the whole particular landscape you see the Syrian civil war the Islamic state Islamist terror and on to France just next to it yeah the mathematics has no mercy at the top you have the recent Turkey topics the recent coup attempt and the dictatorship and to the right you have Russia and the Ukrainian conflict and left below is Israel and the conflict and now let's zoom out again now this is the whole political landscape this time two rectangles marking where we started the presidential election then the foreign politics section now that's in the upper right and lower left you have the domestic politics and recently you have this huge cluster in the middle that's the refugee topic huge cluster that's developed just between domestic and foreign politics which fits of course now zoom out again now you can't see anything anymore just differently colored landscapes just some broad orientation this is where we're coming from the red thing is the politics part then in poisonous green you have panorama divided by economy now this Turkey's cluster chain is network net world and then you have the cultural and so on and so on we can't go through them all but you see these areas have overlaps and are linked with each other once again one more step and this is the whole thing we've seen the lower part and this is where we actually started and the rest of the world somewhat distant to the rest of the world you have science I see you can understand and you've been working in that field once and very far away from the main continent you have sports and now we see how large the whole thing is and how broad the Spiegel topics are and you can see that all on my website you can do your own research on it as in the google maps kind of thing which is much more fun than me doing it for you so that's what we do now we are going to apply this Spiegel online under many articles offers well hmm well the laughter starts before I'm even mentioning it you're not even sure what I'm going to say they offer you to state your own opinion and under some articles they block this opportunity and that's what we look at now and I did say at the start that some articles have retrieved just a few minutes after publication so if I found an article could not be commented that was right from the start so no one would comment that quickly so let's very briefly look at how things developed temporally and you see how the ratio of articles that can be commented are red is those that cannot be commented and in blue they can be and when I started out downloading it was about 80% of articles that could be commented and exactly since the big refugee topic came up and was reported on the commentable article articles are declining and now since a short time it's actually the majority of articles that cannot be commented the red line is overtaking the blue one and that's not just in the politics category it's across the whole offering and since the hate in the net has become so much worse across the internet or Spiegel online is too much afraid of these mean comments I can't read that from the numbers so the interesting thing is the small green plot there again these are non commentable articles but these have a small excuse or apology at the bottom because of netiquette we've blocked this you don't have to read this through so this apology used to appear on refugee articles at the beginning and it seems like Spiegel online itself wasn't quite happy with the rise of comment blockings but as you see this hint has been taken out a lot now even though commenting has been blocked more and more and now back to the land the more red a keyword is the fewer articles in its category are commented and the more blue it is the more articles it can be commented on and the grey keywords are in a bit where about 70% of articles can be commented on and of course this is not this is a continuous scale and I'm going to publish the map on my website as well where you'll be able to click on it we're gonna start with some simple things you guessed it, sports is almost always commentable it's very blue and if you wonder about the red dot in there that's a specific article format which technically cannot be commented on and another topic that also usually allows you to comment is technology and economy these are strikes by the German railway and speaking of strikes probably most of you are thinking of Lufthansa whose main business seems to be strikes very blue you may laugh at I arrived here by plane after all these blue topics let's look at something red deep red is around justice which are reports on criminality murders attacks and they prefer to have fewer people comment on these the commentability on these is around 80% this is the NSU topic neo-nazi activity and this is the same for all neo-nazi topics and the redness of these is 18% commentability also deep red is refugees and not just concrete articles but also column writes so from the outside it looks like Spiegel blocks common systematically depending on topics and it's very powerful that we can see this systematically so it's not just it's important not simply to analyze it's also important to visualize information which allows people who are not IT professionals to find patterns there's only one direct connection to the brain that's your eyes and things get really interesting when you look at how Spiegel online orders commentability by nation this is the Middle East conflict in Israel and they allow virtually no comments on any of these articles let's move to the conflict in Ukraine and suddenly so as a take home message ladies and gentlemen it's fine to bash the Russians and what we did here is simply visualizing and measuring our filter bubble Iran's okay, you can comment on that Great Britain, yep, Turkey yeah, they're not quite sure about that yet and France is interesting this part of the map would like to be blue but all the keywords around the terrorist attacks are deep red and that extends to their neighbors as well and let's look at that in more detail all of these are articles on France but by time the blue line is the amount of articles that could be commented on and red is what could not be commented on and we can see that 2014 and 2015 and then we had the series of terrorist attacks in November 2015 so obviously we get a large peek in articles that were published on France and most of these could not be commented upon so you're allowed to comment on France but not on the attacks and the interesting thing is that this has persisted ever since the attacks fewer articles could be commented upon few of France related articles could be commented upon let's take another step back and I can see that Spiegel Online I can understand that Spiegel Online can just block articles based on their past experiences and of course they have every right to do that to decide whom they give a platform whom they don't but we also have the right to make this visible and I think it looks like Spiegel Online prevents comments on articles where they suspect that red's opinions might not be opportun politically and if this says something about Spiegel Online or about our society at large I will leave up to you to decide in Grunde 2 geteilt so my talk could be divided into two parts up to here firstly we just divided our collection into just a few buckets and afterwards we divided them into way more buckets and each article could even appear in several buckets at once and this was far more complex but also far more powerful so remember these two ways of bucketing articles and now we're going to do something political we're going to look at campaigns campaigns, political campaigns work very similarly they do voter targeting where they divide their voters into several categories so for instance you could divide them by gender skin color, age income so you could all black women in California of a certain age you could send them targeted advertising and this is a very rough way of targeting and it works in analogy to the targeting we did on the left's part of the previous slide but what would be the right part of the slide a few weeks ago this article from the Swiss Tagessanzeiger went viral I'm sure many of you have seen it it was I was told to read it dozens of times every day partly because of course I researched on this topic and it claimed that a data analysis company had succeeded in doing a very fine grained analysis of potential voters which would be the analogy to our map and went on to say that they did it for the US presidential election as well as the Brexit debate and it claimed that this is the reason that Trump was elected that Brexit went through and of course this is spooky and it sells well oh dear the same company behind Trump which is where your tinfoil hats start glowing and they claim that their petitioning of the voters is so fine grained that you can send perfectly targeted advertising to each voter and they can go even further they can target the tone of their advertising so that they hit very precisely of course it's not clear if they actually succeeded because most of the information comes from the company itself and I think they sent a very good salesman to the press who gave a very good talk to the press and they just bought it all he really said is that I can give you advertising that's very well targeted to your target audience so in other words finally we can only target Viagra spam to only those people who actually need it but we can't force them to vote or to buy something they still have to do that on their own big data can't do that so if you're afraid of this what you should really question Judgement and I'm sure some of you had the same train of thought and I didn't expect any applause at this and they felt relieved at this but the problem is very few people questioned their own judgement in fact most people vote for the person who shouts something that fits emotionally just a few days before the election and of course this is what politics want once I mean where would we get if politics rewarded long term success and this emotional targeting works with highly personalized advertising in a very efficient way and this means that data science technology can influence elections so is it so what do you think will have been a bit small I thought I was speaking of data retention younger I'm on the CCC So I think that most people shed my opinion and this brings Und es macht Sinn, um kritisch zu sein von Städten der Elend. Wir müssen das in Fakt, aber wenn wir uns völlig unkritisch sind und uns selbst unkritisch und alle diesen Robben indiskriminativ zu Facebook, dann haben wir nicht ein einziges Ding gewonnen. Mein Gespräch ist am Ende. Es gibt zwei weitere Dinge, die ich gerne übernimmt. Und ich komme zum Schluss mit einem Ply to your, firstly the surprise. Did I say I downloaded hundreds of thousands of articles from Spiel online? I mean more than 700,000. I don't know all the articles, not just when they first appear, but in increasing intervals as well. So we can measure what's been changed, what has changed in an article. And to keep this short, I'm not going to give you a huge analysis, not just because of this talk, but also because I didn't have the time myself, but I have a little, I have a short demo. I track whether or not titles have been changed. And you find some interesting things here. There's the title, the headline itself. There's also the hate HTML title tag. It appears at the top of the browser window. And of course I track those as well. This article is from the 20th of January and on the 21st of January. So one day after it was published, I was notified that the title tag has changed and that SAP grows more slowly in 2014 than planned. So I wondered, what was it tied to before? When the article was first published, it was not SAP that was growing, it was the CEO of SAP that was apparently growing. And I like these little quirks because it shows that there are still humans working on the articles. Not just computers. And today it's called SAP is unable to reach their growth and gain targets. So this shows you how powerful the dataset actually is. I have a history of each article and this gives me even more powerful opportunities. And this was my surprise and here's my plea to you. You've seen all manner of things now. We've divided articles in simple and complex manners. We've seen that different ways of visualising things have different power. And we have tracked lots and lots of features from all of these articles. Even more complicated features. For instance, you could track the links in each article and see if certain authors have their little friends to whom they like to link. There's no limits to your imagination. And of course, as I just showed you, we can also track what has been changed in articles. For instance, we can see where there's the biggest uproar where the comments were closed after a certain amount of time. Also, if you have any ideas, please send them to me. If you have any ideas how I could analyse this dataset and there's a message I would like to send to you if you work on data. Raw data is awesome. Or even sexy. So keep all the raw data if you can at all afford it. If you can at all afford to stay it. I have modern 60 gigabytes of pure HTML. And it's not a problem at all to add features later on. So please don't limit your imagination. Imagine new features, imagine new analyses. Send it to me. Not all everything that you send to me may be possible. I have a job as well and I'm going to have lots to do at the start of the new year. But I really will try to do these. So, thank you so much. All that's left to say for me is thanks for spending this out with me. Here are my links. See you. Well, we can't release you so early because we have our Q&A, of course. First of all, thanks a lot. It was great to see how mathematics can be very exciting to analyse such data. And well, as always, if you have questions, go to the microphones. And all those that probably leave quickly for the yearly review. No, actually, the speaker for that is still here in the audience. So it won't be starting that quickly, but whenever you are leaving, why ever you're leaving, so where are we with questions? Microphone 3. Hi, fantastic talk. I'm really great, I thought. What I'd be interested in, did you once see whether they change articles depending on how many people click there? Well, if I understand correctly, they are testing. Do they do split tests and find how many people click on the article? I think they're doing that right now. So, correct me, but I think they are trying it out right now. So, what is split testing just for the audience? They publish articles with different titles and see where, on which the people click most and that's the title that can survive. You change Spiegel online by visiting it directly. Microphone 1. Okay, I wanted to ask whether you archive Spiegel Plus articles too, whether you include that. Was ist das für ein Wohl? Do you have a Plus account? Yeah, I am including them and of course I have a Plus account but automatically decrypts them. Yeah, I was really annoyed when they started appearing because I couldn't decrypt them initially and that's why you can find instructions in Spiegel Plus. Okay. There's a positive thing about them. The Spiegel Plus articles are longer in the media so you get something for your money as well. 1.001 words. 1.001 tales. Did you look at contents in your analysis as well? Word frequencies, perhaps the link to categories or keywords and link that with content to find whether keywords are complete or correct. No, I didn't. You can take keywords. That was quite quick and easy but I didn't look at the relevant words within the article. Those would of course be the nicer keywords but I haven't done that yet. 1.001. I would like to know which software you used to collect the data to analyze and visualize it and whether the data is available anywhere. No, they're not available anywhere else partly because I'm not sure if I'm allowed to distribute them. I use the Pi data stack and the software for downloading it. I write myself, it runs on one of my servers and beyond that I use Pandas for my analysis which is based on Python and then the whole Pi data stack. Just Google it, you can find a lot on it. And for visualizing it I use Tableau which is a visualization software that has pre-aggregated no, that can use pre-aggregated data up to a few gigabytes and you can make nice graphs for it and you can use Gaffee. 2.001. Did you analyze data real-time or did you do that already in retrospect? 3.001. Do you analyze the data while you collect them? 4.001. No, I simply collect the raw data and in the next step I parse the raw features from it and those are so few that I can actually keep them in RAM and create higher level features so it happens in three steps so I don't do it initially but ever since I started giving my talk it's been downloaded another ten times so it does happen in parallel. One idea for the evaluation you could look whether certain word groups appear in older articles to see whether they've been copied and pasted together. Do you mean an analysis like per article you get 73% new content, yeah? Yeah, good point. I'm gonna do that. Hello. I just wanted to give a hint but I'll put it into a question could it be that the non-commentability of Israel articles is just a resource problem could it be that be more to censor for legal grounds? For example, there might be singularities in German criminal law of course there are that you can't say certain things so that could be... Yes, had it been only Israel I would have thought that immediately but no yes, it could be of course this is a very important part of data science I did that in a slightly sarcastic way but of course it's up to you to draw your own conclusions from the data yeah, that could be of course I mean, the only people who really know is the people from Spiegel but Israel wasn't the only category that wasn't commentable and there's no singularity about simply justice topics. Hello David, thanks a lot for your talk did you consider offering the software as open source so that other sources could be used such as Tagesschau, the German TV news? No I didn't, but to be honest it's not all that difficult you just write a script that runs every couple of minutes and downloads all the articles and that saves it in a database, done the source code behind this is really uninteresting there are thousands of people who did it before but yes of course you could do comparison with other media how did you remove the strain you broke it all down into two dimensions the what? the tension because you have so many dimensions and you project it into two so how did you make sure that things would be put above each other they are not really close to each other or the links aren't shown in theory you can never exclude that altogether but I put a lot of care into the graph I kept only the most important edges because otherwise you get far too many and then there are professional graph layout for instance the one that Geffi uses you saw it earlier of course you have to invest a bit into filtering the edges when you've done that you're still not all the way there but it's most of the way you said that in October you visited Spiegel was there reaction to your analysis it was positive I'm not sure if this was simply because they can't do anything about it anyway but I found the exchange very positive and very interested and I liked going there they took it far better than colleagues at Xerox maybe a suggestive question again but maybe this is leaning towards well opportunities for further research the procedure the physics that we used to visualise topical closeness would it be mathematically more correct if you had a singular partition of the adjacency matrix of these keywords and such as Patreon has done yes but if you do that you can't make as nice a graph as I did and you'd probably get a similar result I mean I see the values of the edges or if you use all those dimensions it's kind of equivalent yes all is silent okay, 3 another short remark on maps the methods, the springs ultimately they position how stable are those I didn't dive very deeply into the theory I'd be surprised if you could prove this but they're established for large graphs because there's nothing to do there anyway you just iterate until it looks right and if it looks wrong you just press start again this is the practice hello did you ever use markoff on your data to generate spiegel data articles I mean no could you send me an email with that idea oh, wir haben Spaß oh, we are going to have fun I can just see it but we won't just generate articles but we also generate whether or not you're allowed to comment on the generated article oh, we could also generate author names, yeah right I believe we've come to the end of our time so if you have further questions surely you will be actually I'll be going out to the next beer bar that I can find if that's not in front of sale 2 it's in front of sale 1 so people can find me so didos to the beer bar on the beer bar and it is time for that thank you and thanks for listening to the translation your translators were zebalis and and Phillip and we love your feedback hashtag c3t twitter account c3lingo email hello at c3lingo.org we love to hear