 Okay, welcome to the no-stage. We're gonna have a talk with Hernani Marcus. He's an activist of the CCC Switzerland. Give him a welcome applause. And after all this note in Revelation, he's gonna bring us closer to the mass surveillance in terms of human language technology, so of natural language processing. So have fun with this talk. Yeah. Can you hear me? Yeah, okay. So my name is Hernani Marcus. I'm activist of the CCC and I also fight the laws there. Introducing selector searches like also they were legalized in Germany now to a further extent and yeah, I'm talking about abusing computational linguistics also known as natural language processing, NLP, for mass surveillance purposes, so I am against. So that's already clear, I think, in the title. So for the outline, first we will have a look at the term of mass surveillance. How I understand it, then probably, I mean, how many of you know what computational linguistics is? Okay, not so many indeed. So there is so there is a justification to show a little bit which kind of fields there are. Then we will have a look at mindsets and methods used for selector search. So it's not just about searching for bomb and stuff like this. There are a little bit more sophisticated approaches, but nevertheless, it doesn't work as we will see. Then I also run a known experiment. While writing my master thesis, it's not representative, but it's more to show how this stuff looks like. And then there's also a reality check looking at the results of mass surveillance a little bit and then we are open for discussion. So mass surveillance is I mean, you have always this metaphor of the haystack and we need to find the needle if there is one indeed. I mean, that's not even clear, but they search for it. It's not necessary that you save all the data, so the Utah data center where everything is saved. This is not the true nature of mass surveillance. Mass surveillance is already given when you just go through all of it, even if you don't save it. So it is, if you compare it with away from keyboard or physical situations, it's a little bit like at the airports where everyone must pass control and is somehow suspected to be a terrorist, a criminal agent or something like this, or you can also compare it with snail mail if all our snail mail stuff would be opened and copied and this and stuff like this. Usually this is not done, at least not in Switzerland, because it's cost prohibitive, not because it wouldn't be cool for them, I think. So this also shows a little bit the mindset already that everything is just done, which is possible. And in the internet, it's easier to just search for everything in the physical world. But there are already situations where this is also done in the physical world, like video surveillance in public spaces, like airport situations. And also in the trains, they are starting to introduce more control. So the direction is this, which we are heading to. And by the way, even if no hits occur, I mean, if they search all your data and there are no hits, they still search your data. So that's still mass surveillance. So you don't need the condition that I found something in my data to have experienced mass surveillance. I mean, if they just search you physically or electronically, this is mass surveillance and you don't need to save the data in the end. So even if you just store the results, it's still mass surveillance occurring here. So what kind of selectors, so these search terms or patterns are called selectors in this Secret Service Service areas. Basically, you can distinguish between two selectors, heart selectors. This is like metadata selectors, like phone numbers, email addresses, Chetnik names, IP addresses. Good examples are the cell phone of Angela Merkel, CEO, phone numbers and phone numbers of journalists, email addresses or also such of diplomats. On WikiLeaks, there were several disclosures where it was shown that in the diplomatic and economic sector, lots of people were tapped. So there they know and it's not that they just target the phone. They probably search for all occurrences of email addresses in all kinds of data to see what's going on with this kind of person which is identified by the selector. So if you have less clue for what you are searching for, like usually in the case of terrorists, organized crime and things like this, you can also make content searches, so you would just assume they must be using some specific words, some phrases, some kind of language. This can also be done with voice, of course. If you transcribe the text with speech-to-text technologies, you can search the text, so you can transform everything into text. In this talk, it's about natural language processing, but you can do the same with graphics and videos, so you can search for patterns. You can of course also combine both selectors, like heart selectors and soft selectors. But I think in most of the cases where it's really about terrorism, they don't really, I mean, in the last times, they knew already, that's curiously enough, who the terrorists are or who the potential terrorists are, but in other cases, they don't have any clue and then they don't have heart selectors to search for, so they just try to search for everything which could somehow match their language. So, that's a well-known slide, I think, so it's not that China, Russia and other countries aren't doing the same, but as you can see, the Western, the NSA surveillance complex is quite well distributed, so they tap everywhere things that as probably also to do with the military stations they have everywhere, I mean, basically the US, but also the EU partners, and this is something which China and Russia probably cannot do to this extent, I mean, on a worldwide scale, like you can see here. But also in the teeny-shiny country of Switzerland, you have such kind of stuff, that's for satellite-based communications, that's in Loick, that's a nice place in the mountains, and there they also search for all kinds of stuff. They once captured a fox, so telephax, a message which proved that there are illegal CIA prisons in Eastern Europe. This was probably one of the better things they did. On the other hand, we don't know what they are doing all the day, so they are just searching everything in the satellite-based communications sector. So probably I need to introduce a little bit of the field of computational linguistics and human language technology, so from Wikipedia, computational linguistics and interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, the main interest when you talk of computational linguistics is theory and methods, so it's about using computers to analyze natural language like German, French, Esperanto, Klingon language, whatever. So everything is possible here, and there is also the term of human language technology, which is also used in the Snowden relevation in some of the slides. This is more when it's about the concrete application, so when it's about tools, so application of computational linguistics, then you can also say human language technology, but it's not so important, these terms. So what are typical fields of computational linguistics and HLT? These are, I mean, basic things are like tokenization, so you have a text where are the boundaries of the words, find the interpretations, so you just want to have the words, usually if you have just a text, you don't, the words are not marked directly if the text is a raw text, but you can of course look for spaces and things like this and try to split these into tokens which supposedly are words, with some exception sometimes. There are sort of parts of speech tagging, that's about marking the words you found according to their type, so if there are nouns, verbs or pronouns, stuff like this. If you achieved the two steps before, then probably named entity recognition might be a thing you need to do, so to recognize words which denote something very specific in our world, like WikiLeaks for example, that's named entity, or if you say Edward Snowden, there might be different ones, but statistically usually you mean the whistleblower, so here you see there is already a problem probably in finding the right things, but yeah, that's the idea here. Then you can also parse a text, find syntactic structures, I mean usually in our languages we speak here, you have a noun, verb, object kind of structure, you can try to find these elements and see how they fit together, and who is the actor, and who has been acted on, stuff like this. Coreference recognition, that's when you have pronouns which refer to whole phrases or words like where or something like this, so this is just a word and you need to find out to which noun or to which phrase this is referring to, and this is quite a hard problem, it doesn't always work quite well when you try to do that on larger texts, where also references are being done with a small pronoun, but referencing to the very beginning of the text and stuff like this, so it's not an easy problem because you sometimes need to jump different sentences, you need to recognize this in the structure of the text. Then collocations are usually words which tend to occur together in a text, these can also be named entities like names, but also just words which tend to be somehow related to each other, like for example computer and hard drive or something like this, when it's about something with hardware. So there are more complex fields which need all of this here in general, so sentiment analysis would be to find indicators of emotions in text, I mean if you say I'm crying or something then you can probably see out of this text that the person is not feeling well or something like this, or if you say I hate this product on Twitter or something then you can also do analysis there and see okay, this is a negative, this person feels negative to this kind of thing, so this is sentiment analysis, probably what you best know not always for the better is smashing translation, so that's like translating from one language to the other, in the 80s and a little bit afterwards there were attempts to do everything rule-based, so to analyze how English works, how German works, and then to try to match these structures to each other, it was very hard and didn't work very well in these days, it's more being done statistically, so you just have let's say texts of the European Parliament once in German then in English, and then you can just statistically see which words might fit together, which phrases might mean the same, and like this you can find patterns to translate other texts, so there are also hybrid solutions where you combine insights from linguistics with statistics, I think Google Translate is also going a little bit more into these directions, there's also text mining and relationship mining, that's to find for example persons and where they are, so to find such kind of relationships, so you have different entities in a text and want to find out who is involved and which places are involved, and then to connect these kind of things. Automatic text summarization, you can imagine this like when you write an abstract for a text, so if you have a long text and you don't want to read it but to grasp directly what it is about, you can probably read the abstract, because they're usually also the results are somehow outlined, in most cases you don't have an abstract or something like this, so you have a long text and you want to know what's going on there, so you can employ automatic text summarization, usually you find out that the sentences in the beginning, the ones in the middle and the authors at the end are somehow sentences which you can connect together to have a summary, so you still need to glue a little bit at the ends of the sentence, such that things fit together, such that you can read fluently the things, but yeah that's a field you can also invest lots of time in. Outership recognition who wrote the text, so yeah this was, this is something which might also pose a threat to democracy, if you have activists who need anonymity but they have a really specific writing style or the only ones who use specific words, so probably this can lead to the to the author methods like this. And topic analysis is a field where you have lots of documents and you want to find out which are the topics in this in this in these documents, is it about computers, is it about politics, or is it about both to some extent, so it's a probabilistic approach, but yeah in the end you just receive a list of words, you can choose how many words you want and they denote somehow what the topic could be, so in the case of computers this could be again computer, hard drive, monitor, stuff like this, so but sometimes texts are about different things, so also here it's not always let's say deterministic. So now let's see a little bit how such methods are employed, so that's an example from X-keys core, that's the yeah let's say the NSA search engine not not just for public data but for any data, so also your communication, and here they simply ask the question yeah what do I do if I don't have a strong selector that means metadata for a terrorist or something, and then they yeah they just start some selector or needle creationism like yeah probably if someone is in a place where he talks another language then the one which is usual there this might be already a feature of suspicion or if someone is using encryption or if someone and that's very vague, if someone is searching for suspicious stuff whatever that means, so you see how that this is not a rocket science what is going on here, so it's really like the analyst just defines something and then it goes on. We have also very concrete examples of such words, so here I have to explain a little bit, we have the topics which are which which interest here are weapons of mass distraction, advanced conventional weapons and government organizations, so they try to find out somehow proliferation, and what do they do as items, just an example you can also add other items, you have machine gun, grenade, AK-47, and then you have of course people who want to buy this kind of stuff, political actors like minister of defense or defense minister, then you have countries which want to buy this that can be in this case Somalia, Liberia or Sudan, and the brokers who are selling this kind of stuff, South Africa, Serbia or Bulgaria, so we also see a little bit what ASA is looking at here, and then probably also the ports where do these items go through, and then you can of course just do a Boolean search on that, so with and and or you can just combine all this stuff together and search in email bodies, so all emails, so this is of course mass surveillance, so because it's not specific, it's targeting whole groups and populations, it's not just they know these three guys are doing something and they go there and do something, so they are just searching everything, and you can of course imagine that these words are also used in other contexts which have nothing to do with proliferation, terrorism or something like this, so false positive rates naturally are quite high, so here you see that it's not just about email bodies, it can also be about chat bodies, I mean the example is really like classical here, like how to build a bomb or a weapon, so you see the Boolean search here and and or kind of stuff, then you can also search of course for document bodies, for calendar bodies, for archives, of course the NSA or whatever or whoever can unzip things on the fly and look at at this material, you can also transfer everything into text if there is some binary components you can probably decode it, so that's all possible, and you can also do the same for for the web area like with HTTP requests, so it's not limited to messaging or something, it's it's just everything basically, that's just as an overview that the NSA handed in some patents, so you see they are really active in this kind of field, so most of these things here are concerned with finding with searches, so somehow methods for finding large numbers of keywords in continuous text streams, so that's a sign they try to generate selectors out of text streams, which they probably denote denoted as suspicious, so to have kind of so the kind of their training material to search for the real world stuff then, and also this text summarization thing is something which interests the NSA, this can of course have to do with the fact that they want to to have the news which are out there in a in a compact way, they don't have the time to read all the news, in the case of the Islamic state they probably should, because I think some texts were more or less announced there, I mean they should have read it completely, not with computers like this, and then of course optical corrector recognition, so if you have a scanned element like with telefacts or if you send a PDF around where you cannot, where you cannot search directly for the text, you can apply optical corrector recognition to to to get out the real text, this can also be done with images and videos and everything basically, so and lots of other things, so it's it's of no use here to go into the details, but if you are interested I have also more details on this kind of things, so it's not just the NSA and the author partners in this area, it's also the European Union, so there was a project called INDECT, that's the intelligent information system supporting observation, searching and detection for security of citizens in urban environment, what the fuck, this project run from 2009 to 2013, so until more or less Snowden revelation, so it's not that the EU wasn't doing such things before or planning to do such things before, you will see what kind of things they were researching, and the project had a volume of 15 million euros and this was part of the framework program, there are always these big research programs from the EU Commission, the current one is Horizon 2020, this is this one is the one before, and in this area they put in 1.3 billion euros to research on such things, there was also a very strange research project, I don't know the name now, but it was 4.5 million euros and it was just about baggage detection at airports, so you see how much money they put into this kind of stuff, so to find out if someone is going away from his baggage and then it turns, everything turns red and someone has to intervene there and stuff like this, so there are 131 INDECT publications, 80 of them are open access, some authors are closed access in the sense that they have to buy them and some are even restricted because they go too far into the details they say, so there are also some details about the concrete operations they would like to have, I have to say that this is a research project but they created some tools and they also have a systemic approach to put them all together in a big platform to surveil all of the European Union for our security of course, and 15 of these open access publications are about computational linguistics and human language technology, so the fields we already had a little bit before, so now you can a little bit see what kind of features they have, I mean here they cooperated with, it's probably not so readable, here they cooperated with the police to find out what kinds of crimes do occur and then they would like to have as features something like the type of the offense, then the presence or absence of some behavior, gender of the offender, the age of the offender and the ethnic appearance, so that means if it happens that you are in the wrong age or if you have the wrong skin color because people with that age and that skin color did something in a very specific area, some criminal activity you might also be statistically you might be more suspicious than others, this can have consequence in real life in fact, here you also see that they have here they just constructed for a language application then an HLT application they constructed a binary model, so the gender is just male or female, age is above median or below median, that was zero one, zero one it's nothing else here, then there are white Europeans and there are Afro-Caribbeans, nothing else, and there is just you are occupied in the sense of occupation or you are not occupied, so these are their great features they have and if they find lots of crimes happening with such features, such of these features and you also have them but you are not a criminal you might still be suspected for something, here you also have pattern matching, so what they did here is they have a list of suspicious websites which they say they are really bad websites which with contents we don't want and normal websites and on a suspicious website you might have patterns like hand package boss, I mean these are just placeholders, you can also replace boss with some other term or package, I mean it's just like it's more like a placeholder for different words you can fill in then you might also have Everest mountain and tall mountain world as a pattern and on the normal side which is clean kind of you have Everest mountain, tall mountain world and temperature called winter and then you can of course subtract from both sides the patterns which are available and then you know hand package boss is the only one which is suspicious and with that we go now out and search everything so this is a little bit idea here so they control kind of their patterns with normal websites suspicious versus normal websites you can also do relationship mining and they do it here in a in a paper with some examples so there are two different kind of events which are about illegal transportation of something and an illegal financial transaction and then here you see this is the coreference resolution thing so you don't have concrete names you have like v, dem, we so you need to know whom they are talking about so probably there is a name behind it but in the text I can find out who is the person and that they are doing something indicated by certain words which are here in the sentences so here people are doing something in Bangladesh some illegal activities and the even the interesting thing is of course that there are no concrete names but they still may know who the the named entities are behind this so the the best one I think is this here the terrorist chat so you have so here they identify a name that's a named entity so they this denotes a specific person an individual and there are also countries involved here these I mean GPE means geopolitical entity and not means nation so this is yeah a geopolitical entity of the kind nation there are also other kinds of geopolitical entities and so you see how they can find the named entities and relate this stuff to each other or at least they try to do it so from wiki leaks there was also an interesting document which was leaked I mean this is like marketing material but there are still some technical things in there this is a French company which doesn't exist anymore cannon target but well yeah you see their business is extracting intelligence from SMS instant messaging emails so you also see it's not just about government you have of course also have private actors Snowden himself was not working for the NSA but for Bruce Allen Hamilton so there is a private public partnership here anyways yeah and this is a little bit a silly example but they just want to say that they of course take care of small and capital letters and if you try to write in this case Viagra some somehow a little bit differently then they can still recognize it or if you write bomb with a zero or whatever so they just want to tell you they they take care of some of some edit distance for words yes and they do that for English, French, Spanish and different Arabic dialects and transliteration so that's yeah here you see they also provide APIs for a homeland security so that's a good example of private public partnership then here and they also claim that with some IBM machine you can yeah you can analyze 1,200 tweets per second and stuff like this so yeah yeah possibilities are there and they also say now it's possible so we can do that we can do mass interception from a technical point of view so let's just do it and yeah you see that they take care of different transliteration types of Arab so if you want to put the Arabic letters into Latin alphabet there are different kinds of ways to do that then they take care of that also here for pedophilia they claim to have found here some features which could show that someone is sharing material like there's a filename hidey.rm or four year old might be a child and then the term pthc preteen hardcore so i'm pretty sure you also find other things with pthc if you if you search around but yeah the idea here is that they have different features different terms and then can find potential yeah pedophilia the internet yes also for drug traffic detection i mean the numbers are already quite astonishing here with 20 to 30 millions per of sms per day that you can analyze if you install the system by yourself and interesting things are also i mean this kind of stuff here that you can sometimes you probably know the examples of of saying snow instead of cocaine or something like this so and of course if two let's say real um criminal guys are just are always talking about snow and stuff like this this might be of course just another another another term for for for the real thing yes and they claim they have high precision i don't think that this is that this is really high precision i mean 14 true positives if the terms are right you have to you have to take that into account per one million sms is not quite it's not it's not a great result so this is yeah catastrophically in in some in some way the customers are also interesting here so you see not just nsa chinese and russian guys but also just companies whatever they are doing with this kind of system but probably analyzing their clients customers so in switzerland because i'm from there we also have the ndb that's the nachrichten dienst des bundes swiss secret service so they they have of course the usual things like a comment comment this if you if you want to distinguish it from sigint is when you analyze this course if communications that means stuff which involves natural language because sometimes you also just have signals if computers are communicating with each other this is not usually they don't talk in english with each other or with some protocols so they have their own language so this is probably not so interesting in this area so there is sigint and comment you can distinguish like this open source intelligence is just everything you can you can grab in the public space also in the physical spaces human intelligence would be when you when you involve persons to to to do some surveillance and then there are also military relations and yeah partners that would be like the german bnd and the cantons because witzland have different states and also of course image intelligence like when you look with drones or something can also do and in any of this kind of sources here you might have linguistic information which you can put into your database and search so that's a little bit the link here the thing i showed in the beginning this satellite based surveillance system mass surveillance system is called onyx it was built in 1999 illegally so without legal basis for 45 millions there are claims that costs now went up to three or four hundred million or more and it's it's it's it was not before 2005 they created legal basis for the system after there were some media scandals so also switzerland despite being a democracy is not behaving properly as of 2012 they they also oh no it was 2012 they created the legal basis it got operational in 2005 and from there it worked for seven years illegally and the interesting thing about this this this legal basis is that i mean for from a swiss perspective that's just like completely exaggerated they retain content data for 1.5 years and metadata for five years so to to make retrospective searches and now there was even a popular vote about this we will we launched a referendum to stop them to also do cable based selector search because i mean you know most of communications is not flowing through this satellite based things but through fiber optic cables and other kind of cables and of and so they want to tap the cables in switzerland to also see what's going on there and they there they did exactly the same thing like 1.5 years for contents and five years for for metadata and then they can search and this will this this will get effective in about a month now we try to stop it but this population didn't understand the law i would say and just voted with 64 percent yes to the law so i think the debate was also a little bit difficult we we didn't have enough money so it was difficult to to win this kind of thing of things because usually you need to make lots of propaganda and really break down things to to to uh to easy understandable sentences and we just failed somehow with that and because it's also digital it's about the electronical things people cannot really feel it if if you would vote about searching all the snail mail probably they would vote no because this feels like invasion so it's physical so that's my my view of it because usually if you if you approach someone and want to look at his cell phone or open his snail mail they get a little bit nervous and don't want to share the data but on the internet everything seems fine so just they should just search for it i have nothing to hide so these are the typical sentences you know already not the special thing is that in switzerland it's not the the the yeah the civilian secret service doing the selector search but the military himself directly there's a unit for that centrum for electronic operation in in german and they they just passed they just passed the results then over to uh to the ndb however even though um the search categories must be accepted by three instances court a court an administrative um um entity and also political control so the elected uh uh executives of the federation um there is a passage in in in the in the order of the law that they can also add like author selectors if the results are not nice enough and that's quite an issue because from a linguistic perspective you can show that if you just add some very specific words you can create kind of another category so even though the the the the the the oversight of the of of these kind of practices thinks everything is right there is still kind of a backdoor to just add more words and do things and in the end they cannot really control what's going on and we saw that also in germany with the nsa untersuchung saus shows it's just um it's it's a it's a big mismatch of what politicians think and what's really being done probably in some cases they knew things but i'm quite sure in lots of the very specific cases they they have they they had no clue what they are which terms are being entered into the system to search for there's also not a nice thing which is completely nonsense uh the so the end base not allowed to to create categories uh and propose search terms uh which include swiss named entities like swiss companies let's say Novartis or swiss politicians or or even swiss extremists so that that that that shall not be possible with with this law uh but you can also find just author search terms which yield exactly the same results so that's one way to circumvent it with information retrieval and computational linguistics and the other one is just by cooperating with author secret services and just tell them please give me this data we cannot have and you give me the other one stuff like this so that's how things work so it's just a strange thing here so i then also did uh uh own experiments um so there are two groups in switzerland which are um by the end db which are said to be the the extra the main extremist organizations like left wing um revolution error raufbau and right wing part high not an all around yet the schweizer um and what i did is i just took their websites as training material um with the idea well probably by with these two websites we can extract typical words phrases terms which are used in these scenes and with these selectors then you can search for author kind of of groups which are similar to them so in the end i am auto generating keywords because that's always a little bit the the mystic mystical question what are they doing how do they come to these words and i'm quite sure they are not just having a paper and writing down something they are concretely analyzing material around and then generating selectors out of it probably they add some things manually but um that they just do everything manually um it's not not realistic i think so in the end i had 70 selectors for left wing and 70 for right wing and more concretely uh 30 selectors were based on um term frequency and inverse document frequency measures it's just standard kind of computer science and computational linguistics stuff to to um to rank documents if you i mean if you make a basic search engine usually it starts with tfe they're also author a little bit more complex um uh approaches but it's just a a method to to to find out the most the the best terms to uh to to um which represent kind of the the documents then also created six selectors with intensification vocabulary so there was a paper there which um showed some um um terms which are typical for either scandalization or conspiracy kind of things i mean this can be um in the most easy ways this can be things like if you say so called democracy you show you don't believe in democracy so probably you are pretending that uh that there is someone controlling everything stuff like this or um if you um yeah if you rent around with um intensifiers like with specific adjectives you can also use such uh you can also um have lists of such scandalization vocabulary and with that create um um selectors then there are also topic words uh that's the thing i said before with topic analysis that you can uh when you have a um a bigger um a bigger amount of of documents you can try to find out what what the topic is in there and uh this is then shown by specific words so the selectors 10 of them were single words so very specific words which appeared there uh 10 of them were word two grams that means um uh words two words in a specific order or just a phrase and the 120 of them were combinations so order doesn't matter of words from two to ten words which show somehow which uh showed to be somehow typical for these groups and so typical example with TFEDF model anti-racist action on the left wing side or that's now specific to Switzerland August 1st that's a national the national holiday and uh they're the right wing groups tend to to group themselves and tell everyone how great Switzerland is and that we have to defend them against everyone so that's a very important kind of thing for them so this even appeared as a um a combination and in the in the in this area of the in the in the scandalization conspiracy indicators um yeah in the on the left wing side I mean the the indicator was pre-text and then connotations which were um colocations sorry which were found were action and militant and um on and on the the right wing side you had like adversaries like campaigns so they were talking a lot like the adversaries are lying in their political campaigns so you see a little bit that it seems to make sense um and in the topical area so topic analysis on the left wing side we had I mean that's that's um probably even a good summary of what they are talking all day about life politics capital and on the right wing side you had like refugees and their own name was quite important there so because they were pretending they are the only ones doing something against it so these are examples of selectors and I had 140 of them in the end and then because I don't I am not the NSA or the Swiss entity or something I just used public sites to search a little bit around and look if there are matches of course excluding the training material itself otherwise it would be nonsense um so the evaluation day um uh corpora was start page search engine doc doc go and then there's also a not evil uh called search engine for tour hidden services at that time um so with 15.8 million documents documents that means html sites pdfs books everything and then there's also this um decentralized search engine uh which um has no social bias in there so it doesn't matter who is searching for something you don't need to log in you just download the software and you are part of the search system so um there you have two billion documents and then then I also uh did some own surveillance uh I mean of my own kind of material at home with uh where where in the end I had 208 thousand documents it was just like everything even pgp lookups everything so web of trust so everything was a little bit inside and so um I could then search with the selectors I created based on the left and right wing extremism things so um 140 selectors times five corpora makes 700 selections or searches and then I just cut off the after five results I cut it because sometimes you have thousands results other times you have just three or even zero if you have very specific terms you know that from your searches if you enter something very long and specific probably result is zero and so in total there were 3500 potential hits to manually evaluate of course uh to be done by two persons to have um control that you are not too biased or something um just uh about 2200 were successful so we're shown as true positives so matching to the to the and not true positives sorry just just um matching this can be true positives and false positives at this point coming to true positives it was then yeah depending on the corpus and the terms it was zero to 25 percent and if you leverage for two less results so if you have less than five results there was another scoring function to leverage a little bit for that so it's it was um up to 22 percent and the interanitator agreement so the two other two persons doing that agreed in the left wing um field they they were quite agreeing that they marked the same things like true positive false positive on the right wing side it was not so easy I think this has to do with the fact that there were lots of conspiracy sites where you don't really know if they are right wing or just completely out of order somehow so yeah sometimes it's not so easy to to say all these are right people they can also be left wing people or just other people so here at um here it was not so clear so and 1.0 would of be would of course have been perfect match so and this also shows if you have different analysts they might not have the same perception so that's already another point of things are getting a little bit not so rocket science style um yeah and what's the reality uh that's from um uh net's politic um um from 25 000 hits they had in 2014 with the with the um with the selector search being done in Germany just 0.26 percent were in the end marked as true positive or relevant um from a secret service point of view this doesn't mean uh you can avoid an attack or something it's just interesting and that's an absolutely catastrophic result and there's of course to do with the fact and that's now from Bruce Schneier here that um by making the haste take bigger the needles don't get just like that more it doesn't make any sense probably there is not even one so just by poor mathematics it's clear that you at the false positive rates will just explode the more data you you put into your system and so this kind of stuff is completely bullshit doesn't work in the end and and also in the talk of um beanie as a day at the same conclusions um yeah so that's a little bit an overview probably we have time for questions 10 minutes if or if you want to know something more specific than it was in the slides just go on i think from the timing we could ask some questions yeah thanks a lot very talk so please go ahead at the first mic i have a question do you have the same suspicion as i have that mass surveillance is input for mass manipulation yes absolutely i mean next step yes absolutely i mean um yeah it has to be noted that for example the nsa has a national interest priority framework there are 32 surveillance fields one of them is counter terrorism authors are um yeah things like economic innovation or let's say economic espionage author one another one is called the leadership intentions or let's say how it is diplomatic espionage and then there are also things like um and environmental movements so this would be a political kind of activist um surveillance so yeah it's absolutely clear and by the way in these fields it it works of course better because there you know exactly what the people are you are searching for and depending on the features you you you need to to classify people i mean if you want to know um who likes a certain band you just have to look up facebook twitter and just put them all together and say they all like this band so that's feasible and you will have high precision rates of course if they are not lying but in the case in case of terrorists of course they don't show themselves like this and the features you have there are really um i mean they are they're difficult to model you don't i mean we have much there's not enough data there are not enough terrorists uh to do that kind of stuff and uh there is no correlation between the the hasteck is getting bigger and the needles are getting more in there so false positive rates poorly mathematically just explode but what i mean is they are going to move the the the needles if they profile people they can send their messages which confirm their views because they know their views and then very slowly they can move them in a certain direction yeah that's also possible yeah they influence their position and then they can influence elections yes of course yeah there are also documents from gc hq where they manipulate polls and stuff like this or where they set up uh bots to influence public opinion so all these kind of the russians are doing something is a little bit um they help with yeah i mean both are both are doing the same thing and there are their slides from snowton about they create bubbles and they move these bubbles around yeah in the way that it's polite yeah but of course the one thing helps the other thing yeah so that's the end of democracy yes of course so we need to stop this public doesn't understand that this is happening and not really know i mean we we have we had a public debate in switzerland with a popular rotation and they just didn't understand what's going on that was my impression yeah mass surveillance is necessary for mass manipulation yes that's my statement yes i would also sign up to that yeah please step to the mic up there in the front thanks hi um i was a little unclear on the the search terms to categorize something as either left wing or right wing that was automatically generated yes um but were they then um confirmed by anyone like by you that this is look actually looks like something that would be left wing or this looks like something that would be right wing so that would imply some of your bias but also that it needs to take more um human hands into it can be that no i didn't uh i didn't change anything so it was poor mathematics or computational linguistics which is somehow more and more getting like mathematics because i mean i said it in the beginning the rule-based approaches are are getting less funded and they are very complicated to to construct a model so everything is going towards statistics and also here i just applied statistical methods so it was i didn't remove any selectors i just take them as they were but of course in reality perhaps they would remove some of them it maybe i don't know so the so the human selection that was then done was that you pointed out which website you were doing and these was the evaluation i i then just used you can just imagine i just took the terms put them on start page on doc doc go wrote down the five i mean the step before that when you when you're picking out how to get these keywords and key terms automatically what you fed it was the websites so that is the human selection yes i i just used the websites as training materials so i did pre-processing there i just tokenized everything removed stop words and then i used the patchy t to also extract text from pdfs and stuff like this and then i had raw text which i could use to generate selectors out of this what i mean is that also means that these websites people who publish the stuff there will they will indirectly affect how people how well people will be able to find them automatically yeah that was just an assumption yes but i might that's the problem here it's it's not it's not clear if it's if it's if it fits together in the end i mean in the end the false positive rates are not that good i mean the bigger the the corpus gets the the more false hits you have in general good thanks a lot i think that was the time we had to spend thank you very much her nanny marquez for the great speak