 10 years ago I created a search engine, a distributed search engine, it's called Yasi.net and I'm not talking about this today, I'm talking about a new thing and it's a little bit a political thing and it's a philosophy and it's a lot of fun with a new toolbox you can use to get a lot of data. Therefore the second time for this is download your data, but what does it mean I like to scare my audience with the first slide a bit so here it is it's called the World Government, it's from the German news magazine from last month and you probably know these guys and I think some of them shouldn't be there and some of them should be there but there are no names so there are no pictures to show these companies can be called metocrats, there's a web page this is an idea 15 years ago where they thought that they will exist and there's a philosophical Alexander Bart which makes nice talks about it I recommend it very much, it's essentially about removing broadcasting from the community so people can talk to each other, they have social platforms and broadcast is not necessary anymore and governments don't have so much power so the power comes from the people themselves and it's steered by the metocrats but the government are considering that this is dangerous and they are going to take over this and there's the prison project which we all know so from the open source, open data perspective we should create some open alternatives because without open source there's no free speech from my opinion so the alternatives which came out in the last time for Facebook or LinkedIn is something like I don't know anything for Instagram or similar and for Twitter it was identical but all these projects came down in some way that hadn't been so successful and the question is why is it not possible that the market can decide, that the market can say there is something better, I believe it's because of the amount of data which is there in these platforms and the more data is there the better the platform gets and the more attractive they get so these alternatives, free alternatives don't have a chance to catch up there's one exception, there's one free software project which is really nice and this is OpenStreetMap, from my opinion OpenStreetMap is the best map you can get and they have the best data so this is an example where free software succeeded with free data and I thought how can we catch up with the problem and one solution is to scrape data or to take data from one of these social community platforms and in this example Twitter Twitter has an API but you need an authorization to get data from there and then you get JSON but only a limited amount and this interface was created three years ago there was a blog post in court following up on API housekeeping which announced essentially that now the open APIs are closed and you need to have an authorization so if you want to get data without authorization from Twitter you need to create a scraper which takes the data from their HTML pages and this is the same as I did 10 years from now with search engine technology we take data from web pages and it's the same thing we can do with Twitter as well and scrape down millions of tweets so let's try this, other people try this as well and the service came down, the attacker was a Twitter search engine and unfortunately our Twitter data supply has become unreliable of the past few months so the service is dead there's another search portal for Twitter called Topsy which is really nice it works really well, it's a closed platform and it was acquired by Apple so they have enough money to pay for the tweets they want to get and that's not something we want to do, we want to free the data and get it from them maybe if the data could look like this and if you know Kibana it can do a mass analysis of mass data and this is an analysis example of tweets about First Asia some days ago and if you have enough data you can do this so let's get enough data the best is how many tweets are there and in the past 2010 50 million tweets have been there every day then next year 200 billion, 340 million and 500 million two years ago and the approximation is that they are growing now by 30% every year so we can guess at least 800 million tweets every day maybe 1 billion so this is the amount of tweets and the rest is can we get them all? can we get so much data? and before we start this we need to know how a search engine works in the basic parts and you always have a content harvester, you have a search index and a search frontend and what we try to do is to create only this thing and the search product can then be done by everybody else because the data is available and as in the project I called Lockluck and on the right side you do this so what we use is an elastic search index and as the content harvester there is a Twitter scraper but there can be more scrapers you can put on your own scrapers or APIs to take messages from other services as well and as a frontend you can instantly use Kibana you don't need to do this, you can take something else but this is just an example and what we also do is to dump all the tweets we get from Twitter on a JSON text dump and we add a peer to peer interface so that all these installations of Twitter scrapers can communicate with each other and exchange their data and the JSON dump becomes a very large collection of archived tweets you can create so that happens if you search on the interface there there is a search portal making a search request and the request is transported to three targets one is the internal search index one is the Twitter scraper and one is a so-called back-end another log-locked peer you can set up which peer that is so you can create it on network and if the results are computed the new tweets are stored in the JSON lists and as a third process the new JSON texts are transmitted to the back-end peer so all the new information you collected is spread through the network you set up and this is a small calculation from a dump I already created and you see from the account lights here what came out is you need for a compressed message with all the JSON data and metadata only 106 bytes this is compressed this amount is high to about 1 billion messages you need 128 gigabytes and if you calculate this for one month then you need only a 4 terabyte disk to collect all tweets of one month so this is a kind of small feasibility study I don't expect that we collect actually so many tweets but it looks like it's possible so a 4 terabyte disk is sufficient to archive all 30 billion tweets a month from Twitter using log-lock and log-lock is the portal which is already up software exists you can download it it's LGPL licensed and I want to show it now so this is the live webpage from log-lock.org log-lock is a spicy Cambodian dish meat dish and you can say that if there is data in it then you have meat into your application so this is the idea tonight like this and if you search for FOSSAsia for example then that's what you get it's a JSON it's not very clear now here but you get all the hashtags extracted and there's also a D-shortener for all the shortened links it's very easy to D-shorten links and this is done from log-lock itself so you get clear data and a lot of statistic and it's formulated in such a way that Kibana is able to extract entities from it very easily so what's inside this webpage as well so there's a nice about page which describes the technology itself the same slides as I have shown you here the network is a bit slow a time-shot and I don't know what happened with the network here you can try this for yourself go to log-lock.org and I wanted to show you the API this is for example a status API and there's a crawler as well the crawler is able to fetch a mass data from Twitter by giving one search term then in the result the crawler looks for all the terms which appear as hashtags and names and then feeds it again to the crawler and it creates large lists of terms and download the search results from this so you can instantly create 100,000 tweets in some hours in your own index so this is almost it I don't know why the internet connection is not working and it will be so nice here it comes up it's only slow it says how to download an incident application it's done in one minute you need a git, a JDK and then you have it up and running in one minute and there is this dump directory and that's the place where all the JSON dumps are in the text file so you can download from every page you can download all the described tweets from this JSON data so we did it Tweet search with JSON API logluck to this paper to sum it up you can collect, dump and index tweet search results with logluck it's kind of anonymous because it's a portal you can anonymize your requests and you don't need authorization from Twitter to get their tweets and there's also a kind of anonymity because the de-shortener inside removes the access to all the shortener services so they also don't know that you are on their pages and it's a peer-to-peer software you can set up a peer-to-peer network and it's a distributed peer-to-peer scraper with QBANA content and that's the end of the presentation and it explains why download your data means it's your data you can download, you can download all your own tweets you can even modify downloaded dump files by just doing a trap over a term and it gives you a new dump list which can easily import in logluck by just throwing it into an handover directory one thing I forgot to show is this QBANA page which shows the most recent tweets here in the statistic and you have navigation systems here and it shows the most common hashtags use the combination with the search term for Asia and so on so I hope you like it and you use it for your own purpose to analyze tweets, statistical data and please send me your comments using Twitter show me the applications you did and I will happily link all the applications you create with the Twitter scale by yourself on the logluck homepage thank you