 Hello, all together. My name is Michael Christen. Last year, I presented a project and we started a project at the first Asia. It was called Lockluck. We collected a huge amount of tweets. And today, I want to show you how to handle mass data in three aspects. One aspect is how to collect mass data. The second aspect is how to store and index the mass data. And the third aspect is how to evaluate the content of mass data. So here's a quick view on a tweet collection about First Asia. This is a history come, where it shows how many tweets had been with the hashtag FirstAsia. And you can see who did the most tweets and who was mentioned most and which hashtag was attached to tweets most. So this is a quick view on what you can do at the end with this. And there was a Google code in last year where students made posters explaining what it is. And this is the best one. It's about a server application. We use an API that we created. We use Elasticsearch. We make everything with JSON. And we use Kibana to do the magic of evaluation of the data. So the task was we want to collect every tweet that's tweeted out there. And the question was, how many tweets are there actually? And statistics stop at 2013. The prediction is that in 2015, 800 million tweets was tweeted today, every day. And let's say 1 billion. 1 billion tweets every day. Let's collect them all. How far did we go? So this is the histogram of the number of tweets we collected. After one year, we have 708 million. So this is not even the amount of one day. But we speed it up in the last few months. So altogether, we have now in 200 tweets, we have one. In every 200 tweets, we have one. From every tweet that's tweeted in the world. So that's not too bad. We can increase the speed if we scale horizontally, if we use more servers. But we know how to do it. So what's inside this technology? If you create a search engine, then you have a concept that you have an index. You have something that harvests the content. And you have a portal which presents what you have collected. And the left part, that's the logluck server. And everyone is invited to create search interfaces. And I can show you how this was done. What's inside? We use elastic search and the scraper. And for evaluation, many applications. This is not true anymore. We can use Kibana, but much more. We have a peer-to-peer sharing interface. We dump everything in JSON lists. And we have logluck.net, which is kind of a Twitter clone. And it looks like Twitter. And you can use it like Twitter. It's not finished yet. But you can see what's on the way. These are details about peer-to-peer sharing mechanisms. We have a hierarchical tree of collecting topology. I don't want to go here into the protocol, don't want to go into detail, but we have a kind of tree where we collect the data and want to show you how we organize it. So this is the home page. If you go to logluck.org, just click through the about showcase architecture and so on. You get a huge amount of information. Have a look at the API. Everybody loves our API because it's open. There's no password. You don't need to apply. You just can use the API and do everything you want to do with logluck. So this is the drawing, again, of our search topology. And if you use Elasticsearch as an embedded node, we have it embedded, so it's easy to work with it. You can also take it apart and have an external node and make a cluster out of it. So you have shards on all these nodes. And then you can scale this up and put on more shards. And you can scale horizontally. The more data you have, the more shards you need. So what we also did is connect with two logluck servers. So we can have a load balancing on them. And what's running now and what we see now when you go on logluck.org is a 16-chart on eight disks on two servers construction, which two loglucks and two load balances, load balances in each other. And this is the search cluster we use right now to host 700 million tweets. So if you want to have more tweets into this large-scale search engine, we just need more servers. And we put up more shards. And Elasticsearch is doing most of the work for us. We just say this is an index with a specific name. And then it joins the cluster and it increases the capabilities to store mass data. There are nice front ends to organize Elasticsearch. This is the Elasticsearch head. And some years ago when a clustered search was available in Zula, then you had to have to make the wiring yourself. It was a work of planning. You have to draw where you want to have the data, where it should be stored. And Elasticsearch is doing this for yourself and plug into Elasticsearch is doing the visualization in which part of the application which shard is stored. So this is created automatically. And if some shard dies, this happens sometimes. Because if you destroy the data on disk, if you have memory failure, or so on, then one of the shards disappears and appears somewhere else. Because if you use replication, then there's no problem. You can also see which kind of index elements are inside here are all the names of the index schemes. And in Elastic HQ, you can see also the distribution over the different servers and the nodes on the servers. So this is a nice thing to have. This is the cluster status page which shows you the number of messages, slightly below 700 million, which I made this three days ago. Now we have 710 million documents. And if everything works fine, I'll throw it as a live presentation. This is the cluster half status, which is unhealthy because the response time is too high. The server I rented in Singapore is not good enough. And this is a view to Kibana where I can do a test search and see a histogram. And this results from the fields. You can have a faceted search using the fields on the left side. And this is a presentation of one attribute that we collect in the tweets. If we have a large amount of texts, then there are interesting things you can do, for example, do an emotion analyzing. We have Bayesian filters which try to identify which emotion is expressed in the tweets. So the different emotions we have is joy, fear, trust, sadness, anger, surprise, and anticipation. And if you search into the whole amount of tweets and make an analysis, a histogram, an isolation represented in percentage for search terms of the presidential candidates, which we have here. This is the Obama. This is Hillary. And this is Trump. So we can see how emotion changes, joy and fear increases, decreases. And I don't know if this is going into the right direction. I don't know if our Bayesian filter is working correctly. But it's a nice thing to play around and see what happens. I think the training of the emotion dictionary is not good enough right now. But that's the way to go. Just one example of what you can do if you have mass data and analyze mass data. These are two maps. Visually doing the, yes? Pardon? What's your prediction? I don't have a personal prediction. We do also location analyzes within the tweets. We try to, from Twitter, we only get location names. We have a location database. And we calculate the geo-coordinates. We put this into the database. And if we make a query, we can make a graphic representation of the query on a map. And what you can see here on the left side, it's not very good to see. But this is Europe. And the query term was refugee. So in which part in Europe is our tweets made about refugee? Now there's no reference about emotion or anything else. Just the number of tweets. And it's a bit misleading. So if you do an analyzation like this, it can be completely wrong, because in Europe there are many languages. And only in England, refugees are called refugees. So therefore, you have here a big red dot. And so every time you do a representation like this, you have to interpret it. So there's not the most topic in England here, refugees, but it's just their language, where I use the term to search this. On the right side, you see Singapore here. And it's the haze map. It's the number of tweets with the word haze in it. And it's treated mostly in Singapore. But maybe you can do an interpretation as well and say, Singaporeans do more tweets than in other countries. Maybe that's true, but I don't know. So maybe it doesn't mean there was more haze. So you must think about what kind of representation you have. But it's nice to have. You can do a normalization. And then the expression, what it means, is clearer. So we had Google Summer of Code created a nice frontend, like a Twitter clone, which is still being made. And in December, we had a Google Code in where we asked people to make applications for the search frontend. So people made just what they want to do here is somebody who made a Star Wars tweet retriever using log-luck as a back-end. This person just wanted to have Star Wars terms and tweets. And it collects all this using log-luck. Then we made an artificial intelligence chat bot. And it works like this. This is a chat bot for, I don't know, telegram. And it selects, you can say something. It searches for the term, takes the last 100 tweets, and selects then that tweet, which has the most retweets. So we think this is a nice game to show what we can do with a large database of tweets. And it's like an experiment if you can see if it's possible to have some kind of intelligent answer getting out of it. Somebody made a frontend to search from the log-luck API in an Android application. And then there's the log-luck walk, which is Android harvesting application. This is a peer-to-peer software, which needs clients to do the harvesting. And this is a harvesting client on Android, which doesn't have any interaction thing. It just tells you what it does. And it says, if you don't have Wi-Fi, then it's not doing anything. And if you have Wi-Fi, then it shows tweets, which is, it is collecting, and how many is there, and so more statistical data. And it's all moving, like a science fiction head-up display. And you can't interact with it, but it's our way to collect data. This is the log-luck.net frontend, which is like a Twitter clone. And it makes some nice statistics. If you ask for statistics from one account, it can show you by analyzes of the number of followers, a distribution in the country. So you can see where your followers come from, and you see the most influential followers. So it's a nice tool for Twitter, if you use Twitter. And it has, of course, the Twitter timeline. This is my timeline some minutes ago. And it looks like a screenshot from Twitter, but it's our own application. We did some more. You can tweet also there. You can make something like mark down large text tweets. And large text are attached as images. And you can tweet maps as well. So this is the end of the slides presentation. I hope I'm still online and can show you some in the index live. This is the elastic HQ frontend, showing the current status of the cluster. You can also retrieve a live diagram from the cluster. And it was unhealthy this morning. And it's trying to heal itself. There's one node here which has not enough shards on it. And it's doing the balancing right now, obviously. And if you wait some hours, then every node has the same number of shards. If you do a query, you get an JSON. This is what you get. If you make a query on the LockLock API and you see the tweet text, the link to the tweet, you can see where the tweet comes from. Italy, which is kind of strange, then it tries to identify the language and says how good the probability is for this. We extract the hashtags from tweets. This is not a place. This is the Kibana interface. We can do any kind of search. For example, search for Singapore. I hope this is working. Now you see the number of tweets per hour. And there are also options here to count. Then the account which was mentioned most is Singapore Star. Singapore Star is also that account which is tweeting most. You see the other accounts. Yeah, this is the live front end from LockLock.net. And there's a map where you can see where your followers come from. You can zoom into the map. This is my account. And because we have the information about every account which follower you have and which accounts you follow, we have also the information about the location. And we can create this kind of map. We can also create a live report here. This is formatted differently because it doesn't fit. And I can see where my followers come from and who are my most influential followers. So a lot of opportunities to have to analyze the data. So if you want to start with this, go to LockLock.org. And we have a really rich API. And it's explained in detail what it does, what kind of rights you have. We have different classes of rights like green is open without any restrictions. Yellow is restricted to local hosts. No, red is restricted to local hosts. And yellow is open for everyone. But local hosts access gets more data. And you can, of course, set it up and run it locally. So you can collect the data yourself. And there's also a nice app section. We made LockLock applications where everybody can provide their own applications. And there's, for example, here this query browser. The query browser just uses the API for the queries, which is also used for suggestions. And it shows how many queries had been made. And this is slow because it's JavaScript on the front end, which is slow, not the back end. And the query which was used most is live, for some reason. And it's doing an automatically re-cruary. So it fills up the index itself all the time. And you see how many times this already happened. And then the next retrieval point is and so on. So you can see all the queries made here. And with the statistics, how often it was made. So these apps is open for contributions to everybody. It's very easy to make one. There's a primer, which explains how to do this. And we would be happy if you contribute such kind of front ends, which are evaluations on the data using the LockLock API. So thank you for listening. Do you have any questions so far? Any questions for Michael? I don't have a question of my own. I know how much there are server views in the month. How much the server? Server views. How much do you pay Amazon for your last research? This is a server sponsored by IBM. It has a price, but I don't want to talk about price. The location of the server in fact might not pass your finger rule. For this event, I rented the server in Singapore, so the access is best here. But if you want to use this for your own, it's easy to set up your own server. You can run this on your home computer and collect up to should be, you can go up to 100 million tweets, which is possible because scaling up that high with one elastic search node is not recommended if you are in a multi-user environment. But if you are running this only for yourself, you are in a single user environment, and you can accept that search maybe takes some seconds and not many seconds. So you can do this on your own computer. Is there a way of channeling the tweets that you've set up? So is there a way of channeling your tweets to particular geographic area? A way to channel it to a specific? From the striper, getting the content together, it seems that it's just a general sort of ground for it. You mean to collect only from strippers from a specific location or collect only tweets from a specific location? But both, but probably the first more than the last. The peer-to-peer topology is a tree, and you get a delivery of tweets from every peer which has assigned your peer as the back end. So if anybody assigns your peer as a back end, you get their tweets. So essentially, it's like that you must tell friends to configure their loglack server to send it to you. The default configuration is that it sends the tweets to loglack.org, but you can change this, and you can set up your private network. So it's not a question. If you can select it, you must tell the right friends to send it to you. OK. More questions? Yes? How do we monetize this? I believe there's a concept, but it's not really sure if we should do this and if I should talk about it. So really unsure. But maybe yes. Thank you for attending. I guess now we have the next one, and I will tell you. Thank you.