 Hi to everyone, and thanks for coming. There are two things. The first one is that we are going to, well, Amleto here, is going to explain an app for your mobile. It's in beta phase, and he will explain better than me. But he asked us if we could use it and try, because they are starting, they have this initiative, and they are starting, they work with different tabs. And we thought that it's good to support local initiatives, and we thought that it might be a good thing. So we are going to try for some time, and I think that it makes a good appinta. And if you feel so, we will send information on how to download it and install and so on. And I think that any comment you can send him an email saying, oh, this is not working or whatever. But OK, that's the thing we're going to try. So first, Amleto, he's going to make a small introduction about the app, and then Carlos Castillo is going to make the presentation about crisis situation using big data. And he will introduce himself better than me, but he's the director of data mining at the Urecad, which is a research center here in Catalonia. And he's done many things and many interesting things. So I hope you enjoyed the talk. Thank you. Thank you, Leish. We appreciate it, what you're doing for us. So I'll talk a little bit about Flutter. It's going to be quick. But how many of you have been to Meetups before? How many of you have met a Meetup more than five people? How many of you have actually left a Meetup thinking that actually they missed out on meeting great people and they didn't know who they were? OK, good. That's Flutter. We're building a company. It's a platform. It's not just an app. But what we are trying to do is actually allowing communities to come together. And I actually go to quite a few Meetups, professionally based Meetups, things about business. And many times I leave over a conference too. And I think, OK, actually, I'm missing probably a lot of the interesting people I could meet. I don't know who they are. I just generally talk to the people that are next to me sitting next to me or drinking a beer next to me. And I think it's a shame. And the other thing is that generally between Meetups there's a lot of knowledge that could be shared. But actually, it's impossible if you don't know everybody that is in the Meetup. And for example, the Meetup forum or the comments doesn't work. So what we are trying to do is try to build a platform where professionals that share same passions could actually interact and also meet each other. And then the idea is also to expand that and allow different communities in different places to share the same sort of topic to meet each other, virtually and hopefully physically. So this is the beta version that it's not beta version. You'll see it works extremely well. But basically, there are a couple of functions. One, it's members geolocation. So for example, if you download the app, you can see everybody that is here. And you can see their profiles and then connect with them. And the idea is that way you could actually meet people that might be useful for you to work with or might be interesting to meet and to have a discussion with. But then if you see point number two, that's point where you'll find the professional communities. So now we have a few professional communities in Flutter. You will find the machine learning, Barcelona, BCN in there. And the idea is, you'll see later, but you can share things. You can share doubts, questions, everything you want, articles and everything you want. And then if somebody's posting, that gives access to a professional, as to a profile, a clickable profile. So you can see who that person is, the LinkedIn profile, and decide if you want to meet them or not. One other feature is actually the members feedback. So if somebody, we don't like spam, we hate spam actually. So the idea is that if somebody's spamming and making advertisements, you can download, download, download, and that will send flags to us and will remove that person from the group. That is the idea. It's pretty harsh, but it's necessary. And the other thing is you can put, if somebody you see it's posting good things, the idea is to upvote them. And then what we want to do is basically a weekly digester and a monthly digest for the best posts, the ones that get more upvotes, get sent out and shared with the community and other communities too. There's quite a few more functionalities, like chats, you can see in there, but the most important ones are these ones. The first one is around me. So there's a functionality in it to see whoever is around you and decide what you want to do. Now in the app we have about 100 people and there are actually some very interesting profiles. So the first one, you know it. It's Alish. The second one, she's a data scientist that we met and she has an amazing background. So once you have the open, I suggest you to go in and look for her. And then the third one, she is a guy that actually built a company in San Francisco, sold it to Apple, worked for Apple two years and now he's here in Barcelona and he has an extreme wealth of knowledge about everything that is going on in the valley, if you're interested. The second part that is interesting is my feed. So for example, there I shared an article that I found extremely interesting about deep learning. There's a lot of debate on deep learning and AI, but the idea is that you can share anything. If you want to share an article, ask questions and the community is the thing for the community to learn. Our objective is for people to grow in their passion and topics by interacting with other professionals. So that's a whole objective of Flutter. So I mean, a few things. There's definitely direct communication that generally you miss between meetups. There's continuous interaction that you can have with other people that participate in this community. You can establish new connection, but then the idea is to grow the community because if it becomes a kind of an ongoing thing, more interesting people will come in because they know of this great community and more interesting will become for all of you too. So as I said, grow the sense of community. The whole objective of it is to create quality interactions and that's what we're trying to do and there's no cost and it's free for everybody. We are not looking for money at this point at all. So I mean, this is Flutter. If you want to download it, I actually posted two things in the comments of the meetup. One are the Wi-Fi details, so you can connect to the Wi-Fi, you can get a good connection and then you need to go to flutter.ine, like fluttering will be. And then basically you put your email address and we send you the invitation out and you can download it, yeah. Okay, thank you. Another thing, we are not a very large crowd, so don't be shy and everything that you want to ask, please ask Carlos and it's quite informal, so don't be shy and ask. No, it's okay. Okay, good afternoon. So yes, as Alex said, we can make this more conversational and interact me and I have a ton of slides and it doesn't matter if we see only a few of them and then we discuss more and learn more. So this talk is not about machine learning, it's about applied computing and where there are some elements that are from machine learning you will see. It's more like how you can apply some very simple techniques to a very interesting problem. So I work on crisis data and disasters data, so I always look for emergency exits. This room has four, two in the back, two in the front and one bicycle here. So if there is a fire, you can jump over the bicycle and then exit through the emergency exits. Gonna be fun. I study the information retrieval and I work on a lot of things that are not really much interest to you right now, but I think what's really interesting is that most of this work I did while I was in Qatar. Qatar is the tiny peninsula attached to a larger peninsula, which is the Arab Peninsula. It's an Islamic state. It has the largest GDP per capita in the world, far exceeding the one of many European countries. It is very secretive about their demographics, so that's why the question mark are there, like 1.5 maybe, blue color workers and then about 500K white color workers, 300K citizens. They have a big, they put a lot of importance in education and science and this here was at the end of 2015, mid-2015, the social computing team there and this was the best they had in Qatar. It's a desert actually, so there is very little to do and the best you can do is you stand in the desert and look at the sea. This is the skyline of Doha, which is the only city that is there in Qatar. It's a very interesting place and most of this work was funded by them. Now, bringing us back to the topic of the talk, if you remember what happened in March this year, there were these explosions in the Brussels airport and then there are many things and as you, as I mentioned in the talk in DataBeers, it's interesting to see what the police asked to the people who were in Brussels. So the police asked them to use social media, not to use the phone, they asked them to avoid streaming when it was possible and they asked them to avoid sharing real-time information about police actions, okay, three things. Now this being the internet, of course, people will do exactly the opposite, so you will see a lot of information in real-time. This is unavoidable and this happened like a few hours after the attacks. I took these screenshots and you can see almost 3,000 words in the Wikipedia article. Now if you look at the Wikipedia article for the Orlando shootings, there is a lot of information there, there are lots of references, many YouTube videos, Facebook pages, images, photos, a posting in Reddit with 17,000 comments and so on. There is a lot of data here. Now the talk is about disasters. I'm gonna tell you more about the domain. I think it's interesting to look a bit more in the domain and then you can think of what are your ideas in which, the ways in which maybe you work and contribute to this domain, about social media, about computing and so on. So on a lighter note, the guys in Twitter did this video some time ago. Maybe you have watched it, but it's kind of what you can do with the social media. Now if you think of this, if you try to do the math of this thing, okay, can I receive actually a tweet before I get a shock waves from an earthquake? The answer is more or less yes, right? So you have some speed for the seismic waves, you have some speed for the tweets and one is much faster than the other so you can have like after 100 kilometers some messages about an earthquake overtaking the actual earthquake. Now I started to work on this topic in January 2010. I was in Barcelona at the moment and there was this big earthquake in Chile. This is in Concepción in the south of Chile and I called some of my colleagues at the University of Chile, it was my alma mater and I asked them if they were okay and if their house was okay and I also asked them if they wanted to do a paper out of what happened, right? And they said that something that was very interesting to them was whether social media and Twitter in particular contributed to the chaos that followed in some areas, the earthquake and to the feelings of unsafety of the people during the earthquake and so on. So there were many research questions there. We eventually ended up working in some of those questions. I will show you in the rest of the presentation. This research belongs to an area that you may or may not have been in contact with which is humanitarian computing which are all these applications of computing to humanitarian issues like analyzing and managing a crisis, seeing how mobile phones can be useful crowdsourcing, human computer interaction and NLP, et cetera. So no matter what field are you coming from maybe there is something for you to contribute to these areas. And of course in all the field of humanitarian computing topics, social media studies is the part in which this talk is mostly focused. So we have here all the ingredients for doing interesting applied research. The problem is of global significance because everywhere there are disasters. In Europe the most damaging disasters are floods. In other places are typhoons, hurricanes, earthquakes, drought and so on. The current way in which people make sense of all these thousands of comments and thousands of tweets and videos is mostly manual. So there is a lot of, there are attempts to try to make sense of what the public does during a crisis but they are mostly based on monitoring the Twitter feeds and Facebook pages and so on. So it's manual. So we can do it automatically. If you provide a better solution you have a public good. We have data sets. I will point you to some data sets. And there are also volunteer communities. And all that, there is a tiny detail that remains to be tested is whether what we can provide is relevant to practitioners. Of course it's not a tiny detail but it's a lingering question over many of the things. In some cases I would claim that we have actually made a difference. In other cases I would say it's not clear that we are providing anything better. Now this whole research program of trying to understand what the public says during a disaster in social media is not motivated by what is the best way of facing a disaster. It's not that we have emergency responders or humanitarian agencies telling us, okay, what do you think is the best that we can, the best way in which we can improve disaster response and we said, hey yeah, Facebook and Twitter are your best allies. I don't think that's the case. This research starts from the other side. It starts from, okay, there is a certain phenomenon. There are these thousands of messages, photos, videos. What can we do with them? There is certainly something we could do with them. Then you can say, well, it's very noisy. It comes from the internet. I don't know if it's true or false but this is the life we live. We navigate these questions all the time. It's mostly about whether there is this object which is like this activity can be mined in real time for something useful and this is the question that we are trying to answer. And many collaborators that I'm very thankful for people like UCRI, EPFL, Daily and Microsoft. Now, the point of this talk is disaster. So, I'm gonna define disasters. Disasters are disruptions of routines. A disaster is anything that changes the routines of people in the city. And hence, it's socially defined. What is a disaster in one place is not a disaster somewhere else. Here, in Beirut, there are three hours of electricity per day. And there is not a disaster that you have 21 hours without. In Barcelona, 21 minutes without electricity will be a significant disruption of routines and hence, could be called a disaster. Something interesting for trying to unfold like what is this data that one can work with is to understand that the people who study disasters, they tend to classify them according to several categories. One of those is whether this is a sudden onset disaster or something that progresses slowly, whether this is vocalizing in space, whether this is natural or induced by humans. And there is, of course, the observation that a lot of people say that there are no natural disasters. I mean, the river may have flooded, but you choose to build a city next to the river. So, it's not really entirely natural causes. Here are some examples. Most of what we know about disasters is heavily modulated by what we learn from movies because we are exposed to a few disasters during our lifetimes, but there are not really very genetic cases and we don't have enough data points to generalize well. So, we tend to, I mean, this is what the sociologists of Disasters tell us, is that we tend to reason about emergency situations and disasters based on what we see in movies. It's very influential. So, have you seen the movie on the left here? Anyone? You have seen? You saw it in the entire thing? Is it good? No, it's not. So, the movie is called Sharknado. And it's about, as the name suggests, a tornado that grabs sharks from the sea and throw them over a city. Is that right? I have seen only the good parts of the entire thing. Now, the right side is a real disaster. These are floats in Brazil. Of course, they look very differently, but I'm perhaps speaking a very particular data point. This is more interesting. So, the left frame is something that no movie director would ever film, most likely. I have never seen a scene like that in a movie. So, you have an airplane that is in flames and people are kind of walking very matter-of-factly with the bags. There was a lot of criticism against these people, right? Also, because the instructions say, leave your luggage behind and abandon the airplane and everybody is carrying their backpack and nobody's running or anything. And the right is from a movie. I don't know which movie, but it's more like a disaster. A more, you would say, this is a more faithful portrayal of what is a disaster, right? And I would claim, no, it's because we are used to see disasters in a certain way. This is relevant for what we can expect. Like, in real life, it's true that some people panic, but a lot of people don't panic. They gather information from sources that are familiar to them, the people who are around them, mass media, social media, phones, whatever they can grab, whatever we can grab when there is a disaster situation. And you have to decide what to do, whether you're fleeing, you're staying, you are gathered in water, what are you going to do in this situation? And a lot of people improvise these rescue operations like on the spot. So this mindset in which you say, well, the public is actually actively trying to get themselves out of trouble or solve the disaster, is a very peculiar mindset that is not the mindset of many emergency management organizations and governments and so on. So as a researcher, you have to first like to say, okay, I'm gonna kind of try to free myself from this view from the movies that people running circles, panicking. I'm gonna say, well, there are some people who panic, but there is a lot of people who are actually very capable of getting at themselves out of trouble. And the way in which they do it involves the usage of communications technology and information technologies and social media. These are some messages that are handpicked by researchers that from the crisis informatic area have studied this problem. These are handpicked because the majority of what you read in social media as more equally to the majority of the books, the majority of the music and the majority of the internet is crap. It's very little, there is very little information out there that is repetition and so on, but there are these very peculiar messages that are very informative because they contain something that you cannot get from other sources. You do not know, maybe you do not know exactly what's the height of the water in a place. Maybe this is one of the messages that broke the news when there were these attack in Norway. So the first news about this disaster were actually posted in social media. So then the question is, can we try to very quickly, as this thing happens, detect immediately that something is going on. Now the scale of this thing is actually large but not so large. So it's large but manageable. It's large for humans, but it's not large from, let's say, like a big data perspective. So the day Pope Francis was elected, the peak rate in Twitter was 2,000 messages per second. Now it's 2,000, I mean, a tweet is like four kilobytes. So it's not really a ton of information, right? It's a small quantity of information. Now what happens is that it's coming very fast and in those 2,000 messages about the Pope elected, maybe there is some interesting information. Maybe somebody started shooting someone. Maybe somebody planted a bomb someone. Maybe there is a reaction that is very important that one need to hand pick from what's going on there. The same, now in typical disaster situations, you observe like hundreds, 500, 1,000 tweets per second and much more in videos, Instagram photos and many other sources. So I work in this group when I was working in the crisis informatics team until next year, we were trying to collaborate with the UN and with other agencies trying to answer like two sets of requirements that are very different and maybe they also have appeared in applications for data mining that you have built that use citizen data. Some people want to, the second point is the easiest. This is what we tend to focus on when we build these applications, not only for disaster, like in general when we do mining of public information, citizen information, there is a lot of focus on the second part, which is okay, I have like all these reports from a city, all this information. I want to find like two or three best restaurants. I want to find two or three places where people need water in this catastrophe. I want to find where the shooting is taking place and so on. So I want this actionable insights and there are some groups of people, for instance the police, emergency services, medical services that are deeply interested in this kind of requirement. They want to know this actionable insights and your job is to take this data stream and select those. But there is another part that is much less explored from a data perspective, which is to try to understand the big picture. So there are agencies that ask us like a very, in a very direct manner, okay, can you tell us like in five minutes after the typhoon passed through the capital of a country, can you tell us more or less how many people were injured or how many houses will need to be rebuilt or what is the amount of damages that the typhoon caused like based on these preliminary observations? The answer is no. I mean, not yet, but then the question is then what is the right mechanism to do something like that? So can you tell us from, in the case of other types of application, could you be able to tell from all the social media activity around Barcelona whether employment is gonna recover next year or not or whether there is a, whether house prices are going up or down? Like all these kind of big picture questions have received much less attention than the questions related to kind of actionable data points. Of course, you can build a big picture with the actionable data, with the data points, with the data points, but perhaps there is some way of modeling this directly. Okay, I'm gonna pause. Is there any, I haven't given you time for questions, but is there any questions so far? Comment, anything? Okay, I will move over there. So the typical thing that you do when you are facing, like okay, coming back to the disaster scenario, you have this stream of messages about the disaster and you want to do something with them, right? So the typical reflex of a data mining is to classify things, of course. Now, for classifying messages, you really need to be able to do some text mining. So how many of you work in NLP, Natural Language Processing? So anyone? The rest, a bit with text? One, great. So I think what's, what I would say is that most people that work in machine learning are not afraid of text, but I would say people who work in computing in general are afraid of text. People tend to work only with data that they can, that is kind of readily available, like people tend to be very comfortable working with a data table and now with nested columns. There's no problem, like some structure, time series, people are very comfortable working with that. People are deeply uncomfortable working with text, particularly computer scientists and computing engineers. They say, well, I'm not an NLP person, so, or I don't know how to process text. I mean, maybe not you, but I've seen a lot of people who are reacting this way when you tell them, okay, let's try to understand what people are saying in this city, what people are saying in this disaster. And the truth is that you don't need to know much to work with text. So there are many tools, there are many well-established methods, there are mature software libraries, there are libraries for doing anything, like dependency parses, identity extraction, linking. There are lots of tools that are readily available in case you want to understand text, which brings me to the Chinese Room analogy. Do you know it? Have you heard about this thing, Chinese Room, yeah? Okay, that's enough for me. I'm gonna skip it, but if you want to some interesting reading, take a look for this thing. It's really interesting and you get a better sense of why I'm telling you that you shouldn't be afraid of text. Other times, you do need to be afraid of text. So for instance, take this kind of message that is posted in social media. So try to parse it with your 11,000 million neurons. See what comes out of it. So to understand this message, first this is probably beyond the reach of a computing algorithm, beyond the reach of most people, unless you're familiar with Indian politics, because there are lots of context information here. So there is a China NC, it's a member of the parliament. RSS is not the party of China NC, but it's a party that has close ties to the party of China NC. And the photo on the left, which represents a party in Malaysia going to help in Nepal for the earthquake, is actually an old photo. It's a photo from two or three years ago. So in order to understand this, you need a lot of context, it cannot be done independently. So understanding social media, like trying to extract meaning from this content is very difficult. You are basically looking at a conversation. There are people who use fragmented language that is not grammatically correct, full of typos, abbreviations and lots of other things. And it's also conversational, so people answer to each other and then if you don't know the history of the conversation, it's difficult. So understanding this in general is not so easy. So for instance, this was after the attacks in Belgium. Both messages, you can understand them as, both messages show sympathy in very different ways, but they are both ways of sympathizing with Belgium, the Belgian people. Slang is easy to manage. And then, well, typically you start classifying this thing. So let me show you some example, like more domain-specific example, like what we did, like classifying these messages from social media. So these are the, maybe I can show you this for instance. So these colored bars are, each one of these bars represent a disaster. These are data samples from Twitter. And for instance, the left one is an earthquake in Costa Rica in 2012, floods in Manila in 2013, Typhoon Pablo, floods in Sardinia, an earthquake in Guatemala. There is the, the next to last is the derailing of that train in Galicia in 2013. You remember this happened in Galicia, and high-speed train was going even higher than necessary and it derailed. So the colors are types of messages. And something that we observed, like very early in this research, was that something that sociologists of disaster have also said that disasters are very different, but they have very common elements. And people tend to more or less worry about the same things. So they start first like, okay, things such as, okay, what is the main problem? Like trying to advise people on what to do, trying to warn them about the danger. Affected individuals is the yellow part, the blue part is in the infrastructure, like things that are damaged, green is donations, pink is sympathy, and the last part is useful information. It's from Twitter and it's done with a combination of crowdsourcing and classification, like automatic classification. Crowdsourcing, so we use some, we use a crowdsourcing platform like Mechanical Turk to collect, like Amazon Mechanical Turk, that is called Crowdflower. And these we use to collect labels and with those labels we train a classifier and when we're confident we start labeling data. It costs money, yeah. So doing this study costed some money. But it doesn't cost to you because the data is free, are free now. There is this website crisislegs.org that are like nine or 10 data sets by now. We have been collecting data sets about disasters. They are good for machine learning exercises. If you are into teaching it's also interesting for showing something. They're good for demos, like small demos on some useful topic, yeah. We are a picture but you don't know what to do with it. Right, so this, I mean it's interesting, what can we, so suppose you take a disaster and you say, well in this disaster there are lots of messages about donations and very few about infrastructure damage. Now we did several attempts at relating that with actual physical variables from the disaster, like for instance official reports of people dead injured and so on, houses damages. We never succeeded. So this was, we tried many ways of normalizing the data of interpreting like what should we count, like one tweet, one person to be normalized by the activity in this part of the world in the previous week when there was no disaster and so on. So we had some data set that was very appropriate for this which was a typhoon in the Philippines that affected several islands. And then we knew in each island what was the amount of damages, infrastructure damage and injured people and so on. And we wanted, and we had the Twitter data and there is some correlation of course but it's not really obvious how this variable respond to the actual magnitude of the disaster. We have the conjecture that this is concave. So if you have, if nothing happens then nobody tweets, like nobody posts tweets about disaster, like if something mild happens then people start getting more excited, like they write more about this thing that happened. Now if the thing is really serious then they start having less time or less inclination to post about this in social media. I mean it's a meteorite wipes the earth then there will be zero message in the extreme, right? And then maybe there is a sweet spot where the disaster is really important but it leaves you enough time and inclination to try to post some information. So weird relation, this is why this is difficult. No answer yet. So we also observed a progression of these things over time and this was across 26 disaster and now even if you think of the Orlando shooting well there is no infrastructure, I mean infrastructure damage is not the most important issue in other cases it is. But there is this progression where people say well, look out, there is a shooting, people should leave the post nightclub and then people start expressing their sympathy, support and then describing who was killed, who was injured and then trying to give other information is more specific and then the nations are volunteering sometimes in a framework of day, sometimes in a shorter framework. This is very, of course every disaster is different but there are some commonalities. You can extract information from these tweets, you can extract what's the important part of the message, what is the time, who is the person posting these messages. There are many things that one can do, track hashtags, URLs, and this is some example of information extraction here. These are the same, yes. Yeah. Yeah. Yeah. Yeah, it's true. We typically created like separate collections per language because many, all of these things are language specific. The, you can, I mean in some cases like when, when you need like a quick and dirty solution you can build like a classified that works across languages like in many cases here you have no, so far there is very little data to bootstrap like a classifier, like if there is something happening today in Barcelona, a bombing. Will we have a classifier for Catalan and Spanish and English that can classify that information? Maybe not that we need to bootstrap it with some training data from before or we need to, we have a classifier and we try to transfer it to the new setting. I mean it's very, you may, maybe you need to translate. It's complicated. So in general, I mean we were very pragmatic and we said okay, different collections per language but maybe one can do something better. It's a good point. And the same with information extraction is not only for classification. So another thing that we did was to try to identify people asking for something with people needing something and this was done independently by other groups also in Japan and they came up with a solution that was very similar to ours. So you basically try to locate this kind of, the important part of the tweet. Like you have a phrase, I mean the tweet is 140 characters but it has a part that is more important than the rest. The wind, it's about the nation of blood, it's about the closing of the bridge. So you try to identify like the central part which is the problem and then try to identify the central part which is the solution in some other message and then try to match these donations. So not only based on text, in the end we have very small precision but if you use text, you can do even worse. And so if you're able to kind of first try to identify the important parties, you can be better. Right, so the problem with the matching is that there are very few messages that involve quantities. So you don't know for instance, I suppose I say there is a shelter open in this street. I think of a hurricane, but there is a shelter open in this street and people say, where can find a shelter? And then you say, well, there is a shelter there. How many people accommodate that shelter? I don't know. People don't post this information. The same with food donations and so on. You don't know if the place is saturated or not. So it's tricky, yes. We didn't get into much complexity but saying, okay, this tweet is the solution to this, this message is the solution to this problem. This is the evaluation and this is the 21%. So in 21% of the cases, one in five, we were able to find a solution to the problem, which means you will have to scan five messages in order to find solution instead of 300 per minute or whatever it is. Yeah. Here we use something similar to what part of a speech taggers do. So they have this extractor learning models where the input is a binary vector, right? The input, in this case, think of the input is a, the input labels are a binary vector. This binary vector has a zero whenever this part is not the important part and a one whenever this part is the important part, right? It's annotated by experts. So you have this expert who annotated this vector in this way. And now you're learning, and now the features are a matrix that has in every of these positions in the text have a series of features, right? One feature can be the word itself. Another feature can see, well, this is capitalized, completely capitalized. Another feature can say, this is a word of length two and so on. So you have for every word, instead of having a feature vector, you have a feature matrix that has for every position in the text several characteristics of the word at that position. And now your labels instead of being a scalar is a vector. And now your learning method is mapping from the matrix of features to this vector of labels. So this is how you train this thing. These are used, the particular learning scheme is a conditional random field, which is a model that is like a chain. Like every state depends on the, every label depends on the label to the right and to the left of that label. So it represents the dependencies on that way. So that is how it works. Now, that thing is the same way in which a part of a speech target annotates that this thing is the verb in a phrase. So the same way in which a part of a speech target recognizes that this is a verb is the same method that we use to recognize that this is the important part in a phrase. Actually, we didn't even, we just changed the training date. We didn't even implement anything. Here for language, yeah. Yeah. So this is here, yeah, it's a good question, right? So for instance, okay, so this will reflect typical. So Joplin was a tornado. In Isuri, I think, in the US in 2012 and Sandy was the hurricane in 2013. So if you train on data annotated for the Joplin tornado and you test on data annotated for the hurricane, Sandy, you get a recall of 11% and a precision of 78%. If you train on Joplin plus a bit of Sandy and you test on Sandy, then your numbers increase. So this is sensitive to whether your training set corresponds to, has information that is current about the current disaster instead of just historical information. Of course, this is kind of the crappiest domain adaptation method you can use instead of trying to change the model. We just add a bit of training data examples from the second setting. So we have like, this is all practical, right? In practice, we had like a stack classifier. So the first level is, okay, is this a tweet that is relevant for the disaster or not? And that is the most generalisable model that we found because across the disaster, across these 26 or whatever training data we had, there is a fairly kind of consistent informative tweet. It is a tweet that is long. It contains places and locations. It is well-capitalized. It doesn't end with a smiley. Like it has like a certain shape of what's well-written. So you start with that and you filter a lot of things and then you start digging into, okay, what category this is. Okay, this is a problem tweet. It's a tweet describing a problem, like on which type of problem. There in your person or damage infrastructure and utility. Okay, and then you go to that branch and in that branch, you start identifying, okay, what's the root of this problem? It's all practical or things that are irrelevant. There is a lot of irrelevant information. I mean, yeah, tomorrow there can be an alien invasion and people will still be tweeting about like lots of other things. It doesn't go away with a disaster. This is very weird. You have these heavy disasters and people tweeting about like insane things. Okay, so it's 8 p.m. I'm gonna go until 8.20 maybe. Is it okay? 8.15 and then. So disasters have a very important geographical component and this is a visualization that was done by the people in Facebook. I show it in the data beers but now I can discuss a bit more detail. So these are the people who use the safety check. These are the friends of the people who use the safety check. Safety check is something that pops up in Twitter when you are in a disaster, in Facebook when you're in the disaster area. And then these people in yellow are gonna appear in a moment. These are people sending donations to Nepal and money and people all over the world sending money to Nepal. From Japan, you see Australia, North America, Brazil, Europe, lots of money flowing into this area. Now, just as people can contribute money, there are lots of people, and this is what makes this setting really interesting, is that there are volunteers. There are people who are willing to provide these annotations. In the case of research, we use a mixture of paid annotations like through Mechanical Turk or Crowdflower plus volunteered annotations. So the volunteers' annotations are much more, are much larger data sets. Volunteers are motivated, digital volunteering with respect to disasters has started with the work of the people in Ushahidi, which is a company in Kenya, that they have a platform for community-powered mapping. And this was picked up by this guy, Patrick Mayer, he's one of the pioneers in this space, and he is very fond of maps. People have been producing these community-powered maps for a long time. These are more modern versions, like where is Sika, where is a pond of water, then for Syria, there have been maps running from a long time. These are some of the maps that we produce in Qatar. This is for Taifun Pablo in the Philippines. And these are maps produced, these are other types of maps, right? These are graphs, they also map things. Now, the maps I show are powered by people, these are powered by algorithms. So in this case, this is researched by a group of people who studied floats in Germany. The top part are the water levels, and the bottom part is a kind of a heat map of tweets saying something related to floats. And if you are really generous, you will see a relationship between these two things. Now interestingly, here you can also see the biases. So this city is in Mandelburg, I think, and it is the closest city to the river that was flooding. So it's kind of, you have, I mean, the physical phenomenon is you have a river that is that river, Elbe, Elbe River, the river Elbe that goes in that trajectory, and it goes through a part of Germany that is not as inhabit as the rest. I mean, there are no huge cities in the Elbe River except for Mandelburg, which is this city. So in a sense, like the people bias, I mean, people are located there, so they kind of bias the reports towards themselves, right? You kind of see the flood where there is people. If there is no people, you don't see much of the flood. This is similarly done for Dengue. If you look at the top, these are official reports from hospitals, like weeks after, with a delay of weeks, and at the bottom, you have tweets. Of course, there are also biases here. So if you take a map of Brazil in terms of population density, maybe you produce a similar map and you make us believe that this is about Dengue. So it's a lot of, I mean, these are kind of very preliminary, I find them very preliminary results because they're still very heavily biased. This is a map of earthquakes in Italy. Airquakes are a very interesting case because you may think, you may believe that given that there are the sensors, you can actually tell very fairly well what's the magnitude of an earthquake in a place. But the damage the earthquake does does not depend entirely on the waves. It depends on hyper-local characteristics. It depends on what is the quality of this ground. It depends on the quality of the foundations of the building, the year of construction. Like it depends on so many things that are hyper-local, that you don't have, even with a very dense network of sensors, you don't know where exactly are the houses that you need to inspect or there will be damage. But people know, people can see cracks on the wall and so on and report them. And this is what this group of Italians did. I work for many years in, for a few years in this tool called AIDER. AIDER is a mixture of machine learning and crowdsourcing. It's actually two platforms that communicate with each other. On one hand, you have kind of a machine learning call which is nothing more than WECA, wrapped in several layers of things to be able to auto-configure and to be used by a non-expert. So you have a kind of a person who is interested in the disasters go here, enter some keywords, start tracking some tweets about the disaster. And then we also instantiate at the same time a micromapping task on the other platform that communicates with the machine learning platform. So you get examples that need to be labeled that are thrown to the crowdsourcing platform. And from the crowdsourcing platform, we receive labeled examples and we continue iterating this behind the scenes until we're satisfied with the quality of the classification. Now, what's interesting is that this is very easy to use. Like you don't need to, there is no, you don't need to select the algorithms or anything or the models or whatever you want to do. You just create your collection and we on the background create a crowdsourcing task and you invite volunteers to the crowdsourcing task and this kind of self-sustains. We do have a simple version of active learning. So whenever we classify a tweet and we have low confidence, those are priorities to send them to the crowdsourcing platform. So basically we are choosing tweets that are close to the decision margin. So I don't know. So the classification accuracy is decent often like you get like a set of goods. You can get very quickly a separation of the messages that are about the disaster from messages that are not like, and you can set up your own things. This kind of works. I mean, we were very careful to kind of choose like models that maybe are not perfect for one particular situation, but that they generalize well across several situations. We need in the order of a few hundred messages per category to be able to create a good classifier. And that was well within the reach of the volunteers. So I think like Typhoon Pablo is word news and we get maybe 1,000 volunteers and some of them produce like 1,000 labels. So I mean, we get a lot of, I mean, it's of course it's very heavily skewed, but you get some people who do a lot of work. Yeah, that's a problem. I'm still worried about and concerned about. I wouldn't say working on because I don't have like even a good path to solution. But the biases are a big problem. But biases are a big problem in all of social media research. There are ways like, there are ways of overcoming bias. We have, we gave a tutorial like about a month ago on biases in social media research. And most of the answers that we were suggesting to this problem were related to doing a careful design that is similar to an observational study design, like a natural experiment design. So you have things such as, okay, you have these two cities in Brazil and they are about the same size and they are about the same population. And one of them is reporting this amount of tweets and the other is reporting this amount of tweets. Then I have a data point that says, well, maybe this city has more dengue than the other because their underlying characteristics are similar. And if you generalize that, you can have like a propensity score. You can have, okay, according to what I know, according to everything I know about these cities, the probability of observing a tweet about dengue in these cities this much, in the other cities this much. And now I look at, I stratified the cities by this probability and in each bucket I compare cities that are equally likely to tweet about dengue or write about dengue. So there are some designs that one can try to do to try to remove bias. But it's tricky because in the end this is kind of an, you are not controlling anything. You're just observing what's going on and then you cannot tell people, okay, now please start, please go outside and see if there is a hurricane. Like people will self-select and I mean you have all the possible biases. Yeah. So this is one, I mean in the screen there is one, one, I mean the one comparison that we can do. But we did compare with, for instance, in this same case like Philippines, like Damach in each island and so on. And there is some relationship, but it's not enough to be able to say. Yeah, so there is some, I mean it's clear that people in social media start speaking about an earthquake when there is an earthquake. I mean this is clear. Now, beyond that, whether if the earthquake is stronger, how much more they will speak about the earthquake, that relationship is not so clear. It's not at all and the hurricane is same. And also it's, yeah, it's messy, it's complicated. So what we can show is like this type of graph will say, okay, this is the hurricane path. Those are the tweets, the red are photos of things that were damaged. Well, most of the things that were damages were, most of the people who posted photos of things that were damaged in the Philippines were in the path of the hurricane, except for that guy in the bottom that ruined our visualization. But the rest were all kind in the path. I mean, this also requires, it's a lot of, I mean it's a lot of empirical things like try, observe different situations and see what's, what happened. Okay, let me show you the last video and then go to the conclusions and then we close, we wrap up. So maybe this needs a small introduction. So the, I think then the end goal to me, I mean one possible end goal is to be able to create systems that do real-time data mining using crowdsourcing, using people, right? So you'd system for participatory mining of streams. Let me give you one example of that thing. So this is a UAV in Banatu, a Phantom DJ. This is like, I think it's like 600 Euro machine that has four propellers and it can fly for 15 minutes. And what, this is a simulation in the sense that this video was shown to a group of 50 volunteers online and they were asked to click whenever they see something, whenever they see infrastructure damage, right? And then the system, what it's doing is just aggregating, for instance that guy there thought this role is damaged and the other 49 said no, this role is like that. So they didn't click on it. And the other, this long guy said, well, this is damage, maybe not really, like it's not so damaged as for instance on the right, like those, for instance, you see there, there is a rooftop that is missing, let's see a moment. So the drone is looking in the wrong direction, but now it's coming back. So you see there is a rooftop missing and then there is another one here on the right. So look at, here I think what's interesting are the possibilities, right? So here, this is a prerecorded video, but the possibility is okay, you can do a map of a crisis, like in the time it takes to fly the drone and then land it. Now extrapolating these two other situations, I believe there is a lot of potential to create real time data mining systems that use a mixture of algorithms and people and maybe you don't need a lot of people, maybe you need just a few operators and like five, six people who sit there with you and watch this whatever it is, like this news, this football game, this disaster or this map of a city or this transit maps or whatever. And with the help of algorithms help you make sense in real time of what's going on and generate an output that is valuable, right? So that's the kind of vision, like have a system for data mining that can involve people and algorithms and work in real time. Maybe you can build something like that. Maybe not, who knows. So is labeling all a crowd can do? What about validating? What about selecting features? What about generating hypotheses over the data or detecting biases and discrimination, suggesting an interpretation and so on? There are lots of things that people, humans can do and we could use them to try to understand the data. Okay, I will not speak much about this. Let me finish with the conclusions. So this research has good, bad and ugly aspects. The good part is that it's interdisciplinary, so you learn a lot and it's really great. The experience of talking to people that are passionate about their own domain because given that you don't know anything, you learn a lot. I mean, this is the best part of the learning curve is when you're at the beginning, you don't know anything. So every time you speak with them, you end up thinking, oh, this is really cool. And of course it's stuff that they have been doing for 50 years or so. The bad thing is that this particular domain has a lot of difficulties because these organizations are very old fashioned and they are difficult to engage with and they risk their kind of the organization is risking their life in every disaster because they are very easy to criticize. So they are exposed to criticism. So all they do since they want is to start covering their backs. This is very important because otherwise they don't survive. And you cannot have an emergency administrator that it changes every time there is an emergency. And the ugly thing is that of course you have like two competing things. You want to do your research and they want 24-7 support and so on. This work, I mean, most of what we do in data mining is here. We have the data and we have computers, so we do it. There is a lot of, this space is smaller but every now and then one can have like a good idea or a good partner and try to find something useful to do with this data. But most of what we do is here, right? There is no shame in that, in recognizing that as long as we understand that we are trying to progress over to this path. I wrote a book on this thing. It's called Big Crisis Data. If you are interested, you can search for Big Crisis Data and take a look at the table of contents and see if you like it and if you have any other comment you can write to me or whatever, that's it. Thank you. Yeah. I will filter it as a noise. For me. Yeah, so any, if there is any comment, question? I understand that if you think like for different purposes you prefer one thing or other thing. I'm trying to think it, for different purposes you analyze differently. And since there is a lot of subjectivity it's like, can you use the, in theory the Bayesian point of view in machine learning is that you take the most of subjectivity to make good predictions. So somehow it seems that in your case that would make really sense because you start from human subjectivity that you, is Bayesian machine learning more useful for you than in other applications? Or it's just something like, we sometimes do it, we don't use it. Do you know what I mean? Is it, in theory, it would be like a good application for Bayesian subjectivity? Yeah, I would say, I will agree with the fact that we are trying to, we are trying, essentially we're trying to model what the human would say about this thing. This is a human, but it would be a human that has access to any information that this particular label doesn't have. So I don't know. Because I mean it's partially true. It's true the part that we were essentially concerned with reproducing faithfully what the human will say. But it's also true that that person is deciding Bayesian information that is not available to them. So what about, so your blackmailing needs to find relevant tweets and to categorize them according to all the, but what about the implicit kind of searching that there are in any social network? In future you expect that relevant information has more requests, that the hashtags contain a lot of semantic information. So, we have used that kind of, yeah, that's a good point. So we have used some, for instance, the propagation of the tweet. We have used for determining whether people are believing what they are tweeting. Whether these propagates as a credible tweet or whether these propagates as something that people are hesitant about. But also we have some negative results. So for instance, just the amount of retweet that something has is not so much related to whether it's true or false. And we have very false tweets that end up being retweeted a lot and very true things that end up being retweeted a little. So it helps for some, it is information that needs to be used with care and it's not only the quantity, it's about the propagation tree, maybe it's about the number of people involved, whether people change and so on. So there is, yeah, with that conversation, how that conversation evolves, yeah. So there are cultural differences. Yes, we don't place that information in the models. But there are other people who have, yeah, there are people who have observed like heavy cultural differences in response of these things. So for instance, there is a very sharp division. There is a study that between Japan and Pakistan, users in Pakistan trust in more social media than the government and usually in Japan trust in more the government than social media, for instance. There is also linguistic variability even in the contents, right? So you may think that from the Philippines, if you look at all the tweets in English and all the tweets in Tagalog, they will have like a similar composition but it's not the case. They are written for different audiences. So the tweets in English are written to a foreign audience and they are kind of expressing how serious the situation is and try to inform others and so on. And the tweets in Tagalog are more interpersonal. They're about, okay, how do you know something about my casting or this local singer or football star or whatever like so. Even within the same culture, the things that you say in one language and the things you say in another can be different. Yes. So the volunteer groups, some volunteer groups tend to avoid human conflict because it endangers their own people and also they see that there is much more deception. There are much more many people trying to make you believe what is not true. So human conflict is kind of, we do some of it and some of it is human conflict but the volunteer group are hesitant to enter this space plus you have much more disinformation. Like wrong information purposefully injected into the data. While in a natural disaster, there are whenever you see somebody saying something that is false, most often it's just that they think that this is true and they are wrong. Like they post about a tsunami warning because they think, well, maybe it's true and if I don't post it, it's bad but if I post it, maybe nothing happens. So it's a, yeah, human conflict is kind of a different, I mean, it has some seminal techniques but it has extra complications. So things that we have, yeah, it's the same kind of people. There also there is some data protection thing to do. For instance, in the case of Syrian, maps of Syria, what people do is that they delay the map. They have a public map that is kind of two days old. And then a private map that is current. And in the public map, you don't show all the information. Just because there are factions in conflict so you don't want to get them alert about somebody else. Also there are cases where people try to retaliate against those who are posting information. So it's a kind of tricky, yeah, you need to kind of protect the data that people is complicated. Yeah. And how do you get the people to use the public services? We typically, we are using the public, the streaming API, the filtering API, but we had to talk to Twitter a few times. They have a governments and NGO division, not division, person. They have a person who interacts with governments and NGOs and that person authorizes some extra, to be more lenient with someone because they believe this is not a commercial enterprise, for instance. Bounding box and in crisis legs, there is a vocabulary. There's 380 words that are very repetitive use in English when there is a disaster situation. So we will use, for instance, sometimes a combination of words. In crisis legs, there is also a paper on this thing, like trying to mix with a bounding box, you get a lot of recall, but you get very little precision. And with the hashtags of the event, you get very high precision and very low recall. And you can find in the middle some trade-off where you incorporate more words that are maybe relevant for the event, and you, but you still, you are gonna gather, you're gonna, your precision is gonna suffer, but you will have higher recall. So it's called, in information review, it's called information filtering program. You have a stream of data and you have to identify which are the queues that you're gonna do against this data to separate one part from another. It's a problem with itself. It's an interesting problem with itself. Good. Okay, thanks for coming. Great.