 Hi, everyone. First of all, thanks for coming. Today, I'll be talking about how you can generate actionable intelligence from open source data just using AI. So a little bit about myself. I am right now in the advanced analytics practice at PWC. Before this, I was working with the Indian Army with various applications of AI, and I worked on several cool projects like actionable intelligence gathering through AI. I've also made a bracelet which tells you when you're going to have a stroke and all that. We'll not talk about everything, we'll talk mostly about how you can use only open source data to generate intelligence because that was most of my work. So now, before I get started, I think I should tell you all that open source data is extremely difficult to work with, right? So there's this quote which I wanted to share with you guys. So this should stick with you for each and every data science project, okay? I see a lot of people actually ignore the data in their machine learning projects. Your data, you have to work 90% on your data, 10% on your machine learning code. You need to extract as much information as you can from the data and that's very important. Without that, you're basically nowhere. Machine learning is very easy. There's one line of code to any machine learning model, so that's not a big thing. Now, open source data gives horrible results. Okay, straight off the bat, if you work with open source data, it gives horrible results. The best results, okay, how many of you have done any courses online for certifying? Okay. Okay, so the thing is, when you work with data given in certification courses, you get good results. When you work with competitions, you get worse. When you work with real-world data, you get even worse. And open source data is garbage, so I made a very neat visualization so you can understand how open source data results are. So as you can see, not very good. You can't even make it out what it was. So fine, let's get started. So open source data in a nutshell is basically data which anyone and everyone can use, modify, share, and they're free to do whatever they want with it. Nothing big. But the thing here is open source data is an ever-expanding source of data. It increases every second, every minute, and it's a growing data source. Now, can anyone here tell me what is the number one factor which affects the performance of any machine learning model? Any machine learning on any deep learning model? Quality of data. But you have 10 good quality data versus quality and quantity of data. Not just quality. You need a lot of data. So even, I guess, logistic regression will beat the topmost algorithms if it has more data to work with. So here's where we come to our first paradigm which we're going to look at is large data versus relevant data. So when working with open source data, let's say you get 100 rows which you think are relevant to your use case. Chances are only 40% of them will actually be useful. But open source data is very tricky in the sense that it disguises itself as being useful. So you need to really know what you're looking for. You need to really do your research to understand what was going on here. And you'll look at this in the later use cases that I covered. So my main point in this talk is I'll talk about a few other stuff which I have done with the Army and with other stuff. And basically I can guide you through what challenges I faced. And hopefully you guys will not be working on such critical projects but you can apply it on anywhere you want. So the second thing you're coming to is analytics as intelligence. So when I was first contacted by the Army, they told me that they wanted a social media intelligence platform. And I was like, fine. So I went home, I did some research. I saw a few of the top tools. I thought, okay, what could, it's not that difficult. I'll just take the best features from every tool and I'll put it into one, I'll show it to them. So here's a sample dashboard, right? So this is a social media tool which is available. You can see, it's giving you tremendous insight. You can see the audience interaction here, the audience impressions, how many retweets you're getting. It's giving you insight. Another dashboard, same. This is giving you a more textual format but it's giving you more detail as well. So you can again see your interaction, how your brand is wearing the sentiment and all that stuff. So when I presented it to them, they said simply that it's not acceptable. This is not, this is nowhere close to what you're looking for. So can anyone tell me why? Okay, I'll give you guys another clue. They said this is analytics, not intelligence. Yeah, so sure, this is, someone said not actionable. Yeah, so this is not actionable, right? Okay, sure, you can see on 1st Jan you had good customer interaction also on 1st November. Now intelligence on this use case would be why did you have good customer interaction here? What was it that you said before that you didn't get any customer interaction? Is there anything you can do? Is there anything driving your customer interaction? So that is where the crucial intelligence comes in, the answering the why, which many people lack or which many tools lack, not people. And in fact in many data science scenarios also, people don't answer why. They try and give you analysis and they just leave it at that. You're supposed to infer. So that's what we mean by analytics was intelligence and we'll be moving towards intelligence, right? So we'll be covering a few use case. Before that, let's just go to the open source workflow. Now, here the first two points are very, very, very important. Not just for open source, but I think in any data science project, any AI project, and I think they are heavily ignored by people. The first one is research. Doesn't matter what domain you're working for, especially when people are working in consultancies. I see they, because they have to work on a lot of domains, they don't do too much domain research. They just go from project to project and that's not acceptable. Any data science project, you need to know what you're looking for. You need to know what exactly a problem statement is. Now, second thing, this is more towards open source. You don't know your problem statement when you're working with open source. Mostly if you get a project in a company or something, you are given a problem statement, right? Okay, this is your data, predict risk. This is your data, predict customer interaction. In open source, you don't have a problem statement. So for example, when I was working with the Indian Army again, I had data from Twitter and all that sources and they told me, okay, get intelligence. This is not a problem statement, getting intelligence. You have to formulate stuff through which you can act on it, right? The other two approaches, I'm sure all of you do very well, understand and clean the data and try various similar approaches when you aggregate it to get results. Now, three things to consider before we get started. In open source data, you must be good at getting something from virtually nothing. So I said virtually because there's always data. There are always patterns in data. You just don't see it. It's hard to see, but that's why we have machines and machine learning, right? To see what we cannot see, there are always patterns. You need to understand that and you need to torture your data. You need to get them out. Second thing is, again, it's very important. You need to understand your problem. If you don't, you're nowhere. And the third and the most important thing is stop trying to reinvent the wheel. How many of you here would say that you're good coders? Seriously, one guy. Come on, man. At least a few people. I come from a coding background myself, but let me tell you, you don't need to code anything. In open source data, in data science, most of the stuff has been done. There are so many great researchers who have already done their stuff, done their research. Why do you want to start from scratch? Any problem you're having, 99% of the time, someone has done it, someone has done their research, someone has presented a paper, a lot of cases you will find GitHub repositories having it. So you should always look for if someone's already done something and then work along with it rather than trying and making something from scratch. So let's get to our first thing. Region estimation. So while social media intelligence was my problem statement, first thing I thought was, okay, I have all this Twitter data and they wanted to know location of people were using Twitter, right? And they weren't very ready to give me data from their side to and to predict locations. So they said, okay, we can't give you until you give us a POC. So what I had to do was I had to look at Twitter data and predict location. Now I had no training data, no testing data. Now this is where formulating the problem statement comes in. You need to understand how to formulate the data, put it in the problem statements all the way so you can use machine learning. So here's our workflow. So the first one is, first and the last ones are just open source search. There's nothing to do with data science. The three in the middle, activity-based predictions, slang-based predictions, and network approximations, these are what we'll actually use to get a person's location, right? The first one is open source search and this is something which we have to do. Like suppose if you were on Twitter and you're not giving a location but you're on Quora or Reddit or some other place and someone asks a question, how is it like living in Mumbai and you answer, I've been living in Mumbai for 10 years. Why am I approximating a location then? I can obviously see that, right? So, and you'd be surprised that even the safe, you have one username, how many other websites have people with the same username? I'll show you guys a quick demo of that later. So the first thing we'll do is activity-based predictions where you'll try to get your location based on your activity, like when you're active, when you're not active. Second thing is you look at slang, and third thing is network approximation. So what we do, I'll just show you a quick video of how I'm approximating the location. So what I did was when it comes to region approximation, identifying a country is very simple through activity because we live in different time zones. It's easy to identify which country someone is tweeting from based on their activity. But once you get inside a country, like inside India, it becomes difficult. You lose your accuracy. For every interaction, for every prediction, we don't take the number one thing with the model that's predicted. We take the top two cities. So suppose my model is predicting Mumbai and Kolkata, I'll take those two, draw a line between them, and keep a pointer in the middle. I like to call this the slide-point approach. And then I move the pointer based on what my other models are telling me, right? So, I'll just... How would I play this video? The video, please. Sorry for the bad animation, but this is how the workflow is. We start with region estimation where we try and estimate a country first, which comes pretty easily. It's not that difficult. So for here, let's say we have identified India, right? Now that we have identified India, we need to go deeper, and here's where the real problem is going to start. Let's go deeper. Yeah, so here we're going to try and predict the cities. So let's say we take five cities as our sample set, right? So we are trying to predict within these five. And the first region estimation model says, okay, it's either Mumbai or it's Kolkata. So we'll take these two cities, we draw a line between them, and place a slide in the middle. Now the model says that the probability of Mumbai is 0.6. We move the pointer towards Mumbai. Second model, Slang Detection 1, says it's Kolkata. We move our pointers to towards Kolkata. So like this, we get the results from all our models. And the final, as the closer the model, the pointer is towards the city, the better confidence I have, the smaller radius I can give you. Okay, so this is actually where he is located. The farther it is, the worse it's going to be. So I think I'll show you guys a demo before I actually start with this. I'll take questions. So this is the model for tracking. So in the start, we'll just look at one person who we used to be very interested about with this guy. So if you see his location, he has given it everywhere. So that's not really a small radius to look into. So let's try and get his location. So we used to keep tracking him a lot. In fact, I used to keep tracking him. So the thing is this guy roams a lot. And every time I track someone, I maintain a record in the database. So I can understand where he is moving and what his pattern of movement is. So with this, I've already loaded the result for this one because he has a lot of data, so it takes time to predict. So the moving pointer you can see here, this is the latest location which we found as per his latest tweet. And you can see quite a few places where he was actually tracked. Now, one in particular is you can see the one in the middle in around eastern China. There's a big round circle around him. That's because we weren't able to predict his location very accurately. So the radius on that thing is around, I guess, 300 kilometers. Yeah, so this is a 337 kilometers radius in which you're predicting. Otherwise, at other places, we can go down to 28 kilometers as well, where we'll be able to get his location. So one thing which I tried to do at... I spoke at Google DFS last Sunday. So what I tried to do there was, I tried to get one of the speakers who spoke before me and I tried to predict a location which was... And it came out correctly. It was the same thing for ODSC, but how I did not find any speakers matching because most of us have already given a location. So this guy was, again, the closest one which I could get, but, again, his location is present. So again, I'll go with the same example which I went for the Google DevFest. So Priyanka Sinha is a researcher at Tata TCS, right? So now hopefully this will make you understand how this thing works. So let's try and get a location. Hopefully my net is working good. So you can see the bottom part of the screen. The models are actually trying to approximate the location and the location radius keeps getting smaller and smaller based on how high each model's confidence interval is. So we go to 300 kilometers using just machine learning model. After 300, to get more insight, that you need to do a little more domain research using some backdoor, some cybersecurity knowledge, using IPs and all that stuff. You can pinpoint it more. So we have already down to 48 kilometers. Hopefully it'll go better. Okay, it's gone. So you can see just one second. So you can see your location is like shown here, Kolkata. And the thing is we don't use a pinpoint thing, right? So we, although the radius is, it's triangulated through a lot of models, the result might be the same when you search for the multiple times, but it actually gets deviated. So for example, I've searched for a widely testing also. So if you can see, there are so many places around here where she was tracked, right? So although all the models are predicting the same place, it's not that pinpoint accurate that always didn't predict on the same place, even if I search for your name again. So that's just a slight take-up. Otherwise, that's how it works, right? And I'll show you open source analytics as well. So this I just did it for my username, right? So just for my username, you can see there are so many websites where this user is present. So out of these, I'm only present on three. But again, it can give you a helpful insight on where your users are also present. So you can do more research on that. Now moving back. Okay, so like I talked about region estimation, the first step is trying to estimate which country the guy is coming from, right? So this is actually pretty simple. We look, we all, different countries are in different time zones. So estimating location through that is really straightforward. You can see the activity graphs from Mumbai, London, and San Francisco. You can clearly see the difference of why I want a machine, right? So this was pretty simple. So for international level prediction, for this presentation, I took five cities, Mumbai, London, New York, San Francisco, Canavera. And I couldn't gather too much data because of a lot of logistic reasons. So can anyone tell me what sort of models do you think will work better, generative or discriminative? Even when you don't have a lot of data? Yeah, so what we found out is if your data source is less, which is very likely because most of you guys will be using Twitter APIs and stuff like that to get your results. The API has rate limitations. You won't be able to get too much data, especially because if you get 1,000 rows, only 200 or 300 are going to be useful. So it's very difficult to get data. So as long as you have less data, generative models will work better because they work on the distribution. So that's better. But if you manage to get bigger data discriminative all the way, definitely. We just wanted to put it out there. Now we talked about large data versus relevant data in the start. So I, like I said here, just cleaning the data and getting rid of stuff which can hamper a model will push the accuracy of 59 to 71% without changing the model, without doing anything. So keep in mind the data which you are collecting is I'm getting all the users from Mumbai and for every user I'm seeing what time it reads and mapping it on a plot. So it's like for every hour there's a percentage of activity plotted. At 1 to 2, he spent 22% of his time or 22% of his tweets. So can you guys help me out with this? What sort of data do we not want? What sort of people do we not want in this data? Yeah, people working in AXIS, but there's no way for me to find out that. So that's not something I can do in data cleaning. That's not something I can... Very good, very good. So one good point is if there are too few tweets, that will just skew his graph. So for example, if there's a guy like me whenever tweets and today I say, okay, there's a giveaway, so we give each other and I tweet at 12 o'clock. My graph will be 0% for the entire day except 12 o'clock that's 100% activity and that's not something reasonable which we can actually feed any model. So that's one. What's the second one? Bots, fake tweets is fine. Their activity is still going to be same for the most part. Bots is a big problem. So the two biggest problems which we'll face is robots, cyborgs, or humans, and rare users. By cyborgs, I mean most company Twitter accounts are cyborgs, right? So for normal stuff like answering queries and all, they use robots which the bots answers simple stuff or sorry, we'll get back to you. Please reply to this mail idea and all. But if it gets complicated, the humans handle it, but they are very difficult to detect. When it comes to robots, I had to train my own support vector machine classifier using different instances from its profile and getting information. So robots are going to destroy your model and tell you right now because once you gather collecting data there will be a lot of robots in there and you need to get rid of them. The second thing which you need to get rid of but it won't be that difficult is rare users. Like he pointed out, if someone is just waiting for one day, you can put a cutoff if you have less than 30 tweets from the last month, I'm not interested in you. So that's one thing which you can do. So let me just show you some of the accuracy stats which I got just from this data. So if you see the international, this is the balanced accuracy which I'm showing you. So if you see internationally the worst we are doing is 77%, which is not so bad. And you can still work with it. We are getting as good as 81% for some classes. So not that bad, but when we go to the local level we are doing as bad as 50%. So that's not something which you can actually work out of. We need to improve this. So what do you do to improve the results? Get more data. We don't have more data. So forget that we have the option of getting more data. I've used a knife bias to get this. What can I do more to get better accuracy? Anyone? Deep learning? No, we have it. I'm saying assume we have data. You guys will work on it properly. So deep learning, anyone? Okay, you are a good crowd because we will not use deep learning for this. It's too trivial of a problem to use deep learning too. You don't need to use deep learning for everything. I'm saying you guys are a good crowd because when I was at Google DevFest everyone was shouting, deep learning, deep learning. Now we'll move to the second piece which is slang detection. Before I get into the problem statement I'll just... This is the basic NLP pipeline where you have a document, you tokenize the words, remove stop words, you do stemming, bag of words, TFIDF to get the topics, model creation and model evaluation. So I think everyone should be familiar with this, the basic pipeline for slang detection. The basic NLP pipeline. So this is clear enough. So let's get into our next topic which is slang detection. Now another thing I thought of was India is a very diverse country. So I have lived in Mumbai, I've lived in Chennai, I've lived in Delhi. So I know people talk differently in every place. They all speak in English but there's a difference in how they talk and the same difference is used when they are actually speaking on social media. So one interesting example I can give you guys is when you go to the south, like Chennai and all, Bangalore also in fact, the tweets tend to be spoken in fuller sentences. So tweets will be, hey, there is no cake at this party. For the same tweet if you move north, Mumbai for some instances, the tweet, they don't use full sentences. Your tweet would be, there isn't any cake so that isn't and all that stuff. So I thought maybe I can look out for stuff like that and slang terms like bye and all that stuff which people use in different parts, it changes from place to place. So that's the problem statement. So now if you look at it, then this is a sentence classification problem. So which is the best model for text classification? RNNs, right? Because you have a memory layer so you know what the sentence, you know the meaning of the sentence, you know. So that's why RNNs are better. But no, we don't want to use RNNs here. Why? So if I get short sentence, a lot of the time for the case of Twitter it might be fine because the sentence is sometimes maybe 8 or 9 words. So if you have a Facebook post where there are huge posts, why not RNNs? So, yeah, so the slang words thing is the thing is you don't really care what the sentence is meaning. We don't care what they're talking about. We care if there are some words present inside them which we can give us an insight on where they are from. So RNNs are going to take too much computation time, too much time to train, plus they're doing something which we're not even interested in. We don't want to know the meaning. So RNN doesn't suit our needs. So we are going to work with CNNs here. So how do you work with CNNs for textual data? So the idea is the convolution layers will fire when they see a particular pattern. And to get that pattern, how do you set the pattern? Like if you want the pattern to be 2, like 2 words per pattern, like, 3 words per pattern, something like that. So size of the phrase. So by changing your kernel size and concatenating the output, you can actually train your CNN to look for phrases of a particular size. So that's how CNNs can be used here. And they are actually pretty good when, and it did give me some really good results for classification for text. An ideals, although this is again, I'm sorry, this is again vary from person to person, but an ideal scenario, 2 grams or a size of text of 2 is best. So just a bonus slide for vectorization I highly suggest to use Google's glove. It has a vocabulary of around 400,000 words. It is a very heavy model. So if you download models around 1.8 GB, something like that. So that is something you have to be careful about. But the reason I like this so much is because we are looking at slang terms and I highly doubt Google has trained their model on Bhai and all that stuff. So it's very easy in glove to represent unknown words as vectors, or randomized vectors. So that's why I really like this. Again, you guys don't need to do your own word to vex stuff and all that stuff. You use what is pre-trained stuff is available for glove. Use that. No need to generate your own stuff. The next thing is networks. So the idea here is this is not too much into machine learning. I won't talk about how I put graphs through machine learning. This is more into getting inside. So for some cases you get to a point and then you are stuck. So even machine learning is not always 100% accurate and for some cases you won't be able to predict your correct output. So do you guys know graph theory? Yeah, so there are a lot of big terms for graphs which are called, but I'll just make them simple like the actor networks, bimodal networks, semantic networks. These networks are fused with social media, give you tremendous insight into what's going on. So you might also be any e-commerce firm, any other firm which is doing social media analytics and if you don't look at these graphs, you are missing out on a lot. So I'll just teach you guys. So this is again a bonus. I didn't do it on Google first. So I'll just show you guys two graphs and how they are used. So actor networks are nothing. As the name suggests, actors are users in your social media. So these are these showcase who is interacting with who. So this is a simple actor network. So I made this when I was doing analytics for Pharma. We also did social media analytics for major pharma companies. So one of them is this is one of the outputs. So you can see the one in the every node represents an actor or a user and every edge represents an interaction with another user. So this is a very simple one. You'll not see something this simple. You'll see big complicated actor graphs which will help you understand. This guy is an influencer. This guy these many people are falling this guy. These many people are retweeting from this guy. It gives you a very good understanding of how your network works. Yeah. It is a biotraction graph. The second thing is a bimodal network. So the idea behind bimodal network is you know what your users are but what's the use in finding out who your users and who is actually an influencer if you don't know what they're talking about. So what this does is this takes a user as an edge and also takes a hashtag as an edge. So I'm using hashtag. It can take any common word as an edge but hashtags are very easily differentiable in Twitter. It's easy to build around hashtags in Twitter. So what we'll do is for every user we'll create okay how many times he's talking about a particular hashtag and other users who are talking about it. So in the bimodal network side you can see that there are in the exactly the background if you guys can read it the hashtags is empowering doctors. Right. So that is one hashtag so out of that you won't see any nodes any edges coming out of the hashtag. You'll see a lot of edges who represent users like the top company I think for Twitter's and Salman Oil and all that and they're tweeting about empowering doctors. So this is a very high level overview but understanding your networks and social media gives you a lot of insight into your data. So this was the first thing now this was obviously when you want to get your location down to much better accuracy you need to use a lot more stuff but this is a good guideline for example a lot of e-commerce companies and all they do social media analytics a lot of companies from pharma e-commerce I've seen a lot of people do that but in Twitter you must be aware only 8% of today's users reveal their location right. You guys are aware. So if you're doing analytics of your customers you don't know where they're from you know they're complaining you know a lot of people are complaining about this a lot of people are happy about this you don't know where they're from location. Even if you don't have an in-depth knowledge of how to get location to 9 km or something if you can predict someone city to 70 to 80% of accuracy that's more than enough for you. So you know where your base is from you know where your base is interacting from right. So this one is a little how would I put it this one is a little bit difficult to this was hands down on the most difficult things that I did with the army and any project which I have worked for. So the problem here was I can't reveal the exact problem statement. So I'll put I'll try and put it this way. So suppose there are there are I robbers or 10 robbers right and what they want to rob certain banks on certain days but they want to communicate to social media because not all of them have phones they don't want to be tracked and everything. So now they're not stupid enough to say okay we are having bang XYZ at Saturday right they're not that stupid they're smart people. So what they'll say is hey let's come have cake XYZ place on Monday. Now my job was or what I wanted to do was I wanted to identify when the word cake and when the word any code word is used out of context and that could be signaling something bad like in this case a robbery I had different use case but again I'm just trying to simplify it. So at that time I was a new into NLP and all that stuff this more than two and a half years ago. So this was I thought this was what I thought the pipeline should be like you know again I thought this is particularly text classification problem no need to get too deep into anything simple text classification. So n grams I think all of you know anyone know okay so the thing is an N I heard of you know so I'll just quickly tell you guys so in NLP when your n grams is basically the phrase and N is the amount of words you're using in a phrase so if you have one gram it's just one word you're considering for your inputs if you're two grams two words three grams three words so ideally two grams are the best for generalization because one gram is two generalized three grams is too specific so two grams is your ideal phrase then which is what I have found again it will depend on top project to project like I said you have to research you have to know your domain. Second thing is noun phrase extraction now how does noun phrase extraction help anyone has heard of this no not named extraction okay so what this does is Abhi jessiki I have a statement called this is a dog right and my N grams is three so I'm taking phrases of three right so one phrase can be this is a it doesn't make any sense so the only thing noun phrase extraction does is it takes N grams in such a way that there's always a noun present in them or it makes sense grammatically like noun verb noun adverb noun adverb adverb noun something like that so with that you reduce your garbage inputs a lot. Second thing you do is theme extraction what is theme extraction again very simple nothing nothing too complicated all you do is you take all these phases you have extracted and assign a relevancy score to them and how do you assign the relevancy score it's through a very simple process called lexical chaining again and do you guys have an idea what lexical chaining is? lexical chaining is a very low level process what it basically does it joins a set of phrases and it gives them a score nothing much it just figures out how to join phrases that's it it's very low level you don't need to know it so what I thought was I was and I was being very optimistic I thought okay let's get these features and hopefully the perpetrators are using the same words in some other contexts and I might get something out of the themes there might be some other words which had to be associated with them which will bubble up when I'm doing the analytics unfortunately that didn't work so that's what I tried to do in the first I tried to get it out and I put it in a classification algorithm it didn't work out very well so another problem with the opposite topics will eventually turn out to be the same okay so I was again hoping that the people were tweeting in another context they might have added a few words here and there to like throw us off or something but even if they did topics eventually come out the same because it's a frequency based thing so you only get the top things word so that's not and as you can see the graph which will explain the next two points now this is a very simple graph which was done on arts dataset so even in this is this graph easy to comprehend someone said yes actually who understood this graph man I don't understand this graph so okay this is so a lot of data science is pretty stuff okay which doesn't make sense this is one of that you I really really can't make sense either you like make it more course or something so this is how you'll get your graphs especially when you're doing this because your data will be very large this is again this is for a very small data so what can we do okay so first the next thing is nlp classification really a best approach anyone can think of anything else yeah it is anomaly detection but it's text anomaly detection so that's what I said so the last slide this one this is an lda model you'd topics will eventually come out to be the same that was my problem here so I thought okay maybe this is not working and again here is that point of knowing your problem comes in okay so what are we trying to identify here out of context words but it boils down to one thing we are trying to identify the person who is tweeting out of context right and for the entire thing we have not considered him at all it's a profile we can consider that can give us some good insights okay maybe always tweets like this maybe there are maybe people who are using out of context make newer profiles and delete them soon so their profile age is going to be less so first thing which we can add is we can add all the user profile data to it and we can append it to our model and we can try and get better results so that's the first thing I did and yes we did get better results for this but still nowhere close to where we wanted to be but there was a significant improvement because earlier we were dealing with nothing almost so there was a significant improvement then I thought of another approach anyone again can anyone just think what else can we do here clearly this is not working out for us yeah so the time and all is considered in the numerical features so that we have already considered all that so okay so the next thing I thought is okay I'm hitting a brick wall here right so even I add as many numerical features like from the profile I'm not getting too high but the brick walls are only there for one reason to help us break them right nlp has to be used that's out of the question to forget nlp at entirely well what I did was instead of using nlp to classify if it's in context or not what I just did something like a multi-label classification right so you can see one pattern in the text come here for this at this three things which you want to identify name, occasion, place name, occasion, place so I took a set of trading data and I labeled them so we want to classify for multiple labels right I labeled them okay so this text contains a place this text contains a place and a time this text contains place and occasion so using all of that we generate additional features and we classify text and we put it put out what all the stuff is being associated with the text right so instead of our final model will be only numerical features so we'll have this profile data like he said profile time, profile age, what time it is and all that apart from that there'll be features okay how many times does he talk about location how many times he talk about location with respect to any other place or any other time how many of the times is he just talking about okay cake is he's talking about sweets or like every day what's wrong with him so that is one tremendous that again was a tremendous breakthrough for me that pushed the accuracy very high by very high I mean workable at least again there was a lot of work to be done but again so you have you will be facing a lot of break walls and you're working with open source data you have to break them down and that is again the boiling point of data science you have to work with data machine learning is the simplest thing in the world that you it used to be difficult if you have one line you write you give the colonel you give the parameters it's done machine learning is always the easiest thing the hardest thing is using the data and actually formulating the data to make the best out of it right so here's a bonus for you guys so I'm thinking you guys most of you guys are in businesses so context analysis is not going to be a big problem for you I don't think people actually do that so documentation is a big thing where a lot of stuff is happening you either identify context of documents what news post blog post edit threads document classification is very important and one of the best algorithms which I feel right now are hierarchical attention networks how many of you are aware of these one okay only two fine so what are they so they try and mimic how we learn to understand documents right so how do we understand documents we know the fact that sentences are made up of words and documents are made up of sentences right so this will try to do the same thing so and even in even in one sentence not all words are equally important right some words will have more other words will not so what this will do is it will have two attention models so and each attention model has a bi-direction RNN RNN you guys must know and an attention model so what the bi-directional RNN will do is it will try to for the sentence level it will try to go word by word try and understand the meaning behind those sequence of words and give an output vector for each word now our attention layer it won't identify the important words what it's going to do is it's going to take the vector from every RNN try and calculate the sum and at the end of one sentence it's going to give out a weighted sum of all the vectors for that for that sentence now the same process is repeated for sentences to get the documents so you can see there are two layers of attention working on one top for another which is why you get the name hierarchical attention networks two attention in one hierarchy and how can you use them unlimited applications if you are able to tune them to your needs document classification is very important nowadays risk classification lot of contractual stuff is there so even we work on something which on which we get lot of contracts we try and identify risk with them so again almost lot of potential hans are really great so okay so I actually have only three moments so I am not going to okay so fine so the third problem which I had was identifying potential suspects so I'll just look at stuff where you if you if someone is interested in cyber psych or open source intelligence you can actually get into this so these are the ways in which you can actually get someone's location fine so in fact I've worked on all of them first is landmark so landmark detection is something which I did so Twitter people upload photos and Twitter scraps metadata from those photos so there is nothing you can do for that you can't get met with photo metadata and get the geocoordinates I took photos from of buildings from almost each country from each city and I said there will be a difference in architecture and if there is a monument behind then good and well and good but I thought every country every city will have different architecture associated with it so you look at the background of the photo other than you when you try and identify what sort of architecture it is and then see what location you are from other thing is billboards and signs short short way if you any photo you take you see a billboard you see a sign you see anything it can give us a lead on the location where you are from OCR simple low hanging route is something more on the cyber psych side so how do I put this easily basically sometimes you will your email address not even yours you want to do something else and you put our email id phone number something like that or low hanging fruit in the nutshell is basically anything which can give us a lead to a location right because right now we have nothing any small thing which we can give us a lead it's a low hanging fruit good things are high hanging fruit that's different mentions now mentions played a crucial role so in the demo I showed you where I zero in on the location there are a lot of people who we still contract okay so they don't either don't have enough data or they just have very very data and we just can't get a good location so we follow the energy that okay fine you can't get your location so the people you talk to the most must be closer to you the people who are you are retweeting the most people you are tagging the most I assume that they are closer to you so I can't get your location fine I'll get 10 of those location thus local location okay I can approximate where you actually are right so again it's not very specific but we are getting something out of nothing so again this is the entire base of open source data and back doors obviously if you do a research there are a lot of back doors for Twitter there are a lot of back doors for almost all Google to Reddit where you can get locations so okay I'll share one cool back door with you guys in Twitter how many of you have used the Twitter API okay good so in Twitter API there is an option to call location through GeoCode okay so what this does is in an area Joe Billow wants a tweet correct they give you their tweets but a lot of the times in there are some users in these tweets who have not enabled the location that's because Twitter is using it's own backend algorithm to calculate the location based on IP okay so you won't know this until you do your research and understand what's actually happening but that's a very interesting if you don't want to do all machine learning all this garbage by all means you can just get that you can just get a Twitter API call it and you can have your results although we want Twitter API is limited so you won't be able to get as much data as you want I'll be I was actually planning on releasing some data wrangling and data cleaning tools I don't like releasing machine learning models because anyone can do that I unfortunately I couldn't get them out by today but when I upload my slides on the conference page I'll be including everything right so I think we are done okay so the questions will meet you and thank you guys you have been really nice