 of finding signal in the noise, how to put big data to use. For any of you who are wondering what that graph is, it's a high-endage social graph. You can generate yours. You can go to LinkedIn, labs.com. You can generate a similar graph for all of you. So how many of you here are familiar with machine learning or natural language processing? How many of you have used it in school or at school? What kind of number? So I went machine learning when I went to college in the US and it was really funny, in the sense that it involved a lot of linear algebra, a lot of calculus, a lot of probability distributions. It was all very mathematical and very boring for me. So I thought maybe this is my chance to set it right. Maybe we can make it a little more fun. So what we will do here is we'll take two different problems from two different disciplines and try to see if we can find, if we can take a step at solving those problems. So the first problem is obviously astronomy. What we want to do is find the areas. So for both these problems we'll try to answer three questions. Is it a big data problem? And what are the questions we are trying to get answers from? What kind of answers you want from this big data? And what kind of techniques we can use to get those answers? So the first problem is you want to find areas or have it at all planets. So let's say we have got data. Hubble Telescope was launched in 1990 and it generates close to one lakh measurements per second. So that has been two decades. So you can do the math of how many measurements it has done since then. So if you take that kind of data or if you look at the stats here, just our own galaxy has close to 100 to 300 million stars. If you take that kind of data, does it fit out right here? Is it a big data problem? Yes. What is it that we're trying to answer here? We're trying to find habitable planets, right? So the very first thing you have to get to that, we have to answer is how many stars there are, how many galaxies there are, how many planets there are. The way to get to habitable planets, you first have to find galaxies. To get to galaxies, once you figure out all the galaxies, you have to find a star in that galaxy that is similar to sun. Once you figure out that you've found a star that's similar to sun, then you have to figure out if there are planets orbiting it. And once you've found a planet that's orbiting it, all you have to do is figure out whether there is water and carbon on it. Once you've done that, you've found a habitable planet. So it's very, very simple problem and broken down. So what we'll do is we, I don't think we can, in the next 20 minutes, we can find aliens, but we can, you know, set our sights lower and we can find the galaxies. Let's say we want to find out the galaxies. So this is the first problem that we want to solve. The second problem we want to solve is we want to figure out the collective consciousness of what's happening. So you have all heard the stats. 900 million people on Facebook, 300 to 400 million people on Twitter, similar numbers on YouTube. And we all know there is a lot of data being generated online today. Then, you know, a few years back. So what we're going to find out is what are people talking and sharing and what's happening in the world? There's been all the conversations that's happening anywhere on the planet. Now, again, it's a very interesting problem. I think the only three companies that can analyze the scale of data today are Facebook, Twitter and Google, because they have access to this data. So again, we have to, you know, lower our sights and we'll try to find, given a topic, we'll try to find what are people saying about that particular topic in real time, right? So you want to figure out whether they are, you know, talking positive or negative about that topic, what kind of issues they are bringing from that topic, those kind of things. So now we have two problems from two different disciplines. One is also one is a friend of a friend problem. So anybody who has used Facebook or Twitter, today knows that, you know, when you log into Twitter or Facebook, you get recommendations. So if you and I have 10 mutual friends, they'll recommend your name, right? So turns out it's an age old problem. It's something that came from astronomy. In astronomy, when you're trying to find galaxies, what you want to do is figure out which stars are friends of each other's stars and then you can use transitivity. If a is friend of b and b is friend of c, then possibly a and c are also in the same cluster or in the same galaxy, right? Whereas what we want to do with the social networking problem is what we want to do, we want to make it possible to do real-time text analytics so that we can analyze what people are saying and provide some kind of a summary so that people can make educated cases about what, what people opinion on a certain topic. So now let's jump back to machine learning. So in machine learning, there are two important concepts, clustering and classification. Clustering is more about discovering. Clustering is more about giving a set of data points you want to group them and figure out which data points are related to each other. And the deal with clustering is you might not know what those clusters are. You might, in some cases, you might know the number of clusters, but you might not know what those clusters are. Whereas in classification, you have a set of predefined clusters and you want to figure out even a set of data points, which data points belong to which class, right? Now classification, again, you can have supervised classification algorithms and unsupervised classification algorithms. We will not go into details on that, but for supervised algorithms, what you would have to do is you have to train your algorithm on a set of data, data points, and based on that, it learns the patterns underlying those data points. And based on those patterns, it tries to classify new data points as they come in. So let's look at a clustering algorithm. So this is called Keynes clustering algorithm. So if you look at it, the very first graph has a set of data points plotted on a two dimensional surface, and what we are trying to find out is what are those clusters? So in a Keynes clustering algorithm, you can make the assumption that you already know the number of clusters. So let's say we know that there are two clusters here. So the way you would go about doing it is you would pick two arbitrary points, data points, out of all those data points, and assume them to be the centroid of that cluster. So that's what's happening in the first graph. In the second graph, then you try to bring together all the data points that are near to that centroid data point. So that's what you see happening in step two and three. That is trying to color code them into that centroid's color to group them together. And every time you do that, you re-compute the centroid. And when you do this enough, and when the data starts to convert, what you get is two clusters which represent the set of data points. So to solve the trend of a trend problem in case of our galaxy trying to figure out the galaxies, we can use a clustering algorithm like this. So what happens is given a set of data points, where the data points are nothing but the location of the stars on a two-dimensional map, you figure out the distance between two stars. And if you make the assumption that there are certain number of galaxies in that two-dimensional space, then you can start clustering around those and figure out which stars belong to which galaxy. So I'll just go back a second here. So the galaxy in the background you see is andromeda galaxy. Now typically, most of the time when you see a picture of a galaxy, it's usually artificially colored to look this way. It doesn't look this way, just looks like a bunch of dots on a black plane. So to differentiate from the nearby stars, they're basically all these astrophysicists that color it so that you can actually figure out that it's a galaxy. But otherwise there is no good way of knowing that this constitutes a galaxy. Now we'll jump to classification algorithms. So I think when we talked just before we talked about decision trees, so decision trees are very, very popular and one of the most used classification algorithms. And I think we have all used it internally, trying to make choices or decisions based on available options. So the fundamental part in case of decision trees is you basically have some kind of a flow chart and for each decision you make along the way, you have something called information gain. So what that means is at each level, you want to make a decision based on a data point that will give you the maximum information gain. We talked about one more classification algorithm which is quite popular for support vector machines. So again, we have a two-dimensional space with a bunch of data points, strongly. What I want to really want you to look at is the bottom left charts. So on the left, we have a bunch of data points on a single-dimensional plane. It's on a line. You can see the green and red dots on a line. Now what you want to do is we know that these data points belong to two different classes. One belongs to the green class and the others belong to the red class. Now the trick here is there is no good way of segmenting or figuring out a mathematical formula to classify new data points based on a single-dimensional plane. So what you do, and this is one of the neat tricks of support vector machines, is you project that in a higher-dimensional space. So the graph right next to that is project, if the same data points projected into a two-dimensional space. So what you got there is you found a new electrical shape and now you can find a plane that separates those two data points. Once you find a plane that separates those two data points, what you'll find is a mathematical function that will allow you to classify. Given a new set of data points, you can apply that mathematical function to classify it into green or red class. So one of the things we do is we are talking about, we are talking about the real-time text analytics problem. So what we want to do here is analyze what people are saying about the topic and turns out that's what I do for everything. I run a startup called VWALT. So we basically do real-time social analytics. So as you can see here, one of our customers is actually monitoring the LSE experiment that just ended a few weeks back, which portion particle was found. And as you can see, this is the real-time feed of what people are saying about that specific topic online. With that, I would like to end. Is there any other questions I can answer or yeah? You chose the right question. Yeah. And what did you basically align normally with a transformation? So like in seconds. One similar thing has said, how do you apply clustering? Will you apply normally the transformation there or no? Or how will you generate the clustering? So clustering would be tricky there. You would have to apply nonlinear transformation to generate clusters because otherwise greens would be separate clusters on a single dimensional plane. So you would have to apply nonlinear. Not necessarily. If you understand the data set really well, what the characteristics of the data set is, what the feature vectors are, then you can possibly figure out if it requires a nonlinear transformation. Yes. Which are the popular tools and which are the tools that you use generally? So I think there are lots of open source libraries. There is VETA, which is the open source machine learning library. There is Mahavar, which sends on about Hadoop and pretty much does lots of machine learning algorithms. And I think there are lots of other open source libraries from universities that you could leverage. So my question is, with the exception of Mahavar, the other packages that are there, how well do these keywords increase in memory? So from practical experience, you require a lot of tweaking those open source libraries to scale it to large scale data usage. For example, we process a database of data in one for our customers. And it turns out most of these libraries don't work out of the box for us. So what you ended up doing is writing our own algorithms. Right. So again, with the exception of Mahavar, do you know of any single library which has something like say a simple support library, which can be deployed on the map reuse kind of framework? As far as my knowledge goes, there isn't any. There is live SPM. Again, you have to write records on top of that to work with map reuse. Is it really the real-time analysis of the Facebook and Twitter data? So what sort of Facebook data do you want? Because there are a lot of privacy issues and Twitter, it's okay for me. We might come to the real-time analysis. So what do you do? So we analyze a lot of fan page data. So you know, fan page conversations are public. So if there is business as a fan page on Facebook, most of those conversations are public. And frankly, you'll be surprised at how much of the wall conversations are public, especially they prove 10 to 20. The younger generation, I think is more open and more willing to share data with the world. Maybe that's another. That's one more question. So do you really need remind sensing sort of anything on the real-time analysis? I'm sorry, what? Remind sensing. So, yeah. So in fact, what you're talking about is real-time generation, right? So real-time generation. Right. So we actually wrote algorithms, machine learning algorithms that could classify leaves. And it's quite funny. So we were trying to try, you know, people who wanted to buy luxury cars. Turns out people between age 10 to 20 tweet a lot about them wanting to buy BMW, right? Most of them are students. So it's very difficult, you know. You need to understand the context. It's aspirational or is it a real need? You know, people who want to buy BMW, wouldn't they tweet about it, right? So this is just one example. But to drive home the point, I think the fact is that trying to identify intent of purchase from conversations is a very, very difficult task. I mean, we try to do sentiment analysis and that itself is very difficult. People say all sorts of things, you know. I am bored with a smiley. Would you classify that as positive or negative, right? So it's very, very interesting data, but it's very, very difficult to analyze. You talked about analyzing Facebook data. Yeah, you talked about analyzing Facebook data. So what are the analysis that you do? How different is it than what Facebook insight goes? So we do sentiment analysis like I just mentioned. So whether people are talking positive or negative about your brand, you also do text analytics. So we try to bubble up the topics that people can about and talk about your brand. And we also try to give an overall picture. So it's not just Facebook. We get your data from Twitter, from blogs, from review websites, from everywhere. So it's a comprehensive view of what's happening with your brand. Is it after the time? Do we have a real-time kind of analysis? Yeah, so it's near to real-time. We have a latency of five minutes with respect to Twitter and Facebook and 30 minutes with respect to blogs and other. Why? Oh, yeah. Presenting, do you use any specific tools for text analysis or real sentiment analysis? We have our own on-ground algorithms for sentiment. For text analytics, we use a bunch of natural-angle processing parsers, so which help us in passing text and trying to understand the syntactic structure behind the text. Are there any open source tools or starting tools? I think if you want it for commercial use, you still have to license it, but there is 10-fold parcel, which is quite popular. There used to be a parcel from Palo Alto Research Center Park, but I think now it's exclusively licensed to Microsoft. Yeah, adding to the last few questions, I just wanted to know, you mentioned that identifying the intent of the message in terms of conversation or whether the guy wants to buy a car or something like that. It's critical. So how do you rate the analysis of the semantic engine that you guys have been doing around? Because you mentioned that the statement that I am more of a smiley, that itself is difficult. So how exactly do you define algorithms to track such rules that come out of regular analysis? I believe my question is clear. So what you are asking is, how do you ensure that you are doing good analysis or right analysis, given how critical the data is? So I think that's a constant challenge. We always try to keep up with it. We always try to, so we use a lot of supervised algorithms. So we always try to train with more and more data. So I'll give you one example, machine translation. So translating from English to Arabic or English to Chinese was a big challenge. And there were companies which used to spend millions of dollars trying to perfect those techniques, trying to get that 1% increase in the accuracy of translation. And then one fine day Google came along and they changed the game. What they did was that through so much data I did, so much training data I did, that they basically knew all of the other guys out of water, right? So they took the 200 GB corpus of United Nations. So United Nations Charter is translated into 193 languages if I'm not wrong. So they took that corpus and they used MapReduce to train on that corpus. So there are two things we do there. One is we constantly refine our algorithms. The second thing we do is I think, again I'll refer back to the active learning that Lankata was talking about. So we do a lot of active learning. So let's say you are one of our customers and you find that the system rated a conversation is at the wrong sentiment. You can go and override it. And once you override it, the system learns from your input. No, if you are retraining the data over time, then you can definitely take that input. Exactly, so what we do is we retrain every month rather than every day.