 Good afternoon everyone, I would like to introduce myself a little bit. So, I took my bachelor's in computer science from ITBHU, and then I went to my PhD in Masters from the University of Austin in Texas. And after that I joined a company called Fair Isak. They used to do what they can do, financial data, credit card fraud detection, credit rating, things like that. And we used to be part of the research group which used to do what they call Blue Sky Research, which is advanced research in different areas, text, vision, and things like that. So after eight years of that I moved to India and I was part of LAMO Labs for some time. Then I did something here. And for the last four and a half years I've been with Google in Kedavra. So with that, one of the things I would like to share is a lot of my guidance from all that time, looking from a visualization perspective. So, you know, I generally talk about the same kind of things, but with one theme in the focus. So it is a theme of visualization. And before I talk about that, I wanted to say a little bit about the conference. So, you are not sharing a very important aspect, right? And it happens beyond the premises of within the companies or within the colleges, right? And I think the more these kind of conferences happen, people from different industries come together. Some of them want to learn, some of them want to teach, and we all come together and talk about different topics. So we really need a forum like that. And it's great that companies like ASCII and others are providing that forum, which is independent forum for people to come together and do these things. And I think in the last conference it was just mind-boggling. I was just completely amazed by the kind of people who are having the conference. You know, they were on geeks. They knew, you know, what is MapReduce and what is Hadoop and all this great stuff. And they were very eager to learn these things and share what they've learned. So now I'm excited and enthusiastic, not of talent I saw. And, you know, I could just see the bulk of a new Silicon Valley in Naguio. And most such events are to make that happen, right? So really, I really want to participate in some form in these kind of events, which is going beyond your college education and things like that. So with that, welcome to this conference. Hopefully, you know, you'll like what you see and you're going to participate in such things. All right, so today I'm going to talk about a picture that's worth a billion numbers. You've learned this phrase, a picture is worth a thousand words, right? It's worth a thousand words. So I just used that to talk about pictures worth a billion numbers. And this is about million numbers that are coming from the data that they have collected and what we're going to do with it, right? So visualization is a very important part of understanding the knowledge, the data, the information that is coming to us at such a huge place. So it's a very important area of ongoing research, ongoing new technologies and algorithms being developed. So it's a very important area to know about. What I'm going to do is I'm not going to tell you about a lot of visualization techniques. It's not a tutorial on how to do visualization. What I'm showing you is a lot of real-world visualizations that we have developed in our company and how they can be used to make interesting decisions about the data, find interesting insights. So you'll see a lot of real-world applications of visualization and what we can do and we'll touch upon very few techniques here and there. But this is just to give you a sense that this is an important area, it's a practical area and it involves a lot of work and it has a lot of value. Alright, so how would you like to put forward in games? What are you guys watching here? So who do you think is going to win? Germany. Germany, okay. Anybody else? Yes. All right. I'm really not into it, so I have no idea. But what I'm doing is I'm going to introduce this idea of visualization and this is not what I mean by it. Right? This is not visualization. So we'll talk about visualization which is somebody sent me this main recent page and I'm going to use that as a way to explore some ideas. So somebody has come up with this list and I think they've done that early on maybe. So this is a list of properties. If you just look at this as it is, it's not very appealing. It's just giving me a list of possible countries that are going to be moved. But they keep up with a very beautiful visualization of this whole thing. Right? This is just an Indian sense of how I can talk about the complex thing in a very beautiful way. Now, what they've done is for each group, they have first and second and then they're going to be matches between them. Right? The winner of this and this and all that. And this data is no. And then they're all going to move by. So you see how they visualize the tree, right? In this space of, it's a very small space of only so many teams. But the idea is to see how visualization makes it really very interesting. Let me tell you how they take the numbers. So what they take was, they said, okay, take all any of these. So Brazil, they come up with this probability of winning as you go around each of the matches. Right? So you can come from here or from here. If Brazil is second in the group or Brazil is first in the group, what is the probability of winning? So it's like traversing the tree for each group. Right? So it's a beautiful visualization of a very useful set of things going on. And it's just a feeling of being able to see it. Right? All right. So let's start with the talk in general. So obviously we all have heard this cliche, right? We are, you know, drawing in data and start with the numbers. We have way more data being collected today. If you look at gene sequences, remote sensing, write everything inside the credit card, which I made a data point, every time you sell a stock, write a book, a patent, buy stuff, all of that. So data is generated at a pace which far I see anything human kind has ever seen before. Right? And it's made out of that, of any single individual, to even take one field and go back and say, okay, I know all the data in this field. It's just impossible. Okay? So we want to be able to do all this data. Right? We can't just collect it and store it. It's a waste of space. So what do we do with this data? We want to make a lot of decisions on it. Right? So if I have your credit card signed, I want to know whether it's a fraud or not. Right? If you have, you know, any other kind of, if you're treating me, if you're giving me a query, I want to tell you what are the top 10 results I'm going to show you. So we are going to make decisions out of data. Right? So the top 10 we are going to do, which is how to make decisions on data. It's not just something to be stored in a database and query. And it is more than that. Right? You plan business as far. You plan on your airport routing chip. You know, you plan out what opens to send to you. You plan where to put the next movie theater in the city. You plan a lot of things based on data. Right? So we will talk about that and it will give you a sense of how visualization helps in this whole scheme of things. So if you want to look at data on one side, decision on one side, so from data you generate a lot of insights and you see some, then you generate some teachers, you make some models, you make some predictions, and you get some papers. So think of one example. So for example, if you look at web search, right? So from web search, you generate a lot of insights, right? Is this a local query? If I say I'm about preschool, I know it's a local query that we have to serve for character services. Right? If I say, you know, the name of the movie, I know that you might be looking for the release date of the movie. Right? So there are a lot of insights there. You can generate a lot of features like keywords, other things, right? You can build models. So in web search, you are putting up information in the keyword model. Top 10 jobs for the query. Right? And using these models, you can make a prediction and sort all the pages, billion pages on the web. For that query, it makes a decision that you want to show this page there, this page there. Or you want to show images on the top or video on the top, things like that. All these are decisions. And based on the key that the user does, you get a feedback, right? A shorting feedback, you know, 10 results, keeping from the third one. So maybe it does not like the first one. And if millions of people do that for that query, that means the third result is the methodology, right? And that's the feedback. And we use that feedback all the time to improve search quality, right? So you understand in one way or another how we do this. I don't want to go into the details of this. What I want to talk about is how visualization helps in office. Okay? So visualization really helps us in understanding what the data is left, right? So how many market segments do I have in my customer base, right? All kinds of things. We will show you lots and lots of examples of these kinds of insights in different areas. How do you analyze features? How do you interpret these insights? How do you make decisions? How do you track feedback? In every stage of this process, you can use vision aids. Okay? So that's what we talk about. And, you know, you'll see how interesting it can be. So before I get started, let me talk a little bit about philosophy. And this is not my favorite. This is not my favorite. This is the words favorite guru on visualization, and you look about the video, and then we will continue from there. So let me... It's not my favorite, this is not a relationship with viewer and I have a reason about it. I'm inspired and said it's not a recipe that failed content. If the words aren't useful, the final slide will be the specific powerful order lies in the truth. Honestly, I move forward with visualizations, but that's a lie problem of the truth, that was not the information. In the next step, I'll scan the child information. Now, with the biography of 6,000 years ago, the first map was scratched into a piece of stone. And that is one of the most widely seen visualizations in the world. Before using visualizations, I should go do something. The next step was about the real science. Raleigh about his telescope going, he saw things, never seen before. He saw the rise of the sunspots, and he watched the sun for about 40 days. And he then created some sunspots and visualized what he saw. And so the history of visualizing data is very substantially a history of science. Visualization is not just some airy, airy creative process, but it's actually a very linear process of decision making that you could do based on some basic principles. Three games are the four we design always. One is you as the designer. What you have to study and what you want to communicate. Two is the reader from assumptions that you need to account for that, or to the truth. You also see things that make some out of decisions in order to survive. And that again and again is the design that you can communicate a lot of information very quickly because we all have brains that are designed to be able to design and to write to the aesthetics of a piece just as much as we react to the information that's even in it. And so you want to change someone's mind, you want to change someone's behavior. Sometimes presenting the information of the individual provides a process way to get into the game of that information. You see things that you can't really define and probably changes the laws and the norms. And you have that data as a result of research. So I would say that data is just a clue to the end truth. I think a successful graphic tells us a story that communicates hopefully, at heart and sometimes complicated in a way that many people can understand. I think the first step is always to stay really committed to the data results and define each point and create a hierarchy and an harmony of that story. When you start to immerse different pieces of information, when you start to learn really what it's all saying, the narrative is the fact that everything we walk around is a hero of a piece. There's one single piece of data or insight that people respond to and kind of be cast from the whole mission. And then you might be willing to see the nuances and all the rest of the story around it. When you look at pieces that's successful, when it translates data or something, that's navigated to something similar. It communicates a message that otherwise would have taken somebody hours to digest and find him and his sisters. To keep this interest license down to the data, we gave our measurements of something. That's some things that we're talking about, our human systems. The data systems that are larger than anything that humans have ever built or experienced before. And these are really large systems. Things happen within the litter apartment. For example, he changed my shot footage from airports for the budget of the airport world and then Eric Babel made it as well. So the central idea was for some people that free time that you were in an airport, you were standing on the surface of a system that is almost compressed. Negative in height. There are more than a billion people in the air. And so there's another purpose of data visualization. Which is to show us something that you've never seen before. This is for me, much more attention. It's about data and it's down on these hard parts. For me, it's about showing them something in this kind of blue steric frame that they can interpret. So we showed them some pieces of the picture. And the idea is that they can sort of stand back from it and watch the task for a little bit and come up with something for understanding. The interpretation is also not really valid. I don't have some mathematical understanding of this system that you know. I have some ideas about how these systems might be changing and how they might be growing and how they might be important for a culture and society. And I'll share some of those ideas with you and maybe you would like to add or something that I would like to add. I think internal audiences are a lot smarter than a lot of people think. So it's not only your audience. It's respect in your audience and really know your content. That's what you should be knowing. Recently, look out for the truth of goodness and you will look out for yourself. You want to see and learn something, not to confirm it. We usually see confirmance. It's very common. How can we see not to confirm? But to see and learn. So he talked about very interesting things. He teaches visualization as some kind of focus for this thing. But he talks about some very interesting things and I'll keep quoting him because he's kind of a very big inspiration for a lot of people in visualization. So he says respect your audience and know your content. So it's not about the color scheme or your power points. You understand? Then he talks about look after the truth and goodness and beauty will take care of itself. And we'll see what that means in a lot of visualizations that we showed. And to me, this is a very, very important part that he said. How can we see not to confirm what we love? How can we see not to confirm what we love? So, you know, I'll show you a lot of examples where this will come in. So one of the other things he mentions is often the most effective way to describe, explore and summarize a set of numbers. Even a very large set of numbers is to look at a picture of those numbers. So I'm just trying to motivate this idea of visualization through a lot of things that he has said. So let's talk about my visualization model. So if you look at one way of looking at data, one way of interacting with data is to learn some models, to build a predictive model. All right, let me predict that. Let me predict success of this. Let me recommend a movie. Let me predict the thing. One is to predict something, right? So we are doing a lot of things. And what we are doing there is really addressing what we call the first thought of it. What I mean by that is lack of knowledge. First thought of it is lack of knowledge, which is I don't know. I know what I don't know, right? I know that I don't know where the next pizza at is or what have it opened. So I know that that is what I don't know, but I know how to find it. I can do the right query or something, and I can find it, right? So here is something I know that I don't know. What is the next level of interaction with data? And that's where visualization comes in, which addresses what we call the second order of ignorance, which is lack of awareness, which is I don't know what I don't know, right? So what I don't know is let's say there's a new pizza that opened up next door. Maybe I don't know, right? And I don't know that I don't know. So this is a recursive thing, and that is where visualization really comes in, okay? Now if you are an open-minded data scientist, right? What are you going to do? You're going to go after these kind of unknowns, right? You're not going to say, I know I have five customer segments in my data, customer base. Let me force to build a clustering with five cluster centers, right? What do you know? You let go and say, you know, hey, find me however many clusters there are, right? That's when you need to know how to do these kind of things, and visualization comes in very handy, right? So keep this philosophy in mind. We are trying to address the second order of ignorance, and I'll show you a lot of examples of what that means through a lot of examples, okay? So let's start with the very basic, right? Let's start at the beginning, a very good place to start, the sound flow, the sound of music. So we'll start with the very basic, and you know, if you ever take a machine learning course, this is what we call the first data set in machine learning, which is the iris flower, and this was created by, I think, Fischer a long time ago to understand how to discriminate with move these three kinds of clouds, right? So there are three kinds of iris, and what he did was he made these observations about, you know, the four different features, and he had 50 examples of each flower, he measured all these things, and there are four classes and 50 examples of each class, and now let's see what we can do with this most basic kind of thing and how we can use visualization to understand this data. So I'm going to start from the beginning, take you all the way out, okay? All right, so in terms of machine learning language, you can say this has three classes, there are three kinds of flowers, there are four features, right? Petal length, set of length, petal width, seven width, and there are hundred different examples. So this is what we call the original data set for machine learning, and if you look at the visualization, the most basic kind of visualization in this catapult, right? We are used to of two dimensional views, right? Three dimensional, we can still do, four dimensional, we can't imagine. So the best thing to do is take two dimensions and do a scatter plot. So here we have two of those dimensions, right? Petal length, petal width. And I can color plot these with a class label and you can immediately see some structure emerging, right? You can see that, hey, one type of flower is very different, and these two types of flowers are very similar to each other, right? So if you go back, you can probably see that these two flowers are quite similar to each other, right? And this flower is very different from these two types of flowers, right? And you can see that structure emerging in, you know, all these visualizations that one flower is very different from these two visualizations. See, so instead of giving you the 150 numbers and all the four numbers, if I show you this picture, it is far more instructive, right? So you can immediately see how to understand data. Scatter plot is the first basic kind of visualization. Let's look at the next one. Here is an example of added two digits. So imagine the postal service is trying to route your letters automatically, right? In order to do that, they need to read your pin number. Now, if a human reads your pin number all the time, it will take a long time to put it in different pockets. But if a machine could read a pin number, this is handwritten, it will be a very nice automated system, right? So that need led to the emergence of research in handwritten digit recognition. So what they did was they collected a lot of samples from tens of thousands of samples of different digits, and people find the same number in so many different ways. So it's a very classical machine learning data set that everybody uses. What can you do with this case, right? Visualization only. So here, you know that this is an image data set. Can I not take the average of all the pictures of each number, right? Take all the zeros, take their average, and see how they look like, right? That's an interesting visualization. To understand the data, this is an interesting visualization. You can also look at the average of the whole thing, and this is how sort of the embryo looks like. You know, the average of all the nine digits look like. And you can see that if you remove certain things, you will go to zero, one, two, three, four. So a very beautiful visualization to show you how the average looks like and how the digits emerge from it. Right? All right? The next kind of visualization is projection methods, right? You've all heard of physical components and all that. So here we talk about, you know, if you have data in two dimensions, what is the most dominant dimension in which there is variable v? What is the less dominant direction? And this is what we call official physical component, and first and second, and third and fourth physical component. So you have a data in high dimensional space. That's a 200 dimensional space. You can't see 200 dimensional space. You need to project it on some two dimensional space. How do you project it? There are so many options, right? Think of it like a torchlight on one side, and a screen on one side, right? You can project it in so many ways. Does that make sense? So you have the data clock here, and you can project it in so many ways. What is the best way to project? This question has infinite answers. What is the best way to project? It depends on what you want to find in it. There's no single answer. So this is one of the best ways which shows me where the data is very variable, right? So imagine the data is like this. I put a screen like this, a torch like this. I won't see much, right? Everything will look like a dot or a small cloud. So I don't see the variability here, you know, structure in the data. But if I put a torch and a screen like this, I see a lot of variation, right? And that is what it means by principal components, which is what is the most dominant direction with a lot of variability, right? So in this case, you know, this is the right direction. So this is what we call the first principal component, second region, right? So you can imagine in a 200-dimensional space, you can have the first 200 ordered principal components and you can solve them. So this is an example of the same endless data, 10 digits, all at two-dimensional projections. So there are 10,000 points. Each point is in a 28 by 28 space. So it's a, you know, 734-dimensional data projected onto 2D and color-coded by class limit, right? So you can see that there is some structure, but a lot of classes are kind of mismatched, right? You can see much, okay? What about another reason to project? Is this the only way to project? So let me give you one more example of why you want to project. Let's say you have a two-class problem, right? And you want to know, you want to see the structure in that. If you project on this direction, what will happen? You know, you can't distinguish between the two classes, right? Let's say this is fraud and not fraud or spam and not spam or the relevant document, not relevant documents. It's a two-class problem. And now you want to understand the data. So what is the right projection? If you project this way, you are going to get both the classes that make some problems. Although this would be a good direction for PCA projection, because the variability is very high in this way. But it's a very bad thing if you need to do discrimination. But if you project on this direction, right? You put the light on that side and you see, you will clearly see that there are points in the two classes. So you understand what I mean by, there is no right visualization. It all depends on what you want to take out of the data. If your data has no class labels, if it has no distinction between this class and that class, you can just do PCA. But if you do have class labels, you will do what you call a fissure projection, right? And therefore, you know, we keep building on this idea of how we keep getting more and more insight into more and more complex kinds of visualization. So this one gives you the variance in the data. And this one maximizes the separation. So here is an example. Now tell me, I am using these three examples. Class 0, the digits, class 3 and class 8. These are all very similar looking things, right? 3, 8 and 0. So I have three classes out of 10. And I am trying to project them on 2D. Now maybe you can tell me which one is a second projection in terms of it is telling me more structure, more separation between the classes. Awesome. Awesome, right? So this is the fissure projection and this is the PCA projection. Okay. This is a projection of 10,000 times 30,000 points. I am just helping them just to show you what you get as structure. So you now can see the structure emerging from this kind of structure. Let's look at one more example. This is class 5, 6 and 8. Again, very similar looking classes. 5, 6 and 8. And again, you know, this one has kind of a better structure than this. Look at how the PCA. So visualization depends on what you want to get out of it. What kind of structure you are looking for. All right, so we talked about multilayer data. It's a very basic one-on-one kind of thing of visualization. Let's do something more interesting. Okay. Which is what I call multidimensional. So let's do a hot experiment on smell. Let's try to visualize smell. How many fairing or perfumes and things like that? Nobody will say that. That's what we need to have. Right? So there was this... I don't know if it was a hot experiment or a real experiment, but imagine you want to understand the structure of smell. We understand the structure of visual data, right? We have RGB. We can put a picture in one of these RGB quadrants. So we understand the space of vision. Right? We understand the space of sound. Right? We can do Fourier transforms and create these orthogonal dimensions. What about smell? We don't have a space to put a smell in. Right? So it's a very interesting problem. Let's think about how would you do it. Any ideas? How would you visualize smell? What kind of data will you collect in the first place? Right? Then you have to think about it. Right? So here's a beautiful example of how we can take something as abstract as smell and create a visualization of smell. Okay? I won't create one because we can't do it here now. So this is what they did, which is they said, okay, let's get all these people who are good at smelling things. Right? I don't mean those people. The real people. Okay? And they are good at smelling things. So they can tell you how far away two smells are. That's all they can tell you. They can't give you a point in a space, but they can tell you this smell is similar to that smell. So what they do is they take two smells and they give it a score between zero and one. Right? So they take two smells, smell them, and say these are very similar or less similar or very different ones. Right? So now this data is really a matrix of distances between smells. Okay? So you have 200 smells, perfumes, 200 perfumes, and you can create a matrix of smells by taking the average score of those two. Understand? Now this is how we have to be creative about certain kinds of data. So maybe some of the gene sequences, right? Gene sequences, you can't create a visualization for a million long gene sequence. Right? Or a document. Right? Because different number of words, you can't create these messages like that. So you need to be more creative about it. It's not a simple, multilayered data with so many features and so many data. Right? So there's a technique called... So once you have this kind of matrix, right, which is the distance between pairs of things, you can actually create a visualization. And this technique is called proximity preserving method. So what they do is they say, okay, let me assume that the points are in some height of the space. There is some distance between a pair of points. I want to project them into a low dimensional space such that if this distance is high and this distance is also high, if this distance is low, this distance is also high. Right? There's a vertical projection, and I'll show you some equations and how to do that. But that's the idea. So what they can do is they can now visualize any matrix given to you as distance between things into a two dimensional point. So the idea there is that two smells are close to each other or far from each other if the number in the matrix is smaller than this. You understand? So now they are able to visualize smell. Right? Don't tell that to your wife that I'm able to visualize smell. That would be very interesting. You can visualize all the perfumes and stuff. Now imagine you want to do the same thing in different domains. So if you have relationship between things, imagine you have two products in a retail store that are sold together more often than running. Right? Then what happens? So you can visualize your thing in the same way. Right? You don't have a space or a point in some space, but you know how two points go across. So you can create a graph and you can visualize it. So this is an example of one of those graphs. You can also visualize things like people or the social life of how close they are to each other. Right? You can visualize all kinds of pair-wise things. Right? You can visualize a group of authors, writers, and members of the instance. So we'll see a lot of examples of that in this. So here's a plastic example that I commonly show which has been created this visualization based on retail data. So the idea is if two people are buying product A and B together more often than random, then that becomes a stronger number in the matrix. And if other two products are farther away, that becomes a smaller number. So imagine, you know, bread and butter will have a star correlation and milk and TV is going to have a B correlation. Right? And this is all coming from the data. And once you have the data, you can use this technique called multi-dimensional scaling or proximity-preserving visualization to come up with a 2D map. And the idea of the map is, again, the same thing which is two things are close to each other if the number is very high. Right? Or if two things are far up, then the number is low. Right? So this could be used as a way to do store layout in a store. Right? You can say, hey, how do you lay out the store? And based on what people are buying, you can redo your store. Within each department, you can again do the same thing and lay out the store. Okay? Here are some other beautiful visualizations people have created. This is all the gene network. Q and genome data. Right? They have gene expression data. And they're able to see what genes correlate with what genes. Right? Graph data. And there are different kinds of gene groups that they have found based on this kind of visualization. This is a graph visualization on author collaboration. Right? So who is the most important author in the field? And how many people collaborate with that guy and how many people are derived with that guy? So you can color code this and you can see what are the dominant experts in different fields. Right? Based on co-authorship of data. Okay? So here's a visualization of tags. Here, what I'm showing you is what makes the central theme of this collection. Right? So this is the central theme and these are all sort of peripheral things in that group of tags. Right? So this is also a graphical visualization of communities. Right? You can do this with tag communities or product communities or people communities. It's a visualization of that. Okay? This is one of my favorite examples. So here what we did was we wanted to understand the customer base of a retailer. So imagine a large retailer like Reliance or Walmart or Facebook or Flipkart or something. Imagine they are all over the country and they have a lot of customers. Right? How do you understand this space of all customers? Right? So what we did was we said let's create a similarity between all pairs of customers based on what they buy. Right? And then let's create some clustering. So these are 50 clusters nationwide for that retail customer. And how do you interpret that? So we looked at what are the top products that this group of customers buy. So these are DVDs and TVs and DVD kind of customers. These are Xbox and PS2 and Gamecube kind of customers. Right? Elevator proximity. These are more of each kind of customers. They have internal hard drives, graphics, PC, CPU kind of customers. Right? Then these are customers which buy a lot of appliances. Right? Home appliances like washing machine and all that. So these are all customers. And what we can do on top of it is there is a structure to this space. Three types of structures I'm showing. One is, you know, what do they buy? Where they are related to each other. And then I can color code them by how many customers are there and how much revenue they generate from it. Right? So this is, in one picture, you can actually see the entire space of all your customers. Right? And that's another power visualization. So then what we did was this is an example of how we can use this visualization to understand what is going on in this store versus that store. So that was created for the entire country. And now we will do this for two stores. So we pick two stores and we tell you what the stores were. And we saw some very interesting differences. We saw that, you know, some of the clusters were very different. Right? The two pictures look very different in terms of revenue and segment sense. Right? So what it means is in this store, there are lots of people in this segment. And lot, you know, yeah. So they generate a lot of revenue here and they generate some revenue here but they generate not less revenue here. But here they generate a lot of revenue in this store. Right? So we saw that there are two different kinds of stores and what happened is we realized that, remember these are customers in the daily and gain kind of area and these are customers in the appliance area. And then we looked at the stores. We checked out this store near one of the universities. So you have this kind of behavior. And this was the store near a rich, posh residential area. So you have these kind of customers. You see? So the same visualization, when you put store against store, you can get some very nice insights. And what they did was, based on this kind of insight, they laid out this store differently than this store. Right? So as soon as you walk in, here they will put these kind of things near the front or cash counter. Here they'll put this kind of stuff, the appliances near the cash counter. Right? So now you see how we went from data to visualization to an insight and to a decision. Does that make sense? And it's very hard to do when you look at 100,000 or 100 million customers and their games. Okay? All right. Here's one of the very interesting examples. How do you make recommendations ready? So imagine that you bought these two products, a car make in car many days. Right? And we recommend these other products to you. Right? Or you and you know, activations of radios and other things. So these are built scores. So if I use this score as it is, I'm going to recommend the top products. Right? Obviously I'm going to recommend the first product in the recommendations. But if I did this, if I looked at this picture along with this list, this is what comes out. So this is the, these products, right? This is the top product. And this is the top product, which is the first guy. But if I look at this product, this product is not connected to many things. So even if I recommend this product to the first person, he'll buy this and go away with it. But if I recommend this product, which is lower on the recommendation list, I'm going to land this customer in a place where he would want to buy other things also. Does that make sense? So decisioning is not a based on this first-order ranking system. First-order, you know, this one-dimensional ranking doesn't mean much. And if you want to improve your quality of decisions, you want to use, look at some of the things beyond that. And then you can probably make more useful decisions. So that's what this slide is about. We, you know, use this to improve their overall sale as opposed to the sale of the first product in the recommendations list. There's a difference between predictions, which is the recommendation score, and the decisions that you take. And they need not be directly proportional. We'll see more examples of that. All right, so let's look at another example from the financial domain. So this is when a lot of banks are trying to understand the space of their customers. What kind of customers do I have? And how do they do it? So let's imagine that a bank is able to create a customer. This is a risk score. That means if I give you a loan, we will pay back our loan, property or default is a risk, right? Attrition means, you know, if I increase your interest rate, will you hoop the bank account? Right? Attrition, right? Credit score, behavior score. So let's say they have four scores, and if I show them this large table, it doesn't mean much. How do you convert this table into a beautiful picture? That can tell them a lot more. And the table you can only look at, let's say 20 rows at a time, but you have 10 million, 100 million customers, right? How do you see the big picture and make some interesting predictions? And let's say they also have two metrics, how much revenue this customer generates, how much profit this customer generates. So what we've done is we've created a visualization called, you know, using the same ideas of multi-dimensional scaling. So we have said, okay, we're going to create a topography. Topography is like a map, right? Let's say you create a map of India or a map of something, and then you can overlay the map with anything. Like, on the map of India, you can overlay anything, right? Population density, weather, right? Time rate, all kinds of things. You can overlay or talk. So think of creating a topography and an overlay. And from the same topography, you can create a particular overview. So what I'm going to show you is so bad, but before that, some philosophy. So, you know, like in one of the things in the video, it's very easy to see the individual bar charts, right? So I can take each of the four dimensions and look at their histogram on by themselves, and it will look like this, right? You've heard of that five black women who look at the elephant differently. Some people say this is that, this is that. This is what happens when you just look at histograms. You're looking at the data from one perspective. You want to look at the joint space in an interesting way. So PCA is one of the ways, this is one of the ways. This is one more way, right? So what we've done is we created this space and we saw something very interesting that, oh, there seems to be like a structure emerging in this space, right? So let's look at what that means. So let's go over the process. We start with the data. We sample some of the points. We don't take all the points. Then we create the topography, right? We look at, you know, we try to cover all the kinds of points. So I want to include very rich people as well as very poor people because the distribution is very different, right? I mean imagine the distribution of wealth in the world. So obviously we don't want the sample based on what is the normal distribution. Otherwise I won't get really rich guys into my sample, but that's important too, right? So I created topography and then I can overlay anything. So here I'm giving you an example of overlaying the revenue. So if I look at the revenue, this group of customers who fall in this part of the space don't create much revenue. They generate a lot of revenue, right? So you get a nice insight. Then you say, what are these customers, right? Let me go and see what they look like. Then if you drill down into your customer base. So it gives you a big picture of the whole thing. And then you can also look at trajectory, which means if I make these customers, and if I make a decision, let's say I increase their credit limit, then these customers will move to a different place in this space, right? So they'll start spending more money, they'll stop paying their bills on time, right? They'll behave in a way, and they will move either to this side or that side of this space. So customers are not starting things. It's not a point in space, right? So again, you know, some equations just to scare you. Not to scare you, but what I'm saying is the messages, these pictures are not coming because I'm an artist and I know how to draw beautiful pictures. They're coming from serious maths behind me, right? So we use this kind of technique to do this kind of work. So here, what happened is we looked at, we have different kinds of overlays. So we looked at how do I understand this phase? With law it looks like this, what does it really move, right? So we say, okay, let me overlay the rating. Customer rating. So these are very high customer rating people. These customers have very good credit rating, and they would have to keep them alone, right? These customers have very poor rating. So that's what we understand. Behavior score. So these are four scores that I use, the four columns. Revenue score. I get a lot of revenue from these guys, but the rating of these guys is good. So this is the deliver that ranks keep going. Which is, I don't want to take too much risk. If I give loans to these people, there is a lot of risk involved because their rating is low. But these are the kinds of guys who generate revenue from this. Right? If they are built so high in the bank, it will be 80. Because if they are built so high, they are not making any revenue, right? So that's the idea. So they want to do something here, but they don't know how to do it. And they know that there is one group of customers, if they move in this part of the space, they want to move to another bank or quit, right? So this is how this visualization gives me the lay of the land. Now we're talking about one million customers in one picture. And how the whole space looks like. This is what we call the big picture. Okay? So now let's look at one of the overlays with density overlays. So same thing. There are two groups. So let's try to understand what this group looks like. Just use the visualization. Right? So we start with this group. Let's say you're a manager of this big bank, right? All over the place. And you want to understand what is going on here. So you say, okay, overlay something all over. So let's overlay die-hard utilization. Which means, if you have lots of credit cards, what percentage of your credit limit you are actually using? So these people are using very low credit limit. I mean, they have red secret cards worth of 5 lakhs. But they are using, let's say, 1 lakh or less, right? But these guys are almost using all of their credit. You can see the combination of this with the revenue as well as the other things, right? Then you look at percentage payment to balance. These are different features. These are additional features. So we said, oh, these guys pay almost 100% of what they are due to pay. These guys don't pay, you know, they pay the minimum balance. Right? So that's the kind of correlation that's facing. Then this is age of cards. So how old are the customers? When was the first credit card issued to them? These are kind of new customers. And these are the guys who are also ready to try it. Remember the attrition score is high here, okay? And then we looked at something very interesting. And this is where the view inside came. We looked at the number of mortgages. Mortgages means how many loans you have. Car loans, home loans, education loans, right? So we looked at how many loans you have. It turns out that within this region, which is a, you know, high credit rating region, there are two kinds of customers. One have very low mortgages. And one are very, you know, they have high mortgages. And all these guys are good customers, okay? And the reason is that these guys are the guys who just got a job and who just bought a hero. So since they have a job, they have a good amount of money in the bank. So they get a good high rating. And these guys are the old-timers. They are retired people. They have paid out the mortgages. Well, they have no more credit to take it. So they are also, they have very less mortgages, zero mortgages. Right? They have no more loans. They have paid out. So both these are good groups of customers which have high credit rating. And you can do something different for these two groups of customers. You can create different policies for these two group of customers. So you understand how visualizing is true? Is giving me a lay of the land. And without doing any complicated clustering or anything else, I can see a lot of things visually with these overlays and make a lot of decisions. Okay? Any questions? So much? All right. So here's another interesting thing we saw. He said, okay, let's look at a group of customers who are here. Right? And here. Now, this is wrong. Because it's both in here. It's either this is wrong or that one. Just the other mic. Sorry. Okay. Does that make a difference? I don't see it. Okay? Okay. So, so what we did was we said, okay, let's take a group of customers who are here. Let's make a decision on them. So let's make a decision to raise their credit level. And let's follow them for six months and see what happens. Right? So our experiment where customers are dynamic things. A point in this space is not a customer. It's a customer at any given time. Okay? It's a behavior. It's a combination of the four scores. It is not a particular customer. As the customer changes, he moves to a different place. So, if you look at customers here, some of the customers move to a higher risk scale when we raise their credit level. And some of the customers move to a lower risk scale. Right? As we raise their credit. So, you can actually visualize what happens, the feedback part. Right? We talked about the whole cycle. We saw insights, prediction, modeling. We talked about feedback. Right? Feedback is what happens when we make a decision about that. So, you can also visualize your feedback and see what happens to those customers. Right? Now, imagine you created a rule. Let's say the rule is, if the guy got married and you know, he or he got a new job, he increases credit limit to five other competitors. Let's say you created a rule out of some thinking. Right? Now, you want to know how and where the rule is doing. So, you put those customers, you give that, apply that rule to some customers and see what happens to them after six months. Right? So, you can use visualization to see what is going on. And then you can learn what is the difference between this group and this group of customers apart from the rule. You know, the rule is the same. So, it was applied to the same field. But what is the difference between customers who moved up and moved down? Let me use that as a classification problem. Positive customer, negative customer. And redo by rules. You understand? So, then you can say, okay, the rule is not enough having a better rule to make decision about this. You understand? So, how we will go through the whole set. All right. Let's talk about text data. And text data is also a very interesting kind of thing because it's all over the place, right? You have text data on the web, on Facebook, on Twitter, on all the enterprises, on the papers that are published, on the patterns that are written. So, text data is all over the place, right? All of the hardest things in text is how do you understand text the way you want to do, right? And humans can read only a small amount of text data, but they can understand it very quickly. But computers can read, consume lots of web data, but they don't understand it the way you want to do. So, how do we bridge that gap, right? So, visualization is one of the... I'll show you how to visualize text data and what we can do with it, okay? So, let's start with a simple exercise. What is this word? Important. Important, right? I mean, to be a little bit of time to figure that out, okay? Now, tell me what is this word? Same word, but because it is in the context, it's easier to interpret, right? Now, let's read this diagram. Okay? So, the idea of what is the meaning of a word, what is the word you expect after I say this word, right? There's a whole philosophy around how to understand language, and how does the brain work, and this quotation kind of summarizes this. It says, you should know a word by the company it keeps. What are the words before it and after it are going to tell you what is the meaning of the word, right? If I say the sentence, apple find a suit against orange. Now, do you know what orange is, right? We know it's a color, we know it's a fruit, but when I say it in that sentence, you think it must be a company or a person, right? But actually, it is a company in France. It's an airplane equivalent of India, and there is a company called Orange, and if you have a sentence like that, how does your brain understand the meaning of all the words, right? Find the brain, find the arrow with this thing, or find the nails, suit could be what you wear, right? Or a suit case. How does your brain make that connection? That this is what the meaning is. So meaning of the text lies in the context in which the words are. The context gives you the meaning, okay? So what we do is we use the same kind of philosophy to say let's do clustering over some text. So what we did was for every word, we created a high dimensional representation, and we looked at the words in the neighborhood, and we moved these words around, such that words that co-occur in the same neighborhood are together, right? So it organizes the space of our points. Now, if you do clustering on that, it kind of tells you that, oh, this is a very meaningful space, right? If you look at regions in the space, they are sort of symmetrically very close to each other, right? How? Same class, same origin, vegetables, spaces, all that, I, and all that. Now, the question is how do you visualize this? You still have clusters. You can't see your entire corpus. So what I'm going to do now is show you a whole bunch of such visualizations and how to see the corpus, what we can do with it. Okay, so here is a visualization of our news corpus. This is news from one of the countries that CIA is interested in, and they want to understand something in this news which will come in the next slide. Now, what we did was we took all the news articles in the last ten years emerging from that country, and then we put them on a two-dimensional map. This is not a self-organizing map, which is a form of clustering which can sort of put things on a two-dimensional surface so that things that are nearby are similar in meaning. Remember, we talked about the meaning of a word. Orange, does it mean, you know, this or that? So in the meaning space, this space is organized in a two-dimensional space again because we are able to visualize two dimensions. This technique is called self-organizing maps. So then what we did was we said, okay, let's do another clustering on each of the cells, and then we realized these cells are actually telling me something. So this is mass destruction, international relationships, diplomacy, verbal conflict, religion and politics, freedom of justice. See, the topics are also very similar in some part of the space. This is banking, government election, meetings, news media, arrest, detainment. This side is transportation and infrastructure, military technology, agriculture and commerce. So you see how the news is kind of, now you know the name of the map. You know what are the different topics the news might be. What can we do with it? So then what we said was let's look at one of the cells, let's look at the words that are nearest to it. So what we have done is imagine I have 100,000 words in a high-dimensional space. I put them on two dimensions. Now each part of the space is in one cell, and two cells are close to each other, and those parts of the space are close to each other. Okay, so now if I look at one cell, these are the words in it. So this is chemical, biological, bacteria, radiological, gel, munition. If I look at this cell, this cell is up above Richter scale, seismology, primer, aftershock, earthquake. So see what we've got is some kind of a mass destruction through warfare or through natural disasters. And that kind of coming in similar because words like disaster, things like that have happened in this space. So now you understand how to take ten years of a news conference and just see it in one shot. Now let's see what we can do with it. So if you're interested in this kind of thing, so let's say you're a government agency and you want to understand if Russia is real, so imagine you're doing this query. Russian proliferation of nuclear weapons to Iran. Okay, so this is what the query is. And now the same space, and we highlight which is the region that highlights the maximum with this query. Same space. So remember this part of the space was mass destruction and nuclear and all that. So this is where the query is and there are some other hotspots all over the place. Understand? So what we've done, we have looked at, imagine now all the web data on this two dimensional surface. You type the word Apple and Apple comes up in one part of the space, one part of the space, one part of the space. Three different meanings of the word Apple, and different parts of the space highlight. Now you click on one of the cells, another layer of the space. Now you can do a visual. So imagine doing a visual search on that. It's a part of the meaning space. So imagine a 300 dimensional, one region in the meaning. See meaning is a very sort of fluid thing, right? So whatever it learns, based on that it says, okay, all these guys went to this cell and they are very close to each other, therefore they back to each other. Just because it's a discretization of the continuous meaning space. So it has to have some boundary and something might be next to it and all that. Okay, so then if I click on it, I can say over Russian deputy fears increased proliferation of WLD, right? So this document belongs to this one. So not only words belong to each cell, but documents also belong to each cell, right? So I think the right corner of the document is based on this thing, right? Sorted by this. Iraqi scientists report on Jaguar and other things in general. So there are, so it's a very visual way of doing exploration of your 10 years worth of continuous confidence, right? All right? So let's see a lot more examples of this kind of thing. So this is scoring from text data. So this is another example from insurance companies. So insurance companies, what they do is, imagine if a Flipkart guy comes here to deliver a bow and he slips because there is water on the floor. And he guards himself and he needs to go for surgery. Who is going to pay for it? Is Flipkart going to pay for it? Or triple I.T. is going to pay for it, right? This is called supprogation. So the two insurance companies say, no, you're going to pay for it. It's your fault. This guy says, no, you're going to pay for it. It's your fault. So it's called a supprogation process. And if I look at a lot of text data, so whenever such things happen in insurance, they write down a bunch of text description of what happened. And we took the text data and we said, what kinds of things are there in the text data? Remember the first thing we talked about, lack of awareness. We don't know what we don't know. So we can't say that there are these five things in it. We don't know what we don't know. This is a very important idea. And that's what the visualizations are telling us, right? So we don't know. So this way came up that there are names, generally there are names, a lot of names of people, right? Naming comments. Then there are muscle injuries, symptoms, bone and neck injuries, traffic and auto. So this region is very big. There are a lot of supprogations that happen, right? Always. It's other guys' fault, right? If you have a traffic accident, supprogation is a very big deal in traffic and auto. Then, you know, food and bruises and cuts because somebody is cutting, you know, something at home and he hurt himself. So can you sue you for that mistake, right? Eye injuries, slippery slopes. We talked about slippery slopes. So that kind of space emerged from this visualization, right? And now, let me show you another beautiful overlay auto. This is what we call the scores. We actually built a prediction model that said predict if this text will lead to supprogation or not, right? So the score is high if this text leads to supprogation. And now, each of the cells is color coded with what is the probability that if you belong in this cell, you have a high probability of supprogation. Does that make sense? So it's a classification problem, new class problem and I can overlay the prediction on top of the space. And when I look at these two together, I kind of understand that this region is very hot, right? Auto and traffic. So let's look at some examples. So now, I'm going to overlay one more thing on top of it. So not only I'm showing you the space, the meaning space of the whole corpus, I'm showing you the prediction output overlaid on the same space. Now, I'm going to show you this query. It says motor vehicle accident next train rear-ended by truck pushed into ditch hit by four others, right? Pretty bad situation for this guy. And this query falls in this region of the space. Understand? So obviously it's going to have a very high supprogation score. So imagine you are an insurance agent who wants to decide this. If I show you this visualization, as you enter the text, I'm going to turn the region screen where your text belongs and you can actually see whether it belongs to a half-part region or a quarter region, right? Okay. Here is another query. Where drawing was cutting board with saw board. Chip went into I. Whose part is it? Right? It's not supprogation, right? The guy has to pay for his own insurance because it was his part. So see, remember this I injury? And this region was machinery and cuts and bruises. So this query fall in this region. And this is kind of a cold region for that kind of classification problem. So now you'll see how you can visualize the entire text space over the day of the model and over the day of the query. Any questions? Okay. I'll show you something even more interesting. This is an example from another banking application, which is collection node. So let's say you've not paid your bills for a while. The bank guys will call you and say, hey, what's happening? And why are you not paying? And people are going to give one kind of excuse at the end, right? Now the banks don't know what to do with that data. So this guy is typing this data, but they don't know what to do with it. So they said, can you do something with this data? So we said, okay, let's not assume anything. Let's look at the visualization, right? Same technique, same process. We created visualization. So we saw all these concepts, actually. Forgot makes a unreachable, confirmed payment, waiting for funds, account makes a, you know, hospitalization, bankruptcy, not sufficient funds, financial difficulties, forgot out of account, right? So these are the kinds of things people are saying when people are calling for money, okay? So there is a structure to this space. There is no, it's not like randomly people are saying something. There's a whole structure to that space. And now we need something even more interesting, which is, we said, okay, let's actually look at the probability of defaulting also. Along with the space, what they are saying, let's look at the hotspots. So here is another model that has been to predict whether you will actually pay or not, right? So it's already made. Will it pay? Should I spend more resources on this person? Or should I give it to somebody else to collect the money or let it go? Or send a notice? What should I do with this guy? So there's a model they have, and we say let's overlay this model on top of this visualization and see what people are saying and what their score is. And it turns out that, you know, there are different hotspots. It's all like a continuous one place, hotspot. There are many, many reasons for what they are doing. This is very much like the recommendation example we saw before. Remember the recommendation engine? We said you can't depend on a score alone. You need to know what are the extra benefits from that decision. Here, again, if they depend just on the score, they will say if the score is about this number, do this action. If the score is about this number, do this action, right? That's a very bad idea. And for this reason, because different people have different reasons not to pay, right? So this guy is not paying because you are unable to reach. He's not keeping up your score. You have to take a different action. You can't keep calling him every six months. You have to send somebody, you send a letter or whatever, right? Go to his permanent address. So this part is different because I'm reasonable. For this reason, this score is high. Here, the score is high because of something else. He's waiting for funds. So it's kind of like medium. Maybe he'll get the fund, not get the fund. So the decision here could be let's wait for some more time, right? Here, the decision could be something else, right? So here, for example, this is not sufficient funds. He's out of money. He's out of job. So he's not able to pay. So the point of this slide is that score alone is not enough. You need to know why the score is high. And once you put these two together, the decisions should be different. You're going to take different decisions for different people. And you're not going to just say, score is this, do this. Understand? So that's a new insight that the banks caught. Earlier, they were just going with the score. And now when we put sure this together, oh, then they revamp their whole strategy. And then they start to get better results. All right. So here's an example, sending the affirmation to attorney, mailing the affirmation to attorney, court case still pending. So there's a court case pending. So unless that is resolved, he's not going to pay. So he's on a high risk here. And then this guy, you know, thought she already got caught up. She will transfer the money. Checking on Monday and make the payment, right? So this is kind of like in the waiting for, this is kind of that region which is, I think, forgot behind and all that. So it falls here. So she's going to pay bone to anything. So that's the idea that while you're talking on the phone, you can actually type this and it can show you where the thing falls and you can take the right decisions. All right. So the last example of this on visualizing text data, this is an accident report. So imagine, you know, we have accidents in the US and every time there's an accident, the police comes, they exchange insurance numbers and all that and they also type a report. It says, you know, what was the car, people were driving, right? What was the weather condition? Where did the accident happen? And the description of the accident. So the government was collecting this data for a long time, tens of years, collecting descriptions of accidents but this is text data they don't know what to do. They know what to do with the other things, right? Insurance field and phone number, they don't, they know what to do. Text data they didn't know. So they gave us this data and said, can you do something? So we said, let's visualize it first. So here is the visualization of what is happening in the car world. Accident. So here you have concepts like type, size, type, brand, car, model, mileage, smells and sounds, death and injury, case is problem, rape problem, instrument agents, immediate transmission. So these are the concepts in the, in that kind of state, in the accident. Now something very interesting happened and I want to show you that which is kind of, which will tell you that if a human could actually be leading all these text data, they could have saved some life. So here is an example. So this is what happened in August 2010, in 2000, which is, you know, Ford is a car maker, we all know, right? Ford used to use tires from this company called Firestone and what happened was in one of their tires, there was this randomly exploding. Some of it was a Ford in the tires and these tires were exploding and there were people getting killed because of that or accidents would happen, right? And in different parts of the country, in different places, this thing was happening. So nobody was looking at the big picture, right? The guy who is filled in a quote, he doesn't know that this thing is happening for a while, right? Everybody is just filing reports. Nobody is looking at the big picture and doing analysis because it's on text data, we don't know what to do with it, right? That kind of thing. So we said let's look at this data and look at this thing in retrospect. So remember this happened in August 2000, the news broke out in August 2000, but obviously the accidents were happening all over up to that point. So now we looked at this car, Ford Explorer 96, and we looked at the density map, which says how many reports in this month came on different cells of the thing, right? So we take a report and we put it on the map and it gives me a density map, right? Now observe this area for this car, see what is happening, March 1999, right? I'm just going to do an analysis. See how this area is becoming red, right? More red. So this is by September 1999, which is like almost a year before there was evidence in the data that something is wrong with this car, more and more reports are coming in this area, prior to it, right? And if somebody was able to look at this visualization before, they would have been able to figure this out. And that is one of the problems that when you don't see the data the way you could, then you land up losing a lot of things, right? So Ford had to get all the tires back and spend millions of dollars to recall all the tires and a lot of money was spent, right? Now imagine doing this on other things, like let's say you are going, they are doing a complex project, right? Building a new airplane, and let's say you have all the email exchanges going on between project managers, right? And they are talking about different things, they are saying the parts is delayed from Australia, or Taiwan, or whatever, this guy is talking about something. Now imagine if somebody, a single person, could read all that in there and create this kind of visualization, right? So imagine the space could be like a, you know, box or you know, quality is poor or whatever, right? So you can have this space and as emails are coming up or discussions are happening on the project, you can actually start to see the density and somebody at the top of the project can look at this and say, you know, why don't we investigate something as emergency, right? Because individually people don't know, but the big picture has it. So our connection is the data as a whole has a lot of insights. We are behind in terms of finding having what we learned, okay? So and you know, obviously as this goes on, you could actually find this almost a year before the whole thing happened, okay? So that was my last slide. So let me talk about the three things that we need to think about when we do visualization. One is, what is the nature of my data? It's an image, it's a text, it's a gene sequences, it's speech signals, all that stuff. What kind of distribution do I have, right? I mean some data will have normal distribution, some will have different distribution. How high dimensional it is, is it sparse or not, right? Text data is very sparse, gene sequences are very high dimensional, whatever, right? So you need to think about the nature of your data. Think about what kind of things you can do with this kind of data, right? Can you actually project using PCA? You can't project a gene sequence using PCA. You need to think of other techniques, right? All right, what kind of distance do I want to define? This is my example, we said people can do distances. Similarly in text, we saw distance and similarity between meanings, right? You can imagine distance between two people, you know, how much exchange of emails they do, how much they talk on the phone. You know, different companies can look at distance between people in different ways. So distance is a very important notion in visualization. You understand that, okay? And then, what kind of insights you want to look at? Do I want to look at the big picture? What kind of, you know, things? Do I want to visualize a decision? Do I want to visualize a model? So we saw examples of all that kind of visualization, right? All right, so that's where I start with the world art that we set. In the arrangement of the visualization, every single pixel should testify directly to context. All right, so I'll stop here if you have any questions. I have a couple of questions. Ali is related to this. There are two visualizations for seeing the result. How do you figure out if this is the best visualization? It is possible that you have to think differently. How do you figure out if this is the best visualization? Yeah, so there's no best. It all depends on what you want to see. So you can define best in so many ways. One is, you can say, is it losing anything? Am I losing something? Or is it complete? Is it giving me the complete structure? Or is it giving me partial structure? That's one abstract notion of what is best, right? Then relevance. Is it showing me something I care about? Or something that I don't care about, right? So in the financial world, for example, if it is just showing you that based on where people live, you know, that is the distance function, then it's not a useful visualization for you. But if it is based more on, you know, their credit score or whatever, right, their social score or social relationship, maybe that's more important, right? It's a Twitter visualization. You're doing it on IP address, proximity versus, you know, kinds of people that follow. So what dimensions you use, and is it relevant to what work you want to do? So if Twitter wants to open a data center in some location, maybe they want to look at how many people are in different parts of the world. Maybe that is a good visualization for that problem. But if they want to do some other analysis on things and recommend people for following each other on Twitter, they may want to do a different visualization, right? So there is no one answer of this is the best visualization, but relevance, holistic nature, as in it doesn't miss anything, it shows you all possible angles. And obviously, how quickly can you get the right insight from it? Is it taking you, you know, a little bit of time to understand or to just see it and you know what it is, right? Like I showed you the visualization, when you understand what it means, you can immediately say, oh, that's what is happening there, right? So it should be user-friendly in that sense. If there is a dynamic to it, like I need to break down on a place, go deeper in one area, or if I want to change the distance function on the fly, for example, right? So it changes the visualization quickly. If those kind of things are there. So those are what we would call good visualization. But it all depends on the right choice of distance functions, which can be changing over time, depending on the application. And the completeness of the data, as in have you sampled the entire space, or are you only able to look at one part of the space, right? Don't miss anything and show me what is relevant. Two things, yes. Sometimes it happens that visualizing data actually leads to ambiguity in understanding the sort of answer to a question as opposed to numbers. Very good. So my understanding from doing all this is ambiguity is not a result of, it is not inherent in the data. It is inherent in the limitation of the number of variables we have collected. So if there is a phenomenon that is happening because of some variable which we are not collecting. Okay? Then however we look at the visualization, we are not able to find out why these things are different. Right? Two things will look very similar to each other because the one thing that really mattered in the distinction of the two people was not present in the data. So in that case we need to understand the completeness of the features we are collecting. Or maybe you are using the long distance function. Right? We said PCA versus fission, right? So if you are not projecting it correctly, you may not see the right things in the data. So therefore it's an art and a science. It's an art. It really is. I mean, any tools that can give you multiple perspectives, tools that can change the distance function, you can add new features to it. You know, you can work on a very large amount of data. You can sample it in a different way than the others. But the source of ambiguity and noise, I think is never in the process that generated the data. It is always in the limited number of features we have collected and the smaller perspective with which we are looking at things. So you provide extra features, collect more, yeah, additional features and do that. Right? For example, if you are looking at people's buying behavior and you are only looking at their, you know, what they buy and what they market for, that may give you only a limited perspective. But if you start collecting their demographics, right? Where do they live? How far do they live from the store? How many people are there in the family? Then you start with an additional input which can give you better visualization of the city. So it's always about this. And I think if you look at the history of science, this is how the science is evolved. It started looking at a limited part of the universe, right? The one that we can touch and feel. So we know everything about it. We thought these were the laws of physics. When we started looking at the space or the quantum world, we realized those laws don't apply. So it's not like this theory was incomplete. It was complete with respect to the data that we saw. When we started looking at this data and that data, now we say, okay, we need a bigger thing. So it's always like that. We are rubbing behind truth which is much higher dimensional space and we have a very narrow view of it. And that's why we see things which look ambiguous or noisy. And I believe that probability theory was really invented to cover up this limitation, right? That we are not able to completely understand that. That's why this certain laboratory is trying to understand Higgs boson by collecting so much data on it that it don't miss any perspective. Right? So that's a good question. Okay? Any other questions? Yes? We know that we have a lot of previous data. There's no such thing as too much data, but yeah, I think, yeah, show me. And the trick depends on sampling, really. I mean, I don't think today we have to worry about too less data. We have to really worry about too much data. And in that space, we need to worry about my sampling study. It needs to be correct. If I am going to do density-based sampling, I'm only going to look at data which is in the dense clouds. I'm not going to see the outliers, right? Like in the banking example, the most of the people are in the middle income rule and I sample based on that probability. Most of my sample will come from there and I'm not going to build the complete space. So sampling is one direction of completeness of visualization. Am I sampling so that the distributions are honored and the other is the dimensions, features. If I do both the things correctly, holistically, then I'm going to be at a distance function. Once we have sampled the features, how do we do it? So if you do these three things right, you get a good response. That's a good question. Is there an exhaustive list of different visualizations that are possible and can they be mapped to a type of data? And if yes, why is it so hard? Because there are good data ideas, right? Even though you have a client-based data set, you have three different ways of visualizing it. If I don't, if I can't, you just have some sort of the software. Like, let us have a look, but they don't recommend a space of this data for 32. So is there an exhaustive idea or are there still ways of visualizing data that are still available? See, it's a very creative field, right? So the more creative the field is, the more options you have, right? And yes, of course there are, you know, if you give the same data to different designers, they'll come up with different visualizations, not because, you know, you give exactly the same data, same samples, same features, same distance function, they'll still come up with different visualizations, right? So it's a creative process. It need not be constrained to five boxes. But yes, there are lots of visualization techniques for different different types of data. See, when does something become an art? That is the question we should ask. When does something become an art? And when does it become, you know, science, right? So there are all these algorithms, right? Plus string, let's say, projection algorithms. If it's an art, then you have to decide what the distance function should be, right? How do you tie that to your business power? Then no machine learning or visualization method will not tell you what to do. You see, it's like a carpenter who knows all the tools. But there is a necessary idea of creativity, right? He can still create so many different things out of all the same tools. So that part is the art, the talent of the carpenter. So that's what is there. And I think, you know, you can think of it in terms of how you choose your parameters. Like, how many clusters do you want to see? In our visualizations, you saw the text visualization. I use a different number of clusters in different places. How do you choose that, right? So that is one example of an art. You would try different things and what makes sense. Too high resolution, too fine trained. People are not going to be able to consume it. Too coarse. It is losing too much information. So you make a decision on them. These are the kinds of things that make it an art process. It's not just about color scheme and this and that. It's about choices of parameters that help the visualization come to life. So yeah, there is an exhaustive place. I'm sure, you know, there are faculties and different things. There are books written on visualization. There are courses. And now there are departments on visualization only. So I'm sure there's an exhaustive list. We can find some there. But you come in as a carpeter who knows what visualization to use, what data and what business model. That's where the art is. Yes. In reality, the space is spherical. But because we can't show it on a 2D map, otherwise we needed a dynamic thing. Imagine a sphere you are inside it and I could rotate it. That would have been the right way to do that. If it was something similar to what they do with maps. Yes. So yeah, that would be the next evolution of this visualization. But the space is like that. Actually in reality, what we see is that space. But we basically, like a map, when you have to put it on a textbook, you don't show a 2D version. But when you go along, you can show it there. So it depends on where you want to show that visualization. So what is looking for? I can't do this right now. But how do you look for features and the right features? Which is supposed to be around two edges. Right. Yeah, I can't imagine. But you can just turn the thing and say, if I want to show you the India part of the globe, I just have to turn the part to the left side. See, the mapping is an artifact of the real space. But the limitation of the 2D visualization, unless we are in a spherical 3D virtual reality world, we have to do it that. So you just plug it around and say, show me the map. And that's how the globe, you can see that. It's the same thing, but your point is right there. It's a special space. So does that mean that the left extreme point to the right is the point? Yes. So you guys are excited about visualization now? Start using it in your data understanding tool. Don't just start downloading SVM software and put it through the data. That's not the way data mining happens. Unless you understand something in the data, you can't be a good data scientist. Alright, thank you.