 So I'm gonna be talking about networks. Now, networks are pretty much the most generic form of data structure. I mean, you've got entities, you've got relationships, and you can move them across one another. It's pretty much the very basis of any kind of data structure. Now, this can be represented in a variety of forms, both data structure-wise as well as visually. And what I'm gonna be talking about is some ways of representing network data, except that these will be networks that are fairly tailored to specific contexts. This is not a talk about technique. This is not a talk about what tools to use. This is not a talk about any product. This is a talk about what one can do with network data of variety of forms. And to help you identify what kinds of forms of networks are useful for what kinds of analysis. It's more like a, let's see what can be done with networks kind of talk. I'm gonna be focusing on four types of network representations. There's the classic grid or table form. Ultimately, a table is a network. You take each cell. Each cell has two attributes, a row and a column, and that's still a network. Or you could represent it as a hierarchy, parent-child relationships, that's a network. You could represent it as a flow, an entity flowing from one to another to another, or a force-directed layout. Now, across these types of representations, I'm gonna be giving you a few examples. I don't know how much we'll have time for, but we just go through them as they come along. And see what we can do. The first one that I'm gonna be showing you is a piece that's very close to my heart. I'm a huge movie buff. Sometime in the early 2000s, I decided that I'd start seeing every single one of the internet movie database top 250 movies. So 2001 to 2005 was a slightly bad time. I managed to cover only about 50 of them. 2005 onwards was a slightly better time. Within a year, I managed to get to about 100. And then 120 in the next year and so on. And during those days, my movie watching, my record was about 190 plus or so per year, on average, between 2001 to 2011, 2011 when I joined MacPresent Job. And since then, I've watched about a dozen movies or so. It's pretty much it. So, which is not particularly good. But the thing is, the internet movie database sort of runs out fairly quickly beyond a certain point. I'm now at 175 and I've been at 175 for the last four years. Not because there aren't movies to watch, just that the movies now fall into the category of foreign language movies, which I have trouble with. Or horror movies, which I have even more trouble with. So I said, let's do something. Let's just see based on more data that is available on the IMDB, if we can get some information. So what you have here is the top 10,000 movies on the internet movie database. Each of these boxes represents a set of movies. And let's take some box that has multiple names. Let's take any one of them. Let's take that one, for example. Now, this is a box that contains all the movies with an average rating of 7.4 plus or minus 0.1. And a votes between 50,000 and 51,000. Each of these grids on the horizontal axis shows you how popular it is. So something that's to the right is very popular. Something that's on the top is very highly rated. You can see that by and large, there are lots of popular, highly rated movies. So if I take something that's near the top right, sorry, I can't see this from here. What is this? A lot of the rings right up there. And the scene indicates that I've seen it. So this is my tracker. The chunks in green are where I've seen most of the movies in that block. The chunks in red are where I haven't seen most of the movies in that block. This is publicly available, but I'm not gonna give you this URL. It's on App Engine. The quota limits are crazy. I have, thankfully, a set of very faithful users. But unfortunately, they keep complaining about the quota limits. I don't intend to pay for others' movie watching experience. So I'm gonna keep this URL to myself. But what this tells me is this bunch is among the movies that I have to see next. The thing. Unfortunately, that's a horror movie, so I'm gonna skip that. Then there is Hotel Rwanda, which I have half seen, but not quite managed to complete. So that's my movie tracker. Ultimately, all you have here is two attributes of data, which is what is the movie's rating and what is the movie's number of votes of the movie. Group together to form a matrix. And that is a useful representation in itself. So I had created this, was playing around with it for a few months, opened it up to the public, and then all of a sudden there was a mail from the CTO of Amazon, who said, look, I've seen this site, I'd like you to come over to our office. I was at London at that time. Okay, fine, not particularly gum-hobert Amazon, but whatever. It turned out to be the IMDB office, so which perked up my interest. I said, okay, fine, let's go to Bristol and have a look. And then the day before the meeting, they sent an email saying, call need him, the founder of IMDB is gonna be there. Okay, that changes things considerably. So when they spent a good hour with call, I thought I was a movie buff, okay? I've managed five movies a day on a number of occasions, typically on long flights to the US. You see four in the flight, and then you struggle to the motel and see one more at the end and finish the day with five movies. And you've got to pace it as well. The movie at the end has got to be a nice light action movie or die-hard types, if you will, or animation movie. You can't watch Mulan Rouge at that time or whatever. But call does eight movies a day regularly. He's not only finished the IMDB top 250, he's finished well above the IMDB top 1,000 and is now working his way down the list. And he says, I've got 8,000 DVDs sitting at home, most of them unwatched. So okay, that's passion for movies, what? But anyway, that is also a tabular representation as well as a network representation. Let's take another topic that I'm very fond of, which is education. See, if you take every single student's marks across a variety of subjects and say that I'm gonna correlate these, I'm gonna take every pair of subjects and look at the correlation between this pair of subjects and treat that as a distance metric or as a correlation metric. Put that as a table. Now I effectively have a network. I have a relationship between every pair of subjects. Now some of these numbers may not be significant in the sense that there are probably only about 100 students out of 100,000 who've taken a combination of economics and botany. So in which case you say that's not large enough a sample, you skip it. But the rest of these numbers are showing you the correlation between every pair of subjects and grouped in a reasonably intuitive way. So let me explain. Physics has a fairly high correlation with chemistry. Okay, that makes sense. Reasonably high correlation with biology, with mathematics. It's almost like there's this block of subjects, physics, chemistry, biology, mathematics, and computer science, all of which have a 0.8 correlation with each other, which means that you do well in any one of these subjects, you tend to do well in all of the others. It's almost like the sciences have something in common within them that if you get it right, you will do well. And if you don't get it right, you will not do well. Whereas if you look at the other subjects, let's take English, zoology, botany, accountancy. Now these have relatively lower correlations between themselves, indicating that if you did, let's say accountancy really well, there's not that much of a guarantee that you will do well in computer science, nor is there that much of a guarantee you will do very well in English, or in history, or any of these. Each of these subjects is different and different in its own way. Economics has a slightly different skill set which is slightly different from the skills that you will need for commerce, which is different from the skill set that you will need for history. Whereas the sciences seem to be a bit more of a monolithic thing. So effort-wise, if a student were trying to optimize marks and wanted to a, choose a group in class 11, or b, do well, then I would argue that it may make more sense to go for science. At least there's only one set of concepts that you've got to learn, and it's a hit or miss thing. You get it wrong, it's all gone. Or, you know, spread your time across a variety of subjects and try and learn a different set of things, in which case you're less risk prone, but still you're unlikely to become a topper. So it depends on your ambition. Very ambitious, highly, you know, risk seeking, go for science. Less ambitious, wanna play it safe, don't go for science. Let's take money. The same principle can be applied to practically any, in this particular case, what we're doing is taking a series of currencies and commodities like gold and silver and so on. And each of these cells represents the power-wise correlation between two currencies or securities or whatever. Let's take an example. So I will take the Indian rupee and the New Zealand dollar. Now these two have an 89% correlation. That scatterplot shows you every single dot for the last six months, actually seven months, starting from January the first to... Six months. First of July. And every single dot in one day and that's the best possible line going through these two. 89% correlation is fairly good, which means that if you happen to have New Zealand dollars, then there's not much point holding onto it. You may as well just convert it to Indian dollars when this goes up, that the other goes up. Not too much of a difference. Let's take these two. Now these two seem to be fairly negative. There is this block at the bottom right, the Australian dollar, Filipino peso, Brazilian real, all of these. By and large, move with each other. When any one of these go up, all of the others go up. So Singapore dollar and Indian rupee, not, well, 74% correlation, not too bad. But the IXIC, which is a NASDAQ and the S&P. Now that's a different story altogether. When the Indian rupee goes up, these two go down. In fact, when many of these currencies in this block go up, those go down. So if you are hedging, then what you want to do is, let's put it this, if you wanted a decent hedge against the Indian rupee, walk up and down this column and see what's your best bet. At the moment, the Chinese yuan is the best bet. So if half of your business, if your business is in China, you're looking for another, sorry, if your business is in India and you're looking for another country to trade with. And your objective was to reduce currency risk. China is at the moment your best bet. Though putting your money in the S&P or the NASDAQ is not too bad either. Right on top, there's the, ILS I think is Israeli currency and ISK I think is Icelandic. They're not too bad a hedge either, but not as good as any of these. Again, all you're taking is par-wise correlations between a set of numbers and looking at this in a matrix form. Let's take another example. I wanna do this. Maybe I do or maybe I don't. I'm gonna tell you what that is, but I'm not gonna show it to you because it's hopefully gonna hit the press in a few weeks. What we did was took all of India today's text, every single issue from 1975 to a week before last and said, let's see how the names are correlated. When an issue mentions one person, does it also mention another person? Are there certain people that are covered more together? For example, what's your guess? Do you think Raj Kapoor has been covered more by India today or Amitabh Bachchan? Raj Kapoor? Amitabh Bachchan? Okay, that's how many people are wrong. There are significantly more mentions of Raj Kapoor than there are of Amitabh Bachchan, believe it or not. Who's mentioned together? Sanjay Gandhi and Indira Gandhi or Rajeev Gandhi and Indira Gandhi. Sanjay Gandhi and Indira Gandhi, okay. Rajeev Gandhi and Indira Gandhi. Okay, in this case, the majority is right. Sanjay Gandhi and Indira Gandhi are mentioned reasonably often. Interestingly, Rajeev Gandhi and Indira Gandhi are mentioned fairly often, but Sanjay Gandhi and Indira Gandhi and Rajeev Gandhi and Indira Gandhi are not mentioned that often. Now all of this gets even more confused because Rahul Gandhi also starts with Ra. So when I start with Ra, I don't quite know which way to go. And he's being mentioned fairly often as well. All of which is saying that I can take data and create the relationship between two entities as in this case, a correlation or some kind of a metric and say that I can figure out interesting things out of that. I can figure out for instance, if Chandrasekhar is a supporter of the Gandhi family or not, or at least, is he mentioned more often with the Gandhi family or not? For example, which is the tabular set of representations. But the other set of representations that are reasonably interesting are the tabular representations, which is, sorry, hierarchical representations, where you say the relationships are not from A to B, for every pair of A to every pair of A and B. The relationships are more like a hierarchy, there's a boss that is subordinates and so on, or there is a country and there are states or whatever the hierarchy you have. Now, while I'm gonna show you these, I'm also gonna show you something else. Now, the last few visuals that I showed you were web pages. The last time I was here on this stage, the last two times I was here on this stage, I'd been praising Excel a fair bit. Some of you may know that I'm a huge fan of Excel. Today, however, I'm gonna be talking about PowerPoint and I'm gonna be singing PowerPoint's praises. Before you shoot me, let me show you what it can actually do. Let's take this data set for what it's, and this is, so we're working with ISP and I've taken some anonymized data from there. This is a representation of how popular different types of sites are. In other words, what do people browse on their mobiles? That's the question. And let's open this and see what we have. Okay, so the size of the bubbles represents the categories that they browse. The horizontal axis represents the number of pages each visitor visits and the vertical axis represents how large the pages are and what do you see on the top right? For those of you who can't read at the back, that's pornography. Fine, but the largest category where content, where the number of users, where users are spending their time is content servers. But let's hold off on that. Let's start exploring this a little bit. Let's see what is happening on the pornography site. Each of those things that shot out from there is one site. So this, for instance, is a site that has the largest number of pages per visitor and the largest volume of traffic or the largest size per page. Sorry, I can't see it. You'll probably be able to see it better. xvideos.com. Very popular, 5,600-hour visitors. The numbers are anonymized. But that's roughly how many visitors were, say, last week, if you will. Whereas if you take the social networks, yeah, let's take social networks. Now, Facebook is the big daddy, but it's not too large in terms of number of pages or the volume of downloads. What's larger is, let me close this one. Let's go there. That one seems to be larger in terms of number of pages. What's that? Pepperoni City, never heard of it. Anyone heard of it? If you're using your mobile, just remember that a lot of people are spending a lot of time visiting Pepperoni City. Now, they may be also visiting a lot of pages simply because it's very poorly designed, so you have to click through 10 pages to see what you actually want to see. But at least this is some metric of engagement. Let's take search engines. Yes, the answer there is obvious. That's Google, somewhat heavier than many of the other search engines. But this one seems to have even more, okay, that's Google again. So Google's pages are pretty heavy on a per page basis, which may make sense from one perspective you're trying to reduce the number of pages visited and reduce, therefore, the latency. Let's look at what technology sites are being visited. Is that technology? Hardware and electronics, technology sites. It's not expert technology, it's expert. Something else, what is this? Uncalibrized, no content, mobile phones. Web-based email, let's see what that was like. So something popped out to the right extreme. Now that's something that people are visiting and spending a lot of time on, or at least pages on. TataPower.com, okay, that beats me. Maybe they have their own web-based email system, but they're sure answering a lot of emails. So if you don't get email from Tata Power, just think about it. You may want to wonder what they're doing with all of their email-based systems. But all of this, incidentally, just in case you didn't believe it, is just PowerPoint. And it doesn't have to be 2013, it could be 2007 for all you care. While we are on PowerPoint, let's take another example of stuff that can be done with it. And while we're doing this, I'm also gonna show you, no, actually, sorry, that was a PowerPoint. I'm gonna show you more examples of what can be done with hierarchical representations. Let's take that one and download it. See, earlier, the hierarchy was simply that there is a category and there are sites, and you broke from the top level to the next level. Now it doesn't have to be just two levels, it could be multiple levels, and here's one way of exploring it. So this is anonymized data for an FMCG company who said, we've got so much sales worldwide, this is not the actual number, which is broken up by country, so there's so much sales happening in UK, little less in Japan, little less in China, little less in India, of which so much is coming from stores, so much is coming from our partners, so much are coming from the direct channel and so on, within which these are the various products. And that's what we are planning to do with each of these products. Either we are planning to accelerate the growth of this product because it's doing extremely well, or we are planning to catch up with the rest of the market, or it's de-growing, so we need to turn it around one way or the other. Now that is a hierarchy, that's a reasonably straightforward representation, but drill down is something that certainly helps as well. So if I wanted to see, for example, what exactly is happening in, let's say, let's take Japan, you've got all of these products doing relatively bad. So why is this product, or how is this product doing in general? Is it just doing bad in Japan? So that's the breakup for this particular product. It's sort of doing okay, so it's doing pretty bad in two segments in Japan, but it is doing okay in one segment, whereas in the US it's doing bad right across, and in the UK it's, again, a mixed bag. Okay, that's the case in Japan, is it just this product? Well, barring a few bright spots, most of the products are doing pretty bad. Okay, what about specific segments? If I look at the store segment, how does that span across country? Or so on, you can click on any one of these products or go blah, blah, blah, and get to see how this works. This is, again, another hierarchical representation, but here the hierarchy happens to be radial, if you will, you still have something at the top which is worldwide, within that you have countries, within that you have this, and so on. And with this what you're able to see is roughly what the structure of this data is. Let's take another example, which is the foreign currency donations received by organizations in India. So the Ministry of Home Affairs has this lovely website where any organization that receives foreign currency donations worth more than one crore is listed. So this is now the fcraonline.nick.n, if I'm not mistaken. And what they do is have incredible amounts of data on every single NGO, who gave them money, how much did they give them, what is the address of the person that gave them that money, what is that money going to be used for, you name it. So we can track a fire bit. And what this shows in a T-Map is, okay, the size of the box is the volume of number of rupees that each of these states get. So Delhi is getting the maximum by way of foreign currency donations, or at least organizations that are registered in Delhi are getting the maximum amount of money, followed by Tamil Nadu, followed by Andhra Pradesh, and so on. And there's not that much money going to Himachal Pradesh, north to Himalaya, reasonable amount to Punjab, and so on. That's sort of the total volume of money that came to India in 2012. Now, the color here is the growth. So greener it is, the better it is. So Assam had the highest growth. And one can start drilling down into this. So why did Assam have higher growth? That's because to a good extent, there's this one particular NGO, the Don Bosco Society, which grew by 400 odd per cent to get to donations of about 23 crores. Okay, fine. So who gave them this money? Let's click on it and find out. That is the SCRA website, where they have information on what type of university it is, what's the purpose for which they're putting in the money. So all of the money that they're using is for activities other than those mentioned above. It's interesting to know that there's nothing mentioned above. But other than educational society, so whatever it is they're doing there, it's not for this. But who's giving them the money? Let's look at the donors. So it's half of it is coming from, I will not pronounce the name, from someone in Vienna for the construction and maintenance of schools and colleges. That's a hefty sum. Now that's in fact, four and a half crores. No, that may not be the largest amount. This looks slightly larger. 13 crores came from Don Bosco itself, from one. So yeah, that's their main funding. And the bulk of the funding seems to be of this nature. Does anyone have any specific organization whose funding you'd like to see? CIS? CIS, do you think I'll find it? I mean, let's find out. Let's find out. Center for, please tell me if I'm typing it right. Internet, nope, sorry. You're not above the, sorry. Oh, it doesn't matter, it'll search. Let me just search for internet. It'll find all the app. Internet, no, there's no internet funding. But let's look for, I don't know, something religious, Christ. Yeah, there's lots of funding there. Let's look for something that's, tree. Yeah, there's some amount of funding there. Actually, that could be spelt in two ways, SSH, query, RI. Yeah, there's some amount of funding there, but probably not as much as Christian missions. Who's getting all the money in Delhi? The largest is, sorry, that was the second largest. Who's that in? Okay, somebody religious. Public Health Foundation of India, and they grew a fair bit. Agakan Foundation, again, large and growing. In AP, it's a rural development trust. In Karnataka, it's primarily, sorry, I just can't read this from here. Action 8, okay. And so on. You get a sense of the hierarchy here again, right? You see which states are getting the money and drilling down into which are the organizations. One can then further drill down into who are the donors. I'm not gonna show you that drill down just yet. And then among the donors, what are the purpose that they're drilling down for? So here's another way of looking at networks as a hierarchy, start at a certain level, drill down, drill down, drill down, blah, blah, blah. Let's look at, okay, I'm tired of networks as a hierarchy. Let's look at slightly different ways of looking at networks. Let's look at networks as a flow. So this is, when it comes through, if it comes through, okay, not coming through. Let's try something else. We have a problem. Let's try that. We have something coming through. Okay. So we were working with, let me pause this for a second, working with a beverage company. And they were looking at fraud in terms of tea auctions. They said, look, we've got a huge supply chain. We get tea from the estates, and then there are a bunch of people that pick it up. There are brokers. Then that comes from there. We get it via a transporter who brings it to one of our plants and so on. So we're playing around with the data and trying to see where exactly the fraud is happening. Now, how do we, what kind of data we have? We have every single lot of tea, and we know the entire cycle that it's been through. We said, okay, that looks like a network. Let's represent it as a flow diagram. But before that, let's first see where the tea is fraudulently concentrated. So if I take every single one of the plants that are receiving it, then the size of each of these plants is the volume of tea that these plants are processing. And the color of the plant is the percentage of adulteration. So that plant's got a fabbed of adulteration. That plant's got likely less adulteration and so on. These are the transporters that are pushing in these laws. So there's one particular transporter who's fairly adulterated. The rest are reasonably clean. And there are different sets of people that are collecting the tea from various sources. A few of them are somewhat adulterated, but then look at the brokers side. There are a bunch of brokers whose tea is completely adulterated, which appears to be the source of the problem. Now let's look at how this is flowing through across these. So if I take each of these brokers and look at the volume of tea that's flowing across, let's take any one of them. In fact, let's not take the broker. Let's take this transporter. So all of the tea that's coming from this transporter goes, sorry, all of the tea that's coming from this particular person who's collecting it from the warehouse is going to the transporter, one particular transporter. This one is again going to a single transporter. Some of them are broken up though. So this particular agent is shipping across multiple transporters. Now this transporter is shipping to multiple plants, but notice something weird. Normally when you have something adulterated and you send it downstream, let's say you've got ink mixed in water and you send it through a pipe. When it comes out, you're not likely to see all of the ink on one side and the water on the other side. It's likely to be reasonably mixed up, but for some reason, when it comes to this supply chain, it gets more concentrated, nor more diluted. Now, how likely is that? Unless there is some collusion happening between these two guys, wherein this person invariably gets all the bad lots or most of the bad lots and this person is sending the clean lots to other places where it won't get detected. Obviously I can't give you too much of the details, except to tell you that this involved in this visualization are a set of arrests. Turns out that three of the people that were collecting the tea from the auction houses were in fact brothers and they were running a group of companies under various names and were doing this sort of a thing, which brings us to what if this sort of, this could be a mechanism for detecting corporate fraud of various kinds. For example, let us take a group that hasn't had too much by way of controversy, which I would say is a Tata group. That is a Tata group. And see if we can figure out the structure of this group, assuming it loads. It is a slightly, ah, there we are. So those are companies of people in the Tata group, ever in the history of the Tata group. Every circle here that is in orange is a person. Every circle here that's in blue is a company. The size represents the number of connections that they have. So if they are a director of a company, then they are connected to that particular company. So let's keep it simple for now. Let's just take some of the larger companies and people. So if you take the top few directors, and let me just increase the, you know, distance a bit so that you can see it clearly. So there's, who's that? Ratan, R and Tata, okay, so that's Ratan Tata. That's Isha Tussain, is it, okay? That's for Ratan Tata again. No, Farooq Kanawara, okay. So those are the people that have had the maximum number of directorships of the Tata group, and these are the companies that they belong to. Or they've had the directorship position off. Let's relax this a little bit and bring in a few more people. Not that many. Let's get to something around this side. Now you notice something interesting. The Tata group seems to be breaking up into two sets of people. It's almost like, okay, not quite two sets of people. Wish I could control it from here. So we're around there. You're able to see it a little more clearly. Almost like there are two camps within the Tata group. Let's not get into the names here. This will be in public at some point. People who control one set of companies, people who control another set of companies. And then there's one company right out there in the corner which is exclusively the domain of this particular gang, which is what company is that. That's Tata Motors. So Tata Motors seems to be the hub for one group of people that they are rallying around. If there was a split in the Tata group, then my bet is it would happen along this axis. All of these people would move to one side and all of these people would move to other side simply based on the corporate structure that there is. And this is yet another way of looking at networks which is just a force-directed layout. What is the distance between any two metrics? In fact, one can do that for a community like this of, I'm assuming, a community of geeks. If you look at, for instance, Github and see, and see what structure of Bangalore looks like. So that's every single person, as of two weeks ago, on Github whose location is Bangalore. The size of the circles, the number of followers that they have, the color of the circles, the language that they program, and you'll see the legend at the top. So the blues are the JavaScript programmers, that's the majority, followed by Ruby programmers, Java programmers, Python programmers and so on. You can see that there's a reasonable network, a bunch of people that are connected to one another, though there are, of course, isolated people. Now there's one person out here. I don't know how closely you folks follow Github, I don't. But there's one guy who seems to be, the size incidentally is the number of followers that they have. So this person doesn't have too many followers. But he sure is connected to practically everybody in Bangalore. He's just decided that I will follow every single person in Bangalore. Let's see who that is. I hope, I'm not embarrassing somebody in the audience. But that's, Jagdish Singh R, who has zero repositories and two followers, but definitely is interested in the rest of Bangalore. And then there's Kiran, who's out here, one of the biggies, who's that, that's me. There's, okay, sorry, I can't read from here. Vinay Raikar, that is Sudevar, where's Anand? He's the biggest. Anand Chittipothaloo, wherever he is out here, in real life he's out here. And he's, sorry, where's the biggest circle? Oh, okay, that one? No, that one? Yeah, he's one of those big red circles out there. In fact, the biggest red circle. But let's, in fact, since all of these red circles are Python developers, let's get to the network of Python developers and see what the Python community looks like, not just in Bangalore, but across the entire country. So now the color scheme is slightly different. We have the city as the color, and the size is still the same. So that's the Python community. Let's make the network go a little further away. Increase the gap a little bit so you can see it more clearly. That's the Python community. Reasonable network, not quite so. There are a few scattered networks. So this incidentally is the Singapore Python network. So understandably, they're not too well connected. But even within India, I wouldn't be surprised if you find a few odd. Let's just take Python within Bangalore and see if we should be doing something different. So if I take the Python network here, mostly connected, but here are a few isolated spots. So somebody probably ought to reach out to these three guys beginning with Nandakishore and say, hey, join the gang. The city's well connected as it is. Let's improve it. You would not necessarily find this of many other places. If you were to take, for example, the NCR region too. So take all of Gurgaon and Noida and Delhi. So the color scheme is the brownish, maroonish thingy is Delhi. The blues are Noida and the grays are Gurgaons. Not that much of a network, nowhere near as strong enough. In fact, the network exists only in Delhi. If I knock off Delhi, Noida and Gurgaon even put together, there are barely any networks that are more than two people. Not that connected. But then if you look at the connections across cities, let's take Bangalore and Singapore, for example. How are these two connected? As you would expect, these two cities are not too tightly integrated. Within themselves, they are great, solid networks. But across these cities, there are only a few people. In fact, I wouldn't be surprised if I found them in this audience. And if for any reason I shot them or the roof collapsed or something, Bangalore is disconnected from Singapore. Now we have a bit of a problem there. And this is despite Singapore being the second strongest city from a social network perspective on GitHub. And the ties between these cities are just nowhere near strong enough. So that's another way of looking at networks, like I said, forced-directed layouts. One could use this to see, for example, if different people are actually providing different credentials as names and so on, variety of other uses. So to summarize, what I've done is in the last 40 minutes or 35 minutes or so, shown you different kinds of representations, matrices, hierarchies, flows, network diagrams. Like I said, I've not talked about tools, but I hope what you will take away is ways in which you can use this data, put it to real life. Some of these examples are publicly available on demo.grammar.com. Feel free to take a look at them. Open for a few minutes of Q&A right now. How do you feel about core diagrams for visualizing networks? Core diagrams work great as well. Core diagrams are where you have a circular representation and inside of that, you break it up into chunks, move them from one to another. They are perfect when you are looking for networks where the domain and the target are the same. So for example, migration patterns in India, as an example, if I have movement from a state to another state, you could have movement from a state to itself or from one state to another state. In that case, the domain is states, the range of the target are the states. In that case, core diagrams work beautifully. Or if you have where in a shopping mall, which stalls are people moving from, or heck, from here to the other room, how are people moving? A core diagram certainly makes a lot of sense then. What do you use as a data source for the PowerPoint graphs? Like how do you supply the data? Is it XML or JSON? CSV actually, for some of them, or database, the question was, how do we provide a data source? A database or a CSV file, or could be XML or JSON? The PowerPoints were generated in Python using our product. We built a product that does this sort of stuff. Barring, without exception, all of the stuff that I showed you are things that we built on our product. Okay. So it's not out of the box in PowerPoint. How would it not? It is not out of the box in PowerPoint, but PowerPoint can do a lot of stuff. No, it's not open, so we're hoping to make a lot of money from it. Is there a kind of product performance figures or something? Okay, are there product performance figures? We can take that offline. Yes, of course, we have product, and therefore there are figures for performance. We'll take that offline. Okay. I don't want to promote that anymore. Hello? Yeah, hi. Hello, you mentioned you did some analysis on India Today text. Where did you get that from? Where did I pick that up from? From India Today. I was at Delhi a few days ago and they gave it to me. So, okay, we have a commercial engagement with India Today. The weird thing about being in a company that does data visualization is unlike a lot of the community which has to hunt for data, we actually get data and we are given money to take the data. So in this particular case, recently we did their best colleges survey. And like again, they gave us the data for the colleges and they gave us the money. They're now giving us all of their issues and some money and so on. So, additions for freelancers where to get the data? Sorry, that wasn't too clear. So, where can a freelancer get this sort of data from? Historical archives are available for some sites. So if I had time, I might have covered some media work that we're doing but I may not have the luxury of time. Let's take one example. Is this election cartograms here? So for example, one of the pieces of work that we did was with Vijay Karnataka where all of these are full page spreads of our visuals that came in Canada. No, I can't read them either. I don't read Canada, but we managed to produce them. But if you wanted to get media content, now at least the images are publicly available. So I've been scraping the times of India and the economic times images just to get a sense of how has the color of these newspapers changed over time, for example. So that is available. Which actually leads me to the more general part of the answer, which is your easiest bet if you're comfortable with it is to scrape data. For a programmer, that may be the easiest. For a non-programmer, the easiest bet is to go to the source and ask them. You'll find that you're more often than not likely to get the data. RTI, if it's a government source, might be a pretty good source as well. One clever way of getting data that has come up in the recent past is to tie up with a company that conducts contests like Kaggle does in the US. Crowd Analytics is one such organization that does that here. Now, when you put this under the umbrella of a contest, then companies, interestingly, are happy to give data and say, look, I'd like you to play around with this and see what you find. And they end up analyzing it reasonably well. So crowd analytics, if you participate in the contest or Kaggle for that matter, you might get lucky and get the right kind of data. Anand, do you mind if I use this kind of a laptop so that you can take questions while? Sure. Tools of choice for scraping data until last year, it was exclusively Python. We have now added Casper.js or specifically Phantom.js to that choice of toolset. If you have data that does not involve weird JavaScript, then fine with effectively pure server-side scraping, Python is a fantastic toolset. If, however, it has these weird ASPX tags and like most government sites, too, then Phantom.js effectively is a browser that does not have a GUI. So that's your best option. I have worked mainly with Phantom.js. Qt WebKit works just fine as well. What is the database that you use to basically store this raw information and this raw data? We just use CSV files by and large. But if it goes beyond that, we pick any relational database. No, not non-relational, usually because it's not required for the kind of volumes that we handle. Right. And have you also considered sharing this donation information with people like Arvind Kejriwal or people like Julian Assange that probably thank you a lot for this? Short answer is yes. In fact, I just found out a few days ago that our Arvind's brother is my ex-colleague. And yeah, we will be talking. And it's not just Arvind, Narendra Modi and a bunch of other parties as well. See, political visualizations are a great space to work in. Partly because political parties have such fantastic information, I just learned that parties get to know by each booth what is their polling percentage for each candidate. Now, that's granular. For every single booth, I know what percentage of people have voted for them. Which means that the next elections, I can attack those where the polling percentage is relatively lower, for example, because that's where I have the maximum of swing. Or I can look at the demographics of that region, try and map it to the overall... I mean, right now the census habitat information is extremely rich. So I can figure out what type of people live where at a village level or a very low level and start playing around with it. So yeah, it's a great field. We'll be doing it hopefully. Yeah. Yeah. So the question is what is our hit rate in getting insights out of data? Because we can't always be sure that we will get insights out of data. What we find is that roughly there is a certain probability of getting insights out of data. And if you are able to accelerate the process and in the time that it takes to do one analysis, are able to do 10, you increase your probability of finding something interesting dramatically. So we work on optimizing the techniques to make it faster to do analysis rather than trying to improve the quality of analysis, quantity beats quality in the long run. So we do lots of analysis and each of these that I'm showing you are one among at least a dozen analysis that we've done and discussed in the rest. So on Python, I tend to use LXML and requests for pulling the URL. On phantom.js, I use Casper.js. The question was what library specifically within Python and phantom.js do I use? Selenium. Yeah. No, I tend not to use Selenium. Not because there's anything bad. I just don't have enough experience with it. There's Mechanize as well which is an interface to Selenium which is great. Yeah. Yeah, if you have more questions, please take them offline. Yeah. Thank you, folks. Thanks, gentlemen.