 I'm going to start by telling you one thing that I'm hoping you will tell me about. Data can tell some very interesting stories. I'm just going to speak up. Data can tell some very interesting stories and some of these can be pictures stories. I want to share some examples of what can be done with data. This talk is more about just spending your mind hoping to show you some new things. There are some interesting stories that just come up purely out of practice. What I will not be doing is talking about the techniques that are going on behind this. There are a couple of talks that will be talking about how one can do things like this. This is just about examples of what can be done with data. The picture that you see here is a map of the flights in the United States. It shows some of you. If you saw this animation, you would have seen that the flights start from your land into the country while the rest of the country is dark when it wakes up. Then the local flights start from your land. Then the country flies on the Pacific coast. These two stars meet in the back line. This animated visualisation is something that was made entirely in the United States. The only source of every dot on this image was the flight position and the viewing point. Nothing happened there. Let me introduce myself. My name is Anand. I'm a data scientist at the United States. What I'll be showing you is how it works. We'll be talking about some examples of pictures. The earliest example of this would have been for understanding data visualisation. During the 100 years war, she looked at the number of people that were dying because of illnesses which she debited in group versus the number of people that were dying because of war injuries either directly or indirectly. You can see the story that it tells. This was her acquisition to Queen Elizabeth saying, we need more hospital facilities. She got it. This is going to be one of the wonderful factors for England. Sean Snow, in sometime around the same time, prepared this visualisation to show where Kauwara is. At that time he was not known that Kauwara is one of them. He brought in response where families were affected and the pump that they were drawing water from and found a very strong relation. That established quite clearly that water was the source of the disease and end up saving several lives. This was centuries ago. If they could visualise data there, this effectively, we would compute as what to be able to do better. To be fair, we are doing quite an amazing job. This, for instance, is a visualisation of London. Every single dot here is drawn in an automated way. The red dots are places where people have taken photos and photos on clicker. The blue are dots where it meets. You can see that there are the business districts which are somewhat bluish. There are the tourist spots which are reddish. You can see that there is a lot of London. You can see Oxford Street Energy. You can see the structure of the screens. All of this is without overlaying any real data as any map-based data. This is purely proof data and visual. The question is why visualise that part? I thought it would be better. Visualise for a minute that this is why the photo is here. Well, there are some things that visualisation can tell you that you cannot see very clearly with the data. For example, if you take a look at this data set, it shows you for different cities over a year the price and sales of a product. Our first guess would be that the average prices look the same. Nine, nine, nine, nine. The average sales looks the same. The average variance is consistently 10 for the price and the average variance for sales is 3.75. Looking at the numbers, I can't quite tell much. I mean, maybe I can figure out some background. It will take me a while. Let's just plot it. Firstly, the first thing I want to highlight is that this data set looks like four cities are completely identical. They're not. Bangalore is a slightly positive correlation. So, as you increase the price, the sales is decreasing. Really? Increases up to a point that stops. Hyderabad looks like there's a near-perfect correlation but there's an aberration. That might have been an error in the data. And Mumbai. We never bothered changing the price except for one location that we did move. But I will argue that there's not enough data to prove the point. Now, it's the same data set. The conclusion that we're drawing from the picture is extremely different from the conclusion that we're drawing from the summarized statistics from looking at the numbers itself. And again, one thing that I want you to take away is when you get a data set, just plot it. Whatever way you can, whatever tool you have, Excel, paper, pencil, whatever, you are bound to get something that is different and more insightful than you and just look at the data. We were working with the energy utility. We said, look, we have it from We know that data readings are being taken but not quite accurately. In some cases, it's traditional to come. In some cases, it's not. And we have this ton of data which we can't process. We don't know. We have new technology. See whether there's a pattern for all that. So we did what I just suggested. Let's just simply plot it. Nothing sophisticated. Let's see how many people have a meter reading of zero, one, two, three, and so on. And what that looks like is this. It's roughly like a normal curve. But there are a few spikes. The biggest spikes are 50, 100, 150, 200. So that we are here. And those things actually are exactly the slap on this. The person with a meter reading of 150, 100, basically less than a person with a meter reading of 101. That's where the spikes are. What surprises us though is the spikes at 10, 20, 30, 40, and so on. There's no economic reason for that. The unit rate is the same. So if you had a meter reading of 10, it would be just one point. So why were these happening? We were in a chair. This supports the other thought that we had, which is these are not meter readings that were taken and adjusted. These are readings that were taken in the first place. I just put them around now. And they were trying to automate the readings. They needed a case to prove to the unions that there is a reason for automating them. And this was strengthening their case. But then a second look at the data shows a few other things. Now, sure people are trying to reduce their meter reading. But is this happening to the phone? Is it more like, there comes a pay-in-certainty and we reduce the meter reading? Or is it more like a fixed format? On a regular basis, you keep a meter reading down. Is it the same set of people for putting a meter reading down or getting to the boundaries? Or is it a very diverse set of people? So why we probably are not able to see if each row shows a meter reading of an individual over the course of a year. And the lady who is in the first row had a meter reading of 200, 200, 200, 200, 200. And that's tough too. In the case of broken people, we were having zero. Except for this... So this gentleman looked at the name and said, okay, I know what happened. But there was a huge knowledge at that place. His guess is the actual reading would have been 5,000. So the diamond would have gone to them and said, please, can I put in my schedule for something like that? But it also varies by job. So, for instance, if you take different sections of the city, the degree of fraud which can characterize as the first rich bump at the last boundaries varies anywhere from, for instance, in section one, it's 70% going all the way up to 100%, which means that there are close to one point pieces, times as many people with a meter reading of 100, as compared to 90. Huge jump. So that section got 5,000 fraud happening. And as you go further down the degree of fraud it's decreasing. You can't see the bottom of the section. The bottom was something like 30% fraud, which is good. I guess one has to deal with some degree of fraud. But here there's an interesting knowledge that there was a big gap here and then almost as if the top could see, this moved up. And checking back with the records looked like that's the time when the section manager moved in there and moved back, moved out of that section. So yes, the data is deeply dependent but in some cases it can also show you the people that are the heads of this data. We were working with the Tamil Nadu Education Department and I can see there was a story around what credits marks so given a child can be find out up front what marks are they likely to get. Purely based on Tamil Nadu so there's gender-maker difference. It does serve communities that have subject-maker difference. And you'd be surprised to find that the child's subject actually makes a massive difference. We're also testing some weird things like does astrology have any significance? Does neurology have any significance? Do people on certain months score better marks? Do people with different first records have different marks and so on? People with different first records there is no statistically significant difference. If you're interested in 2011 the letter Ti scored the highest mark the letter W scored the lowest but not a statistically significant not a very significant increase. Soco-psychic makes huge difference in statistics that I've found is less than 0.0000000 That's what the curve looks like in 2011. That's incidentally what the curve looked like in 2010 and 2009. That's exactly the same curve that you have by subject. That's exactly the same curve that you get by industry. In June, June bonds score the lowest. July bonds, a little higher. August bonds, the other top. September is really much of a difference around, say, all really sorry for June bonds. Also remember, this is the average. There are, of course, the June bonds. There are. But this pattern is repeated in every single way. And it's, I mean, it's quite simple. I mean, it's just, Christmas is more than 120 months. I mean, for some of you, I guess the explanation is obvious now. It isn't as much as the sunset and the age. Children born in June would have, it just becomes six years at a time of automation. And they're the oldest children at last. And at that age, for one year, it makes a huge, it gives a huge advantage. The same pattern was seen with the Canadian hockey team. There are more than a lot of players. We talked about the hockey team having people consistently born in January, February and March. Very few exceptions. The reason I guess is that they cut off for them was that it was a dissensus. So anyone born in January, just misses the cutoff. Therefore becomes the oldest student in the next match and therefore has the best position. So, which is an interesting thing. And a possible indication of this, therefore, is you start splitting sections by age. You can have a different, you pitch differently. But on the other hand, it's probably a bit sad that 12, 14 years of education can't really wipe out this one-year advantage at the beginning. What are we teaching? Incidentally, the choice of subject makes an even larger difference. 18 percentage points. If you had a choice between taking in Tamil Nadu, science zoology versus commerce computer science. He used to go for commerce computer science. 18 percentage points. We want to see really the raciest one-day battle. Who is the fastest and the longest in one-day battle? The data was all there. We had a few innocent in-match and asked the analysis in 1970s that the first match was clear. Let's take a look at it. Excellent can show you some interesting things but we thought we'd just client-seed all of our machine. And that's the result. The size of each of the boxes is the number of rounds that are scored in one-day battle. The color is the speed at which they score. So the darker the red, the slower they are. And the brighter the clean. You can't quite see the differences in the greens very well here. But take my word for it, that's the rightest among the big greens. But actually, you should put on this score a little higher from a strike rate perspective. But then, that's what Sehwag's scored. And then we'll get deeper into the unusual match. So there is a certain consistency factor as well. So Sehwag's consistency, consistently fast as a moderate but also quite inconsistent. Whereas, one would argue the technical has been mostly consistent. It's of course smaller than X. And the Vaskar has had his share of fairly fast innings. Sehwag has had two really big scores. The rest of them were a few small. And while the strike rate is fabulous, don't worry, they give family from two large innings. And you can't really make that score with a person. The thing is, if you read this data set out as a table, it comes to 70 pages. You're not going to be able to pass 70 pages at one time. Part of what data visualization does is compresses information into a very compact representation. And that's the value that's always there. The other thing I missed out was that we did the same thing for the world we can see with the top 100 creators. Again, I'm not sure if you are able to see the colors that well, but the fastest scoring batsman in our range of hatches. By a huge margin, it's just a little bit. 190 strike rate by the second highest would probably cost us 100. Interestingly, there's also an increase in the strike rate over time. So if you look at a person like couple days who batted more than a decade ago, the average strike rate per year was less by about 3 percentage points. So it means less every decade by about 3 percentage points. If you make that adjustment, couple days strike rate compares to that of second. So in this time, it was about as fast a batsman as Zeva was doing. What can we make out of security's pricing information that we get? Can we see any part of the population holding a given security or a set of securities? Are there any specific security that I should be voting more on? Do we risk it? Should I be voting less? Or some of those? Because they will certainly be what I've got. But here's a picture that shows the correlation between second currencies, commodities like gold, silver, and stock indices, currency, S&P, and so on. Square shows you the numerical correlation. So the Australian dollar versus the euro was for the six month period ending in 2011. It was about 68%. And the only visible thing there, unfortunately, is the scatter block. Now, remember what I said earlier, in general, always block. So the number can smoothen many things. The year for this was for the Singapore dollar was correlated against the second currency. But it had a nearly zero correlation. And there are two reasons why you can have a nearly zero correlation. Either for the first three months you had a positive correlation and then a negative correlation. And therefore, it's averaging over zero. Or because there really is no correlation. And for the Singapore dollar versus that conceiving fact was a bomb. That there was a strong positive correlation each year and then a strong negative correlation which was very different from zero, something close to zero, like the point here. Between the Malaysian rate and what was the rate. But what do you also see? Because there are certain blocks of related currencies. Singapore dollar, the dollar, Japanese yen, gold, Swiss franc and the Chinese yuan. Who tend to move with each other almost perfectly. And there are, there's another block of currencies here. The S&P, the good sea and the VSE. Which also moves within each other. And for some reason, in Pakistan we also reasonably coordinate with that. But these two blocks move statistically with each other. So if any of these these station currencies plus gold and the Swiss franc this block goes down by zero or something. So if you had gold and wanted at least a hedge against it the most negatively correlated against that would be the poor sea. So gold, the British index that's the best hedge against gold. The poor sea, then the best hedge against that. If you hold the city the best hedge is gold. And if you hold the Indian rupee then your possible best hedge against that would be the Japanese yen. But it's not strong. Minus red seven. It's still not good. But before I go on into this one just talking for 15 minutes just keep going. Stopping any time you got any votes. I'll try and wrap this up. Another thing we were trying to do was take sort of like unpopular civilization of flights in the US can we take weather data and plot it. So we took the last 100 years of our weather data for each district and plotted that on a video and first we tried to see whether the temperature was essentially varying a lot. Now unfortunately the color combination is looking a little like this. So you just have to see it on our website. But each row is showing you India's weather by district by month. So generally very much they do so on and as it has been called and each row is showing to you by decade. So 1900, 1910, 1920 and so on. What you see in the full color version of it is that there really isn't much of a difference between January 2019 over versus January 2001 whereas it is a huge difference between January 1901 versus May 19, 1910. But also if you play like a video there's some patterns very much. So for instance the northeast and the German Academy are probably two to over a year. So it's the west coast. South-west coast, the Marlborough coast is consistently cool too. But what was most interesting was that there were two places in India, two districts, which have a complicated pattern. You have these areas being cooled by the surrounding areas are hot and these areas hot by the surrounding areas are cool. As you play the video you can come to obviously one of the areas just now. One is Velasco, the other is Shibara. One common thing we could find is the Vodafilms so perhaps that's a significant difference. We don't know. Now the thing about all these visualizations is they can tell you what's happening, not necessarily why. The last question will be about Vodafilms. In this fourth section what's something that we did with triple lighting. We were playing with their computer users so for a period of 40 days every single activity of every single student would record. When I say every single activity I mean right down from every keystroke that they like, which applications they were using and so on. There are four means that I have the Gmail password of a feast with the students. They were told by the way that they have the option of turning it off. So each column is a student and each row is an application that they were using. The color indicates the degree to which they were using it. So the browser, Firefox was the most used Microsoft Word on which they were wearing assignments was the second most used. Chrome was the next most popular browser and you cannot see clearly who's using Firefox because we're using Chrome. And in the next few years there is like somewhere where you do use it and we have some strong doubts on why they're using it. This might explain the fifth most used application was VLC. They did have course videos actually. You can see the title of the content they were watching and one group came out of it. We had a solid movie recommendation in the screen. The victims of each of these is the amount of time that they're spending on the computers. So the people out here are spending that much time and we are spending a lot of time. And the person here and the first two top users between them they are among the few people using Half Life. There are four people playing Half Life as opposed to for example Note 5 which is used by faculty or the command processor which is used by faculty or just these four people have managed to push Half Life into the top half of the list of applications and spending 10% and 11% of the time on this. This guy is a solid game. We don't want to have any percent of the time. He or she gets the time. What this tells you is partly who's using what how but it can also go a step further than that. What we haven't done yet but are the process of doing is trying to see how they switch between applications. So when I'm using for instance my editor, do I need to switch the browser or do I need to switch to minus key or do I need to go? So use a video while I'm learning what the patterns are learning how long does it take. The other thing that we're exploring is how long are they spending on a video? Is a video of five minutes too short? Is a video one hour too long? How people will pause it That's another kind of behavior inside that When I was relocating back in the UK one of the things I was trying to decide on was which city to settle in. And then I was here in August so I thought of the third page using a bunch of people including a bunch of people and I knew it was a very complicated bunch of people. Trying to save each channel on one channel independently from another channel independently from another channel. So you have to stop that that would find people. Then try the same thing in Hyderabad to find people. So I said let's take GitHub for those who are not aware because it's sort of like Facebook. And I said let's draw a line that means who is following each other. We filter this network by the who is living in the city and the broad edition to see what the pattern is in the city. I think you're not able to see but that's that's a very large network and I certainly connected the conference so it does mean that if you get one person and there is a batch on that at least on GitHub there will be some other person that you should be able to connect easily. And even if you're not part of the certainly connected conference, there are plenty of precise clusters these are the color it indicates the language that they look at so they may need languages maybe organizations and then once you know this you can also see how this cluster if you connect with this cluster there is some good that come out of it. That's Chennai the second largest community and there is a community there it turns out that I had missed them from there far smaller and it's an emerging community group called Chennai it is emerging now that's probably the third it is the third they just on the not there is most centrally connected there are a few hubs here and there if you go to Hyderabad there are hubs there are three or four clusters all the ones that I could identify were organization these are startups and people there are trying to contact we also did this with Q-Series in the Middle East Sri Lanka, Singapore and so on and depending on which place you are looking at Middle East for instance does not have 3 people to provide help in contrast Singapore has a massive engagement this incidentally explains my choice of location I would be sending an email but not all designs need to be here is a simple redesign that we did before we sent the path for the electricity sometimes something has electricity can be visualized it's really just the number that you want to see how much do I need to pay let's make sure that that's common and once you know that the next thing that you want to know is why I got this so what was my current reading what was my previous reading therefore this month's meeting is not to be last month you should have paid so much this month that's what's needed and also sometimes just tell me how much did I use last month how much did I use a month before then how much did I use the same month last year because there's a huge season of variation the months in which I switch on PI conditioning that's going to make a big difference to the amount of power that I'm trying to find the 45% let me show it to my neighbors and say so many points like three points things like that can be put into an interface and it's a reasonably simple period this doesn't really call much for visualization as much as it does and if you were to pay this is another very simple visualization originally created in your page to show tells you if you had bought had sold the lines if you had bought the lines anywhere from 2008 2006 7 8 9 10 and had sold the lines at this period how much money would you have in this industry so the reds are where you have made a loss the beans are where you have made a profit the whites are where it's somewhere to be and the diagonals in this period you are lost money at best you are bought for recovering if you had bought around this time and sold if you had bought around this time and sold probably early you would have recovered decent bit of your money but if you had sold anywhere from that then please you would have made a loss but also you can see the holding part so if your held for one year does this one year and more you cannot make money systematic investment is great the data for the last 10 years doesn't seem to indicate that much so let me end with four books which you can't read again but it's easy enough to remember then just pick up any book by Edward Dark he's all books by Edward Dark and you'll get a sense what visualizations can do like I said I have not spoken about how you can create these visualizations I will emphasize though that half of these visualizations can be created on Excel you just need a sense of what can be done and need a sense of the story that you can use around it and you need a picture story so let me end I have a quick question for I guess we are a data visualization company we did last year we analyzed in a controversial way and there is a story you can find more about us on our website and on your website I have two questions the first one has to do with clearly there have to be I mean the design that is shown on the front end it's what is also the kind of I mean make it possible for you to display something I mean so basically I just want to understand this dynamic between design and technology and therefore what makes it possible for you to show something in a certain way so that's my first question the second question is sort of still something I am trying to arrive at my own mind but also in terms of I mean I don't know if it's as simple as so this question about this whole process of how do you read the pattern are you starting with a particular question or how is it or is it a problem that's forced to you so the first question was between design and technology design is something that unfortunately comes only to this practice there are a few guidelines that one can pick up but unfortunately even applying those guidelines does take a lot of practice the good part for work is quite easy and many cases that technology itself is the design so if it's possibly easily created it's really difficult to create what it's only for the security realisation plus a lot can't so the good way design is driven by what you can do and therefore it might simply be to keep all the look almost as easy as easy way is to teach how people were talking about this technology but I want to represent something that just technically helped and what is the field that will do another point that are applications like for instance that will be a spot 5 which will value drop Excel 2010 and 100% sure has at least a quick answer therefore the design is just the second question was the part of the how do you know what place you are the part of this course and the experience but one there are two ways of going around this one ask the question and find the answer in which case you know what you may be I think this ability to do that analysis which can be easier the tougher part is if you don't know the domain don't know what to ask and just want to know what's interesting what I found is one simple rule of thumb is data can be divided into any data set that you can take can be taken as an exception it will have a bunch of columns numbers and categories typically read next let's assume you split the columns into these columns have text these are numbers take any one of those numbers that you are interested in and find out how it varies by the time for example we are working with a quality company trying to see what the variation in mortality is for the chicks so one simple automated way of detecting this is to say what are the rules types for each roof type what are the average mortality is there a significant difference for each floor type is there what is the average mortality is there a difference so for every simple every single categorical variable find out what the average is of the number now this is a simple technique it is just saying find the average of each of the segments but about 80% of the analysis that I have seen in all my life comprises just this one thing tell me how a particular variable varies based there any 20% so you can there you probably can go but the good part is this is something that can be done blindly without you calling it Hi I am Laena and let me first congratulate you on a fantastic presentation what I wanted to point out was this is danger of correlations making correlations between various variable and then making a broad generalization out of it especially the birthday made me a little worry how do you protect yourself from making I mean how do you how do you protect yourself from making such generalizations what are the processes you follow within your team to make sure you are not falling into that fantasy Eric Raymond had given the answer in that one he taught us a technique how to find something different it is not necessarily similar process we create something two months later we find out that it is not true and we have found out in such cases when he access to the raw data it is a possibility he was trying to make sure that it is to do the data made for the PS the data has been done yes