 Welcome back everyone. So we are going to start our first talk of the day. So it's called consuming government data with Python and D3. I guess that's that would be interesting because data visualization Nowadays is quite booming. So to give an introduction of our speaker. So he's Pratap Vardhan So Pratap is a data scientist at grammar.com a data analytics and visualization company And he has done his BTEC from NIT Bhopal. So before we start the talk just again a few quick announcements Again, I'm reiterating the same facts. Don't create Wi-Fi hotspots. Please So and plus you can check out the open space availability and if you want to propose a talk or something you can do that Make sure you ask for the feedback forms from the volunteers who are there so that we can have the feedback for this talk and Also, the registration for lightning talks are going on outside. So there's a board. You can just clip in your Topic whatever you want to speak during the lightning talks And or get in touch with one of the volunteers. So they'll help you out. So first talk over to you Pratap With Python and D3 and in Python Do I mean in all of all the Python libraries, but no in this talk mostly I'll focus on how do we use pandas This is one of the data crunching libraries and also D3 based Frameworks, which lets you do some interactive visualizations And all this talk is form of a ipython notebook, which is there on github and I might have just tweeted right now and So if you just want to run through the examples with yourself on your system You can just download the notebook and just run it across all the data sets of this talk are up on github And you can just use that now Let me start off with What is it that that is very particularly interesting about about government and data and Rather than me saying it I would just use Calvin. What would he think of why would he need government in a sense? This is one of the reasons which even I particularly think why it's pretty important as in Not that I'm doing any good for the government or with the data But rather I want to have a powerful say or or heck. I just want to run the things Of all the all the things that you do with the government. Which are the couple of things that you think are Pretty much connected to you. Can I just have three four words which connects when you talk about government data? Let's see if we capture that today Just shout out any shout Okay, okay We have sensors Okay, we have a lot of Okay Let's see if we try to answer some of these things But before even getting into the formation of government and all before anyone gets into form the government The first thing they do is launch a political party for the political party to survive the next thing you would need is some money Which is probably like you want to do a some a Funding session right so you want to collect some donations Unfortunately in India the data of political donations is not pretty much open Although we do have some parties or some sources of information where we can do some analysis Interestingly, there was this one particular party which sort of published their donations on their own website However, there was a tricky part the donations were not form of you know a data set rather it was part of a ASP website where you have to scrape the data by clicking next and so on and which is little tricky again because dealing with ASP websites especially for scraping involves some sort of Understanding because they don't store the query parameters that well any guesses which is a party that I'm talking about right so The way we do the way we did it was Take the data keeps scraping the data and after a week or let's say daily update instead of scraping the data again from the bunch Let's just take the Delta of the information So you just have the last updated date when you scrape the information Just take the list of the next hundred records or 200 records Whatever it could be and then start updating the DB So the DB you as you see it will be in the data folder with up donation dot DB which is available in the repo also and fund the The one thing that you need to notice here is let's forget how we how we would usually do it with Python Let's assume the pandas didn't exist You would have number of steps even to read a simple CSS file or to do simple group by or to do some aggregations It involves certain amount of lines of code most of the talk you would see here all the aggregations are all sorts of Any sort of analysis that we do it would involve just a single line of code without any involving any single line of follow I guess you would appreciate why we are not using follow-up in Python at least when you're doing data analysis So the simple thing I've just read the data and the data that I have Currently has name country state and a transaction ID like the moment you make a donation. They give you a transaction ID you have certain amount of The date stamp also is recorded to just to get a feel of the data of The total at least to the period of last month August or something in about 80 crores has been donated to this party and They were about 2 lakh Donations been made on an average amount that has been donated was around 3000 rupees I mean if if you were to think of this like a three-year-old startup We started off and it got crowdsourced funded by two lakh people almost raised close to a billion rupees and But that's a rosy picture. We are looking at a very high level and saying commenting all this But let's let's get into further details. What is that we can actually look into it? I? I Hope the next slides will have a bigger font, but there's nothing I could do because it's just a sponsor this way I'll see what I can do about it So some of the things that we just saw was like which they keep talking about, you know The donations coming in were usually between a certain amount of donation like it's either a thousand rupees or five thousand rupees But not large amounts in a peculiar case. You have Baja's electoral trust which actually donated close to three crore rupees This was a single largest donation made to that party and it was around 2014 and right after the election or around the election time But what's interesting to notice if if we were to consider if you donate money to anyone And if you're donating more than one lakh rupees, that's pretty much like you must be super rich Or that you can afford to donate so much amount of money and if I were to pull out the information The donations which are more than one lakh and what was that some I mean What was that amount of donations which came in which are more than one like so people who have donated so only 400? 400 odd people at least have donated quarter the amount of money Out of the two like that we have been considering it's just that 400 super rich people have donated quarter amount of money and so partners gives you some sort of reasonably good Functions like to deal with any time series data you have resampling methods or to do group buys you have group buy I'll talk about the importance of using resampling especially in the context of when you're dealing with time series data and why it's really fast If you look at this data, what we have simply done is we have taken the data set set the index to date Taken the amount and resampled on month and taken the sum of it So this is sort of month-on-month value that we're getting for our matmi party This is good fine. I'm getting the data, but this doesn't look that good because what do I make sense of it? I just need to read the numbers. I don't know where it is growing over there where it is going down Likewise, we also take in count most number of times we just focus on the absolute figures like what is the sum? but we don't focus on Condent like how many donations were coming at certain periods of time you would actually see it was just because of one or two people There was a spike in the amount or when is a popular perception of public that they were actually donating a lot amount of money Now what do you notice here is? Instead of me doing this in two simple two different functions or two lines of code here I'm trying to do get the amount as well as you count in a single line of code Using the how method where I could actually define the numpy parameters or the functions Which lets you tell either on this column you need to do some or a mean or standard deviation or all sorts of Any aggregation method you want to use? No one thing of that I've noticed like this was the one thing that I sort of skip actually when I did this analysis was When you sort of do any sampling or re-sampling you have something called as At least in machine learning called curse of dimensionality reduction Which sort of tells you that if you have a lot many features your model is sort of not pretty good Probably we could also argue that at least from a visual perception or trying to get the initial insights if we don't sample the data at a Proper frequency we might lose out on some key information Now if you notice the data that's been presented here, it's almost identical It's the same data. It's just some amount money By just that on one I've done it on daily basis aggregation on other I've done it on monthly base aggregation on the daily basis when I did it if you notice in 2014 July there was a huge spike and Actually when I do it on monthly it was generated during the election period where the spike was So the July period was because there was a single donation from Bajaj that actually spiked up this information now All this is good. How do we get into actually doing some interactive? Visualization so one thing or one of the things that pretty much helped me a lot was used to use JavaScript or D3 now How do I combine this JavaScript and Python into this? You would have other frameworks like torn out or flask where you pump in the information and then you can render the information on output But what if I don't want to do all of that? I just want to focus on my ipad notebook I don't want to do any of you know other interactions or any other overhead so Here is one simple way where you can actually include JavaScript into a notebook What so I Python provides you with certain functions where it can inject HTML and it all that I'm doing is I'm just passing the string What that string does is you're initially defining a HTML tag, which is time identified by a time ID All that I'm doing is take a date current date and every half a millisecond Keep updating that value. So you would notice There is some sort of you know the time gets updated right now. It's 1055. I'm almost like 10 minutes past my since I've started my talk now This is good. So there is something that I can dynamically change the value In the in the browser without having to stop my console So I can move on to the next piece of code without having to do that So the next so how do we bring visual into this? To use D3 or to use Other frameworks like C3 or dimple you could use again require config where you can import these libraries So the moment you do this D3's and dimple is sort of like, you know loaded into your browser And you can access these variables or like you can call them as their internal functions to do all all sorts of plotting that you want to do Now one example if you were to use D3, it's a very low-level language, especially if you're very good with or very imaginative with your Visualizations you could do all sorts of things. So here's simply what we're doing is defining SVG with a width of 600 and 300 And next I'm putting a rectangle which I'm defining it as square at a position 30 and 30 Later if you remember the auto time update that we have written a two slides back if using a similar function But I'm just replacing the X and Y with a random value So it starts as a position 30 30, but slowly after every two seconds it keeps moving around So all this is happening in I put the notebook and I'm not I'm not Stopping it nor I'm controlling it Now you can imagine if I have this control or flexibility that once I have the data I've posted out and now let the JavaScript behave however it wants to it could be me like You know hovering on some element that will give me a tooltip or I could use a brush or a filter to narrow it down Now to give a sense of what sort of visualizations we could do Here is one example. I mean, I've just included the links that we have used Let's walk through some of the features that we used Yeah, so if you notice here all the aggregations the back end crunching is done in pandas and once you visualize it on the browser The way it comes out We are primarily talking about a summary section that sort of updates every time you select any different filters and All this is auto annotated Whatever you want to give like if at all you want to speak about what are the donations coming from certain state or so on? To make it more intuitive with heart like what is it that people okay? Looking at charges fine, but this is something to do with politics and some commenter would be interesting So if you notice here There is some sort of spikes coming at certain periods of time or there was some dip So generally when he began strike or like okay, this was related to the power bills issue there was a spike in the number of donations and This was the day when Yes, the election was announced then there was huge spike then it formed a minority government The moment casual announced that he will take on Modi sort of there was a huge spike And when the moment people got to know okay We bet on a wrong force and actually Modi won and there was collapse in donations, but there was one spike That spike was due to the bhajaj. So we might actually think that no that could be wrong But as we go back in the data and see it was not because of usual public perception But you would notice again in 2015 there was another election called for there was against pike in donations And they will literally it has fallen down and other sort of visuals that you see here This is called some sort of a trim up visualization where the size of the rectangle represents the total amount of donation made from that group it could be from Delhi How many how many donations are coming what's the sum of it and the color sort of represents? What's the average value it might so happen that daily could be donating a lot of amount of money But it might have been donated by a lot many people So the average that they donated could be thousand. So it's actually Maharashtra and NRIs where average donations is much higher than Delhi and So but how do we get to that crux of information that we could actually put a visual all that I need is for every state I need information of what is a total donation amount? What is the average amount? What could be the total number of or the mean of it or the average of it? and One way of doing it is again using group buys So here what I've done is just taken to do a group buy on state Take the amount and write an aggregation function, which does the sum account and mean of it and just sort on count That's it. It's a single line code all that you have to do is call a data frame That's like your tabular data and just do some aggregations all all that you wanted to display in a tree map is right in front of you So it's just single line of piece of code the same thing again could be done for countries I all look might be interesting to see that they did as it sort of classifies India also the content you might have to remove that data So for filtering something all that you have to do is use square brackets and put a condition that I don't want a country Where it is India so all that I want is a set of rows where country is not India and the moment You get that again you chain it with another group by function on country and then you write your aggregate functions And then you do all sorts of either you want to sort it based on count or you want to do it on mean or a total number of donations So in countries if you notice it's the United States that has the highest number of donations coming in and then you have UA UK Singapore Canada Australia this is fine and interesting I mean Probably usually expected because a lot of NRIs are coming from these countries, but if you notice the large amount of you know Less people donating with the donating large amount of money are actually coming from Hong Kong So I don't know what's happening in Hong Kong It could so happen that there have been 161 donations on an average They're sort of donated more than like one and a half lakh or something with the biggest of all where single donation has come from Somalia we keep talking that okay, so some conflict in Somalia But it does seem like at least when it comes to politics of our country people going there working in all that you know All that issues that they're facing with they're still able to save up some money What's rather more interesting is it's not just you don't see any top countries figuring in the average donation countries All you see is tonight and tobacco Turkey Indonesia So you have all sorts of countries that you don't expect where people will be donating in high amount or at least individuals donating in high amount they actually do and The good part is now if I want to see what is actually happening with Hong Kong or what is it that they're doing? Probably I can just click We'll get back to that Now So all that we have done so far is we have used that data as it is we haven't created any data We haven't really seen into what else could we actually deal with the data? We have taken the remands. I've taken the countries and we have taken the cities Is there something that we could actually you know create new data from the existing data? So one such thing that we did was what about you know I'm not interested whether the guy paid two rupees or ten rupees for me anything between one to hundred is the same amount Like he's making a small donation So let's sort of bucket them into bins of one hundred five hundred thousand and then so on so now I've sort of wanted to see the distribution of it you do a histogram you get this sort of values that you want and Okay, one note pandas inherently has NumPy as well so you can all that you need to do is by accessing like PD dot NP So all the NumPy functions that you want to have You could get it from that so the histogram that we used here was for coming from NumPy the next So we sort of created a new column called amount bucket where I'm saying cut the amount on these bins and explicitly mentioning that right is equal to false which means that do not consider the value on the right So whatever is hundred if it is hundred it actually comes to the next bucket not one to hundred So what we just did here was we created matter it or we created a new Information that didn't exist probably it might be interesting to see what is it that we can do with this data? The first thing probably we would be interested to see is from the states that are donating Where is it that people are actually donating in? One rupees and hundred rupees like you take all the pain coming to the internet using your credit card whatever and you donate one rupee I guess probably your internet cost and data data charges would be more than one rupee But you're still investing your time to come and donate one rupee So which are those states where people are actually coming out and voting with those some amounts so again a simple aggregation would let you do you know do it on state get the amount bucket and Get a count of it and you would see you know At least in Andhra Pradesh a lot of amounts are actually coming from thousand to five thousand not bad And if you go to certain other states Especially there was a case where Uttar Pradesh leads in actually donating small amounts people are fascinated with you know giving denomination as a What's more interesting is people have donated every single amount that exists between one to hundred like one two three four five You keep counting the moment they see a sheet of donations that they've made and if they've sorted by mistake They would see okay. Why am I getting? 67 rupees how would someone even donate 69 rupees or 72 rupees in fact? It's not just 1 to 100 if you go from 1 to 200 There are just four values missing people have donated every single amount that exists What's more bizarre actually people are even superstitious in their donations you would see that There are donations like triple one triple two that's fine fancy number, but you have numbers like 2014 2015 Probably auspicious here the thing that you know if you donate on auspicious here, it would be good But there's a mindset probably people think it's just generally people in the low-income groups do this sort of donations But you've seen instances where there were donations coming in actually coming in more than one lakh rupees They've actually given with Thousand something and the date on which they've donated so it's one lakh 23 thousand and 14 something so on so you have all those weird sort of numbers that are coming in and This is interesting as in it at least tries I mean although we can try and extrapolate at least how People react and people sort of say something at least from their donations Can we talk something about the psychology or our psychology? Now All that we're doing so far is to do pivots, but this is not very interesting because if I want to compare Something from an age bucket with other states what I would rather do in excel is called pivot right now the same thing I could do it in Pandas as well all that I need to do is call pivot and do it on which What is the index I'm doing it on state and I'm doing it on columns Which is called amount buckets and take the values of amount of transaction IDs and just do a count of it Now I've got it for all the states for all the amounts I've got amount of values now it might be very easy if I just put a heat grid for every every value that exists here I just need to put a color and then see what is it that I can see now Most of the times you don't want to see all this information all that you want is to focus only on the top five states and Then do this sort of a pivot and then do a heat grid on top of it so To filter that if you remember initially we have taken up states as a variable where we define the top ten states Now I'm saying in the df state Is in it so it's even the syntax is very very much readable that you can see in the up states Take the top ten and check wherever the df state is that and I just want that data frame and do a pivot on it I get it now. I just all that I just need to do is put some sort of a heat grid on it and the moment you do that You get some sort of a pattern where you see Delhi has a lot of donations coming from hundred to five hundred There are a lot of values or there's actually an anti pattern that you might see apart from barring out NRI's Karnataka actually donates a lot of you know, thousand to five thousand probably from Bangalore because You know it or whatever fan base they have now So so far we've been talking about, you know the good parts or the good analysis or like you know Which feels as good if you have donated to this party. What if there's something? Mischievous or something that's not missionally wrong But the data is probably pointing out to some funny things like whenever you make a donation You're getting a transaction ID that is supposed to be very unique to you, right? You're not supposed to give Multiple same transaction IDs to multiple people now. It's what happened that When you try to find out I just took on the transaction IDs Give me the value counts and try to find out wherever the transaction IDs is more than one Now you would notice like Anil P Wilson and I okay. I'll be reading out few of the names if lucky enough if you're in the room just raise your hand so We have Anil P Wilson who's donated certain amount of money and on particular date And you would notice that it's actually the same date same amount might have so happened that the bank has actually tried to credit The same amount again. Maybe the database actually recorded it twice I don't know something is happening where people's names are getting duplicated So could be some issue at there and let them figure out what it was But what if they're actually different people you have transaction IDs like the first one if you see Tushar Shah and Harish Kumar Sharma or though they were born at different places. They were connected with the same T IDs and The sort of you sort of see these people like you know It's a weird thing, you know, you and I are connected with this TID and The other thing that you notice is this is some legacy issue probably because all the donations that were made during 2013 had this issue or had this blip where they were giving the same transaction IDs To these people probably later they figured out what might have happened and they probably moved to you know something Better or they must have fixed it Although also while noticing this you might actually have seen if you have the access to the database open the DB and write a simple SQL query where you would notice There's a donations that are missing during 2013 November to December for 15 days or 20 days I'm not sure whether they wiped it off for the database didn't record or something could have happened But for those 10 days because there hasn't been a single day since the inception that there has been I mean There has always been a donation be it even 1 rupee or whatever be the amount but for those 20 days. There was no donation Now Like you know, we all have fun when we're deeply connected with politics Especially if you're in Twitter, you know how people react to things. So I don't know who did this But it seems a lot of people wanted Narendra Modi to donate to Pretty sure he might have not but you see a lot of people actually in the name of Narendra Modi and actually one person even wrote Narendra Modi for up But the interesting thing is they're not donating much 1 rupee 5 rupee the problem trying to make a point or they'll take a receipt And show it to their friends or you know, this is what I did today But that's not it Lot of fans of KG also try to do the same thing But if you have access to the slides, you can go through some of the names, which are pretty funny I don't want to read that out loud Just to give you some sense One 17362 says K. Jiravall Nautanki Then you know, but yeah fine I mean you can't do anything with trolls or they just keep doing this stuff again in the same thing You know the donations are largely less. It's not huge amounts that people are actually donating Then we get international people into our politics. You have Osama Bin Laden. You have Obama Then you have Hitler We just don't leave at Hitler. We just want to make statements like Hitler, Lover, Arvind K. Jiravall and People get real creative, but if you notice again the same thing It's just 1 rupee so I guess a lot of people are actually donating 1 rupee probably they're trying to make a point You know I can fool with the system or I'm just checking before I make a big amount Let me check if this actually goes through you know Fine this all has been fun Like you know you can give along like you know you have this data this is what you're doing But at least good part the data is how you can play around and do something with it. So what if You get elected and you start becoming the prime minister of the country or chief minister You start doing things and media obviously reports for the things that you do What is the most famous thing that you remember of the XPM Manmohan Singh any anything? Pretty good. I guess everyone agrees that he's very silent, right? Let's see if he was really silent or someone cooked up the data Now I'll let you read this later, but let's actually this is more important actually because what it says is I Know more about the private lives of celebrities and I do about any government policy that will actually affect me Right, I'm interested in things that are none of my business and I'm bored by things that are important to know The media aim to please I'll leave that at that point and let you fill in with whatever you like Now the data again the same thing So the PMO had this site called PMO speeches right now It got moved to archive or something and we have to scrape all the speeches like it had a unique ID A link and you have to scrape all that information on all that you get is an ID date on which he gave the speech the place the full content of the speech and the title of the speech and The placement place to all the meta information that we added whether the state or the country and etc So let's let since you anyway studies very silent. I'll not ask the answer. What could be your guess? I'll be silent for a moment. Let you read what this actually says I guess whoever said that he's pretty silent, right? He actually gave a speech every three days. I guess most of us would never do that right giving a speech every three days In all he has given 1200 odd speech in a span of ten years. That's about three speeches. I mean three Every speech every three days. That's pretty high. I mean But it would be interesting to actually see Did those speeches really occur or were they just press statements or wherever they actually given When were they actually given we have the data right so all that we do need to do is you know grew by on time or grew by on date and States and extract some information now all that we have done here is we've just taken the the data frame Did a group I on time? Extracting the ear part now since I'm not working on timestamps directly and I just took you know the count of it You got some values which may not be very revealing because we are not actually trying to compare these What if I just? I've sort of used Dimple J's library where I just sort of injected some information all that I did was pass a Jason object X is based on ears and Y is based on speeches and just put the GM object as a bar, right now so it see it seems like in 2008 he gave almost like 218 speeches that's actually more than Speech almost every one and a half day. I guess right If you remote if you remove Saturdays and Sundays, it's almost like I guess even hot public holidays and other holidays the speech every day Every night probably must be preparing the speech and next day is going in and giving it up But we keep hearing that he doesn't talk right we we haven't heard of him much We don't see him and it's too because we have we have rarely seen him speak But the peculiar thing that to notice here is as when his tenure began in 2004 and from 2008 It's the number of times he spoke or like his speeches kept increasing like 2006 Then I mean sorry five six and seven and you remember there was a second election in 2009 Then it was then I don't know what might have happened They might have sought that you know, it's time to probably you know pull back one more and sing and push some fresh Fresh person in into the party and probably that's the reason they sort of try to push the speech content to the other person now People are best judges to what happened to those speeches by that person But I'll leave it to the point where since 2009 onwards. They've been pretty much stagnant You know the number of speeches have been pretty less and he's been content to himself Now all that we have done is done a group by right I keep mentioning about using re-sampling Now re-sampling is pretty fast like if you're using ipad then you can just use this magic comment time it and Measure the time that it takes so almost it takes 18 microseconds to do if I were using group by if I just do it using re-sampling I would do it at like, you know 6x speed and both the values are actually identical so Take away is generally if you're dealing with any time series data and all that you want to do is you know measure some quantity based on dates Our otherwise or month wise or week wise. It's it's pretty good to use re-sampling Now we have seen that Since when he started speaking and where all he spoke Which were the top places or where where was he speaking the most any guesses again? I guess this will be pretty easy You louder Parliament I guess that will take it as Delhi But do you know what could be the proportion that in Delhi must have spoken X amount outside of Delhi it could be this much any rough idea 80 to 18 Delhi and 20 outside so that's pretty huge right so he's a deli man. He should have been the So it turns out that actually Delhi he gave 760 speeches the second that came closest was actually Maharashtra And then us funny thing is after New Delhi and after like Maharashtra the popular places that he went to was USA Russia Japan and There are few countries, you know some problems his home state from where he know he got himself nominated at least he went there But South Africa was more popular than Some and there were actually few country. There were few places that he hasn't actually visited and So we could do all these sorts of things like how many unique places that he has been to We get some sense of an idea how many places that he has visited in 2006 was the maximum very visited like 50 places That's the best thing today. We keep talking about Modi's been visiting a lot of foreign nations I guess once the data is out we can even compare with this and see which one is going where Now we have text information as well, right? So what is it that we can do with the text string methods or something that even pandas provide So all that we need to do is measure the length of the text and then see what's the How is this variation as you notice again here wise if you go After after 2006 or something the length of the speech also decreased So there were cuts both ends you cannot speak more and the number of times she'll speak also will be less Or it's a self put rule or I don't know from where it came Now the next thing I just want to talk so far. We've been talking about you know The data that we can scrape on somewhere where it is not readily available But we have some nice initiatives like data.gov does provide some good data sets. What is it that we can do with? other datasets one of the datasets that is working this week was exports and imports So all that I just came to some of these slides you can just look look them up how to read these The interesting things that we noticed was wherever your major exports and you know imports coming from obviously exports were exporting more to USA China and the imports were actually coming from China anyway But how do I merge these two then I would probably sort of need to do merge or a join of these two operations And then try and see what is it that I can compare This is how you would merge two datasets doing a concatenation of the two because both are series and I could join them on those keys But so where is it that my exports are pretty high? And if I were to sort on that ratio of exports by imports you'd get its UAE where you know you're exporting a lot But it's almost in the one is ratio USA. We actually export more than we import But there are certain countries China for example, we actually import more than we export now all this is good you have certain methods to do it and And What's interesting what I found more interesting was you have information for what commodities are we actually trading? So we would assume that any commodity we import or we export we won't do the both right I mean if I'm importing petrol, why would I export it again because I need it. Why would I actually buy and sell it again? If you notice some of these examples Cashews we actually export a lot of cashews, but there is some sort of a shell cashew that we actually import I don't know what that special variety is that people want to import But there are some glaring examples where we actually import and export as well for example petrol we all know that we actually Import a lot of petrol. That's the reason for the fluctuation dollar as well But why would we export could be to the could it be a reason that you are actually buying someone and selling it to someone I'll do a higher price. We just need to look at all these policies interesting thing at the bottom if you see wheat We'd be would assume that we are exporting at a very rarest of case if you're importing wheat, which means that We have made some policy that we thought we have surplus of our wheat and we have actually Exported it and later realized that our country doesn't have enough feed and we import now this data sort of you know Let's us look at more more meaningful way that if our policies are working or not now This was again a simple representation I just came through this and I'll try and touch up on crops bit a bit because We keep talking about you know Also far in the media we are very less about farmers or what's that may concern? So I just want to touch up on two topics which are more about crops and rainfall So in the crops area similar sort of analysis. We just carried out try to do some aggregations if at all you can find some patterns What we noticed was the data was incomplete All that you required to find out How is it I can extract there are anomalies in the data was you just try to extract how many unique crops I'm actually getting in the data set for every year for some reason In the year 2002 and 2003 there were a lot of crops the other other data sets were not coming in so there was a mismatch in this data and I'll skim through this and I guess I'm a bit out of time so I'll skim through this and just take it up When you have the notebook in your hands, I'll just touch up on the last point Which is more about telling data stories So once we have the data and once we have the visuals that we want to tell I just can't put them simply in form of you know data and then a visual and describe it There must be some intuitive way that people can understand it could be form of if you're hard publishing in a newspaper You can't do much about it But when you have a powerful medium like web where you have all sorts of things that you could do in terms of interactivity One thing that we tried to do was sort of how do you block this? All that we have was a census information The data points we had we had a state What is a percentage of households who are having bikes who are having smartphones? Who had land water all sorts of different variables we had so? The data story that we wanted to tell was you select your x-axis you select your y-axis and Radius and color also varies based on that all that you just need to do is select anything So if I were to change this to percentage of four wheelers The data sort of moves so you do see which are the states that are actually moving this is fine But I don't know anything to start with could you actually tell me a story? Fine, let's write a story. We'll start with some of the data points or insights that we initially identified I'll talk about okay though We may think that irrigation equipments would be higher where we're having a lot of irrigated land But it turns out there are a lot of exceptions So you do see this a regression I mean sort of a regression pattern But there are some outliers in the east and the west right so and like you could tell your story Insert your concept over there and once the person comes over to that piece of text The visual is sort of to be refreshed So he sort of knows that you know for some reason it's especially if you're working in a telecom sector This must be a good interesting data point The belt specifically Madhya Pradesh that this girl and Orissa has a lowest mobile penetration Irrespective rest of India Yeah, so in all So we have a lot of data coming out from the government some could be very messy. You just need to find means to Find out how is it that I can do you could marry some data sets It could be crop or rainfall and combine them to tell some stories But we would come at a point that some information is incomplete We might have to deal with you know taking help of some methods Like how do you extrapolate or interpolate or fill these values with some of the methods that we have walked through The important part however Would be we have all sorts of tools that we would like to play with we are actually having good Hold of data at least in the recent years. We are getting good amount of data. It's being made public probably we just need some imagination and concern for The data that's coming out that we could tell some stories so that we are not blinded by the media and we can do something Thanks. I'll just have some few minutes if you have any questions I can just take Which tool you are using like any packages or Okay, I guess I've mentioned in the beginning for some of the charts that you've seen in the ipad notebook Well done using D3 or dimple. Yes. Okay. Those were JavaScript based libraries. Otherwise, you do have matplot lab in python itself Okay, this is pretty good for exploratory in it, but for interactivity and stuff you might need, you know to use these Things like the amount of text that is highlighted. It will show the amount means that part only to visualize Second paragraph we are highlighting so the data rated only second part will be highlighted So how can we do that? Basically, if you notice all the filters on the top, it's URL driven Okay, the moment I come to any section Yeah, the URL gets changed and based on the URL the visual gets refreshed So I sort of push new data once the URL changes. Okay, and that data gets refreshed. Thank you Hello. Yeah Hi, I'm one see I've been working on similar problem set Basically, I'm dealing with healthcare data. So I have a small flask server which just serves Jason content To angular front end there. I'm using a D3 and cross filters to you know, just display the data One of the bottleneck I've been hitting is the amount of data which I need to push out So there should be some form of streamable That Jason should be streamable and one such thing I've come across was bokeh Bokeh plots. Have you ever used it? And if you have what are the? Advantages or disadvantages? Okay, I went to use bokeh because I used D3 So pretty much solved my problem But at a high level when it principle is again the same behind both the both these things You just instead of passing all the data that you want for the amount of visuals that you need to represent You just pass that amount of data. The thing is if I put a scatter plot all that information that I need is just the state name The x value y value and the radius, right? Even if you pass like hundred rows of information, I cannot display hundred rows of data because you cannot even consume that As much as possible probably when you're working with you know real-time streaming You would only pass aggregates which would come into the browser that would actually save so that you can actually have control on the data Just to interrupt you guys, can you take these questions outside or you sure offline because We are ready ready for the next stop. So sorry to interrupt, but yeah, so can we have a round of applause? Thank you for such a fantastic talk by Ashutosh