 The announcement is we have got at 12.30pm we have got an open space on ZeroMQ by Balaji, it's on first floor, so those of you who are interested you can go there. That's it. And I hope the password is still working. In order 3 also, we have got open spaces in order 3. So you can just check out from the volunteers. Thank you Arun. Our next speaker is Anand Chittipottu. He is going to speak on the topic Messing with Government Data using Python. He is a co-author of web.py, a micro web framework in Python. He currently works for internet archive and active member of Indian Python community. Thank you. Can you hear me now? So this is something unusual. On the next speaker I will actually talk about Messing with Government Data in a Python conference. So I did something unusual during these elections. So what I did was I wanted to provide the technical assistance to a couple of election campaigns. How many of you have worked in an election campaign? Raise your hands. That's very few. So I was very fortunate to be a sit with an election campaign long before. So a friend of mine contested for a candidate for a similar election. So I was quickly following it and helped him in building his website. I was looking at how the election campaign runs. How it runs is it's just chaotic. So anything that you want is not there. If you ask someone, how many volunteers do you have? He has to call someone and he'll have information somewhere, an Excel sheet or somewhere. And he won't be available at the time to take the call. So you probably take a couple of hours to get very basic information like how many volunteers we have or how many places we have covered, etc. So what I did is I provided technical assistance to help in solving some of these problems. So let me give a brief background of how it started. In December, I decided to move out of Bangalore and I moved to Vishakhapatnam. I was relaxed and sitting quietly. And suddenly I got a call from a close friend of mine. He called me and said, hey Anand, I'm taking off an ear from work. I said, wow, that's nice. What are you going to do? He said, I'm voting for a political party. Wow, that's unusual. I don't know if I can be of any help. So one day he called me and said, we probably need your help in providing some tools for the campaign. I said, okay, I thought I don't take much of my time, but it ended up being a big adventure. I'm going to talk about the challenges that are faced during this process. So during this process, I actually ended up building a lot of tools. I built a campaign management system. By the way, is the font visible to the back? No, sorry. Okay, so that now. Okay, but that should probably cut down. I guess there won't be much test anyway. So I'll probably read it out. Sorry for that. So I ended up building a campaign management system, a volunteer sign-up system. I'm about to find all the details by OtrID and script to format OtrIDs in a PDF form, in a compact form so that it can be printed on paper and a lot of the tools as well. If you can see, that's my GitHub contributions. From mid-February to April, that's when the elections were. You just see a spike in my activity. I was all writing code and then commenting and there's a lot of activity. So that's the period when I did all this work. So since most of you haven't followed elections particularly, let me give a glossary of the terms I'm going to use in the start so that you're comfortable with these terms. Okay, so one is a parliamentary constituency. So a parliamentary constituency is like Bangalore North, Bangalore South. So parliamentary constituency will have about six to eight assembly constituencies. Saab Baal is an assembly constituency in Bangalore North. And there are wards inside an assembly constituency. You have J.C. Nagarwad in Nhabar and Sanjay Nagar, different wards. And there's a polling center. Polling center is where it's typically school building, but that's where the elections are conducted and it'll typically have a bunch of polling boards. So there will be multiple rooms in the polling center. So I have polling center, some public school and then you have room one, room two, room three. So those are the polling boards. And every author will have a unique author ID and the author will go and then cast his vote in one of his polling boards. Okay, now this process is a lot of challenges. I mean, so first of all, I'll show you the slides of... let me show you the challenges that they faced. So this is one of the things that I've built. This is a screenshot from one of the tools that I've built. This is a campaign tracking tool. So what it does is it basically has the hierarchy of all the places in the constituency. So this is what we're using in Karnataka. So we have Karnataka and this is the assembly constituency and inside the assembly constituency, if you look at the text here, it says it has eight wards, 75 polling boards and so many polling centers and polling boards. So now how do you get this information? Where is this information available? Okay, so even just to get the basic system up, you need to know what are the different kinds of places that you have in your state and your parliamentary constituency. And if you look at the progress bar there, it's kind of showing the number of booth agents that we have and how many more we need to actually be able to cover it completely. But they're not very interesting at this point, but they talk really about how to get this data. So let's see that. So it's the same page but at the previous page where it was a assembly constituency level, but this page is at the polling booth level. But if you see on the right, it's showing you the navigation. The region is actually stated by hand, but you have a parliamentary constituency, assembly constituency, a ward and a polling center and a polling booth. And in the water, we chose another polling booth in the same building. Now, where is this information available? How do you get this information? So first thing is, how do you find the mapping from polling booth to parliamentary constituency to assembly constituency? Apparently, the government don't speak that language. They actually have different nomenclature. They actually have districts and assembly constituencies. They don't talk about parliamentary constituencies. But elections are happening in parliamentary constituencies and you want to know assembly constituencies so that work can be divided and then track easily. The only place I could find reasonable data about this is the Wikipedia page. So this page has the constituency number and name and all assembly constituencies in the parliamentary constituencies. So now how do you get this information? Write the Python program to script it. That's not too hard. Let's look at the polling booths. How many polling booths do you have in the assembly constituency? Or what are the polling booths in the assembly constituency? There is an election commission website and that has one webpage for every assembly constituency. It has names in Canada and English. Actually they have English only for Bangalore region whereas they only have in Canada. Now, those are the polling booths that I've shown in the previous slide. Others are on the number 1, 2, 3, 4 and 5. So the five polling booths inside the same building. Now, first I have to extract this information and then load them as polling booths and then somehow, smartly I have to identify that all the five polling booths are in the same building. How do you know? I do some text processing, figure out all of them are actually in the same place. So I'm going to do some Python programs to do some grouping based on some heuristics and then identify the polling centers. So we have got paramedic constituencies, assembly constituencies, polling booths and polling centers and what about the ward? Apparently, this is a PDF of voter rolls. So election commission gives one PDF for every polling booth containing information about that booth and all the voters. This is a PDF file and that's the only place where you can find which ward the polling booth belongs to. So if you want to build that, just to get the big bones of the system up, the ability to navigate through all the localities in your state or in your parliamentary constituency, you have to pass all this information. That's fine, isn't it? So these are the things that, these are the sorts of data I need to extract before I can start doing something. And there are many tools that are built on the way. So one of them is a volunteer sign-up system. So these are all the things done in very short span of time, about 45 days. And the requirements are rapidly changing and people actually, the election campaign goes on its own path. For example, they start with an approach and the result work, they want to do something else. So you have to change your software to meet those needs. So one day the parliament said, see the volunteers, signing a volunteer is not working very well. There are a lot of people signing up, but they're not becoming volunteers. We need to fix that problem. The problem was, the process that they had was very time consuming. Instead, the time was very high. In sense, when someone shows interest to become a volunteer and the movement and the time he becomes a volunteer, there's a lot of time gap. The reason was the approach they took, or usually what happens is you give a missed call to some number or you go and fill a web form and someone will call you and then ask you where are you from and what can you do. And then you become a volunteer like that. But the issue with that is it has to go through a central process. So usually they're like two to three days or even a week delay between you sign up and you get a call and then you get confirmed as a volunteer to work on something. So they called me and said, can you build a form where people can search for their location and then know who is a coordinator and then contact them directly. I said, that doesn't sound like an interesting solution. So thought for a while and built this. So what this is, it's a, so I did it two different stages, one for volunteer sign up, other for the Buddha's entry registration. And it's the same thing. There's a field called locality. So the forward signing up will fill in the locality there. So they'll go and say at another. So what it does is it, that gives an auto-complete using Google Auto Location API and identifies the location. And I've built a service using which it figures out what it is, what assembly consent it is and what parliamentary consent it is. So the moment someone fills his form, I know which one is from, which parliamentary consent is from. So the moment he fills a form, an email gets sent to the person and the ward coordinator introducing each other. Now we can start working immediately from that point in time. So the whole delay of two to three weeks to one week is completely cut down by the system. But unfortunately, it was built very late in the cycle and they couldn't really make use of this very much, but it's a very interesting system. And this other part of it, this is for adding a volunteer. So the previous one was people signing up. This is an internet or system and the admins won't add someone as a volunteer. So instead of locality, there's a voter ID field. So when someone fills an voter ID, it has to automatically figure out which polling booth he belongs to and then add him as a volunteer there. Now how do you do that? I should go and figure out query the election commission website and find out which polling booth he belongs to and then add him as a volunteer. That was the challenge. So one option is to query the election commission website. The other thing is to preload the data from some form. So we do have that information. I'll get that in a minute, but just one day before the election, there was a need for building a small website wherein from voter ID, you can find your polling booth. The reason was this is already there in the election commission website. But as expected, all government websites don't work where there's a need. They go down. So in fact, it happened even here. Election commission website in Karnataka didn't work on the election day. In fact, they made plans to scale up. So what it is, I think they had an ASP.com application or something. They made a copy of it in three different directories and then gave three links. Hoping that that will work. That will take 3x the load, but it doesn't work. So I built a small service. It just says find your polling booth in a banner. You go and type in voter ID. It tells you which polling booth you are from. So if you have ever observed on election day, you see a lot of people sitting from different political parties on a table with an auto list in their hands. People go and say, where should I go and vote? Or they'll come and shout at you. My wife's name is there. My name is not there. What are you doing? So what these volunteers do is they go and help these people. Tell them where they should go and vote. So inside the booth, the booth manager in charge will have a sheet of voters. And then there is serial number for each voter. So what each voter should do is they should go with what is booth number and that's called a path number. And what's the serial number? They should take those two things and then show it inside the booth. Then they look at that in their list and make sure it's that person and then let them vote. So now for every person that comes to the table, they have to give him what is his booth and what's his serial number. So that's what this thing provides. But where do you get that information from? That information is only available in the PDFs that are given by the action commission. I've put some bars in that the names are not visible because it'll be a privacy issue if I expose all that. So this is a PDF containing all the voter information. Now I have to accept this and build the database. And also the other interesting thing is now on the way of elections you have to put a table and then show this thing up. But this has to be printed. So each polling booth typically has about 1,000 voters and each sheet prints has about 30 entries. So that's closely 30 sheets for polling booth. And parliamentary consensus has about 2,000 polling votes. That's like 60,000 pages. Just for one single copy or to put 60,000 pages. That's so much of papers and also so much of money. So there's other issue is now. This is an outer list for one polling booth. A polling center, the school building, will typically have four or five polling booths inside. If someone comes and says, where is my polling booth? I have to look and tell him where he is from, which booth he belongs to. So the typical way is people use book force. So they actually put five people each holding one sheet, one of these lists, and someone comes and asks, where is my polling booth? All of them start searching and whoever finds will tell the answer. That's very, very inefficient. So it's not started by any name or something. You have to go and then search. So what I did is I did a program to excite this information and combine all the outer IDs in a polling center together. So we have five booths. All the five booths are combined together, started alphabetically, and then printed in a compact form. Now if someone comes and says, what's your name? Quickly search through this. Then tell what's his booth and what's his serial number. Now this has about close to 150 entries per page, whereas the previous one has only 30 entries. So it's actually 5x saving, and also it's basically I'm building a database index on paper. So you can do binary search and finally figure out where is the name, whereas the previous one, you're doing a sequential scan on the database table. Here you're doing an index scan. This was very useful because it saved a lot of money and also you're able to tell them which booth very fast. And there's other resultings I did in Bangalore, and back in where any Vishakh Patnam, I gave the sheet to them on the day of elections. So there are very few number of volunteers. So I told them, since you already have very less volunteers, I suggest you to concentrate on the first check. These are the polling centers which have most number of voters, most number of polling booths. So both schools and those buildings get more voters. So if you have limited number of volunteers, you better send them there first. That would cover close to 40-45% of the voters if you just send those 10 polling centers. The remaining set is ones which have more than four or more polling centers, polling booths. So that was very helpful because just with this information they were able to do scheduling of where to send those volunteers. And again, so this information was possible just because I have the data about of grouping the polling centers, polling booths together as polling centers. And so there are a couple of things that I've done during this period. But the fun part is now how I did these things. So let's look at the approaches that I've done for solving these problems. So first thing is parsing HTML pages. How do you parse HTML? So what I realized is beautiful soup is pretty good. So there are a lot of scraping frameworks that people use. I mean, I kind of find it actually counterproductive. If you start using those frameworks, it actually slows you down very much. Beautiful soup is very simple and you can start using it easily. And always save intimate results because saving intimate results will save you a lot of time because when you're parsing thousands or tens of thousands of pages, somewhere in the page they'll have malformating or some unicode error. You'll start doing it from again from the beginning. So all the saving intimate results will save you a lot of time. And ASP.NET is the worst thing that would happen to web. And the government loves ASP.NET. Every government website you see is built in ASP.NET. So I had a real fun time exciting data from all the ASP.NET websites. So you'll see what kind of tricks you can use to actually parse information from those websites. Let's look at beautiful soup. Is this visible? No. I'm very sorry. But now, these are the only things we need to look at the text. So beautiful soup 4 is the newest version of library. You import beautiful soup. And then you create beautiful soup using HTML. And then it has two different ways of selecting elements. One is you can select that using a CSS selected. So you can say I'm trying to parse the polling booth information from the election commission website. That is slide that I've shown you. So that's the idea of the table. And then I'm picking all the rows. So that gives me all the rows in the table. And then first one is header. It just takes what's the name and what the header of the table. I skip that. And for each row, find all. Let's add the API to find. So I'm finding immediate children, which are TD. So I've got the TDs. And then I accept the text and then send it. So this is pretty simple. There's nothing fancy happening here. I just find out the right element by the ID. And take all the elements and take the text of it out of it. So saving intermediate results. So this is the other interesting thing that I've done in the scraping process. I wrote a small digital disk name voice. What it does is it saves the results of this page in that file. The file is already there. It just returns the result. So now when you're scraping 10,000 pages, if it parses 10,400 files and fails there, when you restart the scrape, for the first 100, it reads through those files and it starts from the next one where it failed. So that saves a lot of time. So let me show you how this disk name voice works. That's a decorator. What it does is you can specify the file name, or you can say how to concept the file name from the parameters. You could say take the parameter AC, and then if the number is less than, the width of it is less than 3, just put 0s in the beginning. So you can make AC 0, 1, 2, underscore, both start ESV. So you can specify that by the name of the parameter, or you can specify by the position. If you see here, saying 1, the parameter 1, so that is state, sorry, the district. That's dictionary. So taking the attribute state and attribute district of it, and then serving it. So that is very helpful, because I don't have to write the same code in all the files, in all these functions. Also, it supported different formats. If you see the first one, the format is JSON. So it serializes it as JSON and saves it. The second one, it saves it as a TSV file, and it reads back as ESV. So now the ASP.net part. So if you see, oops, I'm sorry, something ran forward. So if you see, when you select a dropdown, oops, shit, sorry, something is missing. So if you look at any ASP.net website, you'll typically find a lot of dropdowns. When you select a dropdown, it sends a post request to the server. And if you click on any link on the page, it again sends a post request. It's the worst abuse of REST principles I've ever seen. So I mean, if you click on a link, it sends a post request. And along with it, it sends a lot of fancy things. Sorry, it's not visible at the bottom. It usually has a special hidden field in the page called unscored review state. It's a huge, with a string containing Polly Bay 64 encoded string. And at times, it becomes bigger than Coppola family. And it sends it along with every single thing that you do. So I had to work around all these things. So what it did is I built this small utility for working around. So there is a function, getSelectOptions. So I'm giving the main of the select. What it does is it parses that individual soup and gives me all the options available. Now I know all the districts I have in the state. Now I loop through each of them and then finds out all the assembly constancies in the district by selecting the other dropdown. And then I want to find the boots inside each one of them. Now I start afresh again. I select the option that sends a post request, comes back. And then I select the assembly constancy. It sends a post request and comes back. Now I have a HTML page containing all the boots. Now I go and parse it. Now it's too tedious, but the thing is these functions make it very easy to do because you're almost, as if it's actually going manually clicking. Select the district dropdown and then select this. Click the assembly constancy dropdown and select this volume. And then you parse it using beautiful soup like you did before. Now the fun part is parsing the PDFs. So how do you parse PDFs? So you must have seen this string, so it's very tricky. So don't worry about the text because it's not meant to be visible. Just converted that into text. So what I did is I converted the PDF into text using a tool called PDF2 text that comes with XPDF library in Linux. So this has this plain text with name. And I need to parse this and find what information. How do you do? So now first I have, I wrote a function. Read section. I should probably read section. What it does is from those two markers, two dot details of part and polling area and three polling session details, it selects the text between those two. So I basically have a huge chunk of text. I'm trying to window down the region that I'm interested in. So it comes down to this part. What I do next is I identify the index. So this is in the text. So I want to identify the index of where these things appear. Now this data is a lot of times when we have all weird kind of things. So sometimes I can find a ward member on the left side or I can find something near police station. Then I'll probably think that the index is wrong. So what it does, it takes three, four things and finds the best of all of them. If there is an outlier, it ignores that. So somehow I really find that line where that information starts. And then I select just that region. Now from the whole PDF, I've cornered down to small window which has a written information. Now I take it from the ward number to polling station. That gives me the ward number and the ward name. That's fun, isn't it? So I'll do all this thing to get just the ward information of a polling booth. Unfortunately, it's not available anywhere else. And the other wanted setup system I'm showing you that there's a map-based thing. Fortunately, I don't have to do much work. Open Bangalore and data-made groups, they have made Bangalore wards and parliamentary consensus maps available. So all I had to do was take your maps and then write some small Python program to build those APIs. That wasn't hard. There was some learning about OpenJSG and all that, but that wasn't too complicated. And the last part is the formatting outer lists. So I did the... What is formatting I've shown you, right? Made a compact outer list. So that was... I used a report layer for doing it. It worked fine, but there was a small tragedy on the day of elections. So one day for two days before elections, they told me that they want it to be done like this. I wrote a program and tested for one polling center. It worked fine. I said, okay, I'll give you by tomorrow morning. So it told me that they want it by 10 o'clock so that they can send it and then print it and then be ready by evening. I started running it and realized the program stuck at some place. It's not coming out at all. So I have a report lab and it's generated some videos, but something just got stuck. It's not working at all. So I went and saw why it's not working. I saw that that polling center has more than 10 polling booths and the number of ways are more than 100. And somehow it gets stuck there. So I tried different things and then it didn't work. So I thought, we just had one day, I don't have time to understand the report lab or switch to something else. So I thought let me put on a high-end machine and get it done. So I thought I'll go and buy an EC2 server and then run it and then solve it. I went to EC2 and then already served and said, you don't have your limit, don't allow you to create that. Please send an email so that I can activate it. I had just one date left. I can't wait for them. So I went and started running it. Same issue. It got stuck there. So I started running these things in parallel. So once which have less number of polling booths finished, but the ones which have more than a number of polling booths didn't still work. So I spent some time and realized there are some performance issues with report lab. So if there is too many entries, it takes too much time. Probably there is a model N-square issue inside. So what I did is, when I looked at each page, each page has 144 entries. So I know how many entries each page has. I can also split the data into 144 entries and give it to report lab and then generate multiple prefs and then combine them later. So that's what I end up doing. But I lost a lot of time and money by the time I realized this. So it's already like evening of the day before elections by the time I realized the solution and what. But it can solve the problem. So these are the challenges and fun that I had during elections and I could resist and continue to mess around. So if you can see again now, that's my GitHub history. I started messing with more government data and then started improving the system. And good thing it's all open source. It's all available online. So if you want to take a look at it and you want to play with it or you want to collaborate with me and work on these things, you're welcome. And this brief summary is messing with government data is really challenging and a lot of fun. I recommend all of you to try it once. A beautiful soup is enough for passing HTML and even ASP.NET websites. You don't need any special frameworks. I don't even think they are counterproductive. Same intermittent results save you a lot of time. Passing periods is bit tough but not impossible. All data will be very valuable and made available openly. So that's what I'm trying to do. So let me know if you have any questions. Hi. First of all, it's a very interesting work. It's really like I appreciate your courage to take this challenge up. So what I want to ask is whatever you have developed, what is the future of that? Are you having any plan to use these tools in the forthcoming elections or any other plans? So the system that I built is probably going to be used in the next elections. But all the data are being continued to use and then use it in other places. So in fact, just before coming two days back, I was trying to get the Delhi data. So the Delhi polling booth names are not there on the election commission website like we had in Karnataka. The polling booth names are available in the PDF. Now I tried to program to pass those PDF names and then I can come by as polling booths and blah, blah, blah. So it's very valuable and very challenging. Yes. Mike here, please. Can you please sit down? Just said that we have some performance issues and obviously actually when we are scribing this kind of huge data, especially with like beautiful so-called that kind of libraries, there will be scaling issues. And you just mentioned about one issue that you got around with some hacking, like running it parallely. Do you have any scalable solutions or any techniques that you have used for? I don't think beautiful soup is a performance issue. Performance issue was in report lab in the final report. But that I worked around by limiting the data size. But beautiful soup was perfectly fine. There's no performance issues with that. Once I have used beautiful soup. Actually it was a huge page. And it was actually taking at least 30 seconds to get the data, at least one data. So beautiful soup can work with multiple parsing engines. So if you create a beautiful soup with the second argument as LXML, it tries to use LXML library, which is C library, which is pretty fast. So you can look at a beautiful soup documentation, such as beautiful soup on LXML. That should give you an idea. Thank you. So I have got a question for you. So I worked with data.gov.in on the beta version of it. Just the visualizations part. So the real pain was the data which was there. So it's on PDF. We all know that. Sometimes the open data format is such that it is very crappy indeed. When I say crappy, I mean you got null entries, you got the data is not consistent enough. So is there any automated way that you make the data more consistent so that I can do that? Not really. So the thing is that data is crappy. At least you can get the crappy data that's so much valuable. For example, when I got the word information. So there is one polling booth, I guess in East Bangalore, is in JSON other word. Very unlikely. But these guys just enter data like that. We can't help. So we had to go and manually fix those kind of entries. You basically spoke about making data handling easier during elections. So what happens to all that data once the elections are over? No, so this data is not about elections really. It's not election results or something. This is about the structure of political boundaries in the country. So how many states we have? What are the paramedic consciences? What are the semi-concernies? And then what are the polling booths inside each one of them? And a lot of these tools are still valuable. For example, if you look at the volatile signup system that I built. So I built a small service with an API. So given a location, it tells you what you belong to. For example, if you want to do some work in what's in Bangalore, you could just use it even now. I think we should really have a repository of all this open data so that anybody can use it when they need. You should really spend your time on doing your work, not on doing all these kind of things, right? So when you want to spend time on doing election campaign, not sitting and scoping stuff. No, no question. You told me you had some problem with the Protlap PDF conversion. What is the base data that you converted to PDF as a listing? No, so I basically have tabular data. So I pass the PDFs and call the tabular data. So I have the name, house number, gender, age, there are outer ID and boot and serial number. So I have a tabular data. I want to put that as a table. So I was putting, like you see, putting two entries in a page. And then I was trying to fit for one polling center. So when the number of pages is high, you're just taking forever. I have some experience with converting text to PDF and printing. I always found text to markdown, markdown to PDF is much faster. But it's not plain text. It's actually a table. I can create a table and markdown to PDF. It's pretty fast actually. I'm just wondering if you tried that option. Markdown to PDF? So I'm not tried, but I kind of see that that's currently HTML to PDF, right? Almost. Yeah. It's kind of funny. You can use the, I don't know if you know this toolkit called, I don't think it's Pandoc. It's basically, I forgot the name of that, but it's not really a port lab. It has this toolkit, nothing which, this gives you this command line option called PDF. I'll deal with the PDF phase. Let me show you that thing that, yeah. So it's this. Okay. So now, I think it's PISA. PISA. Sorry? PISA. Okay. So tables doesn't work very well with HTML printing. The reason is, now I have a table and the header, if you see, I want it to be repeated in every page, right? So the top had a name, tradition name, et cetera. It has it repeated in every page. Okay. The report lab, you can just say, lock the first row. So that it will be repeated in every page. Okay. Because report lab is meant for printing on pages. Where HTML and markdown are not for printing purpose. Okay. So you can probably have to print this table in markdown. Okay. But you won't be able to repeat those kind of things. I also put in these two tables together in a single page. It would be very hard to do in markdown. Last. Hey. I used to work in the report lab libraries, but I had some problem with the versions of the PDF. So some PDF, you know, some header was wrong. So I'm not able to read it all. Then I manually have to save it in a different version and, you know, start using it. Do you have any experience on that? No. So this is really a limited report lab. I just spent two days and cook up something. Okay. I'm really not an expert in report lab. Yes. Yeah. Okay. Fine. Thank you. Okay. So I know you guys have a lot of questions from Anand. Okay. So he's been one of the prominent speakers in PyCon. So what you can do is you can catch up with him after this talk. Before you can go out for lunch and come back to the other one. So there will be another session. You can check the session plan out.