 So, good afternoon everyone. Today I'm going to discuss about how to transform India's budgets into open link data. So, let's see what does budgets mean. So, budgets reflect the priorities and values of the people of its state. It tells about the promises of the government and it tells about how much they are achieved. So, it has so-called moral documents. It has been referred as moral documents across several geographies. There is one story linked here from box.com, a proper data journalism website talks about how in US they considered them as moral documents. But in India, the state of budget is something like this. They are very hard to access, very difficult to comprehend. It makes us really difficult to analyze them in time and make our priorities accordingly in terms of analysis, reporting and see where the government priorities are fitting in. So, on your left you can see budget of BBMP and on the right is budget of Karnataka state government. You can see the disparity in the format and the structure difficulties involved there. So, major issues with India's budget are most of them are unstructured PDF documents, difficult to parse, difficult to analyze, limited availability of budgets online. For example, Tamil Nadu government doesn't publish budget for more than a year on their website. So, suppose I have to go back and see what was happening in last two years or so. I can't do that. So, that's a major issue. Inconsistent formats. Each of these government bodies keep changing their formats as and when they change a new vendor or have a new addition in terms of policy or a major change in terms of politics, so on. No metadata. So, none of these websites give any metadata. So, we don't have detailed information about the currency, detailed information about what's inside the budget document. So, you have to really go through thousands of pages to understand the key keywords related to that particular document, which are pain. And lastly, inconsistent and incomplete budget codes. So, you can see budget codes as unique IDs for your database. So, if you have inconsistent and changing budget codes, how you're going to map the whole time series and see how the trend has been. So, those are the major four to five problems we have been facing. That's where OpenBudget India comes. We are a platform to make India's budget open, usable and easy to comprehend. It's a community-driven initiator to focus on the open budget data. So, from public accounts to trust in government, this is generally a cycle which has been followed across various geographies. Public accounts are where you can see the detailed information about how government is spending. And if they publish the budget data and if it's an open format, you can then enable fiscal transparency using it. You can see where the priorities have been, what kind of tenders have been invested, where the money has been going across various departments and ministries. And eventually, that can lead to trust in government. So, for today's talk, we would be focusing on just the open budget data aspect of things. Let's see how things work. So, OpenBudget data is something which is publicly accessible. It's available online for everyone to use. It is in reusable format, not just giving the analysis, but giving the hardcore data points to the maximum disaggregated level possible so that people can find their own trends and do their own analysis. Without any restriction, it should be free. It should be legally open. So, whenever you see that small C symbol in a circle, that should be missing from the government website. No copyright. Also, it should be machine readable. It should be editable online. It should be either in Excel or CSV to begin with. This is what we say OpenBudget data should be ideally. As per the Tim Barnsley, founder of World Wide Web, there are five stars of open data. The number one is PDF, which we get at the moment with most of the government websites. Second one is XLS, slightly open, which requires a vendor-driven software to use. Third one is CSV. Any machine, any digital machine can read this document and understand what is happening inside. Fourth one is RTL. That means you should have a web URL for your data set. And fifth one is linked open data. You are able to interact with these documents in the form of a graph. You are able to play around from one database to another database, link them together, draw your analysis. That is the fifth level of transparency. So, that's where we want to take India's budget to a linked open data. We are an open source, committee-driven initiative. All our code, design, algorithms, documentation, everything is available on GitHub. And we would dig in about the data pipeline today. This is how it looks like. Simple steps. Number one, scrape the documents from various government websites in whatever format they are available. Some give us an XLS, like Sikkim and Union Budget website. Rest of them still give us messy PDFs. Second, parse them into clean machine-readable data to use. Transform them for making it more usable on a timely basis. Try to find the unique IDs. Make it completely machine-readable, machine-consumable. Fourth one is publish. Publish it online. Give a URL to each data set so that it could be used via API. And last one is analyze. The interesting part here is analyze comes after publish. So, all the analysis should happen once we have already published this data. So, you're not just restricting analysis in-house. You're kind of cultivating the open analysis of budget data. So, let's focus on the scrape component today. There are around 150 such budget websites which give us information of several priorities of government. As you can see, each one follows a different template. It means each one has a different HTML structure. So, we developed something like this, a utility, a centralized scheme of functions and methods which could be used for scraping, call it scraping utils. Here you can download, you can download a file, you can do session management, you can do cookie management, you can do expat selection and so on. Then for each particular website, we write a very small plugin so that all the custom logic sits according to the website. So, to more of the website structure changes, we would just change the plugin. This is relatively very small amount of code compared to the utility. At the end, we get the PDFs and XS and whatever format they are available on the website. Just to give a brief about expat, expat helps to convert the HTML document or XML document in a tree structure and then with the help of expat, you can access a particular node or a set of nodes which follow a certain rule. Certain parent, certain child, so on. So this is an example to access addition language, German and French and you can see the Wikimedia's, the super parent projects are further and then project additions. These are the hierarchy. Second step is parse. So from 150 websites, we get 150 plus budget document structures and that's another messy problem to solve. These are the kind of PDFs we get. So as you can see, there is slight tabular information there. As a human, we can see the lines, we can see the columns, we can see, okay, approximately these are the cells. But how to train a computer to do the same? That's the challenging problem. So what we have done, similar to the scrape pipeline, we have created a parsing pipeline. The centralized repository is PDF to CSV. We have most of our munging code, parsing code reside. Then individually, there are plugins to customize and tailor the content as per the state or union. Let's dig into the parse algorithm. What are the steps we follow? Step number one is loop over each page in the PDF and convert them into images. We convert each page of a particular PDF document into image. Sometimes these PDFs are rotated. So we unrotate them. We make them strengthen. And at the same time, we need to sometimes change the format. Sometimes these are a three page need to convert them into a four and so on. These kind of page layout mechanisms happen in this stage. So we loop over all the pages and for each page, what we do is try to identify a set of vertical and horizontal lines prominent in that particular image. This happens using hub transform. It's a very popular computer vision technique to detect line. What it does, you can see the demo there. It's like a lighthouse. It keeps checking for all the points available in the vicinity and then try to move into the direction where more number points are present. So it's sort of a moving lighthouse. Step by step, you progress and you start drawing the line. So that rankings you can see on the right hand side of the moving diagram there. So more the rankings in particular direction. That would be the way we would move forward to draw the line. Once we have drawn the lines, we try to detect the largest contour. So this happens using again open CB, a popular Python library for doing computer vision. We detect the biggest bounding box possible, the largest rectangular contour present. And that would give us the table boundary. Next what we do is we, in this step, another thing is happening. We also extend the vertical lines to touch the table boundaries so that we ensure the whole table structure is in place. Next we compute the coordinates for tables and columns for each page. We call it table attributes. These are top left, bottom right. And then the column coordinates extracted from step two as even C2 C3 C and so on. This is then passed to a popular public library known as tabula. It is very good at parsing PDFs, but it requires input from human. So what we're trying to do is give them the boundary boxes and give them the column coordinates. And what it does is detect the cells out of it for each particular character in the PDF. We get something like this, the information of top left height with and rotation. So given a bounding box, we try to calculate the characters falling into that box. It internally uses Apache PDF box, which is a popular Java library to parse PDFs to examine in this format. Apart from the common munging, we require some specific munging, which happens in the plugin like fixing header values. Sometimes headers are inconsistent. Sometimes these rows and columns are merged or splitted because to make it print friendly. So what we do is either split them or merge them based on the logic, make it more machine readable, filter out non-UTF characters because when you are pushing through API, it becomes a problem to deal with non-UTF eight characters. And similarly, we do other data sanity checks to make sure the PDFs are converted well into CSVs, like total and cross-net, those kind of calculations. Finally, we get something like this out of it, which is much more easy to consume. But still there are some problems with it. And that's what we would look into the transform step, third step. So for specifically state budgets, this is how the unique ID looks like. So these are the seven heads you get. So demand number is department of that particular state government. Major head is function of that particular government. Sub major head is sub-sector of within that function. Minor head is a program with the sub-sector. Sub-minor head would be scheme like server six or beyond, swachhbhara and so on. And detail heads would be like salary, official expenses, objects, so on. And in objects, we would have sub-schemes wherever possible. These are the seven heads and demand number is the unique identifier for a particular demand. But this is what we get from the state budget. This is Karnataka's budget. You can see here, you just get one code. So all information is hierarchical in nature. You can see urban health service allopathy being repeated multiple times. So you are trying to present hierarchical information in a flat structure, which is a pain to deal with. So what we do, we collect all this and bucket it with the help of a specific budget code. And once you have these budget code ready, this can act like a unique idea. And this is now finally ready to go into a table in your database. Next, we do publish. For publishing the data sets, we follow CCAM and it's a platform which enables open data publishers to publish their data in a much more structured format. You can add detailed metadata. As you can see, you can have description. You can see what all formats are available for a particular data set. There are keywords, sources, tags available for each data set, which is really helpful for anyone who is trying to consume this data. This is how CCAM architectures look like. So the base is a model. It's a typical MVC architecture. The base is a model, which deals with system database and a data store, which is mostly in Postgre. And on top of it, we have ORM model of SQL alchemy and a search via solar. And then since the logic clear, it's so called controller, where authentication, actions, business logic, background tasks, like updating the data sets, generating the site, XML, so on happens. Views are generally rendered using pythons, ginger to template, which is a popular technique to publish HTML documents. Oh, and as you can see from logic, there is direct access to the API where most of the information is given as chase and or multi part. And on top of it is simple routing algorithm, which routes your URL to a particular data set. On the left, you can see there is an opportunity to add custom plugins. Let's see what we can do with the help of plugins. So you can add libraries, custom libraries, JS libraries for visualizations and other Python libraries are required. Your Python scripts for the controller, it uses pylons as the base architecture. So you can add your custom Python scripts. You can create your own ginger templates. You can change the view, the hierarchy of documents completely. And you can add custom CSS image files and so on. What you get for each extension is a simple functionality. For some, you can get visualizations. For others, you can get a sitemap and so on. With the help of sitemap, you allow bots to index all this information. So none of the budget websites at the moment publish a sitemap. And that's a big problem for making it searchable. So what we do is generate a huge sitemap with detailed information so that these documents are searchable by users. And this is how the categorization look like. We categorize based on tiers of government, combined budget, union budget, state budget, municipal corporations. And in terms of sectors, we have chosen 12 developmental sectors like agriculture, education, drinking water, sanitation and so on. So that specific researchers can look into the budget directly. We also publish the data sets via API. This is how the API looks like. You can easily access the keys and then approximate values for the same. And for each data set, there is a resource ID, a unique ID to get other information. All these data sets are under Creative Commons 4.0 CC by license. You have to just use with the attribution in your work. And how does the open link data comes into picture? So the unique ID which we discussed in a couple of slides back, 22100110, urban health services, alopathy hospitals and dispensaries. With the help of compiler auditor general, the three codes need to be unique across all the states. There is a mandate. So now with help of this unique ID, you can query any state and try to link the whole information. So you can connect Kanataka budget for 2017-18 with previous years and compare it with Sikkim budget at the same time. Just with the help of this unique ID. So that's the power of open link data. This kind of analysis earlier used to take months to do. But now with the help of this open link data, you are able to do this analysis just in 15 minutes or so. The last component of the data pipeline is analysis. This is how we analyze. You can see a time series of budget at a glance for union government, the central government from 1994 to 2017-18. You can compare on the recovery of loans, total revenue, total expenditure of the government and so on. And you can see how the trends have been. You can also compare the actual accounts, the estimates, the revised estimates and so on. Also you can analyze budget of a particular municipal corporation. You can see what is happening in your municipality. This is for Ahmedabad. You can see the trends and you can see the priorities of how your municipality has been performing in couple of years. Also with help of open link data, what we are able to do is produce analysis of union budget in less than 15 days. So as soon as the budget is out, you're able to create a dynamic tool where you can see the sectoral priorities of the government and of the budget. And you can then argue on the facts rather than just the opinions. So you get all the data in one single place available for you. Also to extend our efforts, what we are trying to do at the moment is have an aggregated comparison of state budgets. You can see which state has been investing more on which sector, again the same 12 sectors. You can see how in agriculture, Karnataka is doing as compared to Tamil Nadu, Madhya Pradesh, Himachal Pradesh and so on. And you can see not just the total expenditure, but also the total sector expenditure as percentage of state budget. You can do per capita expenditure, per capita as per the population of that particular state. So you can see drastically that the figures change. Maybe the total expenditure is huge, but when it's compared to the population of the state expenditure goes down drastically. Now also you can see the revenue expenditure and the capital expenditure. Revenue expenditure is something which is an ongoing cost while capital expenditure is something like recovery of recovery of loans or buying off land and so on. Also one more motive of our initiative is to educate people. Currently it's very difficult to understand what is happening in the budgets. What are these codes? Demystification is very key important aspect of open data. So what we're trying to do is create something known as budget basics where you can get very detailed information of how budgets function, where is the money coming from, where does the money go and so on. Future work, what we want to do with these budget codes in place, you want to create a public national database of budgets covering all the levels of governments, union budget, state budget, municipalities and eventually district press release as well. This would help us to compare with the help of unique ID what is happening in each government body and we can see the time series of the same. So suppose I'm not aware of what is the budget code for hospitals. What I can simply do here is type in hospital and I would see all the data coming in. So the index is not just on the code index is also on the description. And this would also facilitate fuzzy search. For example, Karnataka writes Sarvana, Sikshana, Abhiyana while Madhya Pradesh writes that Sarva, Sikshana, Abhiyana. So while you type either of these, you should get both the results. So fuzzy search would be also in place. This is where we have been adding so far. How you can contribute? Help us generate more open data in budget. We are struggling and scaling up since the number of departments is huge. So help us evolve our algorithms, contribute your ideas, suggestions on what else we can do. Evolve our code base. Everything is in public domain. Everything is on GitHub. So give your suggestions, open some issues, help us find some bugs that would be really helpful. Cover budget data in your geography. Currently there are only 98 municipal corporations publishing their data out of more than 300 municipal corporations. So help us cover more municipal corporations on the portal and help us convert those PDFs into clean CSVs. Refine our designs and help us do more analysis in time. Help us with more suggestions on how we can make this data searchable and usable. We are open to new ideas, suggestions and feedback. Here you can reach us. These are the slide URLs, the code URL, my email address and all our Twitter handles. That's it. We will take only a couple of questions as Gaurav and the next speaker, Rakesh will be available for the OTR session. So just yeah, OTR is at 3 p.m. in the room one. Hi, this is Paridhi from American Express and thank you so much for the presentation. This is one of a kind that I've seen so far. One quick question that I have is that this is one use case which is for the budgets. But if I want to export this technology, this complete architecture to banks, for example, I want to know what my rival banks are performing over the period of time based on financial audits. So how well do you think this technology can be ported and how fast can it run? Sure. So each component of this pipeline is open source. So you can just start picking the scraping URLs from here. You can pick the PDF to CSV from here if you're dealing with PDFs. You can see the transform code from here and already the published platform is in open source. Analysis is something which I think would be very custom tailored to your use case. But the first four key components are already in public domain. You can start picking and playing around with it. Depending on how complex is the structure of your data that would govern the time it would take to replicate this pipeline for your use case. The processing or the development of the pipeline. Development of pipeline took us almost one and a half year. But now it's an open source. So the application would take hardly few weeks. That's what we estimate. Next question. This is a great initiative. This will make data available for all of our citizens. Put your mic a little closer. Hello. But I wanted to ask you, are there any initiatives not related to this, but any initiatives to help governments generate data in a systematic way rather than them generating data in PDFs and we going through this complex pipeline. You give a tool to a municipality. You'd give a tool to a central government or a state government where they key in this data in a tabular format. Yeah, generate the PDF as well as this CSV. So I think Rakesh would be covering in detail about that part. But I would just give my few points there. I think it's currently difficult to convince government to use an open source technology and give open data by default. So that's where the struggle has been. Most of these departments are using treasury softwares, but all of these softwares are proprietary in nature and suffer from vendor lock-in. So those are the major problems. We do have a policy in place, national data open sharing policy, but very few data sets are out in public. One last question. Hi. Does he scan store data in RDF format? Yeah, so URL one is actually RDF format. So you can access it via API. Okay, and what is the number of triples that you have? Number of triples. Can you explain like that's each entry subject predicate object that you have. So what is the total number of? I think it depends on the structure of the document. Some have more, some have less, but on an average, we deal with at least 15 per row, 15 per object. Performance so far of the platform or the parsing part. Platform so far, we are able to serve more than 10,000 requests per hour. Thanks, Karan.