 I'm coming from an organization known as Civic Data Lab. What we do is connect data, science, design, tech, and social science together to strengthen civic engagements in India. One of our projects is OpenBirds India. It's currently the largest open data platform out there driven by communities in India. It consists of all the fiscal information of the country ranging from union budgets since 2010 onwards in machine-readable formats, couple of state budgets in machine-readable formats like Karnataka, Sikkim, West Bengal, Assam, and so on. And a lot of municipal corporations budget close to 55 municipal corporations are covered on the platform. So everything is free, open for use, and under just Creative Commons license, the same license you can see at the bottom of my slide. So what we would be discussing today mostly is just setting the scene, letting you know what district treasuries are, how we can do the time series analysis on them, some sort of exploratory data analysis through open source tools, visualizing evaluation of various algorithms for time series anomaly detection, because that's our focus area, then visualizing the trends in the big picture, then lastly, how we can scale it up, how you can contribute to this, your ideas around it. So it would be that the last bit is much more discussion oriented, so I would like to wrap my talk a bit earlier so that we can discuss more on these ideas. So there was this article in New York Times by Carl Richards, and here he says like, budgeting equals to awareness. We are in this era of data where we don't know how our money is being used, how our tax paid money is being used for giving public services. How much of your tax component goes into the school in your neighborhood, or for the Potholes repair work in your ward. We don't have that kind of transparency in India, but there are other countries who are working to get that transparent information in place. So how we can advance India in that direction with the help of data science, per se. So this is how a typical government expenditure flow looks like. The government of India, the union government, transfers a certain amount to the state government and a certain amount to societies and autonomous bodies. When I say societies, these societies are generally set up for specific national schemes, like Servasiksha Bihar, which is the flagship scheme for right to education, national health mission, and so on. These are the flagship schemes by the national government for giving basic benefits to the citizens of India. From these two sources, then the money goes to the district officers or the Zilla panchayats, from where the money further trickles down to block panchayats for public works, or to the grand panchayats, and then in the end to the beneficiary. You can see the societies also give direct money to the beneficiary. And sometimes state government can directly invest on the public works, for example, state highways and so on. So this is generally how the money flow works in India. It's pretty complex. So even if we are looking into one particular aspect of government, we are not certain that we are getting the complete picture there. And specifically, societies and autonomous bodies don't publish data in open on a regular basis, which becomes a bigger hindrance for us to do the math properly. But there are certain data sets which come on a regular basis, publicly available for citizens to work with. And these should be of key interest specifically to data scientists who are trying out new algorithms and trying out new techniques to deal with data. So the administrative bodies which are publishing these information out, like the union budgets, comes on a yearly basis. First of fab, since 2017 onwards, we get that 1,000 plus documents to analyze the union budget of India. State budgets, which also come on a yearly basis, from March onwards to, say, May. And sometimes if there is a change of government, we get a revision of budgets and so on. But these are all yearly data sets, which just has three timestamps to work with. So not much of a scope of time series analysis, per se, on those data sets. You really need to collate a lot of data points over the course of years to create a proper time series data. But interestingly, some states publish disembursement information. So if you see this cycle, how much money is being disimbursed from district treasuries, the Jaila panchayats, to the block panchayats and other beneficiaries, that information is known as disembursement information. This is near real-time expenditure of district treasuries. And this information is being published by a couple of states on a daily basis. So that would be the focus area of our talk today, districts and sub-districts, treasury data. So the approach we have taken for this particular exercise is, first, to find the research questions. What are the research questions we want to answer using this data? Second, start gathering the data in a clean, usable machine readable format for us to use. Clean it further so that you can train a model or do some more analysis on top of it. Step three is do a detailed exploratory data analysis to see what are the trends, what are the patterns in this data? If there are any outliers, let's identify them. What kind of algorithms you can try out? You get a lot of information from the EDA. And then evaluate various algorithms. Due to lack of time, we would be just going into two basic algorithms today into evaluating them and see how we can do time series analysis using them. And then finally see the future of it, how we can scale it up, how we can create a public product out of it. So defining the research questions, I think the basic questions we all need to ask is how we can track funds and utilizations of these funds using this data. How we can identify certain anomalies, certain off-beat transactions happening on a district level. How we can track expenditure for certain national and state level schemes. And finally, engage various stakeholders with near real-time district fund analysis. How we can make it much more participatory in nature, making it accessible for citizens to use this data on a regular basis. So those are the basic research questions we have come up with this data. So for today's scope, we are looking into Orissa's district data. So there are a lot of districts which are publishing almost near real-time disembarkment information on Orissa. They have a portal known as Orissa Treasury.gov.n, where they publish detailed information. Once you enter the right codes here and the date information, you get something like this, a PDF which is unstructured to deal with, and a lot of information is hidden in this PDF. But we have figured out a way to extract HTML out of it, and then we are parsing those HTML into clean CSV. And for today's talk, we are focusing on Balasor, also known as Baleshwar, district of Orissa, seeing how that district is performing. So BrotaScript, to parse the HTML after extracting from the PDF, clean it and make sure that the PDF which we were looking at here, it might be difficult for you to read from the back, but it says there are some budget codes and what was the allotted amount, what was the expenditure, what was the amount which was surrendered, and what was the balance of the treasury. Each data point defines these four basic things, along with the budget code, which defines the purpose of this transaction. So once we write our script, make it clean, it looks something like this, which is much more digestible for humans as well. I wonder why don't they just publish like this. So it's the DDO code, which is the district level disembarkment officer. The officers are responsible for taking the money from the state treasury and disembarking among other grand panchayats and other institutions which are falling into that geography. DDO name, for example, this is Tipty Instructor of General Palace, Eastern Range of Balasor. So this is the position of the DDO, which gets the money. Budget code defining the purpose of this money, why this money was received, what was the allotment serial number? If there are multiple allotments, then the serial number increases. What was the allotment date? So for example, first March, 2018, 3,000 rupees were allotted to this particular inspector general for a certain purpose. But that person didn't make any expenditure that day. So what we had is a balance of 3,000 rupees. Simple balance sheet, what we get for our accounts as well. Similar balance sheet for all the public money which is going in the district. But the difficulty are the codes. If you have to digest what these codes are inferring, we need to look into what their mapping is, which is, again, a very cumbersome process. So the budget code generally has seven to eight categorizations. It's like referring to your place, with first would be your country ID, second could be your state ID, third could be your district ID, fourth could be your board ID, and so on. Coming back to the house ID being the last figure of your address. Similarly, what they do is create budget heads for people to actually know where the money is going of each particular number. So we have department major head, sub major head, minor head, sub minor head, detail head, object head, and so on. So each state in India gives information in this format about how they are spending money. Interestingly, as per the CAG guidelines, they need to be common only till minor head. Beyond minor head, they can have their own coding which makes this data much more difficult to be interoperable and comparable after the minor head. So now it's time once we have the data in place, we scraped the data for 2014 onwards till the end of previous financial year, which is March, 31st, March, 2018. So we have a good four years time series data with us. What we would be doing is now running exploratory data analysis on top of it. And the purpose of this exercise is to identify consistent patterns, identify outliers, discuss the same with various stakeholders, what are the reasons of these outliers, highlight visible trends, recognize factors of seasonality. If there is any seasonality per se hidden in the data, what are the factors of seasonality? Visualizing relationship among key variables. Simple steps you take with any time series data set. So for this purpose, I'm using open source tools. Everyone is aware of Jupiter. Apart from that, I'm using Apache superset. Apache superset was incubated at Airbnb and then was taken up by multiple organizations when Airbnb decided to make this piece of software open source. And then Apache is now incubating this open source project for a couple of months to improve and make it much more open source friendly in terms of contributions, et cetera. So if someone works with Tabula or Power BI, this is the open source version of the same. So let's do some exploratory data analysis. I'm not sure how readable it would be, but I would try to put out everything in loud so that you are able to hear it. So here we are seeing the monthly allotment and expenditure for the Balasore district. The blue is the allotment, the budgeted figure. And the red is the actual expenditure. As you can see from April to March for financial year 2017-18, none of the months the district is able to spend the money which was allocated for them. It's a very interesting insight. More money is being allotted to the district. The district is unable to spend that money in none of the months from April to March. And obviously there's a big surge on April. You try to figure out the reasons for that once we look more in detail. Let's see who are the biggest five consumers of this data, of this money. So we have Education Department, which is primary education, the one in the blue. Then we have a very small component, at least in April, and a bigger one in May, red of higher education department. The yellow one is of health and family welfare. The green one is for the home department. And the purple one is for women and child department. So that's how the distribution has been for them. Let's see the top 10 departments receiving funds through Balasore District. So as we saw, education has been popular. The least popular in the top 10 is the agriculture. And you can see the same thing with the help of table. I'm not sure how visible it is at the back, but it just says that this much amount was allocated. This much is the remaining balance, which was not being utilized, and this much is the expenditure. So if you see health and family welfare has a good amount of balance remaining, apart from the school and the mass education department. So these are the questions we need to ask to the specific departments in that district, like why you're not able to utilize the money which was given to you for the welfare of the district. Another approach to look at is the partition diagram. Suppose if I have to look, who are the major consumers in school and mass education department? I click on it. I see district education officer of Balasore is consuming the maximum amount. Then the district education officer of Sodar is the other one, Pongal is the other one, and so on. These are the sub districts of Balasore or Baleshwar. Baleshwar, and you can see the components from that. You can create these diagrams, partition diagrams without any knowledge of coding. You just need to install the software. It runs on web, and it's good to go for your purpose. Also you can create box plots. So I have created box plot of 98 to quartile. So you can see, I have removed two anomalies, one at the top and another at the bottom, since every district treasury might have a zero allotment at moment, so the bottom anomaly would be zero. With the top one, you can easily see from the graph. The link of the same is included in the slides, so if you are facing trouble seeing, you can see it later as well. So again, this is education, this is social security, this is social welfare, and so on. You can see superintendent of palace, and so on. You can see how these videos are performing. What we were seeing in this whole dashboard were just the data points for financial year 17, 18. Let's see what was happening for the complete data itself. So superset also gives you a very interesting feature of filter selection. So here I'm seeing data on a monthly basis, since April 2014, on virtual March 2018. We collected all this data, and now you can clearly see the seasonality pattern. From April to somewhere in August, we get a surge, which is the biggest allotment. Happening for the whole financial year, and they are expected to spend this money in during that surge time, and this is happening throughout every financial year. Sometimes it's happening in August, sometimes it is happening in the April, but it's mostly the first half of the first quarter, or the first half of the second quarter, where we are getting these surges of investments. You can then filter this based on the department. For example, I was seeing the total transactions of the treasury, but I have to look into specific department, say school and mass education department, the primary school. It lets me generate the dynamic visual for the same. You can see the red one is the allotted amount, and the blue one is the actual expenditure. The gaps are increasing since October 2017. It's a very interesting pattern in the data. Suddenly we are observing a certain increase in the gap of allocation, as well as expenditure from October 2017 onwards. So it could be a very interesting question for district budget analysts, like why this was happening. Once I deselect, it refreshes again, and here is what I get the balance time series. You can see it was pretty smooth till October 2015, but we got a surge in February 2018, and a drastic surge in February 2018. One can zoom in further to see what was the problem. So this sort of exploratory data analysis lets you ask right set of questions, just by looking into the data, slicing and dicing it, and it's very simple with help of these open source tools. You don't need to code at all. Now moving back, once we have access to certain things like this, what else we can derive? So again, just summing up the key insights. There is a yearly search in fund allotment around April to August. School and mass education departments and the DDOs associated with it spend the most. Expenditure in comparison to allotted amount of various DDOs has declined substantially in financial year 2017-18, especially after October. So these are crisp data points for your public accountability tool to ask to the right people in the government. Apart from pay of district administration, nature expenditure happens on a subminor head known as 0731, which is integrated child development service schemes. So a lot of money is going into integrated child development service schemes in Odessa. You are able to see these kinds of trends with such dashboard. So once we have done the expiry data analysis, we got a rough idea about our data. It's time to move to the algorithmic approach of trying to find certain anomalies. Simple thing, create a moving average window to see how the anomalies are coming in every month. This is how the anomalies look like for the whole financial cycle from 2014 onwards till 2018, March 2018 onwards. There are a lot of anomalies. You may ask like in one particular line, you can see multiple points, what are the reasons for it? Since we are seeing the data for complete treasury, there would be DDOs at that particular date spending different amount of money in for different purposes. That's where we're seeing more than one anomaly in one particular line. Let's see how the balance looks like. Balance is again the amount which they were supposed to spend but they couldn't in designated time. As we saw from the data, the anomalies were pretty less till March 2017 but got increased after March 2017 but drastically increased after October. Simple moving average window can also tell you the same thing. If I just plot the anomalies, they look something like this. There's a big red cluster on your right hand side which is something you can clearly point it out, give a print out to the district administration that this is how we are seeing the trends in the December spend data. If you move forward from the moving average algorithm, there are other algorithms which take into account specific nuances of a time series data which is something which Twitter has developed and made open source. It's known as seasonal hybrid ESD. It's an algorithm used for detecting alerts in the way people are tweeting for a particular purpose in different geographies. So what it does, it builds upon a very basic algorithm developed in 1983. It's known as generalized, extreme, studentized, deviate, acronymed as ESD. It is used to detect one or more outliers in a univariate data. One or more outliers in a univariate data set that follows an approximately normal distribution. So there are two caveats for this algorithm to work. Your data needs to follow an approximate normal distribution and other one is it needs to be a univariate data point. But unfortunately, real world is not always a normal distribution. It's not always following a normal distribution. We need to deal with sometimes multimodal distributions as well. So how do we deal with those scenarios? So Twitter said that we would probably do a seasonal and trend decomposition using Loas. Let's see what this seasonal and trend decomposition is. So this is your data. What we would do is identify a trend out of it, identify the seasonality factor, and once we subtract the trend and seasonality factor vectors, what we get is the remainder. After subtracting these two vectors from your initial data vector, the remainder is known as the STL remainder, which is always a unimodal distribution and most likely it's going to be a normal distribution. So your ESD can now work on the remainder part once we remove the trend and the seasonality factors out of it. However, there is still a problem. Sometimes extreme anomalous data might have some corrupt residual components. These are known as residual components. So if the data has extreme anomalies in it, even these data points could be corrupt in nature. So how do we deal with that? To fix it, the paper proposes to use the median to represent the stable trend and instead of the trend found by means of STL decomposition. So you would use the median instead of the means of STL decomposition. The simple change of means and median. Finally, the data sets which have higher percentage of anomalies, even higher than the normal trend, the research paper proposes to use median absolute deviate, mad, instead of the median. So you just change the way you take the average. Instead of taking mean, you would take the median. If you still see the anomalies, you take it as median absolute deviate, mad. This way you would have your remainder component much stable for the algorithmic input. The key benefits of using this particular technique is you detect both global and local anomalies, which is very interesting. Very few algorithms out there help you recognize both of them in a seasonality-based algorithmic approach. It employs both time series decomposition and median together with the ESD. For longer time series such as six months of minute data, the algorithm employs piecewise of approximation as well. So these are the broader benefits of using this algorithm for your purpose. So we are looking for a specific quarter, the last quarter where we were seeing a lot of anomalies. What are the trends? So with help of this particular algorithm, the counts are much more crisp and when we looked into these anomalies, these were much more accurate compared to a very basic rolling window model. So you can also try out for your time series data, which is much more nuanced or much more bigger in number. We had close to 50,000 data points for one district to analyze this sort of algorithm. And yeah, so these algorithms are really helpful, but there are other real problems when you try to scale up things. And the biggest one is proxy error. Once you try to scrape the data very often from the website, we keep getting these sort of errors which sort of block us to actually get the near real-time information out of the system. As of now, the website is still down. I just checked like few minutes ago. I was thinking I could pull out like five districts data for this presentation, but unfortunately the website is not allowing us to gather that much information at the moment. So we need more infrastructure support for such sort of public data, open data to be sustainable. At the same time, more use cases for pitching to the government and other organizations, civil society organizations, activists, media, that such data is helpful to draw certain insights. Just closing notes so that we have enough time for discussion. We would try to evaluate more algorithms, things like exponential spooning, double exponential spooning, triple exponential spooning and so on to see if we can get better anomalies and more accurate anomalies for sharing with the government. We would try to scale up to multiple districts. Currently we were focusing on five districts in five states of India. We try to cover all the districts in those five states where we get this sort of time series data and make everything open source, openly available for people to consume. We would try to set up certain contribution guidelines for you so that you can also participate in the process, try to see funding patterns in your particular district or your particular state or the state of your interest and eventually we would have a more data-driven governance. That's the kind of future we envisage with such sort of efforts. I'm open for Q and A. You can access the code here. That's my email address. That's links to the slide. And those are our Twitter handles. I'd like to thank my team, both at OpenBirds India and at Civic Data Lab, as well as CBGA India Center for Opposite and Governance Accountability, to help us gather this sort of information and the groundwork they are doing to make more data-driven decisions in terms of public finance in India. The URL for slides, I would just read out for you. It's tinyurl.com slash odsc-india-cdl. That's the slides URL. And the email is got up at civicdatalab.n. Everything is open and we'd like to know more comments from you. There is one comment which is pending. I couldn't push because of the internet issues but I would be pushing by evening as soon as I get a proper access to internet. It's a very good initiative. I appreciate you for taking this up. And I've been building a lot of knowledge in the field of economics and social data science for quite some time. And so, I mean, this is a particular, I mean, whatever insight you derive out of it, it has to be taken to a lot of people. I mean, even the illiterate people. So, like, the data storytelling plays a huge role here because not many people can understand and interrupt graphs and bar graphs and charts. So, like, how are you going to take the storytelling to a lot of people? That plays a huge role. I just wanted to add that comment here. And there's one more thing which I would want to, I mean, I've not dealt with time series data in, I mean, whatever, government spending and stuff. I haven't dealt with it but I've been dealing with time series data and finance. So, I doubt if moving average will be the better algorithm for your situation. Definitely. Because there's not much of a noise here because there's definitely a trend because, like, you know, I mean, it follows a trend because every year there's not much of a change, how much you, a lot, funds for everything. So, I doubt if moving average will be a better algorithm. I feel like, and one more thing is you need a long-term forecasting in terms of time series analysis which you're doing, especially with government. And ARIMA Modeling MA and AR has a disadvantage because it bugs well for short-term forecasting. So, there's a lot of literature right now on the net about using neural nets for time series analysis. I haven't tested it with government data but I would suggest you explore that side. And I appreciate you a lot for taking up this initiative. And a lot of people should be aware of it and they should also be contributing it. Yeah, thank you. Awesome. So, the first bit, what we do is we have set up a data fallow in five districts we are really active in. The job of the data fallow is to take all these data points from the technical team, make it much more easy to consume. For Orisa, transfer the same thing, translate into OREA for people to consume. And what they do is they work with the district administration on a regular basis to consume this data within the government, also host public consultations with grand panchayats and what sabhas in Orisa to inform them that this sort of money related issues are happening in your district for this particular purpose. So, there are people on the ground who are actually taking this data and making some action on top of it. We just started this project like six months ago and so far we are seeing a lot of curiosity among people at the government, but we're yet to witness a lot of public accountability in this sort of ecosystem because that would require certain tech and data literacy like you mentioned and not just participation from the people on the ground but participation from the people like us who have access to such technology and tools at this portion. So, the algorithm aspect, yeah, that was just to set the base and then we started using this algorithm. We call it SHID. This is actually meant for longer time series data like tweets per se which you keep getting for certain hashtags over the course. And this has been working phenomenally so far but we would like to definitely explore neural networks and see if we can do something on LSTMs per se to explore and make a neural network learn a pattern of a district treasury. That's something definitely in pipeline but that require more talent, more resources. So, open for contributions. Please help us. Yeah, thanks. I wanted to ask you more about the tool that you're using. You mentioned that you have been using the SuperSet and the Jupyter. Yeah. Is there a specific reason why, actually I have a couple of more questions regarding that so hope you don't mind. Jupyter and SuperSet is something that you are using. So, is there a specific reason why you picked these and will they, is there a capacity issue there which you faced in the past? Like you said that you have 50,000 data points that you have taken. So, will it have an adverse effect if you have more data set or you haven't tried? That's something which I wanted to know and do they, these insights that you're showing, obviously it is very, for people who have dealt with the data and the graphs, they can make out. But is there a capacity for these tools to throw up some insights on its own? Is there something that is available? Yeah. So, first of all we use these tools because they are open source. We are a hardcore open source, believing organization. We just work with tools which are open source. SuperSet is one tool which doesn't require any coding to build your dashboard for your expedited analysis. It gives you more than 47 data charts and graphs per se to work on. It has worked properly on a one billion database of Druid data store, which is the data store. ARBNB is currently using to track their logs of all the bookings happening at ARBNB. There are other organizations close to 35 big organizations like Uber who are using SuperSet at the moment. That's where you see decker.gl integration happening in SuperSet, which is a mapping based library developed by Uber. So, deck.gl is integrated and there are a couple of other big organizations who are using SuperSet. It's very actively maintained. That's one reason. Jupiter, it's like the de facto for doing visual data science. Sometimes you need to see what you're doing. You need to visualize your weights. You need to visualize each step of your algorithm to make sure it's accountable in nature. It's the perfect tool to apply FETML, fairness, accountability and transparency in machine learning. We haven't come across anything else which helps you do the same thing at that pace. FETML, fairness, accountability and transparency in machine learning. So FETML, FETAI and FETDL, these are the three major terms on fairness, accountability and transparency in data science sector. You can visit FETML.org to see the guidelines of doing a fair algorithmic approach to data science. Hi. Yeah, the insights part, some of it gets printed and surprisingly everyone loves bar chart. Line charts and other controllable might be difficult for people to navigate. But once you give bar charts, obviously this is pretty complex but if you give something like this that you see figure for each month how much is being spent. This is very simple for people to consume. We create bar charts, we create, add a lot of narrative to it. So superset also lets you add a paragraph here explaining what this graph is showing and that too in multiple languages if it's UTF-8 compatible. Udia is, so you can write a paragraph in Udia as well as Hindi as well as in English explaining this particular chart and you can take a print out of it and show the URL as well to them. So surprisingly, governments are now investing a lot in data science and blockchain. So at the same time they are very keen to build their tech literacy. We just did a workshop with Assam government on superset and 50 people in Assam who are working in taxation finance and other fields are now using superset for their in-house data exploration. So, and since it doesn't require any coding you just need to sign up and you just start using it. You host it on your server and just make it flexible for data to flow in. It also has a refresh option. So you can set a refresh limit if you are having near real-time data or real-time data. You can set the refresh limit of milliseconds and so on. It's using Celery and Revit MQ for scheduling and you can set the schedule time to milliseconds and so on based on your need. Kibana, this is Python based. So it's much easier to navigate personally for me. I face a lot of issues with Kibana per se because of the heap storage and other things. It's just a Python alternative to do better things. I'm not sure if Kibana supports 50 plus data visualizations at the moment. This tool does and it has a better UI UX because it's coming from the design team of Airbnb. So they are well-renowned for making design much more user-friendly. You can see different nuances if you are building a dashboard. So when I try to teach Kibana to a non-techy, it's slightly complicated. I haven't received that positive feedback from them but when it comes to superset, so far we haven't faced much of issues with doing the knowledge transfer and other things. Hi, I have two questions but it's commendable that you picked up something that's socially very beneficial, so that's much appreciated. One narrow technical question is, you talked about anomaly, but how do you define this anomaly? What is an anomaly versus what is incidental? So anomaly could then be once specified? In this case, in this case. So an anomaly in this specific case, if you look, for example, this, anomaly in this particular case are investments which were supposed to happen in that particular DDO office, district disembarkment office, but didn't happen on that particular date. The money was still kept with the treasury but not being utilized. For us, that's one sort of fiscal anomaly per se. Generally, there is a percentage which is, if you see here, the red line is showing the percentage, the lower bound and upper bound. This is the permissible limit which we say is allowed to have this much money in the treasury, but as soon as it's above the upper bound, it's problematic. But this concept of upper bound and lower bound was not scalable. That's where we moved into the Twitter-based algorithm which is much more nuanced to understand the seasonality patterns because upper bounds and lower bounds may vary for a particular season or particular week, particular DDO office, per se. So you wouldn't set the anomaly in advance? No. So the second question is very different. It's wide. I should have thought that government departments would already have a statistical department or sub-department or group that would do a lot of this analysis. So how come we have this sort of gaps? Unfortunately, we don't. At a national level, it is done some way, right? But there are gaps is what you're saying. We don't have any public access to such knowledge as far as I'm concerned. There are a lot of efforts being made by Ministry of Information Technology and Niti Aayog, but the reports which we get are the final outcomes. We don't know what algorithms they used, how frequently they used it. That's okay. And there is the problem of provenance. If I have to do the same thing with the open data sets available, I'm not sure how to do it. So that's one major problem on the national level. But when it comes to district level, I'm not sure we have a national monitoring system for all the 60, 650 plus districts in Asia. I think my question simply is like the UK has a huge department for statistics that produces and not just the statistics, but also the analysis. Yeah. We have a much weaker version of that in India at a national level. So my question simply is at state levels, we don't have anything like that, is it? Okay, all right. Okay, thank you very much. Even freedom of information of UK allows us to ask those questions if there is such effort being carried by state. If you try to do, we try to file RTI on statistical analysis on state level. We haven't received any response yet. Okay, so it's a struggle. Yeah, it's a struggle. All right. Yeah, data.gov.n. Yeah. Yeah. So data.gov.n, first of all, is more of a aggregate data website, not a detailed data website. What we are working with is a detailed data. And with the change of government, after 2013-14, we haven't witnessed upload of budget data on the platform. We have tried to contact data.gov.n data controller, chief data controller multiple times, but we haven't received any response. That's where we decided like we need a budget data platform. It's just targeting budgets in India. And as you see, there are only four to seven nine catalogs, four to seven nine data sets, which I'm sure is a very small number for a country like us, because for us, budgets itself, we have 10,700 data sets available. Unfortunately, no. It's not a community platform which lets you upload the data. It's not an open platform. Yeah. This is any one for... It's going to be published here. So you can see a couple of analysis, which are already live. We try to do advocacy. As I mentioned, we have district data follows who are doing the groundwork of taking these reports to the government, as well as to the WhatsApp bars and grand panchayas in that particular district. They do a lot of advocacy effort. But in terms of web presence, we make sure that whatever analysis we do, everything is uploaded in easy to consume format on the web. So this is what you're saying for the state budget comparator. We call it story generator. It's in alpha version still. Another one is using budget explorer. In union budget, you get like 10,000 plus data sets, sorry, 1,000 plus data sets every year. It's very hard for a human to comprehend all that stuff and give their feedback in the limited time we get for the budget queries. So what we do is in a span of 10 days, we launch something known as budget explorer, which is a detailed outlook of where the money is being spent for national government. You can see the basic schemes. You can see the sectors. You can see the revenue. Everything is publicly available for you to explore around. And all of this information is open. You can download it, use it, query it. You can see how the core schemes are performing. And you can filter again. If I'm looking interested in highways, I can just say highways, and I would get the highway related scheme and so on. So we want to make it accessible for people to use this data. There are a lot of data already out there, but it's in messy PDFs, which is a big struggle for people to delve with. Hi, I have a question. Two questions, actually. Sorry. Do you and your team do this full time? That's one question. And the second question is, this is the analysis on the spend. Is there other kind of analysis you guys are also working on? Like for example, the elections coming up next year, are you gonna do something with analyzing how money is spent there or detect anomalies, so to say? Yeah, so a lot of analysis means being done. You can access everything at Center for Budget and Governance Accountability website. This is the nonprofit organization we are working with. In the publication section, you get all the chapters, journal articles, blogs, policy briefs, everything you can access. It's much more traditional approach to public finance research. I'm not sure if election is our focus, our focus is more on the fiscal transparency and governance and budget credibility. But we do analyze budget patterns along with the change of government and before the election, after the election, and so on. And you do this full time, you and your team. So we do similar projects. So a couple of members from my team do it full time. I have been doing it full time for the last three and a half years. Now I'm doing it part time along with other projects. So we are trying to create similar platform for judiciary, so known as open judicial platform. And similarly, other things like open education platform for Karnataka and so on. Our objective is to make more open data platforms for people to use. Nice. It's being funded as a research project. So Gates Foundation and Omedia Network, Omedia is the philanthropic team behind eBay. These are the two major organizations which funded this along with other organizations like IDRC, Tata Trust, NFI, for some specific aspect of work. They are a nonprofit. They can't monetize their products or books. All the books are free. All the research work is free to access. And Open Birds India, everything is under Creative Commons license and it's all open data. So there's no way we can monetize this thing. Got off two questions. So the first one is my initial impression was the model building part that was done considering only the why, not some of the external factors. But it could happen, the anomaly that is happening due to the external factors. So how you incorporate those external factors, like say Algorand likes Arrimax and all, do you use? So that's one. Secondly, once you detect those anomalies, be it internal or be it external, how do you influence the common bodies? So the first part is this product just started like six months ago and we are still struggling to mine all that data at one place. So all the effort in terms of algorithmic research has been very minimal, per se. We definitely want to look into external factors. The problem is we don't get external factors on a daily basis, the way we get the time series data. So we need to do somewhere aggregation. And as soon as you start doing aggregating on the treasury, per se, you start losing a lot of information, which is a major trouble for us. We are still figuring out what are other open data sets which give us daily information. Few we have identified specifically education, per se. As you saw, education is spending the most, like attendance of teachers, midday meal information up school on a daily basis and so on. So we are trying to correlate when the money was less spent on education in the month of, say, the month of November, the money was very less spent on education. Is there a substantial effect on midday meal in that particular month? Because as per the academic calendar, November is a working month. We don't have examinations in November. Generally, we don't have festivals in November. What was the reason? The expenditure was so less in month of November compared to other months. Is there other data points? We are trying to investigate that part. And that's where our data fellows are really helping, trying to be our eyes, ears, and hands on ground, trying to gather all that information for us, which is not publicly available, per se. The second question was on, yeah. What we try to do is do a public consultation with different stakeholders. Alone, we can't influence them. So we need different stakeholders in place. So what we try to do is find ground-level activists, MP MLA's, responsible for respective boards, and different government stakeholders. Sometimes, the finance department is not willingly giving money to a specific department. Or sometimes, the specific department is not raising enough grant to the finance department. So making everyone at the same place, having certain representatives from the civil society organizations as well as society, we try to do consultations. We do these consultations once in three months in the districts we are working. And we make sure our data fellow stays in that particular place for maximum of his or her time so that they can get the much more objectives for us to work on. That has been the approach so far, but we are open for new ideas. So if you have something, please feel free to share. We haven't explored IMDB data, Indian Meteorological Department data. I'm thinking it could be a good potential for agriculture-related specific activities. But things like November is not even weather anomaly, per se, generally. So Indian Festival Calendar, yes. We are incorporating that already in place. So we try to map these anomalies with the festivals in that particular district, because each district celebrates it differently. So far, there is no obvious correlation in last six months of work. He was asking something. Can we ask him how was it? Yeah. Yeah. What I'm allowed to say is that the MP is very curious to have this tool in-house. They want to see something like this on a daily basis, not just for their own district, but for all other districts in that particular state. So they can match that whether the anomaly was just in that state, in that particular district, or it was anomaly across the districts, which is the kind of comparison which I'm yet to do, doing the time series analysis and anomaly detection for all the districts in Odessa, if they let us download the data without any 502 errors. One quick question. So far, no. If you let them know the purpose of it, it's always OK. And they keep seeing the outcomes. So far, no. But most of the data which I have used for this talk is publicly available online. It's not something which we are investing efforts on collecting. Wrongdoings, the data sanity checks. For that, as the other person asked, we would require other data sets, like, for example, mid-day meal information. If the mid-day meal expenditure for November is very small, I need to see how much mid-day meal was served in November so that I can match the figures. We need other open data sets to gather that kind of concrete insights that this was a wrongdoing investment. So it's a very complex structure in government. Not every department is in sync with each other. We don't have a district data officer, per se, who takes care of all the data efforts in a district. So because of these issues, we don't have a near real-time database, per se, existent at the back end. Certain schemes are being tracked. Like, mid-day meal is a national scheme. It's being tracked because of the national efforts. And all districts are mandated to fill in the information. But again, that thing is not exposed as an open data API for you to use. So you might need to figure out sources which are not easily available for doing this sort of analysis. Can we have a look? I think let him, if you need to ask a question, please ask me, I'll come. Yeah. Do it in a sequence. Just like, instead of considering the overall expenditure, can we consider the per capita expenditure? So you can compare down the line. And you can also consider the demographic factors, like some districts that are not developed from since they're independent. If you consider the western India as compared to the eastern part of India, so the government can have how much they can allocate in the upcoming years. Similar things you can consider for the eastern state. So that kind of data could be available. For state, we do per capita expenditure. For districts, we don't have a very concrete estimate of the strict population. The census gives a estimate, but we have found that that's not that accurate. We haven't covered inter-district comparisons as part of this talk. But when we generally do that, and whenever we do, we compare as per the district capita, as well as we try to calculate a gross domestic product for the district. So we do as percentage of gross domestic product and per capita for that district. These are the two economics statistics we always do when we do inter-district or inter-geographical comparisons. Thanks. Gaurav, this is truly inspiring. Question is, how can we contribute? To begin with, go through the slides, see which component excites you the most. Go through the website. I've shared a lot of links. And let us know what you want to work on. Drop us an email. We have a Slack channel available for volunteers to work on. We are very active on Twitter. You can connect us via this platform. If you are super interested on the coding aspect, everything is available on GitHub. The technology organization is this, github.com slash civic data lab. For OpenBirds India, we have a specific GitHub as well for you to work on. This is at the bottom of OpenBirds India website. It goes to CBGA India GitHub account. They're close to 13 repositories. Couple of people contributing. Already close to 13 people have contributed to this project. We have two specific teams who work with us. One is the in-house and other one is the community team from DataKind Bangalore. The organization I started in 2014 with the purpose of doing more data science for social good in India. We are a volunteer community to do data science projects over the weekend. So these are various ways you can collaborate. And if you have an interesting data set or have access to a data set which we don't have, just drop us a mail. Info at OpenBirdsIndia.org. It's written in the contact us information. Or if you have a very detailed query, you can fill in the form.