 I work with Civic Data Lab. Civic Data Lab works with the goal to use data, tech, design, and social science to strengthen the course of civic engagements in India. Today, I'll be presenting on fantastic Indian open datasets and how to find them. The goal of today's talk is not to present you with some awesome data analysis on open datasets, but to introduce you to the broad variety of datasets that are available in open and how can you find them and how can you work with them. So what is open data? Open data is data that can be freely used, reused, and re-restributed by anyone subject only at most to the requirement to attribute and share alike. Where can you find open datasets? Now you know that there is open data available and you can work to solve some real issues with that, but how do you find them? The first source is government open data portals. But as we all know, how difficult it is to find good data from government data sources, that various issues are availability of data. If it is available, accessibility of data. Most of the time that data is close behind some captures or some other restrictions that makes it difficult to access that data. Even if you get the data, you don't get it in proper format that is useful for you to analyze it. After all these hazards, even if you manage to get some proper data, the extent, the data is not that granular that you would want it to be to analyze it. I would want to give an example of, for example, I want to analyze some criminal statistics and want to get some data on FIRs file. What I get from government sites is the aggregated statistics on number of FIRs file for a particular geographic region. What I'm interested in is mostly their geographical coordinates and different cases that are filed there or more granular data regarding that, which I'm not able to find on government websites. So what do we do? We could spend some more time, dig deeper and find some specific websites, which would give you some specific data that you're looking at. There are departmental websites. For example, Indian Council for Social Sciences, Law Commission of India, they publish good amount of data and to an extent very useful. There is ADR. Then you can go through reports and data published by various NGOs. As we all know that there are various NGOs working in different sectors, civic sectors. They mostly publish their work in open and make it accessible to all. One of them is ADR, Association for Democratic Reforms. They mostly publish data on electoral candidates, their activities, their educational backgrounds, their criminal background, their wealth, et cetera. So it is a good resource to get some good data, good and reliable data. Then you can explore work of various public institutes and universities. They're happy to collaborate, as even we at Civic Data Lab are also collaborating with some of them, for example, Triple I.T. Hyderabad, I.T. Delhi, to get data sets. And they also provide mentoring on those data if you want to work on them. So you can just reach out to their websites or contact them directly. They're really happy to help you if you're really interested in some problem. After all of them, if nothing works out for you, there are open data communities who are dedicatedly working to get this data and open and make it accessible and useful for the common public. These are some of the communities that we have worked with, or they are really, these communities are not just working on these sectors, they make it accessible to that level that it is easy for you to even replicate the work they are doing. So you can definitely collaborate with these communities. They're very open to work with and easy to reach. After talking about open data, I would want to work you through a case study that we have worked on to explore the various aspects of how can you use open data to solve a real problem? I would hand over to Gaurav to work this case study to you. Thank you, Swati. So, Swati was mentioning, there are a lot of issues when it comes to data in general in public space. So let's dive into one example in deep and see how we can make something meaningful out of it. So one of the case study which we wanted to cover today was looking to open data and public transit in Bangalore. We know commute is a bigger problem in the city and there are different buses running at different times just that it's so difficult to track them. So, in 2005, Sajad Anwar, a renowned geo mapper in Bangalore community mapped all the bus stops and the routes of Bangalore and their timings and published his report on GitHub which anyone can access. He scraped this data from BMTC website and we thought we would extend on it, we would see other aspects like school visibility, public space reachability and so on. But unfortunately, when we started working on this data, we realized the website has updated and when we tried to scrape the website now, we ran out of heap memory errors. So they are not yet ready for the kind of workload which is required to get this data. Even though in 2016, BMTC signed an open data policy, still we haven't seen any open data being published by them in collaboration with other organizations or as individually, which becomes a major roadblock in analyzing this data on a timely basis and make your pathways on what could be the use cases out of it. So we next move to the obvious option. Let's not restrict ourselves to just Bangalore. Let's go to Delhi. Let's see if we can gather data from Delhi. And interestingly, in 2018, Delhi launched this program known as Open Transit Data and this is an initiative run by Delhi Transport Department along with triple IT Delhi. And this gives us static data of all the stops, routes, stop times and trips. Also on demand basis, they can give you real-time data. You need to register and you need to mention your use case to them. And once they approve the use case, you can get the real-time data of all the bus vehicles available in Delhi and that gets updated on a daily basis. So this is a huge resource for anyone to look into things like multimode transport, things like reachability of buses in small pockets of the city and so on. So the way they give their static data is something known as GTFS. GTFS is known as General Transit Feed Specification. This is in specification developed to share public transit data across globe and it's been adopted globally by different communities and state governments. What we are seeing in GTFS data is this variety of data available there. That's the data model on the right. You have things like stops, stop times. If there are fair attributes associated to stops, those information is also available. You have trip information. Through trip you can get detailed routes. You get the geospatial mapping of those routes as well and how these routes are being run by different agencies. So if there are multiple agencies, for example, we have VMTC here and if you want to also marry it with the Nama Bangalore data of metadata, you can see different agencies data and the same data set itself. So those all facilities are there along with things like calendars and when they are not working, those kind of information is getting updated. But this is how it looks like. So this is the GTFS data. It's really hard to make sense out of it. You require some tools and techniques to process it. It's a very highly useful information. This is that the presentation of it might be a major challenge to get you started with. So once you download all these TXT files and you start playing around with them, there are certain tools and techniques which you can use to start visualizing it and see what is being presented. So another colleague and friend of ours, Nikhil Vijay who works with WRI, he created this static tool known as GTFS manager, static GTFS manager where you can just load a particular route and visualize it and what are the bus stops and what are the bus times for that particular route. Now still this has a problem because I can't export all of this in something as simple as JSON, which I can input in any software and start playing around with it. So we further moved and started using this node JS library known as GTFS to GeoJSON. It's recently developed five months ago and once you input all the text files of GTFS, we get a JSON as an output. With JSON as an output, you have benefit like this. You can easily import this data in open source mapping libraries like we have QGIS, which is an open source software to play around with geospatial data, visualize it, analyze it without any need of you knowing the intricacies of coding geospatial information. So what you are seeing in the green is the bus routes in Delhi and all the pink one are the bus stops. They are close to like 70 bus routes which we are visualizing and they are bus stops around 3000 in this particular diagram. And similarly, you can start importing this data and playing around in not just desktop tools but web tools as well like Mapbox, which is a popular library for visualizing geospatial information. Here again, much more intricacies. So we have put together 3,213 bus stops along with 574 bus routes. You can zoom in and see how traffic and different buses are moving in. South Delhi, North Delhi, do the comparisons, see how the school reachability is. You can zoom into CP, cannot place in Delhi and see what kind of buses pass through, cannot place and so on. So the opportunities for you just increases drastically when you start moving this data from desktop to web and then draw further more analysis on top of it. So I would stop here and give it back to Swati and see what she has to say on other data sources like satellite images. Yeah, so satellite imagery is a recent development at Civic Data Lab. We have been a new sector that we have been exploring specifically for disaster management. So I would just want to walk you through the journey that we have been through to analyze these, to just start with the analysis of satellite imagery. There are various data sources available. Different countries are working through their own way. They are launching satellites and they publish this open data through various channels. NASA, they have this USGS Earth Explorer where they have their Landsat satellite data. The latest version is Landsat 8 that is launched. They also have recently incorporated data from other satellites around the world. Then there is European Space Agency. They have Earth Online where they publish Sentinel's data. Then we have ISRO in India. They have a platform called Bhuvan and these are the some government agencies. There are other private players in the sector who are doing a great job at getting this data out. One of them is Digital Globe. I'll talk about Digital Globe a bit later. So I would want to give a brief on the bands that are present in a satellite image. In total, we have 11, eight bands available from Landsat 8. So the first band is for Deep Blue. The next three bands, they are from the visible spectrum, red, green and blue. Then there is near infrared, shortwave infrared and shortwave infrared 2. Then there is panchromatic band, which is the highest resolution among these bands. There is thermal infrared and thermal infrared 2. So these are the bands. Am I speaking about their classification is these different bands are used for different use cases. If you want to perform a specific analysis, you need to look at a particular band. There are ways by which you can segregate these bands and get specific band of one image to analyze these images. About the use cases. So the first, the visible bands are mostly used for vegetation mapping. The infrared bands, they are used for flood mapping, vegetation mapping as well. Then there is shortwave infrared that is used for that is used for desegregating wet land from dry land. So you can use it for soil moisture indexing. So these are some use cases that we have been exploring. There is a whole lot to explore. You can visit these. So USGS has their own product description that you can study to understand what all the bands are and how they can be used. Digital globe as I was talking about. So they have their own satellite called world view for which their data is proprietary. So that is not an open, but they have an open data program which they have started specifically to publish open data on natural disasters that are occurring around the world. So for every this natural disaster event, they publish some imagery for the people to analyze the data and help fasten the process of disaster response and disaster management. For India, they have these three data sets that they have published. Kerala floods, the recent Cyclone Fanny and the Monsoon of 2017. So you can get this data from their website and start exploring the, for example, you could do an impact analysis of these disasters. They give you two types of images. Pre-event and post-event. For some countries, not for India, but some countries they have vector images along with the raster images. So these data sets only contain raster images, but they have for some hurricanes around the world, the vector images are also available. So you can do all the labeling and sort of stuff. So yeah, they have a platform called discover.digitalglobe.com on which you can live view these satellite images, whatever regions they are covering. It is useful when you have to figure out, when you're just starting to figure out what data you need for your analysis. So since the satellite images are huge in size, it's difficult to get all the data in your system and then start exploring. So this platform is useful when you have to, when you have to come down to some specific region that you want to analyze. There are tools that you can use. I am one of them, QGIS, Coro already mentioned. Then there is Geospatial Data Abstraction Library, GDAL. It is a very popular library, open-sourced. It is mostly used for exploring the satellite images as well as manipulating them for pre-processing. Then there is Shapely. It is a Python library, again open-sourced. So it is used for mostly exploring the geo-coordinates and playing around with that. Then there is Raster IO. It is a wrapper around GDAL only. So if you are just getting started and not comfortable working with the intricacies of GDAL, you could use Raster IO as well. So this is a suite of tools that you can use to get started with analyzing satellite imagery. I would want to introduce you to a challenge, a SpaceNet. So it is a data repository of open satellite images, through which they also encourage people to analyze these images. So they have started a challenge called SpaceNet Challenge. Every challenge is focused on some sort of analysis. The recent one was about building detectors. And these are the winning solutions that you can find in the GitHub repository. They publish their data in open, and you can find research papers also around these solutions. So yeah, this was about satellite imagery. But there are various sectors that we at Civic Data Lab also work with, and they are present around the world. In different sectors, you can get open data sets. So I would want, Gaurav wants to talk about some of them. I would hand over to him. So once we have dig deeper into two specific sectors, I would just give you a brief overview of other sectors where there's a lot of data available for you to play around with, try some models, and build some insights. At the same time, you can contribute in one way or the other to the societal good. One prominent one is budget documents. Some of you who know me, I've been working in public finance for the last four or five years now, and have been trying to see how we can mine much more fiscal data of the country. So in my previous talk at ODSC 2018, we discussed how we can use things like anomaly detection on near real-time spending of different district treasuries. So these district treasuries are present from states like Maharashtra, Odessa, Jharkhand, Himachal Pradesh. So some of the progressive states in terms of fiscal transparency. So you can go and access near real-time data from there or through one of our initiatives known as Open Verge India. You can access from there as well. And you can try different sources, see how much money is coming in your district, see how your district is performing compared to other districts of similar size, similar nature, and so on. Also, budget documents are a rich source of multilingual data. You can develop parallel corpora from these budget documents. Budget speeches generally come in at least English and Hindi for the Hindi-speaking states. And for other regional states, it comes in their regional languages as well. So you can build NLP models on top of it and start using them for assistive machine translation and much more. These are some kind of experiments which are going at Civic Data Lab as well. So if you're interested, we are happy to get in touch with you. Apart from that, there is another sector which produces almost near real-time data and which could be of huge interest for you is judiciary. So all our quotes from Supreme Court, High Court, District Courts, and Lower Courts gives almost real-time information of what kind of cases are being listed on a daily basis, what kind of court establishments are here in these cases, what is the status of pending in these court establishments, what kind of case types where there is maximum pending. You can see things like how your district is performing in terms of crime against children, gender crimes, land rights-related issues, digital rights-related issues, and so on. So the opportunity of such data is quite huge. And it's almost easily accessible through different court websites. Each one of them have their own way of publishing that data, but I'm sure you would be able to figure out a way with it. Before that, court also produces a lot of judgments and orders, which is also in Indian languages. So that is something of interest as well. You can start developing topic modeling on top of it and create different corpora, which could be useful. So with that, we would like to end this session and give you some time for Q&A and see if some of this interests you. And we hope that you would be excited to get your hands dirty with much more open data in Indian ecosystem. Thank you. Thank you so much. This was fantastic stuff. My name is Ranga, and I just had a question on the last point that you made. You said to the courts, do they publish their judgments also, they are public and that's available? Yes. For example, I've read articles, probably in the US saying, typically when there is a case going on, the lawyers or the apprentices go through all similar cases in the past to kind of get an idea of where the judgment is lying or what is the precedent. So there is a complete prior art search. Is that something like that being done in India? Is it available or is it something as an opportunity? That's what may be aspiring data scientists. Yeah. So some of this data is already available. So there is an organization known as Indian Kanun, which is publishing some data of high courts. When it comes to lower courts, there's very less work which is happening at this moment. And that's where I see there's a huge opportunity for data scientists to collaborate and co-create data sets. We have been trying to tackle that kind of problem at this moment at Civic Data Lab. So what you saw in this visualization, we try to capture three-year cases of Pune court establishments, all the court establishments in Pune, looking into their orders and judgments to understand various nuances, how Pune as a district is evolving in terms of access to justice. So such kind of efforts are possible now. The government websites are almost near real-time updated when it comes to these cases. We have seen few glitches here and there. Sometimes they stop publishing certain orders, court orders, and you need to wait for the case to finish to see the judgment to get into the nitty gritties of it. Those issues still exist in this space. Swati, do you wanna add? So when it comes to court establishments, everything comes via a standard platform known as e-codes. So the metadata is pretty structured. Judgments and orders, as you know, it depends on the judge or the bench. So there's a lot of subjectivity there. At the same time, it's a pre-flowing text. So it's available as a PDF. You can extract the TXT out of it and start processing it. But it's hard to say what's the quality of data in terms of orders and judgments. I can say there is an opportunity for people like us to work on it. Okay. So major districts publish it in English while certain smaller district courts still publish in regional languages. So do you wanna brief it out? Maybe in the regional language, the final judgment would be in English, okay? So what do they get through the problems? It's a straight across. Okay. Hi, thanks for the presentation. This is Ankit. So I just wanted to know if you came across any platform where we could get information on the traffic lights, where they are, and what is the duration of the traffic lights? So it's 15 seconds, 20 seconds. How does that work? So if we go back a bit, we mentioned about a community of, yeah, community, data community. So OpenCity, one of our partner organizations, they are hosted under Urwani Foundation. They publish traffic light times and traffic light locations in Bangalore. Timings and locations both. Only for Bangalore, right? Yeah. At this moment they have data for Bangalore. And it's also in PDF at this moment. PDF? Yeah. But you can write an algorithm to get over that. Sure. Thank you. No, it's not real. Please one by one. I think that gentleman, please. Okay, so weather data for India, can we get that? IMDP just publishes a certain gist of weather data and they also publish a few images which could be of interest. IMDP. Yeah. And so Indian Meteorological Department. And there's one more question. Like I had a problem statement saying the impact of Swachh Bharat beyond. Is there any way to get information about cleanliness and all that? It's very hard because Swachh Bharat, rural mission and urban mission websites have stopped publishing granular data, which they used to. Earlier, what granular data they were publishing, there were some privacy related issues as well. So with recent amendments, they've stopped publishing that information. What you can look into is some sort of outcome budget of Swachh Bharat beyond. Certain states produce it in detail and that is available on OpenBirds India, the open data for public finance. And how about health data? Can we see India's particular cities health is improving or going down? National health mission does provide certain data on top of it, but I am not aware of a comprehensive data intensive effort for public health per se. There has been certain efforts by different states like Rajasthan has their own platform for pregnant women and child well-being tracker, but they don't publish that data in Open at this moment. Thank you. Yeah, hello. Good evening, yeah. The availability of all these data sources and there is no specific problem statement attached to the public or the context most of the cases. So how do we proceed on those pieces? Sure, so if we go back, these are also, so apart from the step zero, step one, step two, step three, step four are also your opportunities to collaborate with people. So I am part of some of these communities and we organize weekend events where we bring in non-profits, researchers, government agencies to come and discuss problems with us, which are much more pressing for them and how we can use the data which is available either from their own websites or websites from other government sources or nonprofit sources to counter that. So doing problem munging exercise, as we do with probably our clients, can do the similar kind of exercise, brainstorming exercise with research groups, non-profits, government agencies, and in our limited opinion, all one, two, three, four of them are very open for collaboration. Okay, hello. Hi, so I wanted to know if you have or anybody else has successfully tackled some kind of problem statements using these kind of data sets? Yeah, so we have been working with couple of government agencies in public finance space to reduce fund flow gaps. So how the fund is flowing from state treasury to district treasury, district treasury to grand panchayats and grand panchayats to beneficiaries. There's a lot of money flow happening, money flow happening at the same time gaps happening in the money flow. So we are working with certain governments to track this money flow gaps and fund utilization gaps, trying to create an automated algorithm for early detection for such kind of thing. Similarly, we have been working with some of the non-profits in terms of crime against children, trying to see and track down where is the maximum pendency in the whole timeline of a judicial proceeding and how different court establishments are performing in that judicial proceeding. So ideally when you work with partners on the ground, you have much more touch on these issues and you can experiment more and you have an opportunity to scale solutions as well. Okay, hello. If a data scientist want to volunteer on one of these activities to explore some of the problems, do you have any platform or some kind of? These all six are platforms. Okay. Whom you can reach out to. These all have opportunities for volunteers to collaborate. Okay. And apart from that, we are always open to collaborate as well. So you can reach out to Civic Data Lab and that's our contact details. So we have fellows and volunteers coming with us working on some of these data sets and contributing along the go. Hello. Do you know any specific data source where I can find the information about the public entities like railroads and with the geolocation of the United States? You had a slide where you went through the layers of satellite imagery, how that can be utilized with the geolocation. Do you have any insights about it? Railroads you mentioned, right? Railroads, water bodies and stuff. Yeah. So OpenStreetMap, one of the communities I mentioned earlier, hosts a lot of this data openly available. So you can reach out to OpenStreetMap, India Community or go through OpenStreetMap website to capture some of this data. Also they have a tool known as Overpass Turbo, which helps you to extract all of this data from OpenStreetMap in format of GeoJSON and shapefiles and other popular geospatial data formats. Thank you. Excuse me, just wait for the mic. Hi. As you mentioned that most of the data from health sector is not as easily available as other sectors. So if I want to collect some data from some hospitals, so what steps should I follow? I think partnering with the organizations working with hospitals, local hospitals could be one way to go. So how can we work with hospitals? Maybe contacting some of the communities which I mentioned and look at what kind of public health related projects going on in them. I'm aware when I was volunteering with DataKind, we were working with this organization known as Antara Foundation to look into public health of Rajasthan, specifically focus on certain districts and create early warning indicators for Anganwadi workers, Oxygeners, medwives, Asha workers on resource related issues and high risk pregnancies and so on. So you can contact communities like DataKind and DataMade to keep having these sprints with certain non-profits who work closely with either rural health or with hospitals. Thank you. Hi, so I briefly worked with DICE data to gather information on schools in India, in one particular area which is Ahmedabad. So I mean the quality of data is horrendous. I mean the data entry as well as the data that the providers is very bad. True. So I mean how can you be confident about any sort of analysis that you do on that data if it is accurate or not? So I mean that is a sort of problem. True. So I think that's where we need more data critique also. So whatever your analysis, whatever analysis you have done and if you have a clear input saying that the ground reality is different from your analysis, publishing that as a case study is also helpful so that we can then go to DICE saying that we need a rectification in this area. Organizations like Akshara Foundation in Karnataka have been working on this kind of front where they have been working to course-correct mapping of schools and course-correct some of the information collected on schools and so on with their team of volunteers going in all the districts in Karnataka at the moment. So with partnerships and much more data critique, I think data quality issues could be fixed to some extent. Yeah, my question was around the licensing model. I mean, can any of this be used for commercial purpose because what I understand so far is that most of these projects are based on universities or based on cards or something like that. So can automated systems be used as an inspiration? So it depends on source-to-source. Like DigiClub, what Swati mentioned is under a non-commercial license. So all the satellite imageries of these disaster locations are in non-commercial license so you can't use for any commercial purpose. While initiatives like OpenBeds India publish in Creative Commons CC Bi-License where you can give that information with just an attribution to OpenBeds India and the original creator of that data set but you can use it for commercial purposes. Yeah, that is possible. So you need to check the source and the license. And most of these sources mentioned on their website that this is how you can use the data and this is how you can attribute and these are the legalities and all that. There's a license page generally on all these data platforms. Gov. Swati, amazing work. I just want to have two questions. One is on the work, whatever you are doing, the use cases, do you get it published? And number two is when you're actually working with the open data sets or with an organizations, especially mentioned Akshara, do you sign an NDA with them while you're using their data? And how does it work when you have to publish? Yeah, that was our recommendation to Akshara Foundation that we should be signing an NDA. Most of the time, non-profits are cognizant of that fact that some of this information may be very crucial personal information and we need to keep it under certain legalities. In terms of publishing, most of our work, we try to write blogs around it and those are available on our website. You can have a look. So Swati recently published a blog and a couple of other colleagues have also recently published some blogs. That's one way. Another way is through talks. So we try to disseminate some of our learnings via talks. So you can look into our talks as well and that is available for people. A third way which we disseminate our work is a code. Most of our code is also open source. So you can access the code and you have the opportunity to drive your own insights. We not just restricted the insights which we have generated. So these are the three ways we have been disseminating our work. Fourth angle which we are trying to figure out is how do we partner with more communities and more volunteers and more collaborations in general with not just the data community but other communities like designers and storytellers and so on. So that is something which we are still figuring out how to do. So if you have any suggestions, we are more than happy to take some. Thanks. Hi, I think this is very good information. One point I want, like these are the organizations who have been actually working on some of their cases and they are actually going through the government data. So and you are doing some cleanup or some organization. So what are the data you have actually taken from a public sources? I mean, is it, I mean any effort done to make them available in a proper format so that it can be accessible? So that everybody does, everybody does not do the same work of going again to the original data set and understanding that, right? So any work in that direction is happening? Yeah. So that's what most of these communities do and most of the time we partner with them. So OpenBeds India, we are the tech partner with them and it is housed under the Center for Budget and Governance Accountability. Whatever cleanup we do on the PDFs of different public finance websites, we publish our code as well as the free and the post data in OpenDomain. So you have an opportunity to relocate and work on it. Similarly, I'm aware OpenCity does the same thing. Whenever they convert a PDF to CSV, they publish the code as well as the free and the post data in Open. DataMeet is also a group where a lot of volunteers do the same exercise and we all link our code via GitHub and other sources. So speaking to some of them and being part of their conversation channel might be helpful for you. Thank you. You had a question. Yeah, thank you. Hi, so I work for an organization that deals with a lot of different open data sets and one of the problems that we frequently run into is trying to stitch them together. So like DICE and different census years, sometimes with other data sources, even within the census years and relating to DICE, all of the village codes have changed with transliterations of names and districts and districts changing themselves and villages changing themselves. It's really hard to kind of stitch them together and get one accurate data set, especially to like run a machine learning algorithm on it. And I was just curious whether or not you knew of people that had done work even on the intermediate steps, like getting accurate lists of districts that have changed on different ways that districts have been spelled so you're not redoing the fuzzy merging or other ways to merge those data sets. Yeah, so I think this remains still one of the most critical and unsolved problem in this sector. Unfortunately, I don't have a very straightforward answer to it, but what we have done is when we see a major change in the source, we don't shy away in writing them that we need a change log for this. Earlier we were relying on this kind of coding. Now you have completely changed the coding or you have completely changed the definitions. If you can just provide a change log to us, that would be super helpful. And most of the time people do respond. So it's just about getting the right message out there to the right person. So doing more data-driven advocacy, if specifically if you're working with a lot of data, has helped us as an organization and that's what we encourage our partners as well. That's only one way to go about it because it's very hard for rewriting algorithms again and again in this ecosystem. Okay, so thanks for the amazing talk, Gaurav and Swati. We are up at the time, so we'll stop here and that's all the questions you can take offline. Okay. Thank you, everyone.