 So quickly introducing myself, my name is Udit Poddar and I work for a data intelligence company called Social Cops. So what Social Cops, what we do is basically we have a platform that is deployed by organizations and governments that help them make sense of their internal data, that helps them tag those internal data with externally publicly available over 600 data sources. Our platform also helps these organizations collect primary data, transform, curate all this data and visualize this data for the key decision makers. So going back in years, one year back I was living in Bangalore actually, mostly in this, when I was working for a business analytics firm here but I was constantly wondering that how can I use my data skills to actually make decisions in sectors like health, education, agriculture, rural infrastructure etc. So that's when I came across Social Cops and so during one of my interviews with the founder I also talked about how I wanted to take decisions in the agriculture sector. And after one, like within one month of me joining at Social Cops, I had a project where I held an organization, one of the largest philanthropic organizations in the world, target $8 million investments in agriculture in India. Now how did I do that? So I had no prior knowledge of agriculture sector, like most of us here. So how did I manage to do that? But before I go into that solution step, let me first talk about some interesting problems that organizations and governments are trying to solve in the agriculture sector. So this is a district in Uttar Pradesh called Fezabad. It has a high production of rice and wheat, which are the basic staple food crops in India. It has also, but it has also extremely high maternal mortality rate and infant mortality rate. So imagine a girl say Pramila living in this district. So her family is a rural family, an agriculture family which produces surplus amounts of wheat and rice to sustain the family, but still she's undernourished. More like the 194.6 million people in India that are who are undernourished. So India is actually housed to the largest undernourished population in the world. So what is wrong? So we are producing food grains and surplus, but there's something clearly wrong in our rural sector that our children, that the people are undernourished there. So if you see this graph, the production of rice and wheat denoted by black and blue lines have constantly increased since independence. But the production of pulses have practically remained stagnant since independence. So India is facing something called the pulse problem, which means most of our people in the rural sector don't have availability, don't have access to pulses in India. Pulses are very high in nutrition value. So we Indians, like in the urban areas, we typically eat some dal in some form or other way every day at least once, right? But our rural population doesn't get to eat that and pulses are very high in nutrition in India. So how would an organization, a government organization or a philanthropic organization solve this pulse problem? So they would typically invest in areas that increases productivity of pulses. So areas like irrigation, availability of irrigation increases productivity of pulses. High yielding disease resistant crops. So pulses are disease prone crops. So availability of high yielding disease resistant seeds can also help in increasing production of pulses. Something called crop rotation, which we have all studied in our 8th standard social sciences class textbooks. So crop rotation literacy actually plays a very important role in pulse productivity because farmers know that they can actually grow pulses between rice and maize season. And accessibility to market. So any basic thing that a farmer is growing, he would want to sell it. She would want to sell it in a market. So these are typical areas that any organization would want to take decisions in if they want to solve this pulse problem in India. Now typically the way these organizations work is, so generally they take a lot of time in figuring out the geographies, the granular geographies that they should invest in to solve these problems. And most of the time it is actually a very, very macro research that is done. So you know organizations will narrow down upon a couple of states saying that we will solve the pulse problem in these states. But they wouldn't know where in these states we should invest unless they conduct a primary survey and spend a lot of time in secondary research. So we the data community can actually solve these problems for these organizations in a much, much quick, in much faster way by using something called the public data, the open data that is available online. Before I go into the public data that is available online in agriculture sector, let me just give you a picture of how India is a very, very complex country. So we have 1.26 billion people living in 36 different states and union territories. These states are divided into 684 districts. The number of districts keep increasing with time. So back in 2011 there were 640 districts in India. This has direct implications of the data on the data that we will be working on. These districts are divided into 6000 plus sub districts which have over 6,40,000 villages. So imagine if an organization is just taking a decision at a state level, it is practically very, very, it's practically impossible to make those investments reach the people who need them the most. So how do we solve, how the data community can solve this problem is? So India is actually very rich in public data. So Indian government releases a lot of data that is publicly available online. The only problem with this data is that it is not very, very accessible. So that is why I would call it a public data but not open data because we cannot very easily use it. So we have social cops, we have built systems, we have built technologies that can scalably acquire these data and clean these data in a format that the data community or data scientists can use to build models around it. So before I go into the challenges and the solutions that we will, let me just take you through a couple of data sets that are available in the agriculture sector. So there's something called agriculture census. This is one of the most comprehensive data sets available in the agriculture sector. So any one of you if you want to explore the agriculture sector, this is probably the best data set that you can look into and understand the agriculture scenario in India. So it talks about all different kind of farm holdings that are available in India, different crops that are grown in these farm holdings, different kind of irrigation facilities that are available in these farm holdings. And the data is disseminated at a sub-district level. Again, there are more than 6,000 sub-district in India. So this entire data set is disseminated in more than 2 lakh Excel files. Now how do we clean that? How do we curate that? That's the big question that we've answered at social cops. Next data set that I would like to talk about which is very, very important for decision making in agriculture is the input survey. So this talks about all the input systems that go into the production of crops. So any decision maker, if they want to increase the productivity of say pulses or production of pulses, they would typically invest in these inputs that will finally augment the productivity of the pulses. So this data sets gives us all information at a district level what kind of inputs are available to the farmers. So we've used such data sets to actually come down to decisions in less than 3 months of time, which would have widely taken more than a year. Now biggest problem with working with public data is that they come in PDF format, they come in PDF files, they come in a lot of Excel files. So how do we deal with that? We at social cops, we build internal PDF parsing tools that uses image recognition to parse PDFs like these. And it detects tables within reports and parses all those tables scalably. So we're also actually planning to open this tool for the data community to use so that we can augment decision making in the social economic sector. So one of the challenges that we also face in using all these data is how do we stitch those data to form a master data set for any sort of analogy. So stitching data set is again a problem because the district, so the only unique identifier here are the district names, the geography names. Now these keep changing with time. So earlier there was, so in 2011 there were 640 districts in India, 2015 we have 684 districts in India and 2016 it's expected to increase to up to 715 districts. These districts, the names of these districts might change. Now how do we stitch these data sets together? So we've built something called the entity recognition system which actually learns these district names while we're working on it. So any data set that comes out tomorrow and has some sort of error in the district names or different district names are used, it will actually give it a unique identifier for us so that we can identify which district it is. We use fanatic matching and fuzzy logic matches for matching district names in this entity recognition system. So this is how we've solved the problem of data, data curation and data stitching for all these public data sets that are available online. So using all these public data sets, I actually did a project last year where I collated 31 different data sets, 31 different data sets, crunched 2000 plus variables from those 31 different data sets and came up with 209 impact indicators that could help an organization, a philanthropic organization drive decisions in agriculture sector in India. So I would actually like to take you guys through the tool that we finally built, the dashboard that we finally built using this data set. So this is the dashboard that we built on the agriculture sector. So the decision was to take place in three states. So we have data for three states here, Uttar Pradesh, Bihar and Orissa. So this is a typical data product that we created for effective decision making. So user can go inside the Uttar Pradesh map. I guess the internet is slow. So let me just take you to the snapshots of the dashboard. So this is an Uttar Pradesh map where we have a query tool on this data product. So where a user can enter queries on multiple indicators and come up and reach to one single conclusion that they want to reach to. Say for example, if an organization wants to understand which areas have irrigation facility, but they don't have pulse production. So these areas would be like the low-hanging fruits for them to implement pulse program or increase pulse production. So that's what we've done here. So percentage of area receiving irrigation. So we can query on this data tool and also add other queries like, you know, where productivity of, say, Urad Dal is less. So we get that one particular district where the organization can actually go ahead and do a pilot pulse program to increase awareness of pulses and increase nutrition. And this is, guys, ultimately the motivation factor, the factor that keeps us going in this exercise of using public data. Thanks. Open for questions here. Andy, user for this dashboard will be farmers. I'm sorry? Andy, user for the dashboard or the report you are building, mostly they will be like people living in the remote area or farmers. No. So these dashboards, the users will be people who can make investment decisions in these areas. Because, so the dashboard gives us information at a district level, right? So the users, governments, philanthropic organizations, NGOs that are working in agriculture sector. So they will typically use these kind of dashboards to decide the geographies where they should invest in. Okay. So in my case, I also have farm plans. Right. But there, I don't think anyone is there to suggest what to cultivate. Right. So in that case, you or someone will go to each and every village and because when you talk about investors, mostly they are farmers. They have their own plot and they are cultivating what they want. Right. How will you educate them? So education program is then implemented by, you know, some partners, some clients that we are working with. So they typically increase the awareness of say, crop literacy or things like that in those areas. Excuse me? Yeah. Is the dataset available publicly? Sorry? Is the dataset available over the net? The dataset is available. So all these data have been taken from publicly available data. Which are? Refined one. Refined one, it is not available publicly, right? Okay. Hello. Any chance? Hello. I have two questions. It's a very interesting initiative. Right. My first question is, who are your customers? I mean, you build this, you know, interactive map which allows us to do things and all. Yeah. So are you expecting, you know, self-governing bodies? Are you expecting agricultural research institutes? Who do you think will buy this? Because somewhere, you know, you want to make money. So my second question is, tell us something about that thing where you reduced it from 2000 variables to 209 variables. Right. Tell us how you reduced it? Right. Keeping what in mind. Right. So typically, any authority, any organization who can take investment decisions or who can take some sort of decisions that will affect agriculture sector in an area can use these data products for, you know, targeting their investors. Policy making bodies, any sort of organizations who have the authority who can actually make investment decisions? Yes. Hello. And second question was, how did we come down from 2000 variables to 2009 variables? So typically, the way this tool was built was to, you know, increase nutrition in the agriculture sector in these three states. So while we are scraping data, so while we are cleaning data, we do it scalably so that we have all the data sets available to us. But then we typically come down to the decision making aspect of that entire process where we then narrow down to indicators. We run some models to understand which indicators are actually doing well. Hello. Here. Yeah. I have a question in the comprehensiveness of the tool. Apart from agriculture, irrigation or agriculture related problems, are we solving other problems associated to agriculture like storage and transportation as well? Yes. So we actually, so we have the, so we've also tried to solve problems related to accessibility to markets. You know, we have a lot of data available. Storage, we haven't looked into the problem as of yet, but we still have the data for storage. So any problem that typically pops up in storage, we can do. Yes, we're collecting data on all aspects of socio-economic sectors in all socio-economic sectors. Hi. Yeah. So when you, so over here I think you pointed out data for three particular states. Over here you, as part of the dashboard, you pointed out data for three different states. Is the data available for like literally everything? Yes. Did you? Did you have any issues in having to translate maybe some of the data? Because this data, was it published by, it could be different if it's published by the state government, right? It could be in a different language as such. Yes. So we also face problems where, you know, data comes in local languages like Marathi or Bengali. Yeah. But we typically try to, try to solve those issues by transportation and stuff. But in this particular case, we had focused mostly on the national data set. We also focused on some state level data sets. Okay. So in this case, we didn't face any translation problems. All right. But yeah, this is one of the issues that actually arise while working with public data sets. The data is public data, which means it was collected by a government body, right? Right. So there's always a question about reliability of the way the data is collected. Right. How do you tackle this? Do you have any, do you make any adjustments in your models or? So what we typically do is all these data that are collected socioeconomic factors, right? They are collected by a lot of different bodies. Say for example, an agriculture production data might be collected by the Ministry of Agriculture at the central government and also a lot of state, state authorities might be collecting these data. So we do triangulation among different kind of data sources to understand whether, you know, the data is reliable or not. And also we do a lot of research. So there are a lot of research which are published on the reliability of different data sets. So we typically follow that before we actually come up with a product and use some source for any sort of data. Hey. Yeah. Yeah. So what's your revenue model? Is it something you're doing just for the society? Sorry? What's your revenue model? So the revenue model, so we have an entire stack of products which help again clients to transform data, visualize data, collect data. So our platform is deployed by different, different clients, different organizations that gives us a continuous stream of revenues. That's how we are. Exactly. So a lot of corporates also who want to actually do CSR activities can use these socioeconomic data for specific targets. Okay, guys. Thank you so much. Thank you. Please take all your questions to him offstage.