 This is Omkar Moissila and we are from Fields of Youth. Here we have talked about structuring data from surveys and our learning from a case study that was done of a slum survey that was conducted in 2010. So all of us here in this 700 people group are bigger data, so that's an established fact. And we're a group of people who love analyzing data and visualizing it and showing what it means best in a given context. So we thought maybe it would be interesting to see what data looks before it actually comes to this form. So we like to present the journey of data up till the point where it can be analyzed. So this is the structure we'll be following in today's presentation. I think we should take about 25 minutes and we'll take questions after that. So all of this, some context I think is very important. Most of us here, I think, live in cities and urbanization has been really, really rapid. We've seen increase in traffic enormously in the last 10 years, increase in prices, both land and properties and increasing distances and increasing, almost increasing, and there's been a lot of flux in cities. But along with this, there's also been increasing poverty. And officially, the number of slums in Bangalore is set to be 770, but a lot of NGOs, analysts, and even the SBB and the Commissioner of Things is over 1,000. And why is there such disparity in the number itself is because the slums are very contentious in nature and their contributions to the city and the city's contribution to them in the big fight. I mean, there's been a constant argument about what's happening in this big topic in lots of parts of the world. So as researchers, we were interested in what is the city's contribution to these settlements, where about 60% of our people live, and what is their contribution to the city. And in that smaller chunk that we wanted to look at was the relationship between migrations, and I think in one of the talks, they said about 200 to 300 billion migrations happen on the time. And what is the relationship between migrations, livelihoods of people, and the mobility that is available. So with any such question, you go deep into what already exists. How many people have studied this problem? What are their findings? I mean, are we reinventing the wheel? So that's the first step any of us would take. So these are the surveys that already have been done. You can go through them. I will not jump into the details of each of these surveys and we can discuss the details of this after this talk. Why we didn't use them becomes the most important point, right? The existing format in a lot of times are non-digital. Especially in Karnataka slum clearance board, they were in stacks of books. I mean, they were pie-high, they collected dust, we kept telling them, it's gate-eyed. As researchers, we're greedy. I mean, any piece of information that you want about the question you're trying to solve, you really want to have that on hand because you don't want to buy on the same length, right? So we digitized this data and we found that at 2.02 data points, 94 and 2008, and in a lot of places, both of them had the same number. For example, the number of Christian children that studied in school in 1994 was exactly the same for months now in 2008. And this happened in multiple variables and this just didn't add up for us. So the integrity of data was not really great. And the aggregation levels of this data, a lot of this data was it's collected under slum levels, people spoke to the slum leader or as pointed out in the morning, it's either the state level, the district level. So if you want really aggregate data, it's missing. And the census data, which is quite aggregate, was not available to us at that point because it just got to be used in 2009. And not available. A lot of times, the exact data point that you're looking for is not available. Transportation availability in slums is not available. At least we did not find anybody who had extensive data on it at that point in time. And how some of the metadata, as I said, the census has it. But we had 2,000 data points, but we wanted something more recent because a lot of change has happened between 2000 and 2009. So we wanted something like this more. So I would like to do an overview of the survey now. 1114 households were surveyed in 30,000 slums across Bangalore. The smallest slum was 50 households and the largest was 500 of them. So we tried covering both small slum and large. So this is how we stratified the slums. We wanted religion majority, Muslim majority, Christian majority, slums. We wanted Kanata majority, Kanata speaking slums, English slums, Buddhist speaking slum, Muslim speaking slum. Four in periphery slums, slums that are in the centre of Bangalore, slums that are cropping up at the edges, periphery of the city. And across all four directions, north, west, east, south. And slums in plan, localities such as Janagar, or Corporation, right in Bangalore, in the centre of Bangalore. As I said, really small slums, really large slums. And every 10 households was surveyed. So about 10% we would say was the slum. I would also like to mention that one group of slums we did not survey was the new, really newly formed slums that are cropping up here and there. They are called the Jellabadi based slums. It means renting land. There are a lot of private developers that give land for, really open land for small amounts of rent. Where people come and set up their shacks. A lot of new migrants are very, very skeptical of talking to you because they don't want to talk to you because they think you're going to chase them away. And the developers, if they find out you're talking to them, they're scared, you're going to declare, you're one of the government and you're going to declare the land. So they're really skeptical. So these people survive under the cloak of invisibility because they're really vulnerable. That bias doesn't exist. We haven't surveyed a really new slums. So they're fairly well established. So these are the themes of questions that we surveyed. That is, we had demographics, the basic name, education level, math and status. Employment, previous employment, current employment, employment and benefits, self-employment, migration where they came from, where, where. Water was extensively covered. Where they got the water from, how much water, how much they paid for it and so forth. And household data, what is the material for household? What is the services they access? They have access to a lot of other details in the household itself. Loans, that is, where did they get their loans from? How much interest did they pay? What is the time period of their loans? The assets, they have the aspirations, problems, and help with mostly qualitative data. And transport was also extensively covered. So we come to one of the most interesting parts, or most useful to us to talk about here, is the on-ground data collection. So we had, we, in this, in this particular team, we required, if you want to go into a slum, you require people to know them. You require to know somebody in the slum before you can go in and talk to them, because otherwise they will not talk to you. So we require partners who are familiar with these slums and the people in it. Despite knowing partners, despite having partners who didn't know the slums extensively, we had to go back to one of the slums, because the slum leaders let you to let us talk to them. There are circumstances, so having such networks becomes really important. There were eight women surveyors, and for us, the thing we really like to talk about is all of these women work from these slums. So they knew the context well, because they did it. I wasn't, if I had gone there, I might have missed out a lot of questions. The comfort level a person has in talking to you when you're getting one of them is different from someone from outside. So that was the most interesting thing for us in this survey. And because most respondents were women, and since these were all women surveyors, they surveyed during the week, and they couldn't get men to talk to, because men don't know how to work, or we even couldn't get both family members working, because there would be no one at home. So 80% of the respondents were women, so the comfort level was really established because surveyors were also. So there was an extensive training period, given that all of them were new to this. We had a questionnaire, which was a character of something that was done by the world life, and we had to place a lot of questions and soothing answers. They were trained in three to four sessions to negotiate with the questionnaire as to what these questions were to be asked, what did they write down for each question. And after that, a pilot survey of 200,000 was done, and they went out to about a couple of them, and they asked these questions. And the question is, we went through the answers that came in, and some of them didn't seem to fit, so some questions were too personal, some couldn't be negotiated properly, so we had to change the question. But the biggest learning we had in this process was data entry is an absolute must up to the pilot, given the cost of missing in this process. So what happens is, sometimes you have a question that you codify, right? As in you say one is yes, two is no, three is maybe four is not good. If you get an answer, most of the people tell you something that is not in these quotes, and what's the point in having a quote. We have an extensive lead text to this survey. So I think data entry and data formats also change, it takes a lot once you convert it into a digital form. So it becomes very important to digitize once you get the pilot out. And this has a question, somebody raised in your quote, why don't you technology, why do you have a writing mobile phone, and the space of human error is very easy in using technology. But it works really well when you have a few parameters that you have to end up with. For example, in the general health, you have weight, you have height, you have PPE, you have PPE health. For a few parameters that you want to update constantly, it would be really, really useful to have a mobile phone that someone can use and enter these parameters on the phone. But when you have too many questions, a lot of which is text and lenses, I don't think a mobile phone is really useful. Which is why we chose the traditional route of using a phone. This is the longest part of any survey. Because we're really, right, we spend a lot of time collecting data and know how the data can be carved. You want to be able to be in every piece of what's available to you, you want to be able to verify, you want to justify, you want to make an audience leader. So, we are in a country where there are 18 different languages spoken and a lot more spoken around the country. And English is not one of the major ones, especially if you're talking about films, you're not going to find a lot of English speakers. So, I don't want to have someone know the local language really well. And the survey initially, the forms can be written in English. So a person who is working on the research was someone who was very comfortable with English. That got converted, translated to Canada. And the survey was in Canada and that got translated back into English. So you have three points where there are lots of details. For example, because this is something that I've also said, that there was one person on the research team who knew Canada and was on the field who knew the research background really well, right? And he was the only person who was doing the translation and who got to pick what questions to put. And in this process, because there are not a lot of people who knew Canada as well, a lot of questions got discovered. For example, we did not catch disability data. So you do miss out a lot of these things in translation. And our favorite example is time for travel. The question essentially was time to travel. Time to take the travel to Canada. It was said time for travel. In Canada, it translates to prior to the summer. Time for travel. Which means, most people said in English in the morning. Once we saw the answers, we really wished we could capture that data in what time you leave to work, leave to go, you can go, go to traffic simulation, see. Why there are lots of decisions. You realize that during this, there are very small nuances that you miss during translation. So again, we had a team of about 2 to 13 people that was digitizing this data. And they were again trained. Because we wanted to be able to go back to then and say there was a mistake that happened. So there was a one-to-one connection and I said it was very greedy about data. So we did not give someone else, we did not outsource this project because we wanted to be able to do that. So we trained them because we understood why certain separators are required in certain places. The difference between not actually given, for example, how do you educate them to know what class are you attending. That's a not-accurate question. And other states where not answered. For example, somebody doesn't choose to answer. So it's not answered. And some answer the disorder. That is black. So we wanted them to be able to make an integral part of digitization. So, yeah. You can't assign data for math in stone when you build a company. You can't say, I want people to say that water consumption is a unique test. A lot of these people have collected water and bought for all their lives. So if you go and ask them how much water do you consume per month. They're now going to tell you I can do 50,000, 5,000, 40,000. Even a flow that they cannot assign. Because they're going to say, he can do 4,000 a day, he gets 40,000 a month. So that's something that is required. And how far do you plan? How many of us would say, we would probably say, I'd have 100 makers to collect water. But a lot of people would say, I've done this to collect water. And when you're deciding formats, we need to keep metric perception in mind. How we perceive different metrics It's I see the water they have played this game. It's different from the principle. Metric perception I think is very important. When you're deciding data formats. The greedy researcher is back here. You want to check and verify as in we have scores and scores of Excel sheets finding out a problem in different parts during the survey. This person has entered 5 years instead of 5 months. Do I keep the data? The person has said this as 5 kgs instead of 5 grams. So the age of the baby is 7 months. What do I do? Is it 0.7 years? Because of the metric. And having somebody on the research team who knows this data, right? Who knows the language context, who knows the data formats, who understands the databases It's also very important that as Lucy you need an intersection of all these different disciplines to be able to do this really well. So again how I live in your data You might have excellent uniform cooking data where people in one's terms said all of us had the same problem. So you can say let's not have a problem, let's go solve it. Is it true? Or do you go back to the survey and say are you sure this is what happened? Or all people in the slums said they married at the same age. For example, they married at 21 and 18 exactly. Because their children were married 5 years ago and they had to do over 10. So is your data right? Do you want to say are you throwing away this data? So that's the discussion that you have to have very we are very very we try to hold on to our data while we will keep the data. So how do you want what parameters you will keep and these assumptions need to be stated when you are writing about it. I think these assumptions, even as researchers we forget down the line that you made these assumptions and you will keep the data. So I think keeping a track of these changes and this is what, this is the change I need. It's important because stating these assumptions after an analysis is extremely simple. So I have a little bit more time. Can everybody hear? I am just going to give a very deep overview about how we store all these data. We have close to 300,000 data points. As I mentioned we have each household data in one excision. So we have 1014 excisions. I mean excision is great for one thing but we want to query across these things we want to query just one plan. It becomes a little paper. So what we did we decided to design a database for it. And for simplicity sake we decided to use MySQL. There are great clients of MySQL, they are great in views, it's not simple just to come to the scientist, it's simple, but other people also do use it. We had a economist, a scientist we have been to survey the data. It's simple problems that we used. One thing we had was all of these people we couldn't do it. They were familiar with SQL. They didn't really know the importance of data build. They didn't really know what normalization was. They didn't even know how to write a set of query. Which is great. Which is why we could use MySQL. So after the discussion with all of these people all of the very ideas in the retail, we decided we will come up with the naming conventions for the tables. So what we did we thought we will say a question number underscore question name. Like for example, 55 was family mode. So what is the table for that was 55 underscore travel mode. Very simple it was exactly same as the questioner. So all you need to do is select start from 55 underscore travel mode to get understood all the travel mode data. If say for example, we had to store a set of questions in one table it was question number underscore 2 from question name. So it became day for example, cycling data was 83 to 86. So 83 underscore 86 underscore cycling. So it was very simple for them to get as well. When it comes to marginalization this was something a little difficult we tried to do. When we tried to normalize it too much more than 3 and a half, it became a little problematic because from the feedback we got from all these people who do not know what normalization was they do not understand why there was these extra problems, why there was these extra ideas. They do not get why it was insensitive. So we had to have a balance between normalize data and use it in schemas. So this is why we ended up having most things in our BNF but around one third of the tables would get to NF. Sometimes you would want it. A lot of times we are asked, why this problem at the time? There are a lot of standards to store data, to listen to data. So why is here? I think one of the biggest advantages we have here is because of the naming patterns in BNF it is kind of simplified scheme of cap. It becomes a great way to store data and also a great way to distribute data. In fact, the version that is online is an SQL5. It is very easy people just put it back in my SQL and it is great. So that was how we have stored data. So, all of us here know what we are doing. How we are analyzing this data? So we just got the data cleaned up. We have it in almost all the formats. So it is underway. One thing is, there were a lot of partners there. We had the surveyors, we had the researchers, we had people in Islam who gave us this information. So, I think all of us had vested interests because a lot of these people are, but there are a lot of people who come out of it. So, I think we would want to try to change that in this process as well. Even some of the themes we included had what the surveyors thought were most problematic. Like water was extensively covered because the people in Islam had a huge problem. It was in the summer, right? They wanted to know what was the problem with water and why they were having a problem with water. We don't have water. We don't have any issues with water. I think one of the reasons why we extensively covered water. So, the finale. You have a report and you want it to impact the policy or you want to write a paper in a non-journey. So, you want to be able to write something. I think as I already mentioned, we want to first take this data back to the producers of this information should become the first consumers of this information. So, I think we are trying to find ways to engage with them actively and get feedback from them as to how valid the data is and how valid our analysis is. How we can use it better. So, yeah, we would like to acknowledge these people on the board and so, there are ideas in Africa and any questions that all are enough. I think I spoke about it because it can be text. We did not even attend it. Which kind makes you believe? It would be a process where the maximum you work on errors. Yes, human errors too. So, human errors are there. The idea is because you have a lot of governance. So, I going through OCR is very it was the questionnaires for answering the camera. Everything had a different hundred. And sometimes we had some finding out what the answer was. OCR probably would have taken a long time to actually answer. Actually, there are a lot of them. One thing was, a lot of people, we were talking about safety and a lot of people felt safe in this room. We think on it really well. We have to work in a career in the real world and we can get rich. I recognize who is the head of this project. There safety is your biggest concern. And here a lot of people try on the networks and they might hate people they already know. So, the only problem they didn't have living there was the fact that they felt safe. So, that according to me was one of the biggest one of the biggest findings. And there were equal number of many women announcements at all ages. So, people I don't know it can be tipped both ways. But there were equal number of many women in the household. We went back to the slums to verify the data. I had been a year since we started data cleanup and we went back to them. As of now it's a very extensive process. I think it took us two years and it's still in the process cleaning the data. So, we have no substantial way for us to go and get the fancy data back in there. It would be lovely and I think all of us have been talking about data. But I think the reason that that already does not work is because they don't get anything back. Maybe we need to find a model to do that. The data is public the link is on the presentation. So, I think the presentation is done. Yeah, I have a question. The primary question was what is, I think it's a very grand question. What is the computation of the slums to the city and the city to the slums? So, I mean it's probably we could find out, I think, great thing. But we have no answer to it. It was the JDT cut-up, just 50 Tata trust in Pankat. We are still analyzing the data. I can't I can't I think there's nothing for the most common thing. It depends on where they located and it's a very where they located and the work they do have a very nice give and take relationship with each other. So, you have completely construction specific places that are for business specific and places that are for marketing. That's not good. Construction people travel a lot to work because the group might become stuck with the travel to the industry. Yeah, so if you take the 20 percent sample, maybe we can try to see if the respondents of men were drastically different from the women and maybe they didn't try. But I think there is a slight bias in them with the problems they're facing. I don't think you'll have the same response enough for them. Thank you so much. Any questions?