 OK. So my name is Sumandru. I'm a tweet agent. Instantly, I wanted to talk about the art of Innocesa data with the art written in this specific way, because I wanted to talk about using art, the software, or the language, and the compiler, and Innocesa data. I had a feeling that the audience, there would be people in the audience who use different kind of techniques in software to work with quantitative data. So if I would keep out the art part, I'll just talk about the art of Innocesa data, and show some art bits just for people who are interested. The thing about the attraction of Innocesa data and why you should be aware of the art of Innocesa data comes from the fact that it remains the best sample data giving national algorithms in our country. This particular table, I'm a Nicanabis, but I mostly work on urban development issues. So I was working on affordable housing policy. And this is one fascinating table that I took out of Innocesa data of 2008-9 housing sample survey data. I don't want to get into the details of it. All I want to say is, this gives you poor capital flow rate of people living in India. Of course, it's a sample data that you can create national estimates based on this. Just look at, do you understand what is the quantized class, the quintile class? So MPC is monthly poor capital expenditure. And these are the content classes. So if I take the poor capital expenditure, the monthly poor capital expenditure of the entire population of India, and I arrange them in the order of increasing poor capital monthly expenditure in a divide into the population in terms of 0 to 20% and 20 to 40% and so on. So the five classes. This gives you the poor capital flow area, given the kind of housing this is. Pocca is, what is poca? I know. So what's the name of poca? Concrete houses, semi-concrete houses, and non-concrete or taxa is made of materials that that decays, so all kind of organic materials. What it tells you is, till the 60 to 80 quintile class, which means 80% of population in India, the average poor capital flow area is around 10 square meters. That means if I take the national average of five people per household, that gives you a 50 square meter house for the 80% of the population. And the argument that I was trying to make is that given this kind of a situation, so also 50 square meter household means for, in government terms, it's called economically weaker section housing. So it gives a certain kind of, what to say, if somebody wants to build a middle-class housing, economically weaker section housing, it gets some kind of subsidies. So on and so on. What I was trying to argue basically is that the subsidies given to the housing is completely insufficient to target the kind of housing means India has. Given this is the real picture of what kind of what size of housing people are living in India. This is fascinating because you can take out this kind of data. Throughout the other parts of the presentation, I keep talking about the pitfalls of NSSO. Data using NSSO data, acquiring NSSO data, and sometimes even publishing NSSO data. And navigating all this pitfall is the art of NSSO data. This is actually the structure of the presentation. I begin with a brief history of the sample service. I talk about certain concepts which are in use in the sample service. And I talk about data organization. So I stopped my talk in talking about data organization just to tell you that what you really need to understand in working with NSSO data is how the data is organized. After that, whether you're using SPSS or Stata R Python for that matter, that's really up to you. And there's no need to get stuck in one kind of software language. Starting with the history of NSSO data. It began in 1862, which is when the statistical committee was created by the colonial administration just to get a sense of what is happening in the subcontinent. I don't want to go through all these points. You can probably, this would be up on the net. You can change it up later. What I want to tell you, or what I want to focus your attention now is that it had a very strong commercial interest from the beginning. So the colonial administration's primary interest was checking what kind of commerce is taking place in the subcontinent. And if you still look at the present themes of collection of data under NSSO, you will see that NSSO collects data on housing. So that's one important market for you. It collects data on health facilities, as well as health seeking practices of individuals, which is, again, a big market for you. It talks about data investment statistics for the entire population of India. Again, you're a very important market. So I also feel that NSSO data is sometimes, it's seen as governmental data. And it's about administration. And it's not about the markets. So a lot of market-driven analytics work, usually, not look at any sort of data, which I think is a sad situation. In 1947, I was a student. In 1947, the first PC Mahal Anubhavish, which is the big story, he gets appointed as an Anubhavish statistical advisor. CSU is the initial need that was given in the center that was created. It was later called the Central Statistical Organization. I mean, at the present, it's part of the Ministry of Statistics, and put in implementation. Must be. For the concepts I talked about, the fact that NSSO data doesn't come. So when you go to CSU, the Central Statistical Organization asks for data, you have to tell them what round and what schedule data you want. So round gives you each round is each annual collection of data by NSSO. So if you say, if you go and tell them that I want that 2008, 2009 data on something, they would rather appreciate if you say that I want 67th round of data. So that's how it works. The list is there in the CSO website, so that's not too difficult to look at. Schedule is the thematic focus. So what NSSO does is, unlike census, which is a complete enumeration on one hand, and also it is one question at every 10 years. There's some additions, but the thematic focus doesn't change. It's not that one census talks about housing, the next one talks about health facilities. NSSO on the other hand, it's not complete enumeration, so it's only sample data. Sample data makes it difficult, not difficult. It makes it impossible for you to understand anything below the level of what they call the state region. So if you want district level averages, you cannot get that in NSSO, which is a major limitation. So what the state region is, okay, I'll come back to this between us. So what she does is, schedule says that, so this year NSSO would undertake sample survey on, say, four different thematic areas. It's consumption expenditure, employment unemployment, date investment, and quality of housing. So all of these are called different schedules. There's some schedules which are repeated every five years. So for example, consumption expenditure schedule, NSSO would do, say, on 60th round, then again on 65th round. And these are called the thick rounds. So these are called the quintennial surveys, or surveys that are repeated every five years. So if you want a good time series of data, you should look for the thick series, because that is where you get every five years very good sample data. At least state level averages are perfect. It's the best you can get. I mean, it's not perfect, of course there are data collection issues, that's awesome. There are minor rounds, or minor, not minor rounds, there are minor schedules, which are called the thin rounds. These are the schedules which are not repeated in such a regular frequency. So for example, 2002 was one housing schedule, 2008, 2009 was another, and we don't know when the next will be. So there was a 96, 2002 there was 2008, 2009. So roughly a six year cycle, but it's not so regular and the bigness of the sample is also limited. So thick are the major cities, thin are the minor cities. I was already talking about state regions. State region is a cluster of districts within each state. So each state is carved up between usually three to four state regions. And at the state region level, at the level of aggregation of three or four districts, NSSO sample survey gives representable statistics, not below that. So that's another question often comes up that whether NSSO can tell me about the city of Bangalore, which really cannot. It can tell you about Bangalore, urban Bangalore rural, couple of other districts as a total. Is this what you should be looking at? Is this what you should be looking at? No, yeah. Okay. Okay. So the fixed-width data format is again, I mean though concern with NSSO data. So I'll show you one sample just to make it clear. So this is what NSSO data looks like, okay? It's really not very pretty. Anybody who worked with the census data here? So census data is really much more nicer. First it comes in Excel sheets or access database. It gives you a variable and gives you nice columns. You can understand what each number is talking about, right? NSSO doesn't. And this is the trick. So the trick with working with NSSO and art of working with NSSO data is to get this data into a format, which is structured, which is something you can work with, right? And also to assign variables on this data, because the raw data itself doesn't come with the variables. So this format is called, so this format is called the fixed-width data format. What it does is each row has a fixed width, okay? So say it has a width of 150 characters. So each width can only take 150 characters. Again, this, the community of the data, you should read it in the, through the history of NSSO data. NSSO is not something that started happening 10 years back. It started happening 50 years back. And the data handling capabilities then were severely limited and had to be done online. And then again, early computer. And so somehow there are huge lot of handovers coming from such technologies, such as data technologies into the present practice of NSSO data. So the schedule file gives you the questionnaire for that schedule. So if the survey was about housing, the schedule file will give you the full questionnaire. I'll show you one as well. The layout point tells you which questions are mapped to where in the raw data that I showed you, right? So it tells you what questions, what variables are we found to where in the raw data. So the trick is to take the layout file and the raw data and put them together and create a structured data format, okay? Come to data organization, right? Now we talked about fixed-width file which is a, comes in TXT file. This is binary coding of information. That means it doesn't have, yes, no, it wouldn't have the person's state is connected. It would have the person's state is number 27. So then there is, in terms of numbers, there's no, there's no words and attributes involved. There are lists of supporting files which tells you about how to go through how to work with that data. So the schedule file I talked about, that's the question. The layout file tells you where the variables are in the dataset. The read-me file gives you some more information. Usually what it tells you is that if there are, say, five data files, what they do is they basically categorize, not categorize, they divide up the entire data in terms of number of states. So it tells you which data file has the data for the data file. So if there are five data files, one of them would have the state number 27 and it tells you that these state codes are in that data file. That's what usually the read-me files are. The state and the state codes are very important if you're working with it. It tells you that number 273 is, so Bangalore follows in number 273. So 27 is the state code. 3 is the state, is a region code. Together the state region code would be 273, something like that. The real difficulty comes in with the, again, going back to what I was saying about fixed-width files is the thing about levels because each row has 150 characters. And the entire questionnaire cannot be coded in 150 characters. So answer given by one person or one household has to be shown in multiple rows, right? So these multiple rows are called multiple levels. So each set of answers by one person is divided up into multiple rows, each level is called a row. There's a variable called level which gives you number 01, number 02, 03, takes different values, which you can use to understand what exactly that row is about or what part of the questionnaire that row is referring to. There are other major difficulties of course as well that say for all schedules really. So say for consumption expenditure, there will be some questions that will be at the household level and some questions asked to each individual of the household. So if you're taking the entire schedule and you're working with 16 different variables from across the questionnaire, you always have to keep in mind that certain variables need to be multiplied by the household size so that they're all comparable or divided by household size. Because it can be also monthly per capita expenditure of the household and not the individual. And you have to keep that always in mind when you're working with it. Whatever's talking about levels, I'll try to show it in a simple example. So what happens is the schedule, see tells you that the first question is what's the serial number of a person. And layout tells you that the serial number is coded in column one, two, three. Is this visible mostly? Yes? Okay. So you know that the first three numbers in each row tells you the serial number of the person to whom this row refers to, right? Please ask me questions if something is not clear. Sorry? Yeah. Right. I'll just come to that. Yeah. Sir. Yes. I'm noticing this is kind of like a Jurassic era of data representation. Sadly so. Has someone brought it down to like the last hundred years of this thing by really creating a database which has looked at this zoo of data and has actually translated them into SQL-predigable tables or anything like that? Not yet, sir. Are there any outputs? As for NSSO, they're supposed to start publishing, I think. I mean, we can quit expecting NSSO to do something like that. I haven't got a message by now, they wouldn't. Half of that thing, half of making it SQL-predigable database that has been done in ASEAN-PENG University where I was working. Okay. The part that has not been done is getting this structured data into a SQL database. That hasn't been done yet even there. So what we did is we took the unstructured, not unstructured, it's fairly structured, but as you see it's structured in Jurassic time customs, right? Took that data and structured it in a more easy to understand way, but the next part hasn't been done. So let's see if, and the question also has, it raises concerns regarding the copyright of the data, whether somebody's actually allowed to do that. I mean, it's the government. That is not too clear. I mean, just the government, but it's a government data product. So you cannot just take it out, put it in a SQL database and make it available for others to either freely access or buy it, right? So there is a catch there. Right, so what you're saying is that, so the first three characters were about that. The second two are about the age. So we know that for one to one and 343, these two different versions, the fourth and fifth character is the age, right? The daily wage is given by the next four columns. So this is how it would have looked like had it been there only 150 things, 150 characters being sufficient to capture the entire question, right? But sadly it doesn't. So what happens is there would be multiple rows. Each row about the same person would have the ID number or the unique ID number of that person in all of the rows, right? So that is fixed. So it would have one to one and one to one. So serial number would be there in columns one to three for all columns. But the thing is that column four, which is two and four here, that gives the level number, okay? So what is the person's age is given by column five and six, which is 12. But we have to only look at levels. We have to only look at rows with level is equal to two, as I said here, and take the columns five, six as the age variable, okay? What happens is if I'm, and there would be another row and see the daily wage is encoded in the next row, which has the quote four. So the 3434 becomes the wage, though in the original data file, the column locations are the same for daily wage and age, right? Please tell me if I need to repeat anything. Just repeat and wait. Done. Okay, so what you have is, so there are multiple rows about each person, right? One to three is the person's ID. That remains fixed throughout all rows. Column number five gives you, or column number four, sorry, gives you the level number. So on the right-hand side of that row, how you are looking at column structures or width of columns is determined by what value that level variable is taking. Because different levels have variables of different columns, yeah? A bit clearer? Yeah, yeah, yeah. So what happens is if I just go back a bit, right? Not easy to see here, but say one to one and one to one, that's the serial number, two and four are the different levels, level number two, level number four. Then I have one to one, two and three, four, three, four. One to one, three, four, three, four, are actually talking about different variables because the level number is different. So this one to one, two is in a particular row of a particular column, right? Of course, yeah, right. But in the row data file itself, as I showed you here, there is no separation given to understand where a variable ends and begins. Exactly, so what you have here is, so what you have is something like this. So this is the schedule file. And so these are recent schedule files that actually pretty easily parsed. So these are Excel files with fixed column locations for different things. So you can actually take this and immediately create something like, sorry, I don't wanna show you that yet. So what we do is we create something like this. I'll just increase the font size. So we create something like this. So we create a standard text file, the description and column range as two variables. So serial ID is one to two, round schedule number is three to six and so on. So you take the Excel file, get it into a text file, make it easy to understand what is happening in each way. Do you wanna see a bit of R? Absolutely, yeah. So this is the bit of R I wanna show you. So what you do is you take a variable or codes, you read the table that I just showed, code.txt, right? Don't even think about what is happening, the parameters you can work it out later. Then what I do is I split the range, which is say one to two, using the split character hyphen, okay? So that's again, pretty straightforward. I tell them what is the beginning column, what's the end column, just to understand. So the R already knows that, say the ID is, so for the variable ID, the beginning of the range is location one, the end of the range is location three, within the row, right? Then you create another variable data. You read a fixed width file, this is the name of the fixed width file, which are given by the column width variable within codes. Column names are given by the description variable within codes. Please ask me if I need to repeat. Five more minutes, okay? I'm mostly done. Have you published all this? No, not really. No, no, but I'm sure, right? I think I will, yeah. There's some other codes, which we should not be looking at right now. Then what we do is, have everybody used the R here? Okay, a lot of people see it. That's nice, that's nice. Oh, sorry, everyone has to show you this. Shit, sir. Right, so this is what you do, basically you take the, okay, let me go through the logical spectrum again. So you take the description of the variable, the name of the variable, you take the column width, or the column location rather, so one, two, three, four, from the first to the third character, from the fourth to the sixth character, and so on. For each level, you do this, okay? Then what you do is, you take the data set. You say that read this data set given this column information, and then delete every row, which is not level is equal to one. Then you repeat the process for the next level. So you have that entire data set for each level, and you already have the person's ID. So you take different data sets and put them together, merging it by the person ID. That's pretty straightforward, right? It requires some process, but it can't be done without too much work. R is really nice to work with, which is the reason I wanted to talk about R here. What you can do is you can just, so I personally like working in a text editor. So you use Notepad or something like that to write the R code. Yeah, pardon me. Yes, absolutely. Otherwise things wouldn't be possible, yeah. Yeah, no, no, no. And so that's also the part of the art of appreciating unnecessary data. You have to appreciate the fact that these guys are really good at their work as well, because it's a lot of data, and they've been doing this without using most of the techniques we use regularly. Of frustration. No, exactly. Yes. So there's a significant amount of work and a significant amount of logic behind this. Anyhow, so. So you can write up your R code in a text file, and you can just run, you can save it as .r, and it runs directly in R. I'm using RStudio, which is a really nice IDE for R. On this part, it's not visible, I know. On this part, you can take a look at your entire code. So it allows you on the fly ID body, which is really helpful. If you run it directly here, on the console part here, you can see what's happening. If it's getting stuck somewhere, if it's delivering something, at the moment, it's still running, so I'll wait a bit. So there's a slow computer. And then what you can do is, and then you can see something like this directly. So what I'm doing here is, I'm basically breaking up the entire dataset by each column position, which I reckon easily merge certain column position given the code that you see already created. The column regions I've already defined. You can also do this and say, if the column location number, say seven, is not equal to one, then do it on the most. Which is the same as saying that all levels which are not equal to one, needs to be deleted. All rows with not level is equal to one, gets to be deleted. And all is done. Like, so that's about it. The last thing I want to talk about is, there's a really nice visualization that we call Google Vits. Which what it does is, it uses the Google Chart API. I'm done with it. What it does is it takes a Google Chart API and creates a Google Chart visualizations on the fly. So, basically it does it and it thinks the data, it uses the JSONIo library to output as JSON object. Uses the JSON object to create a Google Vits on the fly. This is something that I've been using, I'll tell you for example later.