 The talks about what is data right and I'm going to get into why why even don't this exercise but in the interest of time why don't some of you you know we spend the whole day talking about data. What is data? Data is information. Data is information okay. Who else? Data is the raw format of the information. The raw format of the information. Anything else? Any other thoughts? Plural or panic tot? Sorry? Plural or panic tot? Plural or? Anic tot. Plural or panic tot? Plural or panic tot? Anic tot. Plural or panic tot? Okay. I haven't read that one anywhere. Any other thoughts? I want to quickly kind of pen right down. So one was the information right. I just tried info. The other one was you can get it on this thing. How do I do that? Pull it down and look. This is the automated one. This is probably like a remote or something. So info, right that's one. Plural or panic tot? I would just pick panic tot. Or panic tot. No there's a point in there. Okay. Who else? You can take black color marker. There you go. So we have information. We have plural or panic totes. Somebody else please come on we talk the whole day about data. It's quantitative not qualitative. It's quantitative not qualitative okay. It's a quantitative point. Point not qualitative okay. What else? Relevant information except in numbers. Info in numbers. Relevant info in numbers. In numbers okay. Anybody else? This is just set of everything. Set of anything is data okay. That's what we're going to do. Alright so I'm a programmer okay. And I spent the last 10 years or so doing programming. And actually 10 years or so doing serious programming. And before that like fun stuff right. So as a programmer very early on you get used to this idea of abstraction right. What we'll do believe me this is exactly what every programmer does right. We'll go right 10 lines of code. Then we'll say this is a duck right. We'll say this is a duck. Then we believe in it. We'll be like so tied to the idea of like this is a duck. It quacks like a duck. So it must be a duck right. Like example for like I mean to take a non programmer parallel and write quack on a paper. And I say hey quacks like a duck it's a duck right. I will send it to my customer. I'm going to tell you. I'm going to give you. Go to a little kid and say hey look you know. It has a really nice duck. Go play with it. What's going to happen though. And this is also another lesson you learn as a programmer very quickly. Is that the kid's going to do something you didn't expect. He's going to say you know what ducks can swim. He's going to take that cheered paper and put it in water. It won't swim. And he'll be like you're a duck slave right. So that's one scenario. Another scenario could be. So that's the scenario where your customer had a problem. Another scenario could be that somebody else who's building around your duck right. Somebody decides to build a zoo. And they build I don't know. Somebody decides to build a jungle. And then they build a lion. And the lion comes to the duck. Lion comes to eat a duck. And he says this is a paper duck. This doesn't work for me right. So this some other guy complains that you said this was a duck. But this isn't a duck. But I'm so passionately sold to the belief that this is a duck. Because I built a duck. I built a duck. Come on. How can I have a duck? Right. So when you are on as programmers, you learn this idea of abstraction. And you get tired of it. But slowly you realize that abstractions leak. Right. And there's a very nice article by those folks. On leaky abstractions if you know if you're a programmer. But abstractions. So you might simplify an idea and say hey anything that can coact is a duck. But that's not entirely your situation. And in some cases that might work. But in other cases it will fail. Right. Abstractions leak. Now all programmers have realized this idea. They realize that you know stuff must be defined properly. Why? Because we take a small abstraction and we build bigger abstractions around it. And all of software is fiction frankly. All of it is just signals of 1s and 0s. So all of it is fiction. And just because I believe that the stuff underneath means a database and I build stuff on it. Doesn't mean it really is a database. Right. So abstractions allow us to think like this. So we become, our mind gets set to it. But we also realize that abstractions break. And that's why as programmers we realize that defining stuff is very very important. So much so that if you go to standard bodies like the W3C. Or you know the Apache foundation which is doing like open source stuff. You'll notice people having pages and pages and pages of discussion around the definition of a small idea. Why? Because those small ideas or the small definitions are used to build bigger things. And it's easier to think at a large scale than you have good abstractions. But if your basic abstractions are wrong, you have a problem thinking at a large scale. Because things have to break. Right. So the word data. Data. We've built so many abstractions around data. Right. We've got data analysis. We've got data mining. We've got databases. We've got data quality. Metadata, master data. Information. So the terms before this was stuff that BI people and data analysis people deal with. But information is something everybody deals with. Can somebody here define information? Meaningful data. Meaningful data. Cool. Cool. So philosophers have spent a lot of time on this idea of information and knowledge. Even people as far back as Plato and Socrates have spent a lot of good time thinking about what is knowledge and what is information and how do we learn and stuff like that. So I mentioned knowledge. Knowledge is another abstraction. It's based on data. Kind of sounds the same. Then you have things like facts. That's also true data. Maybe something like that. I don't know. There's truth. What is truth? So philosophers have also spent a lot of time thinking about this stuff. Wisdom. And believe me I'm not kidding you. I have read at least 10 papers in the last two days since I proposed the talk. Which correlate data to wisdom. Wisdom is the stuff Zen masters think about. So what the hell does this thing work? If so much of our world is built around fundamental concepts in philosophy, fundamental concepts in the view of science, fundamental concepts. In this new science we invented information science. But actually these days trying to invent a brand new science or data science. I'm not sure what that means. So what is data? Sorry there is a decrease in data science. There is a decrease in data science already. I will debate that degree already. Anyways, so what is data? And I was planning to ask that question at this point. But some of you have already done that. So we said, first one was information is data. So what's the difference between the two? What is information and what is data? Are they the same thing? She said meaningful data is information. Fair enough. I will not go on that one. Quantitative and qualitative information is data. Cool. Information in numbers is data. Something like that. We've got a bunch of definitions. And in this room we have so many definitions. I'm sure there are others who have thought of other stuff. And I'm not said it in the morning. Then I'll say something about variables. That's also one of the theories. Turns out, because I'm so passionate about this idea of having strong basic abstractions, four or five years ago when I started working with data analysis problems, I became a little worried of what the hell is this thing. It didn't make a lot of sense to me because if you go to a dictionary, it's the first place you go to find a meaning of a word. You go to a dictionary and if you go to different dictionaries, they will define data in terms of information. They define information in terms of data. They define knowledge in terms of data. They define data in terms of knowledge. So it's basically cross-link. What is down to it if you really do an analysis of what means what, you get nothing out of dictionaries. But you get one basic idea of what is data. You may not talk about that side. So that's one specification. Data as the given. So the Latin root of the word data is there, which means to give. And hence the past participle data or datum means something that is given, one item that is given. The plural of that is data, which is several things that are given. Now what does it mean in this, what does given mean in this sense? Given it's the starting point or something, I'm tempted to think of math problems. Math problems are things like given A is 2, B is 5, what is A multiplied by B? So given in this sense, something that you start out with before you go anything. It's the starting point. If you're a programmer, it's the input to your program essentially. This example, or you could have a slightly better example, given the length of a rectangle is 2 feet and its breadth is 5 feet, what is A? So my question is, did we add more data? What did we do? What is the difference between the statement given A is 2, B is 5, what is A multiplied by 5? Given the length of a rectangle is 2 feet and its breadth is 5 feet, what is A? What is the difference between the two statements? It's the context, very important thing, right? So as we mentioned, let's remember that context point. So as we mentioned that if you do some initial looking into things, you find that data information and knowledge seem to be related. We're not sure which is which. We don't know whether data is made up of knowledge or knowledge is made up of data because it's just a little weird, at least in dictionaries. But it seems that these three things are linked, really. Now, if you start digging deep into what these three things are, then you come up with two main, actually three main subjects, which have looked at these three words. The first one, which is the oldest of them all, is the study of knowledge, it's called epistemology. It's a sub-branch of philosophy. And that's where scholars have spent a lot of time trying to define what is knowledge, right? And this stuff predates computers because people have been thinking about what is knowledge for a very, very long time. And in that paradigm, there are several explanations and I'm not going to attempt and define all of them. There is one that makes a lot of sense and that definition is knowledge is justified through belief. So I found this, I saw that there seems to be a lot of consensus on knowledge is justified through belief, right? What does that mean? So some brief look into epistemology research looks like, something, if you want to say p is knowledge, then p must be true. You must believe that p is true and you must have a justification of why p is true. If you can do these three things, then you have knowledge that p is true, right? So this is how they try to define it. And for now what we'll do is, since we're trying to define data, we just get this context and forget about the term knowledge, right? Let's look at data and information because they seem to overlap way too much. So I start talking about subjects, right? So I'll go back to my slide, actually. So knowledge is defined mostly in epistemology. There's this other subject called information science, which is fairly recent. It's like last 50 years, 60 years, people have been talking about this thing called information science. And there they try to define all these three things. Most of the research on the topic seems to deal with all, but their definitions of knowledge in my opinion are incredibly narrow. They're force-fitted into the idea that everything's knowledge is all about computers, right? It isn't about information systems. Knowledge is an experience. It's something that happens in our mind. It's very hard to define. It's when we know something, we have knowledge, right? And how do you define this knowing? It's a pretty hard thing to deal with. So I don't think that the information science definitions of knowledge do justice to knowledge, but the philosophy definitions are very different. So what we'll do is we'll look at data and information in the information science papers, right? And we forget about knowledge. So I start digging into several papers. One of the first basic definitions that I came up with was data has no meaning or value because it is without context and interpretation, right? So this is trying to say that data is raw, right? It's raw in your context. Whatever your setup is, data there has no context and hence you don't have any way of understanding that data because you don't have any way of understanding it. It means nothing, right? And because it means nothing, it has no value. But you know what? As somebody who's trying to make a career out of data, well, data has no value, are you kidding me? So another definition. So the first one was from Jennifer Rowley in this book called The Wisdom Hierarchy Representation of Data, Information, Knowledge, and Wisdom. The second one is also... Actually, it's from a different paper. This one says data are discrete objective facts or observations which are unorganized and unprocessed and do not convey any specific meaning. All right, so there's some more things added in there. So he's saying that one, they are discrete. They're not continuous things, right? There's one piece of data, there's another piece of data, right? There's a third piece of data, they're discrete things. Why? Because data are datums, right? So they must be discrete rules or something. Objective facts or observations, or objective facts or observations. So discrete observations which are unorganized and unprocessed and do not have any meaning. This is a little more information, it says it's unprocessed. It's the given in whatever you're trying to do because you've not done any processing on it yet. This makes sense because it goes in line with the original root of the word which is data is the given value, right? So unorganized and unprocessed conveys no meaning. This is the whole map. So the same paper when they try to define information, they say information is data that has, the data that have been given meaning by way of context. Now which is what I assume is absolutely right, data is information in a certain context. Sorry, information is data given data plus context, right? So if you have data and you have the context to understand the data, then you have information, right? You have something that makes sense, right? And you have to think of this context very, very deeply. What is this context? This context is several things, right? Just let's take this English sentence for a minute. This is conveying some information. How are you able to understand? Okay, like you said five minutes. Okay, how are you able to understand this sentence? First, how come I have written something, some white stuff on the board and you can understand it? What is the context? How can you understand it? Data plus metadata equals information. Data plus metadata, okay, find data plus context. Same thing, I think context and metadata are similar things. But what is the metadata thing? Because we have the open data thing. The context is that you know the English alphabet. You are the context, right? You know you have in the past learned a different piece of information, which is the English alphabet. And that's why you can understand what is written on this board. So the context is deeper than what you might think of. Context is about who you are, what you've learned in the past, and what you have now, right? It's also about how I write it and stuff like that, who I am, who is sending the information, how is he sending it. So context is a really broad concept. But once you have data plus context, you have information. This slightly broader definition says, information is data that have been organized so that they have meaning and value to the recipient. So information, the root of the word is inform, right? It's about informing somebody. It's about communication. So if it is about communication, the other person, so I might be sending this text. You may or may not receive it. If you do not know English, you will not receive it. So if you don't have the right context, this information is just data. It means nothing to you, right? So we look back at our example. Given A is 2 and B is 5, what is A multiplied by 5? This was our first example. The second example was given the length of a rectangle is 2 feet and its breadth is 5 feet. What is the area? This thing, what he said was we added more context. We added more information. We did not change the data. I'm rushing through it because I want to get to more and more. So far, we looked at one definition of data, which comes from the root of the word itself, which is data is something that is pre-given to you. Another definition that is explored in a lot of information science papers is data as a signal. This stuff is really very, very interesting. Why? Because it helps us explain all the data analysis abstractions we do. So data as a signal. Here I'll try to define it. I'm not going to bore who these people are. So data are the sensory stimuli which we perceive through our senses. So data, very simply you can think of it as data signals coming through your senses. You can hear data, you can see data. If you're a computer system, you can see it coming through your API. If you're a mobile phone, you can see it coming through your sensor. But basically, data is stuff coming in as signal. Information is the meaning of that signal. But to have, to understand the meaning, you must have the right context. So it's the same definition, slightly twisted. But it says that data is constantly coming at you. I heard something flap here, that is data. It just doesn't mean anything to me. That is why it's not something that I care about as information. But it is data. Somebody's moving around. It is data. I just don't care about it. I just don't try to comprehend it. So this is an example for one of the papers. The noises that are here are data. The meaning of these noises. Example, a car, a running car engine is information. So if you have the context of having heard what is the sound that a car makes when it's running, then you can interpret the noises to be the sound of a car. I don't like this definition. But I also at the same time like it. Because it solves all the problems we face in the world of dealing data inside computer systems. And he's not talking about computer systems at all. But there are three important things to look at. Data is one of more kinds of energy waves of particles. So data is a signal for everything else. Actually think of it as light. So data is light. If data is light, it says data is light selected by a conscious organism or intelligent agent on the basis of a pre-existing frame of inferential mechanism in the organism or agent. So basically what he's saying is data is light. The signal you're dealing with is light. So data is light. And I am the organism seeing the light. Then I have some context. I have the context that something that looks green is green. I have that context. I look at an apple because that light reflects at me. I see that this is an apple because I've seen the apple before. So I'm the person who has the context in my head. So I'm the intelligent agent in this case. And I have an existing frame of my state or existing context. So basically there's a signal coming at me and I'm the context. I decide given my context what to keep, what to throw it. Because there's always signal coming at us. And this is true in computer systems alone. There are computer systems that are perceiving data from sensors, computer systems perceiving data from several different things. Now, what is a database? And this is where it all ties into what we do. A database is a record of this constantly coming in signal. So there's signal coming in, coming in, coming in, coming in. We've just chosen to keep the record of a certain part of it given our context. So let's imagine a business. A business has customers coming in, customers buying stuff, customers just browsing. The business chooses that in my context it is important when a customer buys something. And in that context it chooses to record whenever a buy transaction happens. So a signal in the recorded state. Now, let's think of our light analogy. So what is recorded light? Photograph. Right, exactly. So imagine data as a photograph of what happened in your business. So data as a photograph of what happened in your business. Now, what if it was continuously several photographs? You would imagine a video. I would choose instead of imagining a video to imagine a reflection. What if my computer system was a mirror? There was light coming in. So now this guy. So my computer system is a mirror. Whatever business process system you want to imagine. My ERP system, my whatever system. So I have signal coming in, light. I have a mirror. This is my system. And this system has a context. It knows that it must record buy transactions. And this here is my database. So my DB is a reflection of everything that happens in my business. So the data I have in my enterprise data warehouse is a reflection of everything that happened in my business. Now, why is this word reflection very interesting? It's interesting because light displays all the properties of a signal we might want to imagine. If, for example, this mirror is bad. If this mirror is bad, my reflection is bad. So this is like last two minutes. So if the mirror is bad, my reflection is bad. So the agent that is creating the reflection is that agent is bad. The reflection is bad. And this is the stuff that we deal with in the entire, all the books that are all about data quality. They are talking about the flag that if you recorded badly, you have bad data. And who's recording the bad data? The context. So it's not just the system. In the context, there's also the person doing the manual recording. So in the context, there's also the medium of how I got the information. So if it was coming over a phone call, the phone call was jarred and I did not get the right information. So if the mirror is bad, I get bad reflection. Another thing that happens is that you can have multiple mirrors reflecting the same light. If you have multiple mirrors reflecting the same light, you end up in this other data problem that we deal with is where we get multiple reflections. And these reflections might be out of sync. So there's a mirror in India. There's a mirror in US. And as light passes, for some reason straight line, I don't know. But let's say there's a mirror here in time. There's a mirror in bomb, right? And I shine light from here. The reflection first reaches back here, then reaches bomb. And there's a time detail. So the reflection in Bombay is out of sync from the reflection in Bangalore. And this can be a serious problem. It could also be that the mirror in Bangalore is more sophisticated. So the system in Bangalore is better at recording stuff than the system in Bombay. So again, the reflections are out of sync. The moment the reflections are out of sync with even another common data problem is which one wins, right? Which one is the correct information? And this again is a data quality issue. It can be an issue of which one is master data and things like that. So what I've realized is that thinking of data as a reflection of whatever happened in your business really works in the context we are in, right? Of data analysis. Why? Because then you can factor in this, this, the problem, all the problems reflections can have in your analysis, right? And then I want to make a final point and then I will stop. I'm going to stick to all of this, right? Yeah. This is the idea that hasn't been explored. Actually isn't even in a lot of, actually it's nowhere. This stuff that I've been thinking about. Data is a social entity, right? Social objects are something that a lot of people in social sciences have defined as topics of conversation, right? So all of us, why are we here? We're talking about data, right? So the reason we are here is data. There have been a lot of smaller conversations around toilet data and particular numbers and we talked about insurance a little while ago. So, singular data points are possible topics of conversation provided the right people are involved. In fact, the main reason we do data analysis is to help with conversations. We have business problems where we want to make arguments and we do data analysis to help our arguments. So, if conversations happen around data, why don't these conversations live with the data? What happens today is that you get a data point, you generate a report, you go to a meeting, discuss that data point, right? Or you go through a chain of emails discussing that data point. Or you post it on a blog and you get policy makers to discuss that data point. You post it on a newspaper post and then several people discuss that point. So there's conversations happening around data. There's even better conversation happening around higher levels of data, right? Stuff that is visualized, for example, into more useful information. Somebody's statistics, things like that. There's conversation happening around data. Data is a social object. Data can trigger conversations that the right people are involved. Then we must think about bringing these conversations closer together because today they're silent. I have an email conversation with somebody, somebody else has an email conversation with somebody. There is value lost here because there is information that doesn't build back into the data. And that's why I think thinking of the social impact of data is very important. And this is where, in fact, what open data is about, right? The whole idea is you open source data so that there can be conversations that will trigger better governance, right? So if that's the goal, we need to bring these conversations together somehow. That's kind of my question. So I have a question. In the end of your talk, focus around making information, right? Joining different pieces of information, what is your take on information science versus knowledge management? Well, actually information science is a subfield which is contradictory to the definitions in their papers because they say that information, when understood, makes some knowledge, right? So if that's the case, information science is left behind. Knowledge is a higher level entity than information science. But still information science is studied management of knowledge. Now the question becomes, and this is also there in a lot of information science papers, is knowledge something that you can manage? The philosophy side of things believes that there are two kinds of knowledge. There's knowledge that happens inside us. And then there's universal knowledge. And they say that universal knowledge may be something you can manage. So maybe at the organization level you can manage certain knowledge, but you will still end up with knowledge that is inside people, which is why a lot of information science knowledge management theory deals with managing people and training them and improving their quality of knowledge. But it still lives inside people that knowledge, only a percentage of that knowledge ends up in computer systems, even though the subject is called information science, knowledge still remains an experience you and I have individually. Thank you.