 Well good morning everybody. Good morning everybody. That's better. Thank you very much for inviting us to participate in this and thank you all for sitting through early enough in the morning in this session. I think as the previous speaker has mentioned, it's all about data. I think data is one of those interesting components that we all intuitively think we know but I think one of the challenges is how do we make sense of all of the data that we're having and I think the presentations earlier were excellent because it shows that there's a lot of interesting possibilities and a lot of areas that we haven't even started thinking about and a lot of it has to come from the fact that a lot of the nuances of getting to data has been essentially non-existent and today with the way technology has moved forward, we're able to get to very interesting opportunities to make things happen. So what I'd like to do is to cover some of the things which I think is interesting enough and share some thoughts about the technologies that from an open source and from a red hat point of view that we are doing and how this can hopefully ignite some ideas in all of you here that we can create the next big thing as well. So I entitled this talk in data we believe because data ultimately drives everything. Every single component will be stated as can you prove this? Where's the data to prove this? We all say that. You say something can happen this particular way. Show it to me. Prove it to me. When you want to prove something to somebody else, what do you base your proof on, on data? And I'm sure some of you may also have heard of the phrase about statistics. Lies, damn lies and statistics, right? These are at the end of the day how you interpret what you see. The data is there. How do you interpret? What kind of tools do you have? What kind of skill sets do you have? What kind of biases do you bring to analyzing a data? So I'm not going to teach you everything about this. A lot of it I don't know myself. We are discovering this together. So what is data then? What is data? Let me just throw one item up here. I would probably say data is anything that is once in a while, right? We all pretty much know that. If you have some meaning, who determines what the meaning is? Why would some sequence of ones and zeros mean this thing as opposed to somebody else's thing? What would it be? So let me pick an example here. 1001001. The answers are up there. In decimal is 41 decimal. If you see a series of 1001001, 51 in octal, 29 in hexadecimal, if you look at the ASCII table, it's the right parenthesis. And it turns out to be my postal code as well, where I live in Singapore. So 101, if I were to see that, oh, that's my postal code, wow, that's great. But it means different things to different people. So it's a question of interpretation. The data is there. But how do you see it? What lens do you use to look at the data? So we all natively and intuitively understand that. But the challenge has been, how do I get to all of this data and make any sense of the data? So I would probably want to put this out here, which is to say that there are essentially two very large classes of data. You can probably classify them. It's by no means a definitive classification scheme. But I think this makes a lot of sense and helps us understand and figure out when I want to try and analyze data, what kind of tools should I be using? You know, as they say, you should always use the right tools for the right job. Don't use the wrong tools. If all you have is a hammer, everything looks like a nail. So use the right tools. So let's take this and work the next step. I would say that there's a thing called structured data. What would structured data actually look like? What is structured data? I'm sure you must have database administrators here who really live and breathe this on a daily basis. I would say structured data is data that lives in a fixed data format. With a schema, it's usually normalized, found in a database. It's predictable. It's known. It's filled in every now and then. It's captured. It's thought away. Usually you find structured data in enterprise data warehouses. Why? That's where you want to derive business intelligence from. That's something we have been doing in the industry for the last 30 years. So there's nothing new, really, in this space. Let's look at the other side of this coin, which is unstructured data. So what is unstructured data? What would it be? If structured data is stuff that stays inside a table and a row within a database, what would unstructured be? It would be everything else, essentially, including what sometimes people will label as semi-structured data. You might argue that, for example, log files, all the click-throughs, as the previous speaker was mentioning, all the cookies that you have in your system as you pull in data from all the visitors to your site, all of those data are stored in a log somewhere. The log has some logic to it. It has got some form around it, some structure, but it is not structured in the way a database is structured. Essentially, it's a flat file in most cases. And there is no formal schema around it as well. One log file may differ from another log file, and yet you need both of them. It is sometimes not even normalized. How about repeated data systems, repeated information, repeat users, and so on. So there's a lot of interesting aspects that unstructured data throws at you. So let me ask a question here. The question is, are there more structured data or unstructured data? What do you think? Show me a show of hands who think that there is more structured data. Nobody? More unstructured data? I hope the rest of you. It is an intuitive thing, right? There is really more unstructured data. Unstructured data rules the world. Every single component, your heart beat. If you have a monitor that monitors your heart beat, there is more of unstructured data as well. Every single component is unstructured. There is more unstructured data, and that's the reason why you see the data explosion. Terabytes, petabytes. You know, we need to now invent new words to describe the sizes of these file systems. I think the current, the largest one is zeta bytes. Is there anything beyond that here? We ran out of letters in the alphabet. So we need to invent some new words to describe even bigger data sets, sizes. But really, that's a real problem. Sometimes you need to know when to throw away data. For example, recently with the discovery, or almost discovery of Higgs boson in CERN, one of the scientists was saying that, you know, they capture something like a petabyte of data every day in the analysis that they do, in all the atom collisions that they do, at which 99.99% of the data they have to throw away. Why? Because they've got no place to store it, no easy way to access it and do anything with it. They have to be judicious about what they throw away. And yet, how do you know that your answer that you're looking for is not in the 99.99%? It's a tough call, but that's the reality they face. So we sometimes also need to know when to throw away data. Sometimes you just want to keep them. But it's a tough call. We are early days today. We need clever algorithms, and I'm sure somebody in this group is going to come up with some ways to manage and store everything. That means you don't throw anything away. So some considerations here. How can I get any insights from the data that I have captured? What are the tools that I can use to get the insights? Insights, again, come from many ways. You can look at it from a very holistic perspective. You can look at it from a prescribed perspective in the sense that hey, I want to know from the data that I have such that I can ensure that traffic in this particular road is going to be very... applying very smoothly from certain hour to certain hour. I am specifically looking for it. I want to now create policies from a government point of view to divert traffic elsewhere. So I have a specific need. So I want to go discover data that meets my needs. In other words, I am bringing my biases to the data as well. So when you have data, you can believe in the data, but it is how you interpret that becomes very interesting. And it's not a science, per se. It is really an art form, interpretation of data, because you choose interpretation of data based on a question they have asked. So would the data be available all the time or in batches? Now one of the challenges we have always had the last 30 years was how can I access all the data that I have actually captured? You all have heard of this thing called data warehouse. We have seen that. You have also seen data being stored off-site on non-spinning disks, on tapes. When do you bring those data in so that I can do some analysis? When I bring it in, I only have so much space. How do I do anything? So it's a very, very difficult problem that never had a good solution to, and I think today we are the cusp of just about making it happen. And the last bit, the last consideration here is the ability to access and assess the data going to be cost-prohibitive. If that is going to be cost-prohibitive, you are not going to do it. You must have a business case to make it happen. But if it is not cost-prohibitive, if it is cheap enough for you to get access and assess the data, things can change. I think the comment earlier about the road, the mapping of the streets, I didn't know there were so many data points for a particular street. Sometimes it's kind of interesting. I've used the maps before, and when I go down to a particular road and I say, okay, if I stand on the road and I turn around and look and see what is around, what am I supposed to see? What I would like to do, and probably it's a suggestion for the maps, is also put a time when that particular data was actually captured so I know whether it's still valid or is it a year old. So that will be good to know so that you can expire the information if it's not there anymore. But the point I'm trying to make is there's an enormous amount of information that you need to be able to assess and access, and that's not easy. Now, one of the things the Red Hat has been able to help push the last 10 years as a business and the last 19 years as a company going in the open source world is to help commoditize technology. One of the things, these are some of the areas that commoditization is happening in technology. The first and most interesting one is computing. You have heard of this thing called IAS, PAS, SAS, and so on and so forth. Cloud computing. Really, this is a very, very nice and interesting space to be in because you now have the ability to think beyond just a set of systems in your data center. You now are able to look at this as not a subset, but it's part of a whole, a much bigger whole. And all of this is being driven interestingly using open source technologies. Now, that again is a very, very fundamental change in how we do things. This is not an open source versus proprietary model or a debate, but the idea here is that if we want to do these kinds of very large-scale systems, it is very difficult for one organization to do it. We have to collaborate among developers around the world and the only way successful collaboration can happen is via open source methodologies. Storage, that's the next big thing. After you have all this data, great. Now, how do I get to the data and retrieve it in real time? How do I ensure that I don't have to go back to a data store somewhere, load up the additional information and work with it? And when I'm done, I put it back out. No, I want it there all the time next to my compute resources. Networking. Networking may look like we are done. Actually, we are not done with networking yet. Networking is still in some sense in a renaissance phase of the capacity and the speeds that we need to be able to move and shift data between different nodes. A 100-gig Ethernet is going to be the norm in the not-too-distant future. Maybe 1,000-gig Ethernet as well. It's going to become very, very important because at the end of the day, when you have so much data, if I need to move data around for whatever reason, if it's going to take too much time as the previous speaker was also mentioning, if it's going to take more than 30 seconds a minute, your interest is going to go down, you're going to go and do something else. You don't get to the answer. So, we need to ensure that that also moves forward. The other notion of a software-defined data center, I don't know how many of you have heard of this next phase in data centers. You define your data center as a piece of software that you don't care what the underlying stuff is. There is some stuff there, but you define your entire network, your storage, your compute resource, your access, your quality of service, every single component via software. So, you have a software-defined data center. That's an interesting concept to think about it as well. And the last bit, so it's great that you have this stuff sitting somewhere. How do I get to it? My end users, if you don't have high-speed data networks to connect to, people are not going to be looking at it in any useful shape or manner. So, access speed becomes an issue. The last mile, so to speak, LTE, XDSL, extra high-speed DSLs and so on, these are going to be very, very critical. And as technology moves forward, a lot of these things are essentially commoditized. I was reading about the 4G rollout in India. I mean, that's an amazing piece of engineering. The guys who want to roll this out, they want to have 99.9959 coverage. If they can succeed making that happen, imagine the amount of, I would say, the amount of innovation is going to happen, it's going to be enormous. It's going to be very, very interesting. Now, all of this is great, but we also need new skill sets. And I can tell you all of these skill sets that we need are not being taught in schools today. None. We are inventing new things today. The idea of a data scientist, what is a data scientist? Whatever you think it is. It's still new days, it's still early days yet. Exactly how do you define a data scientist? I don't know. What about a visualization engineer? What is that person trying to do? Is it an artist? Does he draw pictures? What does he do? What is the role? What is the objective of that particular task? These are, again, new opportunities that come up because you have this enormous amount of data to look at. And yet there are yet to be formalized jobs that have skills like deciphering and acting on patterns of big data and so on and so forth. I would encourage you to look at the link that I provide on the slide. It's on FastCompany.com about a whole series of different job opportunities and skill sets that need to be evolving in order to manage these things. So you have a lot of very, very interesting days. I would probably say this conference is at the right time. It's at the point where the internet was beginning to take off in the 1990s. Data is beginning to take off now because of all the connectivity. And so we need the right kind of skill sets to manage this. Opportunities for startups and so on and so forth. So what? So you say all these things. So what? What does it mean to me? Do I care? Why should I even bother? What do I have to do? Does it matter? Actually it does matter. Because the future belongs to whoever has access to the information faster than the rest. Who can analyze and assess and access the data and come to an interesting conclusion and act on that conclusion will be the person or the business or the opportunity that's going to succeed. If you don't do it, nobody remembers number two. Only remember the winner and talking about Olympics, right? Who cares who came second as long as Hussein Bolt is number one. I don't care who's the number two guy, right? For the 100 meters. To me, it's only him. I don't care about the others. Even if he came number two, I still think he should be number one. It's just the way he runs. He's just an amazing athlete. What I'm trying to make is the quicker you can get to the data and come to a conclusion, the sooner you can benefit from it. And big data is about gaining insights about data while keeping the cost of retrieval relatively low. It's all great to say, yeah, I need to do all this fantastic stuff and all that. But if it's going to cost you an arm and a leg and justify to a whole bunch of bean counters that I need this, I need this piece of software, I need this piece of hardware, you know what, it's not going to go very far. The way open source had helped organizations is not because you had to justify the acquisition of these tools. You just downloaded it, installed it and used it and you found a benefit. And then the CFO wondered, wow, how did you do that? Oh, I use these particular tools. Wow, can we use more of that? Yes. And so the opportunity becomes interesting. And that's where open source and open standards helps push the evolution. Now I use both the phrases here, open source and open standards. Open standards are extremely important in many, many situations. If the open standard based tools, techniques are not there, we have a problem. We have an incumbent or we have a group where you may not have the opportunity to leverage and break out of. You may be what is unfortunately the term used is a vendor lock-in. You don't want a situation like that. That's the last thing you want to do. If you really, you got no other choice, try build it yourself. Try not to go down the path of a non-standard way of doing stuff. Be fair to yourself. Be nice to yourself. Now you may ask, haven't I seen some of these things before? Actually we have seen some of these things before. Some of those things were called business intelligence and data warehousing. Right? We have seen some of this. And yet how many of you are very, very proud of the data warehousing and the business intelligence you got out of it? So when you have a business intelligence piece of software or EIS or whatever they may be calling it today, how quickly can you get to the insights that can be derived from it? Do you do it yourself or you ask somebody else to do it? As this previous speaker was saying, you have to ask some kind of an analyst to go and analyze the data and give me a report. Can I just do it myself? Can I get to it myself? So the problem here is one of, as also mentioned by the previous speaker, is that ad hoc is not easy to do. Ad hoc querying of data with these tools. Not that these tools are no good. In what they do, they are good. But we need more than what they are able to provide today. And ad hoc query is a very, very difficult thing to do. These guys cannot do it. We need different thinking and different tools. This is why we feel open source technologies, well, they could improve on data warehousing and business intelligence tools. But we want to go beyond that. So this is where my pitch from a red hat point of view comes in. It's probably my only sales talk here, sales slide here. I'm not selling these things yet, but I want you to think about it. One of the things that we have been doing is to make sure that the operating system across the board is standardized. It's a commodity OS. It's reliable, it's predictable. There is somebody whom you can go to when you need official help. You want to have a corporate backing for that? Done. That's Reha Enterprise Linux. You also require maybe some kind of a middleware set of tools. All right, fine. We also have a Jboss-based, Java-based environment standardized, J2E compliant, fully open source. You can do whatever you want with it. We also have something called Reha Storage. Now, storage is going to be the next big thing, and that's why we are here. Storage is a very, very important component. We need to commoditize storage. We need to move it forward. Otherwise, it's going to be again the same old, same old proprietary storage systems which are going to be prohibitive and you may not get the business insights that you want. Now, I'm very, very glad to mention that Reha Storage is the evolution of Gluster. You may or may not have heard of Gluster. Gluster is a Bangalore-based business that we Reha acquired maybe about a year ago, and it is now the storage that we provide to be able to access petabytes of data at one go. You need that kind of systems to be able to scale out. And the last one here is about two other stuff that we are talking about. One is called OpenShift and the other one is called CloudForms. These two technologies are about PAS and IAS respectively. So we talk about all the data, find how do I now create applications that I can access and assess this data? What do I need to do to make that happen? I need some kind of a platform. Do I want to build it myself? By all means, be my guest. Build it yourself. But hey, can we like build something and then make it available to the open source community and the rest of the world at large? That's why we created something called OpenShift. This allows you to write your applications, whether it is in Python, in Perl, in PHP, in Ruby, in Node.js, in Java. You pick the poison and you go and do whatever you need to do on that one platform. Now, you may ask, well, yeah, it is on a cloud somewhere. Is that safe? Is it good? Is it okay? Well, you know what? We thought through that as well. And being Red Hat and our track record is proof of that. We made sure that it is fully open sourced. And more importantly, you can download it and install it on your local machines. So you can have your own in-house PAS environment. Deploy it there, build it on it, test it. You're happy with it. Then maybe you want to choose and run it on a cloud somewhere. That's your choice. But you're not restricted by only putting it on the cloud. Have it local and move it across. Bring it back. Whatever it is, it's all standards compliant. Makes it easy for you to move from one system to the other system. Your analysis, your data sources and stuff like that. So as I was mentioning, open sourced version of OpenShift is called OpenShift Origin. You can go to that URL and download the ISO that you can now run it on your local machine. Why not? Please build upon it. Use it. Break it. Tweak it. Improve it. Share with the rest of us. So we can have a much better environment to do the kind of stuff that we are about to be able to do. Exactly what it is, I still don't know yet. Blaster.org, as I mentioned, that's where the Rehat Storage came from. It's still a very active open source project. A lot of interesting stuff is happening there as well. R. How many of you actually use R here? That's a very good audience here. Excellent. R, it's very important. Processing. How about processing? That's a group of people using processing. Hadoop. We all know it works well. It does its job for the kind of roles it's expected to. Cassandra. Anybody heard of Kaggle.com? Got a few of you. Anybody on a project there yet? Any visualization stuff there? That's an interesting idea as well. That's a site that's providing opportunity for people to collaborate around the world. To create new ways to look at data. New ideas. Make it even better. Refine the understanding. You're not restricted to only yourself. And I put the last bullet point there. Insert your own there. Hoping to see more interesting opportunities and ideas coming up there. So what about data sources? So we talk about one aspect of it. What about the sources of data itself? One area that I think is very, very interesting to look at is data that has been captured and stored by governments around the world. I'm sure you have heard of data.gov in the US. There are equivalent ones. Data.gov.uk.nz.ca in Canada.sg in Singapore. There are just an enormous amount of data governments who collect and store this data have now made available to you and I to do what we want with it. Now that's again a very interesting idea. Now you have a different form of data coming from a different source. The problem with some of this government data unfortunately, is that sometimes the data is not easily readable in the sense that or they give you a PDF file that you have to go and scrape the PDF file, which is really not a very clever way to share the data. They want to share the data but we put one step back. I give you a PDF. Then you go scrape it. Well, okay fine. Thank you very much. We will do something with it. But the point I'm trying to make is there is enormous amount of interesting data being made available as well. So when you're looking at problem solving, when you look at problem solving, there are interesting solutions to look at that's already been done and improve upon it and look at sources, especially sources that have got immediate social impact. Doing the social good as well. I think that's an important consideration as well. Data from corporates. That should be an interesting aspect. Not all corporates want the data. There are secrets, so-called secrets to be shared out, which is fine. But where you can, you want to be able to get to it and do something with it. What does that actually mean? Universities. Universities are also a very rich source of data. And I mentioned CERN earlier. CERN has just got an enormous amount of data. It's just incredible. And during the announcement of the Higgs Boson, there's actually a website that looks at all the different, puts in all the different sources of data within CERN itself and present it within a browser. And so when you click through each of these little portals, so to speak, you actually can drill down and look at whatever else that they are capturing. So there is an enormous opportunity to do very, very interesting things. And of course, build everything upon REST API, JSON, XML databases, and so on. XML, data sources, standards, standards, standards. Standards is what we need. Open standards is the most important thing here. Whether or not it is ultimately open source, that's a good thing to do. But standards-based is the fundamental starting point. So in summary, information is not a privilege. It is a right. You have the right to get the information. There is no reason why you should not be getting the information. Gaining information is by analyzing data, copious amounts of it in timely and accurate fashion. Accuracy is, again, a biased term. How accurate do you want it to be? I'm sure you have, you know, heard jokes about accountants. You want to hire an accountant? What's the few questions you want to ask? How do you balance books? You tell me how you want me to balance your books. You have one book for this, one book for that, and you get hired. So it all depends on your point of view and what your objectives are, the question that you're asking. Open source tools, open standards are there to drive it. To me, that is a very, very important component. And Red Hat has been in this space and will continue to be in this space to make sure that when corporates and organizations deploy it, it is accountable. There is somebody who you can turn to for help. At the end of the day, we are here to support you to do what you want to do. And I think that would be my last slide. Thank you very much. If you have any questions, I'll be happy to take them, or you can send me an email or on Twitter. Thank you. Thank you, Harish.