 Good morning, thanks for coming. We just want to understand a little bit about the audience, okay. Now, when we do this, let's start the event. We try to focus on what you really want from this. So, let us do a really quick show of the item. How many people have already worked with BTS? In some way or another? Okay, awesome. Good, really good strength. How many of you have heard or kind of know, like me, maybe read, but not actually work with BTS? Okay, why are we doing this session again? I don't understand. Maybe we should start with the panel right now. So, what we thought is, just to bring everybody to, you know, a common page, I have a bunch of like 42 slides. So, which means I will just flash them very, very quickly and we can ask questions in the audience. We just want to get to vocabulary. When you try to look at the industry, there are several ways of looking at it. One of the interest that I have is, if you take a certain industry, first of all, is it hype or is it reality? Second thing is that, what are your opportunities? And as startup, you know, the ecosystem of startups that we have, as a kind of a perennial startup, even though we've been raised 10 years already, I still consider myself startup, always interested in finding out what are the new opportunities? How is this going to change things? What is going to happen? What are the opportunities? Who will play it in this and all those types of things? So, this is an attempt at least, this overview is an attempt to run through that. So, let me put it in the screen. Actually, if something shows up, the thing is, why does a number just ignore it? Because I just acquired it this morning. I was at a different event yesterday and I gave it to my pen name and you know, what happened. So, I think the first ever exposure I had for Big Data was when I saw an info queue. Once a Facebook engineer is talking about how they're handling pedabytes of data, and every day they get pedabytes of data in the process. Then actually terrified me. I just looked at it and said, wow, that's a bunch of information. And that was a pretty good discussion. These guys were talking about what are the challenges, how much of stuff they have to do in real time and all that sort of stuff. And then obviously no topic or no talk about Big Data is complete without some mention of how it seems to be a code of most of the processing of Big Data. So, that's one of those things. And then what do you do with all this data? You want to analyze it in some way and then visualize it in some way. So, all Big Data discussions also pull along these topics like analytics. Then I'll skip through the rest of it because there is this interesting mix between the term called data science or data scientist. People either love it or hate it. And a lot of people say that's all bullshit. There is no data scientist. And there are also other kinds of things that say, you know, data science is a very much a thing that we need to pay attention to it. Why can't we teach data science like computer science? Because lots of decisions are going to be made with data. So, there's a whole bunch of evolution of things and when you go from very structured regional database work to slightly unstructured or semi-structured data and completely unstructured document data, you need new technologies to handle the key value pairs, no SQR databases, in-memory databases, they're all part of the picture. So, we'll start with what is it? Why does it come from? You know, how do we process it? You know, what do we do with it? Do we process it? What do we take the results and do what? Who are the players? And what are the opportunities? So, none of this will be complete because things are changing. The interesting thing is to go Google Big Data, the results itself is Big Data, right? Because you can get large amounts of information and then you said how are you going to figure out what makes sense in what method? So, let's look at what is it? Like a term cloud, it's a bit label-less. When cloud started, everything can be cloud, right? You know, if you look from the paper, there's a private cloud, there's a public cloud, there's a hybrid cloud, there's a data cloud, and you can attach cloud as a prefix or suffix to anything you want, that kind of thing, cloud computing and that kind of stuff. So, Big Data has suffers a little bit similar thing. So, I said somebody will be making, trying to make sense out of all these confusing terms. So, let's see what it is that they do. Of course, they always go to the big guys who do research reports and send in a lot of companies and make a lot of money. And Gartner says Big Data is being treated like machine, okay? Increasing volume of data. Increasing velocity, the speed at which the data arrives, big treats, for example, when some disaster happens, and very range of tasks. This seems to be a consistent, and I was pretty happy, okay, finally we laid it down with three these. Volume, velocity, and variety. Sounds really good. It's like, you know, hotels, you know, five, some courses and that kind of stuff. And then you said hey, this is cool. And then suddenly, recently, I was like, I continued 2 million and I found another article that says, it is six weeks. And they added something else like, viscosity or something like that. Right now, let me teach these three weeks. And, okay, we already talked about. So, where does it come from? It depends, right? It depends on what you're talking about. If you have an enterprise, it comes, and if you want, it comes out of every customer makes on your website, right? You want to find out what they're looking at, why they're looking at certain things, why are they staying on certain pages. This is all their analytics, anyway. But you can get, you know, thoughts, thoughts. Basically, a lot of difficulty to get. It can come from financial transactions. And, you know, we have, you know, from PayPal, who's going to tell us, and how do you know when a transaction happens that it's not a fraudulent transaction. And that, they can't wait, put it in a batch and wait and come back and tell you later, kind of stuff. All the credit cards. They've been doing a lot of this, I think, early in. But how do they do it now? How do they manage this? What happens when micro payments take off? Why payments take off? And the amount of data that you're going to gather is just phenomenal. Okay, some of this is transaction data. But of course, all of us, as part of Facebook and Twitter, we generate about it. In fact, there's one guy who's mining, tweeting all the time with the hashtag, Big Data saying, every day I'm generating Big Data. What he means is he's tweeting. I said, okay boss, you said it once. Now why do you want to say it? You can go and say it for it. You can find like 10 of those things. And I would like to get some money just for you. And he says, okay, we are generating, so should we. So these are all basically the drivers, which it starts. But these are resources. Okay, before I jump off into this, please don't think that all these are my original products. Okay, I copy it. Every slide has been copied from either a book or an article. And there may be a few slides here and there that are my own, which are probably more likely to be questions. And I have a list of all these resources somewhere. And I have a link to that resource also. So I'll give you that. It's a bundle on Bitly. So we talked about chatter from social networks. Web service logs. That is all data. Graphic flow sensors. See what happens when we have this Internet of Things app. Internet of Things is where, with IPv6, every little device can be addressable. They're all, as in our social chat is not enough. They're all going to start chatting too. So maybe a small sensor in some, you know, garden will say, hey, my client needs water kind of stuff. So we're going to get a lot of these larger amounts of information. And that is going to come. Then, you know, satellite imagery is something that's, you know, we know every little bit of information that comes from there. So we'll go through this. Scans of government documents, GPS trails, Foursquare and others, you know, taking it from an automobile, financial, and just bits and pieces of it. And think about it. Look at each one of them. You know, two are exactly the same. Everything is different. And not all of them are structured, right? So there are some pictures here. I don't even know how to pronounce that. Looking. I'm just going to put it upside down. So, look at these numbers. Now, 25, 20 million points. What does that mean, right? In the last two years, and this is the energy. So when you start thinking about, is this a high, is there something real going on? Some of these numbers, right? Not that we're going to take these numbers. From the time the numbers are printed to now, these numbers might have changed. But they're just one mile post. And this is the, where it was generated. You may ask, hey, I'm a startup. Why should I care? We come to that at some point. But look at it, blogs. These are all millions of posts a day, or hundreds of posts, a thousand of posts, hundreds of thousands. And I don't expect you to kind of stand through all this. We'll not cover that. But I'll be happy to handle all these slides along with the links from where they came from. So how do we process this? This is taken from, one of the nice things is, O-R-I-D, a company in the U.S. that publishes and watches books, also runs some interesting conferences. One of the big conferences is called Straight Up. And they talk about data, and O-R-I-D focuses a lot on the government data and raw data sensors and stuff like that. But they have three or four really good books on the data. And we can at some point, Google does that to find a list. But essentially, this is the processing lineup. And there will be variations of this. I'll show you some others. The essential thing is, one is collecting data. There is a big process of cleaning data. Because not all the data that you get is clean. Let me give you an example. A couple of years ago, a NASCAR product called Guy Kawasaki came and gave an opening keynote. And he became an instant hit. Everybody loved him. Then he did some Twitter event and even more things. Then he went and ate Masala Dosa somewhere and said, I ate Masala Dosa. He tweeted. Guess how many requests I was in the train, trying to get the train back from Bangalore to Chennai and I was going through a different stream and unfortunately to anything in NASCAR it was easy and everybody had to get back NASCAR product on train. There are 200 to 300 repweets of Guy Kawasaki eating Masala Dosa. Everybody was thrilled about it. There is a similar one to date with some data partners that happened recently saying that if you... if you... I think I'll have to get that tweet properly. But the essence is that if it gives somebody how to some job on Hadoop, they will learn to work. But if you train them deeply in Hadoop they leave the company for a better job. And that was the one tweet from one of the data conferences. That's again everybody took pleasure in me tweeting it. The reason I'm saying is the reason we want to remove things take the data and there's a basic piece of data, right? It is maybe interesting. Then the other pieces of data that are basically authenticated or not directly data itself which maybe the number of times repeated is a useful count, it's kind of metadata. And maybe the popularity and the way the information propagates they're all different kinds of data. So this data has to be cleaned because we want to keep the base data only once. And then we want to go through and there are other ways of training too in some of the projects that for example Myself and Chandu both run. People are talking about billions of transactions from large companies where the company name may be spent in five different ways. And you want to take it and then you want to standardize it and stuff like that. So training is one part. Then you process it. And then depending on your needs you know if it's real time or if they can do it in match or what kind of information that you want to derive from it you can do a lot of things, okay? I don't know anything about how internally Amazon processes information so I'm going to later maybe get some help. But just let us theoretically say that we have a huge retail store like Amazon where the order is flowing in. When the order is flowing to the pipeline you want to tap that information for a variety of reasons. You want to tap it for actually shipping the orders back but you also want to talk it to why are these people buying books? Can they add it to the profile? Does it make a difference to the person's profile? Can I update the profile? And as soon as they increase the order or when they are going and even viewing the page Amazon shows us three other... Uh-huh. Uh-huh. Three other things, right? Amazon shows okay. People who bought this book or bought this cell phone or bought this device also bought these kinds of things. That data is coming in. Now we have a little bit to pull it out. So just think about a simple information stream where you can tap out and filter it and get large amounts of information out of it and then directly update it for a whole bunch of things. Some in real time, some not in real time. So processing pipelines is one thing that we'll talk about a little bit more. So here is a bunch of key keywords that if you have the space, you probably know this just for the sake of establishing it and putting it in there. Harupe is a distributed processing framework. It seems to be the most popular tool today that we have and there are a couple of others that are coming up that may be interesting to look at. You can talk about Harupe alone because the map reduces the order that Google used to take. It's basically a very simple way to take the work partitioned, distributed, processed, merged back and provide the results. And Google did it as a change for Google Search. They talked about it, wrote a white paper and then a bunch of guys from Yahoo started and then they started publishing the information about this and creating an open source project with Apache and then of course there's a bunch of companies making money out of it. Cloud Era is for example. It can be commercial implementations and they also support and provide services for that. So there are a series of these and I don't think I need to get into this. So once you have big data, you need analysis platforms. You need a higher level landing for analysis. Conventional programming languages versus some level data handling or data languages. Vision learning is something that is going to be very popular. You can use a lot of data as learning data from there you can observe patterns and then you can have machine learning and things run on it and then you can build a variety of possibilities. Each one of these topics by itself can fill a couple of sessions in conferences. And there are companies that I know even here in Chennai using some of these tools. In fact, there's a company called Rage Factory that is doing something with power. So Hive is again related. Data warehouse software on top of Hadoop. So you can see this ecosystem. When you see Hadoop, you see all these other things that go in and around Hadoop than management applications. So some of these we will be able to cover during the panel. So if you think about an unstructured databases one of the interesting things in this is you keep going back to Twitter mostly because we are familiar in its common development example. In Twitter, you have tweets which are basic units of information. Then you have connections. People are connected. This person is connected to this person. Then you have a repeat patterns. Somebody propagates. When somebody repeats another person it also not only means that they are following this person but they also like this particular topic. So indirectly you can say that you are repeating a lot about startups from Vijayanan streets. It means that I am also interested in enough startups for example. So there are levels of metadata that you can clean from the basic data. You can bring out all those kinds of things and then you can look at it. So this if you look at it, the social graph like the one that Facebook has and one that you can construct are all interrelated graph databases. There are slightly different way of processing data and then taking it and working it. So we are talking about really big data. How do you get the data into and out of this system? So you need a set of tools for doing that too. You are talking about megabytes of data or large amounts of data. In the case of Facebook, users generate the data and they gather it. But if you are an enterprise and you want to take all your existing data and you want to break it in and then you want to create some kind of internet data cloud and you need tools to do that too. Okay? With this I think we will stop. It is a large, large data. In fact, some of the very interesting properties in the big data space are log analysis companies. Like Splunk is one which went public and got some 100 million dollars or something like that. And you can see their ads everywhere on the menu because everywhere I see I see an ad from Splunk and he says that there used to be a company that used to do a lot of logs. So big data in Hadoop is more like a batch-oriented processing. Kind of semi-batch-oriented processing. But in function in data in the process that you get back there is a kind of batch nature to it. So there is a bunch of other projects that are coming up that are slightly real-time. There is one from Yahoo I don't know how popular it is nor is the next one from Twitter or Twitter Stop. They said they will release it in July. I don't know. Is there anybody using Twitter Stop or playing around with it here in the audience? Okay. So it works. There is basically something that there is a deliverable technology to get. That is as tech as I can get. Fortunately we have a panelist with all the practitioners so in the next session we can get a little bit deeper. But let us look at what is happening in this space. This is actually the answer to the question. Is this hype? So a little bit step back to somewhere around 1998 and I came across this thing called XML and I got very excited and said, oh is this a hype or is it reality? So how do you verify this hype or reality? So the first thing you do is you don't necessarily listen to the analysts right? The analysts have their own reasons for saying something is going to take a small amount of time. You watch all the big companies go outside, they'll establish to see whether they are doing anything with it and so for my simplest critic at the time was Oracle doing something with it was Microsoft doing something with it was Google doing something with it all the big companies who are on IBM for example? Who are on these companies? What are they contributing? What directions are they pulling the standards? So that's one thing we look at. So this is the fun thing stuff. Like Cloudera is a company in fact one of the co-committers of Hadoop I think went and eventually joined and I think they came in there. Cloudera I know because a lot of my works in U.S. he used to be in California but in New York they have people working for Cloudera out of Chennai we're trying to get that guy here but you know but essentially what is happening is Cloudera is if you know Hadoop and Krishna I'll ask him to talk about it in the panel. If you put Hadoop in your resume and upload it to Monster, you'll find it in very interesting patterns. There's a large demand for people. I believe in Bangalore I think in last year or something like that they had a conference on Hadoop they tried, they expected over 300 people in Chennai, 600 people showed up they could have built 400 and they had to I think 400 plus less than them, they couldn't have built it. So there's a large amount of demand. Cloudera is basically a Hadoop company but they're also building a lot of things around Hadoop. Mapo is a competition in Cloudera they've raised over 200 million dollars one of them is in the database that company raised somebody it's an Apache that's another noise scale database and Splunk that is wrong about 100 million it's 230 million this is one of the earliest data companies to go public and get this. So I was saying I wanted to make my book very simple so I went and typed in a lot of questions in the market place what is the landscape we can't really go through all of these details but if you really look at it there's technologies, I have some questions maybe I can't fully agree with this or the next slide that I'm going to show you but you get a sense of this Apache space Hadoop, Mapo some of them are products technologies are concepts and then this is the food stack as it stands I think this came from Forbes and then somebody looked at it and they said no, this is not company so let's get to that one this is one of those what is called the audience killer slides nobody will be spending nobody can see anything in there so don't worry about it I'll give you a copy this guy but I there are some areas where there's a lot of questions should it belong here should it belong here what I'm going to show you is let's just look at the headings infrastructure analytics applications cross infrastructure analytics data sources these are all the large groups so now we can go and put one of these services we can get a lot of these companies when I was doing top companies and cloud companies I got 250 companies 250 companies building the cloud space then I can pick up infrastructure some tools for buildings applications so there's a lot that goes on so when you're looking for energy opportunities you want to actually look at some of these guys pick some area and go deeper and figure out what are they building there'll be a time check if somebody will stop me we'll exit the time they're not so it's always whenever you look at technologies you also want to see who is using it what sectors of the industry are using it healthcare is one the one that seems to be a really, really good obviously they connect tons of data I've not mentioned it the intelligence community is here but assume that they are even looking at everybody's recommendations we take manufacturing because they need to get all the information personal location data we already know that and finance which we'll be talking about in a panel is cloud how do you know what cloud is okay good so you're not being accepted so cloud is this application it's kind of an interesting application there are two, one is called PLMX another is called cloud it's spelled as C you can go and give it your Twitter handle your Facebook handle it gathers all your activity so you're giving it permission to take your data it gathers all of it and then somehow computes your influence don't take it too seriously because if you're tweeting for four days you stop at cloud school and remove this and you'll be in depression so but it also does some very interesting things it minds all the topics from your tweets and Facebook posts and says hey you are an authority in this topic guess what I tweeted only once we have become one of the areas of speciality so I don't take that seriously anymore but it is kind of interesting that the top three or four they are not fine because I talk a lot about startups I talk a lot about technology about innovation and some social media stuff so they came up so what they have to do is they have to take all this data that is there we tweeted all the data that is coming from Facebook posts and probably eliminate the duplicate because I have Facebook so anything I do on Twitter appears on Facebook so they have probably cleaned these up you know anonymized it it's a great anonymized application if you go to cloud type in the name of any person that you know some Twitter handle or something like that you see the influence, the reach a person has for example if a person has a niche of 5K that means you post a Twitter post it can potentially reach 5K people whether they read it or not so cloud cloud is one of the users and this is the level of processing data for you to get a sense of it social internet processing everyday so they not only do Facebook and Twitter they do a whole bunch of things so where I just went in track and I found this nice picture so I thought let's put it up here it's a logical architecture how they use Hadoop and I won't even attempt to talk about it I'm going to leave it to you next time I got it I got it I got it I got it I got it so market and market segments there is a slide since we have only 5 minutes I'm not going to talk about this forecast if you believe in forecast it's good to take a look at them just to get a sense of what they are so there is one forecast that says by 2017 you know it is $53.4 billion and then there is another forecast that will go to $86.4 billion okay and these are all at various companies and what is their current contribution this is kind of an interesting slide I don't care about hardware so software services this is the current so maybe a great entry point in the big data is through some services become an expert in some specific area which is popular and you can find that out the way you find that out is data so this simply add from a book by Ed on big data predictions channel you will see more expressive tools we will actually talk about some of these things in the panel there is something called inforchance if you won't do any kind of search on Twitter data you will not get more than 3 days of data if you want larger your own programs to take them which is what I do there is a little Python program I found on GitHub and got back up my tweets if you want I will tell you where it is and it is bfdu.py and I will just go and say I want Vijay on the streets anything that he ever tweeted I want it in the database I will just go and type Vijay and then we will take it and stick it to the database and then you can run it in batch mode every 3 days you can build your own little repository but inforchance you can go and ask because they source it from Twitter you can actually go and say I want Twitter data on big data or on not it any political information they have an API you can actually build applications using that so you will see these and the inforchance was listed and I think Microsoft is having a market place chill I am not going to even do that so this is another slide on the skills gap what are the areas in which there are a lot of skills because obviously many of these are analysis skills but I am very interested in this data hacking thing so I am glad that you guys have a hacker when it is going to hack my account why is it going to hack my account yes so more concentration you don't get distracted with that so I always believe in actually going and looking for jobs as a leading indicator for any technology so Indie.com is one of my favorite because they are job-working data from about 3,000 different job boards and companies and stuff like that and then you can go and do these searches called relative searches and absolute searches I wouldn't type big data I want it in the title any job that is titled big data I want to see how is it going I am just interested in the trend and what is that and it is interesting to keep on seeing the trends to see where it goes and to me that is a leading indicator I think there are about 4,000 jobs and there are other ways to analyze the data if you can create an analysis speed out of it and then if you start practicing it and you can see how long a certain job stays without being a film they will give you an idea of how much of a scarcity a parent is in so that is what I want to give okay I promise you something while I was doing this research and copying and pasting all these products I also went and put all the URLs because I said it is a good to have one step where I acknowledge all these sources and there is a bit.ly bundle bit.ly is a URL shortener I use it everywhere from my tweets and all that so there is a bundle on big data data resources, big data, data science analysis, visualization and it is urlis platform bundle I will get some traffic but I won't be doing it that is the end of my talk one other question a few questions and I think the technicals I may have to ask others can but let's just take some questions yeah my name is Karthik we talk about 3-week velocity or more assume the variant comes in where there is no structure to the data but assume there is a structure say a credential can be like a paper we talk about the data I think there is some structure to all the data in this case it is a mandate to use GoSQL database or even reverse work on relation database I think there is a connection to relation database I think what is happening is that the main fact that these are all mixed up means that you can't really say that I am going to go only with this kind of stuff so because you have some structure when you see a payment transaction assuming it will be unstructured though I doubt and I think it is probably semi structured actually structured but for valuating the transaction there are a bunch of look ups that data will be structured so I have seen odpc and gdpc connectors to all these scales which one will you choose depends on the actual volume of data and why it is projected to go and there are some advantages and disadvantages the advantage is the way I look at it my little knowledge is that I look at GoSQL as a very fast way of writing whereas relation databases because of all these indexes there is a big open to all these deplus trees and that takes time so we had to split up sort of the transactional part of that semi structured payment request you have things like that from the actual analytics part so anything that is fraud or escalated gets split off from the real time transaction itself so they work on two different data sets the one part being analytics we are actually working on that part to move the data sources from traditional IDVMS to something useful for performance valuing things like data grids and also some cases cover the data sources but yes the way to do it is to actually split up between the actual TV type of use cases and an application so you don't have to mix it up right I am Chandu Naya but the question I have is a lot of the data and the maps particularly in India particularly among the tech oriented community is very largely US centered but there is a lot of things happening outside of the US like in Europe the trends seem to be kind of different so do in the UK any indications of what are the differences across job capabilities so there are a few slides I can pick up when I started looking at it the little I have seen is mostly from the government data these huge initiatives in both Europe for example Australia then of course in the US in California healthcare data there seems to be a lot of raw initiatives because they require a lot of infrastructure they are all coming up in the open data movement so there is a related movement called new global data and there is a whole class of interesting technologies they use a lot of these big data tools to process them there is a lot of open data everybody in fact the US is mandated to publish a lot of information I think that is true with India too so I have no idea whether they have an API how much of it is available all the Europe there is a lot of things UK is leading the way they are also building a semantic layer on top of the data so you get a lot of data if you look at the trillion triple challenges and all those kinds of things a lot of data is coming from UK in Europe I think there are a whole bunch of initiatives that is again centered around developing data