 The Supercomputer Watson Not heard of it So, last year Watson won Japardi Now Watson is not a person Is the Supercomputer created by IBM Some of you may be aware of it So, on Valentine's Day last year actually Watson was fitted against human champions And Watson won And what is the relevance of Japardi The game of Japardi in the US For those of you again who are not aware It's like a game who wants to be a millionaire The only thing is that it's a little twisted In the sense that the guy doesn't ask you a question He gives you clues You have to come up with a question The question is the answer actually So, how it has been done The way it created the Supercomputer Was by feeding this computer With millions of pages of information This is not tables It's not structured data In the form of rows and columns But unstructured data, text So, lots of volumes of books were actually digitized And sent to this computer Just like a human race You know over a period of several years We read and then we become intelligent In the same way this was fed And you know so that's the relevance of big data And the technology behind it is again Hadoop and so on and so forth Which we'll be touching upon today And just that since the audience has increased Let me do a quick introduction about myself I'm Mohan Kumar I'm the CEO of Grandvise Analytics Prior to starting this company I have about 20 years of experience Working for companies like IBM and SAP I know you all, you know, wear geeks here So I graduated from IIT Kharidpur Somewhere down the line My geekiness got blunt That's what 20 years of college life does perhaps So we are into big data And today what the topic is basically What is big data and why should you care Alright, so back to our question of Why this big data has become such a big thing Suddenly, right? It's not really new Only thing is that it was existing earlier Only companies like Google and Yahoo were using this But today with the advent of social media tools Everybody is on the internet generating information Terabytes and gigabytes and zettabytes Of information is being generated So that has led to this big data It's like a big monster Which consists of not just the structured information Which is in the form of tables Which has always been there But a lot of unstructured information In the form of text, audio and video files And the businesses across all industries This is relevant for every industry, public sector, government Everybody is relevant They are struggling to get hold of this information And make sense out of it And gain some insights out of it Because data as such is of no use Unless you really make sense And get some what do you call as actionable insight It's a very common term in business analytics So we need to get actionable insights Out of this huge volume of information So that is the reason now It's a big wave It's like what 15, 20 years back What ERP was Now big data is becoming such a thing now So that is big data What is the definition of big data? Obviously there is no textbook definition as such If you will, there is a small thumb rule So what is big data? As the name suggests Anything which is a data of large volume Typically in the range of a few terabytes At least a terabyte of data Plus it has to have a combination Of structured and unstructured information Again, this is not passed in stone But just a thumb rule Because there are situations like in telcos Where probably they don't need unstructured data But their structured data itself Is in the range of terabytes So that's also called big data But by and large the definition is It's a combination of structured and unstructured data Of large volumes Now there are typically three characteristics Attributed to big data This is volume, variety and velocity Volume is self-explanatory So it is large volumes of data What is variety? Again, this variety is basically What you call is heterogeneity of the data So it is not just structured information Like traditionally how information was stored But a lot of unstructured and within that It is not just restricted to text files We can use this for handling video We can do video analysis For example, this is video surveillance data Even audio files and so on and so forth So all this put together We call this as big data So again, that's a loose definition So what is the solution for this big data? Again, there are a couple of I'll call this big data technology But there is no such word Just to help everyone understand I'm calling this but there is nothing like big data technology But there are some platforms and tools That have come up on getting popular Again, Hadoop is not exactly new This came from Google by the way So many of you have Yahoo and Google as well Hadoop came from Yahoo So, thanks for that So, Duc cutting is from Yahoo or Google I was going to say So, it was his son's toy elephant Hadoop is the name of his son's toy elephant So that's how it started And then, of course, this was spun off as an open source And Hadoop is right now an open source community But on top of that, like he mentioned cloud I'll add a few more I'll just talk about the commercial versions of Hadoop as well So, Hadoop has become very popular as a solution For this big data But there are a few others Like HPCC is one more which is coming up Which is on the same lines But it is not exactly Hadoop So, this is a platform And it has a set of tools That help us in processing This large volumes and a variety of data Now, what are some of the advantages One of the most, you know, Advantage or the biggest advantage Is that it is open source So, anybody can access it So that is number one Number two, it really helps in processing the data On commodity hardware So you don't really earlier It was like you had to have like a supercomputer To process this large volume of data But with Hadoop The main advantage is that You can actually break this Into smaller chunks Process it there itself And then bring it back all together So, using commodity hardware I think that is another big advantage And it takes care There are nice tools And there is a mechanism available To take care of the failure And so on and so forth For tolerance and so on and so forth So that is one of the beauties Of this framework or platform If you will Very briefly again Some of you probably are aware But just touching upon What are the components of Hadoop And anybody wants to pitch and please do so As I said, I am not a geek This is a technical stuff It consists primarily of three components One is the storage One of the big problems with big data Is the storage So HDFS or distributed file system Is one of the components Which means that you can build clusters So that the data can be processed Within a cluster On a bunch of computers And again these can be commodity hardware So HDFS is one component Second is the programming language Or the programming environment If you will Which is map reduce So this again helps in processing This whole data In a distributed manner So that is map reduce And the third is basically A bunch of tools to manage this whole thing So that is like the Common components as it is called There are several things like zookeeper So that you can Manage the whole environment So these are three main components And one of the drawbacks As with any open source Open source community Or open source software Is managing it in terms of The various releases And versions and so on and so forth So some companies Have kind of taken advantage of that And they said okay fine We have a core component of open source And they have gone ahead and commercialized And we have commercialized versions Of Hadoop available Again this list is by no means exhaustive Just an indicative list So cloud data was one of the I think early guys to come into this play Obviously I think Doug Cutting Is the guy who founded this And obviously he is there And then others like Hortonworks I just wanted to point out HPCC And there are ideas A couple of other vendors like HPCC They are not really based on Hadoop The philosophy is the same But just to give a little bit of background Because they are also our partner The way HPCC has come up Is they are again a spin off Of a company called Lexus Nexus And they were into content management And they had a similar situation Like a Yahoo or a Google Where they had to manage unstructured information So they developed their own Properly technology And they were actually selling it earlier And now they also made it open source So that's the power of open source I guess We are getting a lot of stuff for free And of course IBM IBM has come up with their version Of Hadoop called Big Insights The advantage here is Now since I come from IBM The advantage here is that They have added a little more Flavor to that In terms of processing Streaming data and so on So their version is called Big Insights And they have a component Which can process streaming data And so on and so forth And then of course EMC Green Plum This is again EMC Green Plum was the company Which they acquired EMC Square Any questions so far again As I said this is soon after lunch Very difficult So just to keep you guys awake Yeah So is there a certain class of problems That are solved by this Hadoop environment Or anything that we do Can be done by Hadoop Very good question I'll just talk about I'll come to that I have a slide on the use cases Today of course pretty much What we say is that We can solve any kind of a problem But again this is like a pie in the sky But there are specific use cases Which I will touch upon Industry specific Maybe we can touch upon Can I Yeah sure So See It's a distributed environment So the only kinds of problems That you can solve Are problems where Breaking the problem down And then bringing the result back together Is allowed So I mean The operation must be commutative So that you can bring the result back together Otherwise you can't distribute it So I think You can run it on one machine But that's not really the point The point is that you need a problem That can be The results of which can be Added back together Like taking average of 100,000 items right So you can take 10,000 Average of the entire thing Because the operation is commutative You can distribute it But otherwise you can't So there is a limit on the operation Thanks That's one Another also is that You probably will not be able To do a lot of real time Analytics Using this So the idea here is You take A large amount of historical data Analyze it And then Show the results Any specific reason as to Why it has been limited to only historical data Maybe someone wants to It's a technical challenge Speed of distribution See what you're trying to do is Take real low end hardware And make it do Something as quickly as possible And because With that scale It wasn't really designed for Speed But it was more designed for scale So that's kind of the original design goal But a lot of people Like IBM Insights Actually Tries to do real time Thing But the real time results There's also a company called HADEP Which is Which is by this really big time Database Guru Who's written The core of several big DBs And they are also trying to Solve the real time problem in Hadoop But that's really a physical problem Of sending data out to several So the data streaming To several machines Takes longer than the data analysis On those machines But still You know It's a challenge So I guess here What you meant by historical is It's not like really old data But it's a store and process So you could have stored it from yesterday Or maybe an hour back Yeah This space is moving I mean this is a jungle out there In terms of the tools I mean Very true You can go back and like You know Pick from the select bar Okay, any other questions? So if you're talking about government Yeah Would this Something like this be appropriate? Yes If you're talking we don't need very high Absolutely Absolutely So for example Hardware Yeah hardware So you don't really need to So earlier the issue was Since you have to put everything In kind of one One block Right So you needed to have A very powerful Machine And then A huge storage as well Connected to that Now here you can actually Split this whole thing into Smaller pieces And this hardware can be Commodity hardware Right In government also it is Applicable In fact these guys At UID are already using hardware Right So it's already being used Because they're more than That I think it is the It is the volume of data In fact that was the question I was asking them Right But I didn't get an answer Do you have an answer as to What hardware they're using? Not exactly the hardware part of it But from our platform Even I don't know that I'm sorry But I just asked them About the platform And they confirm That they're using hardware Right So that's definitely there But they can Enhance that Because a lot of power Of Hadoop is also In processing our structure data So if you just have Structured data Probably you're not Getting 100% of Hadoop Right So it really helps When you start getting And that will probably Happen soon Right So for example Using UID There will be a lot of Feedback People telling them Through communities And so on and so forth And I think that They can actually Put that back Because one of the use cases I'm going to touch upon Is the social media analytics Is the big use case of Using Big data And hardware Do you think this is Jumping on the Yeah sure I think Sanjay then Probably assumed That you understood His answer It is HDFS There is nothing magic there No but HDFS again In UID's context You're saying that He assumed that I knew That it was a When he said Hadoop A very difficult assumption But yes I mean But He didn't have anything To hide I mean I'm sure he didn't Have anything to hide But I didn't get the answer So I was like No but it doesn't Still answer a question Of what kind of hardware I mean see You can have I can tell you that Sorry Because of Commodity x86 Because That may not be Necessarily true Because you know You use HDFS And you have a powerful computer Because government can afford it Right So Yeah I mean We are saying Acronyms are HDFS I think I thought those Alright So These are the use cases That I was Referring to And let me know You know I'll touch upon Some of them In a little more detail But if you need Some more detail So I think Social media analytics Is the most Commonly Used Use case In and within big data Right Because That is the easiest In terms of Getting the data It's public data Which is available On the web You Define a problem Saying Okay What are People talking About a particular What's to do A study about You know What's the buzz Around their brand What are people talking About So they can Actually Use Big data To You know The big data Technology As I was saying To Use Gather all this Information Right And Put that And do a Sentiment analysis And What have you Right There are others Social mention For example Which they already You know Directly do that But Social media analytics In general Is one of the Used cases There are other Industry Specific The retail Which one Is it an Open source Social mention It is not really Open source But it is Kind of a Preview So they are An application There's a website Available there You tell What kind of Applications Available Online And then There are Specific To Industry Retail Is A Big area I don't know Some of you Have read this article About A girl's father Knowing About You know The girl Being pregnant So That's Again That's Extreme case But that's How The information And You know Trying to Microsegmentation Of Customers Sending them Campaigns Right And so Promotions And so on And so forth So Again A lot of it Is done with Social media Social media Data Rather But a lot of it Is also done Using their Point of sale Data And all that In telco In case some of you Are not aware There's something called CDR data Which is Which runs into Millions of records Which is basically Call Data records I guess That's the full form So CDR And then OSS data Which is again Not directly related To the customer But again There Where the calls have been How the calls have been Directed and so on and so forth Again The volume here comes into play It's not really The unstructured data As such But the volumes Are really huge And we need to Process it faster That's another Advantage So since you're Splitting this And Processing over Several Computers It's much faster Than if you Had not done that And in telco For example They have Millions of records And they need to Process almost on a Daily basis Either to run some campaigns Or Gather some insights Or Whatever Telco is another Similarly Healthcare Okay Healthcare Has two dimensions to it Healthcare One is Obviously Healthcare information About the people And then this is Related to the government Right So which areas Are prone to Some certain kind of diseases And so on and so forth There's also Another part Of this Which probably I did not touch There's a lot of Sensor data That is generated From all kinds of Sensors Healthcare But that is Again Very Relevant In case of big data So there are RFIDs Generated data And all this stuff And healthcare Is a prime example Of that Utilities Obviously I think In the morning We were talking About Utilities Data presentation Anand was doing So again A great example Where Large volumes Of data Especially with Smart metering There's a lot of data That gets generated From all these Meters And so on So that's Another use case India's UID project I already mentioned GIS We talked a lot about Right In the Through the course Of the day We are also Doing a POC with One of our Clients Using GIS Information Prank Scala Data Is all Available Because traditionally It has been there Like I can't store more Than one terabyte of data Deleted Or back it up And keep it That's it So now That slowly We need to Also increase The awareness To the people That You can Actually use that data Make sense out of it And really You know Probably even Monetize it We have to Increase that awareness As well Another Example Or a use case We have So many malls And You know Even In countries Like UK And all their Cameras Around the streets Right So Very often What happens is Again Because of Lack of Storage And all that stuff They Just store Two days of Data And then Back it up Everything And then to Get it back Is a huge problem That we Will be able to do But if Something has happened A week back It's a huge Major process To get All that back And then run Through the whole thing Right So this is another area There are a few companies Already I cannot Ring on top of My mind But There are Companies already Doing Video Analysis And so on And so forth Okay So I will So quickly On the market Says there are Again This is an Upcoming area Very difficult to get This data But there was One very aggressive Market Analysis done And they put it At about 50 million Dollars But not The B.I. market itself Is Around 15 million So this is just I know Again I am The messenger here These are not My numbers And But I couldn't Get that right away But IDC is Definitely one source Which is 18 billion And that's Why I mentioned That's a very aggressive On the first one But it feels nice To say that We are in that market Right 50 billion Okay So But I think This is Yeah, you're right This is more realistic 18 17, 18 billion Is more realistic But the Point is The growth Is really very high Because for All of us There's a lot of opportunity People Who are Looking for change Or people Who want to start a business Or people Who have some good ideas Right Innovation In all respects And that's One of the primary reasons Why we should Care about it Okay So with that I would like to stop And take Any questions I have a few slides On what we did But I think It is not Required now Yeah The relationship between Big data And Social network analysis I mean Yes Like could you maybe Talk a little bit about that Yeah So social network analysis Is In a way related To big data In the sense that You know What you try to do In social network analysis Is How influential Is a person Right And to To find that out It may appear to be A very simple problem But actually it is not You need to really do A recursive To find out For example One simple use case We were trying to do Was Using the CDR Data Can we find out Within the network You know Who is the guy Who is probably Most influential among The hundred people Who he knows And can we get That guy To Promote About this network And so on Right Now That needs a lot Of processing And that needs Obviously Crossing a large Volume of data I think Can I add Something I think Yeah We'll just come back Yeah Major market is about Analysis and Like After Process No Do we Use it In Other way around Translational Database No Not really Transaction Okay That's what I was saying Do we have any other Product which is coming up Do we have a market For this Where big data Absolutely Absolutely There is Absolutely a market And that's why As I mentioned The big insights Of IBM Has a small Component Which can tackle That kind of Somewhat Streaming data Or a little bit Real-time data Right So there is A big market For that Or a demand For that But right now You don't have Any mature Product If you will Right So yeah That's right Hadoops See Hadoops are very simple From an idea Perspectives A very simple idea You send out data To several different Machines And get it Get the work done there And get the results back Put them together And start working with the other So Hadoop Can handle Problems Like social data analysis But it would It would still Not be a very Simple problem Because the algorithm To understand The text itself Is a complex Task So that people are working On text analysis In separate form So Hadoop Is not the solution there And To your point There are several players In the market actually You mentioned E.M.C. Green Plum There is AsterData There is SAP HANA No But the question Was real-time None of them Do real-time They very much Do real-time Because there Are memory solutions Unlike Hadoop No Memory is different From real-time We can have that Different There is a point Of memory In memory So performance Provides Performance of what Real-time Real-time As the data Comes You analyze it That is real-time That's the point Okay We will come back Yeah First of all I am not a Technology person So We are in the same world I need an answer In a very simple language To understand My Basically I am looking That is why My interest is Your title Like a big picture Something Attracted So Point is Two different shapes of data Like one Language One ministry One set of data With another ministry Two big departments Two different MIs Yes And Like People like us Like a struggle Like what is The relationship We can have So what kind of Technology solution You can offer So that So data Different MIs Absolutely And how you Like Tell something concrete This is the Okay So again Probably We can take it Offline as well But this is definitely Within the scope Of big data And Hadoop This is a very Typical scenario It's not very Specific to you So we can get Data from Different sources Merge it together Process it And analyze it As One unit And then If that's okay We can take it offline Yeah Alright I think We are Any final questions Or I think We are running out of time Yeah Alright Thank you very much Thank you