 Okay, let me, we're going to take the whole panel here and even ask questions since I wouldn't want to get answered and ask the first question, but I'm excited. How many of you have questions for panelists? Okay, all right, good. So why don't, can we sit here, on the chair? Okay, can we sit on the chair? All right, so let's get started. First of all, thanks a lot for sharing all the information. It was really great and from each one of these presentations, we got a lot more information than we thought we would and the numbers as well as the reasoning why you went is pretty good. Since I have a bunch of questions, but I'm going to just ask one and then I'll open it up for the audience and if they don't have questions, then I'll have it. So we're not going to go into a lot of questions. Are you giving us lunch? Yes. So, um, yeah, that's fine. Pan if you pay. So, um, see, you talked about, I think one thing that came across pretty much, again, again in each one of your presentations, use a mix of technologies. MySQL, Vertica, you talked about Vertica, which is a common database, you talked about vector-wise and you said 19.9, I forgot how many 9s after the dark, but in MySQL, we do the job kind of stuff and you also start with MySQL and move on to these kinds of things a lot. So I just want each of you tell us a little bit about what it was like to go, you know, and did you do any measurements or when you just like break it down? How did it happen and which? See, there are all these confusing array of databases, like there were craft databases, there's columnar databases, key values stores, the three or four different types of NoSQL databases, and of course, you know, there's always Google and Amazon. So did you look at any of those like cloud-based services before making a decision to do it on your own, and just say anything that you think of audience should not or wants from us? So I guess I'll answer your last question first, right? So we are cloud, so we can't really look at other people to give us a cloud, right? Think about as being sort of the wallet in the cloud for the last 10 years, even though 10 years ago, the word cloud wasn't really used, right? So to us that wasn't an option because, you know, we feel that our, I guess, use cases big enough and important enough, and I guess in-house is the way to go for us, right? So that said, when we actually started evaluating, we set up a criteria, right? So like I said, we had the criteria of X number, that is one level, five million concurrent sort of messages being streamed into the database, right? 2,000 simultaneous queries. So those were our sort of benchmarks, right? That's what we tried to evaluate each of these technologies on. And with Oracle, we ran into a limitation at about 15,000 per second. It's hard to write from five million, right? So even when we used sort of the batch interface, so the OCI for actually loading batch files in, now when we compared that numbers-wise to sort of Vertica, it was a magnitude of, you know, 100 difference in performance. The columnar database is actually being fact 100 times faster than actually trying to load this data into some kind of Oracle. So yes, we had applied the same set of criteria to all of these solutions. And, you know, I think the criteria was based on our scale, our sort of internal requirements, I didn't share some of those numbers. But beyond that, I think some part of it is also experienced, right? I think some of the sort of solutions we had ruled out, because we know that, you know, we've seen it not work before, right? So that's, for example, things like, you know, our operational constraints, right? Not just experience or annoying. In our case, even though we use MySQL for some parts of the site, it's not sort of a technology we widely support from an operational perspective, right? Terrain data is an appliance that, you know, we use on the data processing side. We ruled that out. Because from experience, we know that the cheapest appliance is going to be, you know, X million of dollars for us to actually stand up. So some things get ruled out by experience. Other things we applied actually the same criteria to do an evaluation across the board. Just wanted to add one thing to that. So, right? So, right? We always start out with the open source technologies. So in fact, somebody, and when we use to, when we basically, when we also distribute software, right? It's not just on the cloud. It's also on-premise software. Somebody jokingly told me that if you open up your package, so that, you know, the manager, you know, the load package, you find all the open sources in the package. So we use an evaluate open source package as a lot. And there are varieties of different segments of customers we sell, right? Right from individual consumers to SMB to large enterprises. And applying the same technology from a cost standpoint doesn't make sense to, you know, if you're going to use an enterprise-grade technology for an SMB, you're just going to, the cost or not going to work out. That's primarily why we make these technology decisions of either a VSQ or a common database and kind of a decision. So cost is definitely one factor that we... I just want to add one thing quickly on the cloud aspect. We didn't use a cloud at two liars back then, but that would have definitely made a lot of sense thinking back. A lot of the things that I said around three, four days to catch up and so on was because we had to do capacity planning and we were a startup. We could not afford to throw infrastructure at, you know, analytics cluster. But with the cloud, you can easily spin up and down instances. And so, you know, catching up would have been as simple as you're spinning 40 or 50 instances on AWS doing the computation, shutting those instances down and then going back to your steady state cluster of five instances. That's a very popular model these days. AWS wasn't that popular back then, nor was it, you know, like we would have had to take, you know, sort of technology at risk with that that you must not be comfortable with that. And so we did not go with it, but that would definitely give a way to go today had I, you know, had to do it again. Just wanted to answer the question. I wanted to answer the question as to why we choose a particular technology, right? I think it's really about your use case. I keep saying this in most of my sessions. You have to sometimes step back and ask yourself what is my use case, right? So according to that, you can figure out, hey, if you've got a log, a lot of fields, typically you can, you'll think, hey, it's a MySQL table. You shouldn't think of it as MySQL table. You should think of it as a log. It's a log and you could do log processing in other systems as well. So if it's below a certain threshold, if it is a couple of hundred million records, if it is a couple of megabytes of data, you can show you MySQL to do the processing as well, right? Or auditory, right? If it is not, if it's going to be a very large data which can't be managed in a regular database or a regular cluster, then you step back and go out and see, this is my data structure. What other solutions are there with soup my data structure and which are going to be used for possible processing? You don't go to the fancy that everybody is using NoSQL and the cloud, so let me also jump into that. And don't do that, right? So that's my answer. Okay, so... I don't have a question about this. I'm speaking to all this. I'm not really a compatible, an anomaly here. I'm not really a geek, but I work a lot with geek. Okay, so for the question, actually the suggestion I have is, is the fact that this is a very challenging set of decisions. It makes a minute sense for me, I think today this concept of cloud sourcing a book or a resource on choices of technology, a book which is called the business model generation workbook. It was done by I think about 470 people or 55 contributors. There's a business model generation book which is done by about 470 authors. I mean, this may be a good project for us to undertake to see how to choose different technologies based on different criteria because it seems to be a lot of challenges and we can have shared learnings. It will be an excellent resource for anybody, including startups for anybody else. Yeah. Somebody had a question. There are two choices. I can walk up to the mic, back and forth. We have to bring it back. Okay. I think there are two things. One is the big data and the big data analysis. I think Hadoop kind of fits it on the big data analysis part of it. Hadoop is actually just a way of accessing the HTML so on. Hadoop is actually a technology which helps you in accessing your big data. So I think those are two different things. And going with that definition of big data, saying volume, velocity and variety, so it's a clear indication that there are certain limitations of special database where you cannot support elasticity. Like today you want to support 100 millions and that is fine. Tomorrow there is a particular hour of the day where you want to support raising 100 times of what the volume you are supporting on a regular basis. And variety again as the schema issue comes in, so variety of data keeping on changing or the velocity and volume, so all of this kind of is not a very clear indication that there is a big data. So moving on to Hadoop part of it. Hadoop is not just a big data solution. It's a way of accessing your big data. So you can apply MapReduce on your traditional database as well. So the question is like, if you really mean to say that you want to work on big data, is it actually even possible to use a relational database even at first place? So maybe I should take it since he's one of my colleagues. So I think you are right in one sense. So the volume velocity and the rate of change of data is certainly one aspect. I think the other aspect is the type of data. It is sort of the data you are actually talking about. So if it's transactional data, it's a completely transactional system and the volume of data is so huge and the velocity is so high that it's not a very good data solution. So if you look at the type of data you are dealing with, you kind of think it's really special. And that's what I was going to do. So you have to look at the data actually and say what solution fits that data best. But if it is a big data, though it is a transactional data, let's say it's a completely transactional system and the volume of data is so huge and the velocity keeps changing and the variety is changing. There is an inbuilt limitation in relational data where it can just vertically you cannot do horizontal scale. You cannot plug in an additional hardware just for a couple of hours. So looking at that, I think you said my SQL works 99.99 but if it is actually we are talking about a big data. So there are techniques like which you try to do with a relational data but going with a pure big data saying in terms of what definition we have discuss might not be a solution. It might be a different player of most SQL or you know you want to use instead of how do we want to use a high two-axis of data it's a pure analytical system. But the data storage part of it because of that relational model of not being able to do vertical scaling it's a limitation bench. That's what it absolutely is. That's why I think how we define big data in today's world is also a reasonable right. Just the volume of data and the velocity of data you also have to add sort of your data and then what solution fits this sort of level of processing, this scale of processing. And then sharding is one way to at least try to scale it horizontally. Sort of separating out transactional nodes in another way even though it's not sort of real sharding. So there's different ways to sort of scale the capability in terms of the data velocity access or change or whichever way you want to look at. That's why I think like the real thing to look at is the use case and your actual data before you actually jump into the solution. I somehow tend to think that you think that big data is equal to not MySQL or don't use MySQL. Is that what you're trying to say? I'm talking about relational database because it's not relational. Even if you have 10 million events and there's a transactional event, obviously it's better that you use a relational database. Yes, you can do techniques like sharding or replication. You can do 10 million events of MySQL using sharding. 100 million I'm talking about 10 million events a month or 120 million. You can do that. If you just take a MySQL cluster and you shard it take the data what you get you divide it, segmentize it and saying that certain data goes only in certain servers, you spread it out do the transaction processing and I'm going to get the results later in the separate database. But elasticity is not my thing. Why not? I want to add another additional hardware out of the box to give you a solution to take care of. I have put a presentation on slide share or maybe I haven't. It's called How We Scale. So we have a similar solution where we said that the transactions are going up. We can actually on the slide initiate kind of there's a pre-installed MySQL instance but we initiate the database we put that into the data processing cluster and the transactions start like next minute. We do have, we've done something to do with a lot of people have done that. There's a very interesting presentation on highscalability.com which has 10,000 databases per minute very unauthorized approach to database scalability and design. Have a look at that. It probably answers your question. Just to quickly add to that I think it really comes down to more than type of data. What is the question you have to answer and dictate which technology or framework and so on that you use. On your specific question, how is it possible to use MySQL itself as a NoSQL data store which is how Facebook uses it and a friendly user set example, friendly has a public architecture on how they use MySQL as a NoSQL data store. Obviously MySQL has better, better suited relation of data but you can also use it as a NoSQL data store. That said, all I want to say is that you run into the same limitations that you face of MySQL and so on even with some system like Hadoop. I'll give you an example. At our company we have to do a lot of analytics on how many unique users we were getting. That's an important metric in the web space and so on. And it comes out at least back then that if you want to do a distinct on a set of data and so on in the reduced space of Hadoop, it has different memory on one machine. If you were not doing anything special and then you would run out of memory space on that machine and you would not be able to do this distinct even on a distributed system. So you run into these limitations like sharding and so on unique across different groups and on different machines and then just take consequence of summing up those individual packets. So you're going to run into these kinds of limits whether you use SQL, MySQL or Hadoop by the assistance of how you work around those limitations. I would like to add one more thing here basically when you actually do the big data architecture you need to first explain it up. First thing is you store the transactions. The second thing is you build the values on top of it. And third thing is you build models on top of the values. So for the basic transaction capturing or doing everything MySQL does the job. Like Asaf said, 100 billion data doesn't matter. But when you want to run a small aggregated query with three joins or four joins, that's where it gets struck. That's the reason why you need to use IWAR, BigQuery or whatever where you columnize the data. And if you want to aggregate the results it just gets you in milliseconds or seconds. And that's also a problem. And you pass this aggregated data towards the models and it gives you real-time solutions. Okay, so can we win, sir? Can I just have one? Sorry. So I think we're looking at the same problem from different angles. I think you're both right in perspective. Both of us sharing the knowledge about the PayPal system I think our view is a little sort of colored a certain way about a transactional system. And that's different from how an analytics system works. In a transactional system we don't try to aggregate things. It's more about actually running the transaction or getting data for a particular role in that entire data set. Each individual role represents a different data set. So there are sort of differences because of the type of data in how a transactional system should be handled versus an analytics system. But again, even in a transactional system, like I was saying there's ways to sort of cluster your data sources. Like you said there's ways to shard your data sources. And there's routine ways to scale your data out. But the one important point I think that we are mixing in with the concern of transactionality is consistency. It's if you require strict consistency then you run into the problem of I can only scale working. If you move to a more sort of eventual consistency then you have looser constraints and you can actually have more solutions around how you horizontally scale that sort of thing. So I think that's really the one piece I thought that was missing from the discussion. We'll come back actually after we finish coming up with a question because I didn't want to find it out. We'll take like two questions and then we'll take the mind back so that we can get an answer. So I actually have a few technical questions from everyone. So I'm really allowed to do that. So I'm going to stick to a more business problem here. So what sort of opportunities since all of you have experienced a big data to some degree in in addition to big data at some point. What sort of opportunities do you see for new startups to address problems in the area and where do you think like if you had a certain application, a certain type of platform or pretty much anything out there that could help you do your jobs better or faster or easier. So that is three questions I'll say. I have more. I think the fundamental opportunity is that there are very few folks who understand how to process a very large amount of data. So I think just because there's some exploding in the U.S. right now there are companies. There are different kinds of people doing different things. There's Cloud Air actually having their own Hadoop stack and that's pretty complicated. But I have friends who are offering Hadoop as a service. So they will host it and manage it on an Amazon cloud and they will handle those issues and down times as well as SLA which is pretty straightforward. But providing another service available by the amount of data that they have so that's an opportunity which is very massive opportunity. So I think analytics, there's a lot of opportunity in advertising, data modeling because more and more common there's an opportunity. So I think it's really huge. It's just data I would hate to call it big data but I think it's just a data opportunity so it's very wide. Big data is when the data size becomes so large that it can't be managed to commercial based. So if you look at the personal data phenomena about Nike, Rocket Nike few like Fitbit these guys are basically trying to track everyday activity and that's supposed to be a really huge opportunity. There's a new company called Lyft which has been gaining massive traction of Angel List. So that basically is saying we help you track your activity and charge complete the assignment. So everything through personal data, that's a huge opportunity. Hadoop as a service that's an opportunity that I know of. There are tools around Hadoop that's an opportunity that people are doing like I had said the convention two months ago where people are offering something as simple as a better impression service for your big data. So that's an opportunity. We see moments and the obvious ones are obviously the data mining and analytics but there's really no one big data technology so there is this exploding the door of technologies. One opportunity could be how do you make provisioning the operations of these big data technologies easier or ask the situation between these technologies like to people. But nobody's kind of got to that level yet. Yeah, I think before we have start-ups on big data, I think there's a lot of opportunity for disnowing big data having more engineers, product managers and the school. I don't use the big data but knowing and working with tools like Hadoop and so on there's a company that's been buying me for one and a half years to come and do some work with them on Hadoop. That's all they've been unable to find anybody in India. It's a US company working in India now this phase out of Pune. And I don't have time around to talk about big data. I don't have time if you know Hadoop, I can talk to them but there are a lot of these companies just try uploading your resume on Hadoop with the word Hadoop that's all automated and textured anyway and see how many requests you get. It's just incredible how many people think they need Hadoop at least even if they're going to go but at least there's not demand for people who don't know Hadoop and so I'd say that this period was that before we have start-ups working on this we need more people that can do this. So I also want to say this because it's not different. So I think that there are a couple of ways to do it. First what I would do is I would look for the problems that these are solving and you expect these problems to happen in the industries. There are tools that I keep going back to indeed the reason is that let's say that the jobs that stay in you a lot of those that stay in the US are very high priced ones if you go and do things like big data or Hadoop or high or you know any of these things and then follow it with a qualification say greater than $100,000 for a year. I was talking about business officer he's not close to each other. No, okay, that's fine. That's a good question to ask. The reason why I say is that companies that are looking for people these are either product companies or companies that are trying to use this kind of stuff and the reason going back to business opportunities I'm going to pass for a while and then I'll come back I have to think about it a little bit in the sense that there is one of the biggest problems I face when I was trying to decide either for me or for some of my customers whether we go to EC2 for elastic for cloud computing for example when they produce the map and do the service will I use it or will I not use it and lots of there is a lot of knowledge required so if you are not talking about a product company but it's a business service opportunities I know a company for example in Chennai that even before all this big data happened they specialize only in high performance computing then the speciality is very simple on Oracle and SQL server they'll have it in terabytes of data they'll build you solutions for so if you're looking for niches there are lots of things that you can you can find I mean I can't come up with a bunch of them right now but it's a combination that I would love to have with you and you know brings lots of ideas and how people will find them some other questions I think precision making on what to use RDB database I was just reading an article talking about RDB we should use an asset property suddenly it became a base where even the consistency you can forego that became caps it's actually a consistency availability and partition and I read about Facebook users basically avoiding consistency and making sure availability and partition Google uses IP tables which actually foregoes availability and then goes for consistency and partition and they say even all three cannot co-exist together first of all why is that and your experience I don't think that can co-exist together I don't think you can optimize for all of them they do co-exist to some extent you just can't optimize all three sort of corners of that time though which is can so why can't you do that because think of it this way speed of rapidity speed of rapidity so if I have to have availability I can't have a single kind of fit if I can't have a single kind of failure means I have to have two or more instances or something if I have to have two or more instances or something which represent the same state all the time that means they have to have data going back and forth between them which means regardless of how fast my travels if you have enough volume in us here those two will never be the exact same at any given point in time so why do you need consistency so again it's based on the limitations that are there in the computational power in the sort of network in the way sort of systems are built but they do exist together to some extent it's just which page of it do you choose to do at once that boils down to again if you work on state data and give approximate results correct so to give a back down to everyone who probably doesn't know this there's a theorem known as the cap theorem or also known as the brewers theorem which says that a distributed computing system cannot have consistency availability and partitioned orders it can only have either of these two partitioned orders so it means any distributed system that is there in the world right now is generally said to have as he rightly said due to the limitations of computing due to the limitations of way things are designed it could only have either consistency and availability or availability and partitioned orders you can't really try to have all of them at the same time so you can architect systems that sort of eliminate to a very very low percentile but you still have some tradeoff right so I've seen this on Facebook for example if I've traveled to a new region and if I try to sort some of my friends they don't talk to me for a while and I suddenly think did they understand me or block me or what and then in the next couple of hours I can find them back so that's an example of your consistency missing so Facebook will really probably design that system saying it's okay to be eventually consistent so Cassandra is one such data base that emphasizes and saying okay I will give very high availability I will give partitioned orders that means I will always be up alright we will not go down as much as we can we will try to remain up and we also have partitioned orders that means if one of the partitions goes down the other part of it will still be serving data so there's tolerance there but consistency might not be there so consistency means data might be eventually consistent a while later you'll see the data the correct data being updated but it won't be there all the time so most systems are designed around these two principles and every distributed system at some point will have we have to bear with one of these issues you know we've always had issues in our system and we always try to keep reducing that but you can't really lock it up Is that the main decision making factor that is one of the that's really not the decision making factor so what data base or what system you would use that's not the primary decision primary decision boils down to what is the kind of data you have what is your use case what is the amount of money that you have to spend on that on that and so forth so those are the top three criteria but every system architect who's experienced enough to understand that these are also the constant that you need to think about so when in the minute you start going to SM I will believe that it's a one hash dynamic has one half dynamic has table and it also tell you that it's eventually consistent so some of the data for example if I'm looking at the same screen at the same time and there's a lot of things data that user A did it might not be immediately visible for me so that's how Cassandra operates because once you start clustering it it will eventually make the data consistent but that's the trade-off the advantage is that it will have high amount of out times really fast reads another advantage is that it has the software that it will give you so yeah I'm sure on the question of Navjo during your presentation you mentioned about you know one of the things that for getting built in analytics if the applications take your own speed in the data that is the transaction part from that they all just take the analytical part of the data so I just don't understand what you're mentioning yeah any other question okay so let him ask the related questions Hi this is Dengal from the high performance competition before asking a particular technical question I'd like to add some points to what are the questions that they haven't previously asked and there I have asked about why can't we look at cloud solutions and other things before going into Hadoop even Ahadoop as a service has rightly pointed out that Amazon provides and others provides when you look at Hadoop basically you look at storage and computation that should be happening in data locus that is a concept of Ahadoop with all about acquisition and all about data locality the competition should occur when the storage is all about you put your data in the cloud and if you don't do it where the data occurs a lot of if it goes to input output so that's where the performance goes on if you look at the Hadoop solutions over the cloud that's the one product that you got EMR compared to internal Hadoop performance wise believe me I have spoken with the EMR guys they said performance is lower compared to internal Hadoop clusters that's the cloud solution where based on Hadoop is very slow that's the one point and about most of the discussion where about no SQL Hadoop they give come out RTV mess but no SQL is also a part of Hadoop that's one thing I want to make sure of Hadoop is an ecosystem everything it contains that to other who are not working in Hadoop I am telling you that is Hadoop Hadoop ecosystem consists of no SQL then other even if you want to connect with the SQL SQL services where you can connect with that is it depends upon the use case you work you can have your services accordingly whatever whatever you want losing what Jova has had for document especially for unstructured data or semi-structured data all those things that's the one part and third part of excellence we talk with people who come in from the outside cognizance as well as internal inside cognizance we talk about the the cost parameter and the number of data involved you don't have to go directly to with data or directly to Hadoop that's the one thing you have to think about the cost parameter that we are going to use second thing that you have to think about you can scale it up like what Asing has told you can scale it up for data but the computation that we are going to do is going to be very costlier in terms of performance as well as the cost that's the two points that I want to come up with okay I got a question for now this you are told here about I mean countries updates is not possible with no SQL no SQL other than Hadoop Hadoop ecosystem that space combines the Hadoop ecosystem you can have update improvements that's the one thing and real-time analytics even that's contradicted with Asing Asing has told about near real-time analytics is possible using Hadoop because that's what he has been doing for a while so firstly my apologies those things were related so I thought is this on I thought the questions were related sorry they weren't so I'll try to answer your question first so what I said was not the application splitting the transaction but the way a business process can be designed so you have the transactional part of the business process and then you have the sort of fraud or risk or analytics scoring part of that transaction itself so those are two different steps two different tasks in the process and we don't try to mix those two tasks in the same system we do it actually in a different system and the sort of risk or analytics scoring is more often the sort of data intensive application it uses all historic data for that account for the transactions for that account in the past to come up with a decision whether this transaction should be allowed or not the transactional part of the system does debit and credit in the end at the simplest level so by splitting it up in terms of tasks that different applications have to perform you actually have different use cases in those two applications so your own force sort of analytics application to work the same way as your transactional application when we started out we were doing it all in the same system even today a large part of it is done in that world TV system but what we're doing is realizing that that's not a scalable way to do things we're actually moving it out into more of a data grid model where we actually have the analytics of OLAP type of use cases does that answer your question now I'll try to get to yours I guess since that question was pointed to me I can take it so you said a lot of things man so I'll start with the last ones that I remember so I think you said for near real-time analytics as certainly possible but like I said 30 seconds to start up that way in our group of concepts that's valid near real-time for some folks fits the bill perfectly for us it didn't right so I'm not saying it's not possible and even for sort of incremental updates and things like that right if you look at the way sort of Hadoop has evolved like I said in the recent releases of Hadoop it's certainly possible when we did the proof of concept it wasn't possible so the same thing goes for recently if you've used and I don't want to name the evil empire Google search recently you've noticed that it sort of reacts faster to you now right that's because of the evolution of incremental indexing as you're doing your search right so they move past sort of the base Hadoop that they started with to do scattering indexing it actually had incremental capabilities they come up with a faster way to do the search so what I'm saying is in our use case right we would like to actually feed the streaming data in so we don't have a large log file sitting there that we actually try to do analysis on we have a stream of events that's coming right to try to stream that event in into a map abuse function right and then try to get real-time analytics on it we think the map abuse is always running I'm just giving it more and more and more input right so to get that first meaningful query started took 30 seconds I streamed 5 million more events into it in the next second guess what it starts over again right so it didn't work for our use case does that answer your question yeah I'm not going to go to the first part because I forgot and I'm not going to leave yourself for that but that's my small mind right anyone else want to know the squad doesn't always talk what costs is there anything else at all that you guys do no I was also an expert on scaling we talked about technique and technology just kidding I think we just on the real-time analytics part so Landry was talking about map abuse not being scalable but you have a lot of streams coming in there's been a lot of work in that space a lot of people have realized that real-time systems don't work without it it's just two different things, two different problems and Twitter has done some amazing work they have something called a Twitter storm you should probably look at that that is the equivalent of Hadoop on the real-time world so what it basically does is it does map reduce in a different way you have clusters sitting and waiting for events you just shard events in different categories or whatever and they just increment counters real-time analytics which are trying series space more or less aggregate functions so so I think Twitter storm is equivalent in the real-time space and you shouldn't really be saying try and put everything in Hadoop that's not the real problem we are working on that problem and other systems are not only like Hadoop when you talk about big data a lot of other big data platforms for example HPCC that is one more big data platform where we are also working on it to analyze how we compare to other big data platforms when you talk about big data I think rightly pointed out you don't have to do exactly the same way a lot of other vendors are available any other questions? any other questions? please try one more question question or statement? no yes no we will get into that 2 years back 2 years back when I joined Onenscape it was all about hypo board cloud and everything right now it is all about big data so what is after 2 years? what do you see here? after 2 years what do you do actually? who wants to speak? I think we have the question to ask I don't think it is just hype maybe we are not actually using some of the apps and we don't tend to see what is really happening I don't know how many of you actually using Amazon cloud if you do use it you realize it is pretty powerful you can spin up instances additional computing power you wouldn't do that earlier at all and the same way on big data as well if you actually land up with massive amounts of data which is increasing by the hour to insane levels you really need good technologies and big data is just probably an analyst point of term but to us it is really very very large data sets that are not that can't be processed through the usual ways so the whole big data hype as you may call it is basically a movement to try to figure out and so it is not very high and I don't know what is after 2 years of surface or something like that there is one very interesting article I read 2 days back it said that recently they discovered the expose on article they said the title was big data meets big data and it took them like 50 years of research to actually they've been doing this analysis for a long time but I think there is definitely value in trying to speeden up the research and the correlation result in fact one of the data points they said was they generate something like you know 100 data points of data and we scattered over like 150 data centers 150,000 processors and so on but that is definitely a real element to it and also a hype there will be opportunities that are real but then that will be determined by the real pain points, real use cases rather than just latching on to the technology work and then the use cases so instead of thinking framework and technology first think the use cases is not my point of view so I think one thing that is really starting to show up a lot is big data married with big compute so if you saw the recent launch of Google's compute they demoed how they were able to spin up 600,000 cores and then do cancer research like in seconds what used to take weeks and weeks right because you're working a lot of data I think you're going to see a lot more of that you have companies that are coming up that can speeden up cancer research genomic research particle physics data analysis type things I think you're going to see a lot more of that but I think there is also a lot of at least in the non-hype computation world who know they have a big data problem and then they are used to then you have people who think they have a big data problem and they actually don't and they end up choosing the wrong tools so I think in 2 years you'll also see a lot of clarity around when to go in for or what's the right tool for which job and you'll have a lot of document and case studies and so on which is a lot in today and was in 2010 when we started out so one metric to see between hypes and realities you watch the applications I don't even want to pay attention to what the analyst predict kind of thing because 2-3 years later the prediction can be something entirely different but if you watch applications being built and being used successful applications being talked about you can separate out what is the reality and what is the other the high portion versus the reality working on that any other questions? yes so just to add one thing to that if you think there are look at the past and think why big data actually came up as a problem so we've had the internet for many years not counting when the defense actually sort of the defense department actually controlled it but as users using the internet how many years now? 15 years how long ago did big data actually start becoming visible? 2 years? 3 years? something like that so why did they suddenly start becoming visible? because you have a lot of influx of user facing applications and the things that people wanted to do with data that was being collected from these users so if you look if you want to see what's coming up from your question so Amazon sort of cloud big data in my mind are not directly related so if you want to sort of have a crystal ball and look into the future we're going to see what kind of sort of applications are going to come up so I recently attended a known physicist IOT doctor sorry I forget his name but he's a futurist he's a physicist and he's a futurist and he thinks that the world is going towards ubiquitous computing capability ubiquitous connectivity not just scanning my iPhone around just going up to the wall and saying I want to order coffee going to a curtain and saying oh yeah I like that cart that's standing outside I want to know all the specs in that cart that I can see through my window and I want to see them on my window but I'm going to order that car and buy it and have it sent to me standing at my window without putting on a single device so everything around you becomes a device now that may be 50 years down the road I don't know if it's even going to be reality but I guess what you have to look at is what kind of applications are coming up and what are some of the problems that are now just beginning to raise their head from those applications so if you look at India 400, 500 million mobile devices very few of them smartphones but even now you run into a bandwidth limitation so maybe something completely unrelated to data for all I know but only way to sort of foresee a guess is to look at the kind of applications that are coming out and the problems that are coming up with those applications and that might tell you what's going to happen in the future as a big trend sorry, we materialize that okay one other thing interesting story actually the other reason why big data has become so popular is it's kind of within the reach of it's basically within reach not only from a technologies point of view but also from a cost point of view so you can actually run all these computation algorithms on cheap hardware, right? so that's one of the things that makes it possible so the story is actually 50 years back my dad used to run some he was an engineer as well he used to run some engineering problems and they were actually complex design problems they used to do it on punch cards and he used to travel from Delhi to Kanpur where IIT Kanpur had this super machine that he could run it on so think of it today you really don't have to go through so much difficulty to find a system to run these problems on the technology and the means are much within you know, it's a little striking business within access to the limits that's what makes it possible I just wanted to give some application where we can quickly do three things there's a company called IPM Web that I know of so what they do is they take our data model every hour and they use a modeling a statistical modeling and they come back and tell us hey, this has a higher probability of a click and they charge a cool $4000 for a simple service so that's a sample data opportunity I was reading some article about police how they are basically putting all the crime patterns into databases and doing analysis and coming up with hotspots where crime would possibly come up that is a great opportunity big data is big in advertising so advertising I would probably say is one of the big users of big data not the biggest I think Facebook is using a lot of data your own data is the open graph they are crunching that data and pretty soon Facebook will launch an ad network on mobile and online which would basically advertise to you about what they know about you on to the third party sites where you use it so if you just like Nike today you will probably get a Nike shoe ad on several sites that you use it that requires big data so there are a lot of interesting opportunities you have to go out there see what's available and kind of get into it and take the best take advantage of it thank you last question should we take the last question I don't need them just take the material you can take it I won't ask this a lot of times I think we got a lot of good information first of all let us thank the panelists I think team for actually making it happen here not for myself but how many of you know that we have a tech community in Chennai called Chennai Geeks okay for you on Facebook it's called Chennai Geeks just go and search for it you can join the group we meet once a month it varies sometimes 3rd week sometimes 4th week and we talk about topics like this and we are open to coming up with variety of topics it would be really great to have participation from all of you because I think many of these things learning takes place when you start listening to the real stories what is happening and how it is being used you know and even new technologies doesn't make sense we want to go after them and stuff like that I think it's time that we had a big tech community just to give us one metric before we started Chennai Geeks a few of us I was living in the valley for some time and they have something called a software development forum at the time I left somewhere around 2007 they had 17 special interest groups and each group would meet at least on one day so you can go to the day area and any day of the week and even you can land up in some technology they don't know when Americans don't do anything on Friday evenings technical so it's on Monday through Thursday and sometimes there will be overlapping groups with it's mobile there's capability in modeling that kind of stuff I'm hoping that we can at least come up with a whole bunch of special interest groups as part of the Chennai Geeks so please come and join and you know Vijay really thanks for making this happen here and we keep coming back to the start-up center all the time so in case nobody notice he's Vijay and he's the guy who runs the start-up center