 So, thanks everybody, I know that we are the seven people between you and your dinner. So, we will try to make sure that we utilize the time in the best possible way. So, thanks to the panel first of all for agreeing to be on the panel at the end of the day after a grilling for some of the speakers a very, very grilling day basically. So, the idea of this panel is basically twofold one is that we do not want to get too technical unless there are people and who want to get down to brass tacks. But we want people to leave with when and how to use big data techniques and what are the tools and what are the pitfalls end of the day which we should be aware of. So, having said that, so I will just go very briefly over the panel members, sorry. So, we have Kalpana who is founder of MetaOm here and she is kindly agreed to be here in spite of her injury. So, thanks again and she is from the biosciences for it basically and she has a chemistry background. So, we will get a completely different view from her side. We have Rohit, Rohit Chhattar from Yahoo and he is basically part of the team which is modeling the SAR model on the grid which is growing pretty much up to the one petabyte very soon. We have Prithvirjeet who has distinguished background in analytics in terms of his work at HP in terms of his work at Genpact. So, we are hoping that he will provide a business angle to lot of thinking that we basically discussed today. We also have, I am sure many of you attended Anand stock today, so who brings in a different perspective on lot of things and he has a background again in analytics and you can look up most of his work on his blog basically sran.net at this point of time. Then we have Naajot Siddhu who from PayPal who has basically who is responsible for the big data platform discussion or platform decisions at PayPal and he has experience in various other things also and we are hoping for some of the what one would call as straight talk from him as he actually mentioned in his session also is there one answer to everything so I think that is. Then we have Joydeep, Joydeep same Sharma who is presently heading Q-Bowl India. He is a ex-facebook ex-hive creator amidst us, so we are hoping that he can provide the inputs in terms of the size of the data that people have to handle at when they say large data or big data and he can also point out the challenges in terms of handling that data. So having said that we have set of four questions that we are going to ask and we are hoping once we finish those questions if there are questions from the audiences we will basically take them up. We have time limit of about 45 minutes, so let us go ahead with the first question and first question is directed to all the members. So please I think there is a mic is working. So the first question is very simple to the panel members. So what is the problem that you guys are trying to solve which involve usage of the large data? Very open ended question and since we have a varied panel member we are hoping that we can get all the different varied use cases so that people can walk back saying these are the places possibly I can look at. So yeah go ahead. So Kalpana can start and then we can go ahead on. Yeah sure can you guys hear me? So what we basically are doing at Metaomas we are actually helping biologists make sense of all the data that is coming out in life sciences these days and when I say biologists we are talking about people here who don't really ever want to write a line of code or do any kind of scripting so that is the audience that we are sort of addressing. So when you look at data itself it's a pretty boring thing to look at I mean it doesn't tell you much what really becomes interesting is when you start mining the relationships in this data and that is really what we are attempting to do and that is where we find a lot of challenges so that two kinds of relationships you have you know things that are explicitly stated like you know your name is such and such and then you have implicit relationships inherent relationships which are not so explicitly stated for the example if I said that somebody was you know Mary was Sue's daughter and Barbara was Mary's mother then you know you have a granddaughter grandmother relationship which was never explicitly stated is the same way in biology so it is mining that which is really interesting in the life sciences and it is there where people start seeing new patterns new uses for you know drugs chemicals compounds and there's a whole lot of new insights and hypothesis that comes up so what are our challenges I mean there are there are a lot of challenges here biological data is very very complex inherently you know because it is scientific data that is coming out and the second big thing is this data actually sits in silos so if you look at each silo of data that is coming out the data actually has a different context and by different context I mean if you look at a planet say Neptune it means one thing to the astrophysicist and something else to the astronomer biological data is like that it means something to the physician something else to the molecular biologist so when this data sits in silos you might be talking about the same thing but how do you make this data interoperable and connected that is the second sort of big issue the third issue complication is data is actually completely unnecessary but created by the scientists itself every guy who works with a gene gives it a new name every guy who works with a protein gives it a new name they say that like biologists will share that toothbrush but they won't share their gene names these are the sort of issues that we deal with just connecting the data and making sense of it these are all the issues that come up with it but once you've connected it and once you sort of you know put in some level of reasoning and are able to deduce implicit relationships how do you serve it out and that is a big issue you know these are guys who don't want to write code and even I wouldn't want to write those huge queries that you know would need to be written right so how then do you visualize graph data as a query and how do you serve it out so that you can explore it and explore that nature of that graph these are two issues that I think you know we have struggled with solved to some extent gone back to the user and found that you know there are a lot of other issues so these are the big things that we've looked at thank you so PayPal is as you all might know is a payments processing company right but some of the things that were being said about the complexity of the data the variance in the data they just sound very familiar to me too right but our problems are also multifaceted right parts of it is real-time decisioning so and not sort of analytics which can be done offline and in sort of even near real-time but decisioning that actually drives the business transaction during the actual users transaction right other parts of a problem are command and control for the site right there we have as I mentioned billions hundreds of billions of messages coming through and these are what determine you know if something on the site is working or not working and which part is working which part is not working how it can actually be put into a state that can be working so our problems even though some of the underlying similarities exist are perhaps a little more towards running the business in real-time versus trying to find these relationships later we also have problems in finding the relationships later in terms of if a failure happens for example what was the real cause of the failure so tracing a business transaction all the way from its entrance into PayPal to the time it actually entered into a failed state that's very important to us so problems are multifaceted and they're just aggravated by the scale we run out in a nutshell I think that's that's what we struggle with so what we do at Brigid we actually try to solve business problems based on data so we try to go and talk to businesses and many times it's almost a very latent thing that business realizes they have a problem in say their marketing or supply chain of finance what they don't realize that it can be solved based on data and analytics so we try to kind of solve those business problems and challenges of primarily using trying to see what data is available being able to augment that data with different forms may not be just from what's coming from the core systems maybe in some times trying to see what are the unstructured data is available try to marry that then trying to think about driving the right level of visualization even of the current form because that in many times you would not believe the business probably doesn't have that also so telling them what's happening is also a big aspect in many cases and then probably trying to tie back and solve that problem in terms of what they should be doing in future that's where a lot of data mining predictive modeling optimization kind of techniques comes into play and that's where the techniques become very important for us to understand what techniques will work because the scenarios might be very different in some cases you might be trying to find out say a patient behavior some cases it might be a customer behavior some cases it might be a resource optimization say managing how many beds should be there in a hospital I'm just kind of making it up but there are different kind of scenarios in which analytics can play a role really figuring out how to solve a problem what kind of techniques will play is another part and then it comes into implementation to make it real for the business to see real value out of it where certain context again some cases might be easy like it might be fully offline you give them a strategy they kind of implement it when it becomes more real-time implementation that's even more challenging in many cases because you have to in a split of a second you have your algorithm run and make decisions there say many of the online say fraud detection those kind of ideas again so so the challenge rise probably across the board in terms of data in terms of what data to use what kind of visualization should come into play how do you probably what techniques should you be using and how what's the layer that you show it to the business how do you implement it so a variety of those comes into more where we get involved in a project form another thing which we are trying to do are still baby steps we are taking we are trying to build on something what we call our analytics apps which are really prepackage solutions in a very specific areas where the business has a solution which they can probably where they don't have to worry about what techniques was tools sitting behind it but it solves their problem again so so it's really that's where machine is to play a lot of work in terms of say there's no data scientists available there but you have to have the technology take a decision in many of those cases and the business has to get almost a idiot proof tool to kind of play with because the business managers are also getting more analytics savvy so that's a little bit of a different thing because that's where you're marrying technology to quite a bit of extend so those are really at a very very high level the kind of work we do the kind of challenges we face in our day-to-day kind of work. Hi I'm Rohit from Yahoo. Most of the people know that Yahoo is the biggest publisher and has a wide variety of content and with that we come with a problem of multi-billion like 30 billion plus events and around 20 terabyte a day 20 to 30 terabyte a day which kind of presents us with a challenge for close to 600 plus million users we use the technology to solve the business problems like identifying the user value generating the user digital genome so we call it as digital genome by doing that we are able to tell user what story they like and what would interest them and kind of you know engage them on Yahoo network and based on their past history we create our digital signature for that person and you know then we know what category is like if it's a sports fan or is it like you know shopping maniac and stuff like that based on that we are able to actually predict or suggest articles and stories also the same thing can be used for advertiser you know trying to advertise their own product to those users who are more you know looking for something so since we have searched and they displayed it all put together the same user actually we have a strong information about the user saying that he also searched for some digital camera and he also browsed through you know some of the sites on the Yahoo itself clicked on some of this display as that actually gives us huge huge not only a business challenge to solve but also a good opportunity for us to serve the users where another thing is also identifying unique users which is most of the people will agree on the digital world as to how to identify a different level like geo level or you know or different kind of interest category level because a person can have multiple interests and stuff like that which can actually if you tell advertisers that you can reach out to say 100 million unique user for sports category that's like an advertiser is just thrilled and if he can target it properly you know they get it so we try to connect the publisher advertiser and the content with the user which brings in the whole thing there are other things like scoring algorithm for publisher whether it's for the ad quality score that we create for a user we also we create like a user value so all this thing is put together has a multi-dimensional problem to be solved and to an extent Yahoo has solved the problem to a great extent so I am Anand from Granda the problem we are trying to solve is that of helping people understand the results of analysis or data humans are great at language we're not so we're born wired for it we're not so good at reading tables of numbers what we are trying to do is tell picture stories out of that to give you an example the back of my t-shirt is a chart that shows the days on which people were born in India at least according to official records and you'll see that practically nobody's born in August the reason for it schools open in June and you know most birthdays are adjusted accordingly now it's one thing to show that as a table of numbers another thing to say okay I've got a full row of August it's completely blank what's that doing so we just tell picture stories we also do analytics of the non-traditional kind think of it as bringing free economics to corporates so for example we observe that students born in June who just make it into class tend to score consistently lower than students who are born in August who score about 10 percentage points higher we find that said those are cells extremely well with coffee and tea in fact to the point that all they're never ever bought without the other turns out that when you're buying children's clothing the sales of every product moves with every other product with one exception needs jackets that's somehow a very separate category things like this which are interesting that the business might not know these are kinds of things that we're trying to extract and most most importantly present it in a way that people get the answer instantly I think so I'll talk about my Facebook experience very little to add beyond what Rohit said so Yahoo and Facebook are very similar companies and so I was trying to try and see if I could answer this question differently and I was thinking in my mind why did we hold on to so much data I mean yeah there's a lot of data coming in and you can do stuff with it that sort of like you know just count stuff show how many page views you got figure out how much advertisers should get billed all the you know standard kind of stuff I think the more interesting question is you know how the hell why the hell where we actually holding on to like 50 petabytes of data or whatever so and I think the most interesting things that I can think of are the most interesting use cases are often the ones that you don't know about in in advance right so somebody will ask a question that you don't know the answer to and you've got to go back and like analyze the year's worth of data to actually you know figure out right so some kind of cohort analysis or hey you know we adopted the strategy last August you know how did that fair out how did the group of users acquired last August behave over time what was their pattern you know relative to users acquired we other means so to me the most interesting applications are the ones that you don't know about yeah maybe I should stop here so thanks the next set of question is basically it's more interesting isn't it in terms of today we have got various I think a lot of you are how many of you already using how to ecosystem for doing big data how many of you are actually using a non-loop ecosystem for being big data of you guys here also and I think that's the intent of basically finding out if you are using Hadoop or using a non-loop or using a relational database or an MPP how do you make that decision end of the day and if you can go further and especially in case of Rohitha and as well as Jyadip is basically if if you have the details and even Navjoth if you have the details in terms of what percentage of your analytics and everything would be at a high level say high pig versus something else at a high level versus very custom jobs or do you encourage that also what would you encourage at what point of time so basically ideas how do you choose that technology I can go first I guess just like any other technology decision right the first criteria as always the solution has to fit the problem right the first requirement as always it must work before anything else right so so even as you go through the solutions you have to look back and keep make sure you understand your problem correctly and you do have to do that continuously with respect to sort of applicability in the problems we have in the big data space right we have a whole slew of technologies we have and it's this is not to sort of say we have a lot of processing on Hadoop but we have about a 300 node Hadoop cluster but 5% of analytics actually run on that right a lot of analytics do run off of terror data itself right the traditional way sourced from an OLTP system which is for which a relational database is the best fit you have structured data you want to run analytics unstructured data yes it's massive volumes of it yes it's five petabytes of it but it's structured and there's certain solutions traditional solutions that work perfectly fine where we deal with unstructured data again like I was going through the talk today even as we evaluated Hadoop right so Hadoop does offer certain advantages for us in terms of being able to leverage cheaper storage you know more distributed storage rather than you know paying for slightly more expensive disks but when you sort of do the trade-off between how much custom development we would have to do and how we would actually have to invest in even utilizing that 300 node cluster for some of our semi-structured data that decision and the fit for it first of all right and the sort of fit for the problem meaning if it takes me 30 45 seconds to get that response back then that's not a path I'm willing to follow right but there's certainly AB testing sort of multivariate testing use cases that that's perfect for because that analytics can run after two or three days and that in many cases also is semi-structured data right there's external sort of sources we mine for impressions of PayPal right so what are people saying outside on Facebook on perhaps other blogs other sites on Twitter what are they saying about us right where it's completely unstructured where it's mining taxed right but Hadoop does fit well where it makes sense to actually make that investment right and we have made that investment but really what you have to look to is does this solution solve my problem right and you have to continually keep continuously keep evaluating that as you go through the evaluation of technologies and there's multitudes of things out there from you know like I talked earlier today you know columnar data sources writing your own custom things you know there's all kinds of options outside of Hadoop and of course on top of Hadoop you have all the tool sets like I like you know any of the other like pig other things that make it our value add services on top of it right but I'll go back and finish by saying that the first and foremost criteria for me is does it solve my problem does it work and beyond that you know technology choices are like any other technology choice you make so it's a very good question that how do you choose the technology platform if you have like now just said if you have an article need which can be modeled in a way where it's relational in nature then you have better off technology available and tools available to visualize cd data and the way you want to do the data but most of the people will agree on the digital world the requirements for the anartics and probably even a paper is changing so fast that introduction of new dimension and I'm sorry I'm talking about the modeling part but introduction of new dimension or a metric that changes the way you compute or look at the data presents you with the challenge that you know even for that matter even if terror or any other oracle or whatever you take has its own limitation now for the known analytics that you're going to generate and it's going to be finite in nature which they call it as KPIs because it's not going to change much like for example for the web page views clicks revenue and all these things are going to not change no matter what you do page use will be us so things that are known and are not going to change our bread and butter to just like when you go on analytics dashboard you should be able to say I'm doing okay for those kind of things you know our relational system helps but the moment you go dig deeper and you try to find out and try to create a correlation say something in on the revenue system share some fraud happening or some robotic thing happening and you want to try to dig deeper and try to see that the variables are related or not when you do that the way you look at the data changes is no more a record it's it's a it's it becomes a logical record and you have to actually go through either you know using a statistical modeling predictive modeling and stuff like that for example quality score the quality score for an ad cannot be just measured based on the click it has to be measured across geo it has to be measured across publisher it has to measure a different type of users and stuff now when you bring all these things together it becomes a problem of a large data when you see that your existing system has a boundary defined in a way where it cannot be totally parallel and I cannot be divided into subsets and each of them does it paralleling and gets back to you that's when you realize that okay you know how dope is your is your framework that you need to embark on right to their custom map reduce or you write a pick program or you know if it can be modeled it is high even stuff once you know that your existing system is having a boundary that's when for example a lot of people say that you know I have like a two-petabyte of Oracle system running are you really actually scanning all the two-petabyte that's the question right if you are then probably Oracle will die or even on the terror at our side if you go with two-petabyte it will because if a person takes two-petabyte and there are ten people concurrency it just kills the system right so the question is how much of the data that is online on your rdbms is really being used if out of two-petabyte you're using only say ten terabyte then you know go with the rdbms but when you know that most of the time we have to scan the large amount of data like at yahoo to do a user behavior analysis to see the pattern how the cluster is moving what is becoming hard or which keyword is becoming popular and we have to tell advertiser okay start bidding on this or certain things like that you know you you have to scan if you have to analyze marketplace you know if you had to see how California is reacting to certain news versus how New York is reacting now you you enter into a space which is very different right and you have to have multiple channel to look at it so that's when you realize that your existing system are not able to scale and they won't scale because they are not designed for it that's when you want a system like either Hadoop and if you want a real real-time lookup and some processing yet you can go to H-Pace, Cassandra but thing is you need to like he said right you need to see that the solution that the problem that you have fits the bill with the solution that you want to come up cheap I don't know whether Hadoop is cheap or not it again depends on the skill set that you have the problem that you're trying to solve because there are many people who have tried to solve a problem for which Hadoop is not made up of you know if you don't if you cannot think of your data as a key value pair probably you know you should think differently so that's that's what I'm saying let me see if I can try and add something here so I think one part of the question has not been answered as to high level versus low level so maybe I can actually try and add to that one thing I've seen is that it's always best to capture problems at the highest level of abstraction and sometimes that might be a C++ function sometimes that might be a SQL fragment and sometimes it might be something even much higher than that so one of my favorite examples is in almost all the companies here you know somebody out there in your company must be doing a 30-day average of something right and that guy is writing the same thing that you know 100 other people have written and I can give it in writing that many of them will get it wrong and they will cause a lot of burden on the system so that's one principle I like to follow is try and think at the highest level of abstraction how to you know capture what you're trying to do and capture that somewhere because tomorrow when you come and revisit your backend you want to rework something you know that's the thing that you want to re-implement the other thing I would say is that unfortunately I mean some of us sitting here are from very big companies and I guess a lot of people on the other side are also from very small companies and I don't think the same parameters can apply to large company and big small company so for example when I was working at a very large company I knew I had an army behind me so my job was to make the best technological decisions and I was confident that whatever decisions I made the army would execute on it it will you know fix the problems as long as the design was right we could always implement it right but if you're a small startup in which is my current role you know you are aggressively trying to find something that works so you know I think for most of us actually the answer is well let's just try everything aggressively and find what works and keep it and of course you know ask around and like get references I think I mean my experience has been that most of the stuff doesn't work you know it's it's marketing it just sort of like people talk about like cool stuff and things like that but you know a lot of stuff doesn't work so that's the only way you find out the other principle that I've sort of seen articulated and I like is that try and choose small things because if you choose small components well even if the break then you can replace them with something else but if you choose like big honking systems that you know sort of absorb and everything you know within themselves then well you know your your odds of sort of fixing it later are get harder and harder so those are the things I can add yeah so I think it's all covered mostly I think the only point which I would add on is it primarily gets down still I think 90% of the work we do relational database and using analytics on top of that really solves the problem so that's again I think going back to his point I think for my analytics package standpoint and analytics tool standpoint I think it depends so when obviously I was in HP we would kind of pick up things like SAS SPS and others which would be probably a standard thing now being a startup again we would rely on a lot more on open source or be it R be it Python which really kind of things which we are trying to adopt and they work beautifully well and and the thing is where it works and they have to make the choice in a different form also like going back to certain things which are more real time or near real time that's where the non-traditional database concepts kind of become very important or even there are choices there also even are you doing the learning of the algorithm real time if not maybe certain traditional methods will probably work very well then the things like probably certain techniques especially when you're using techniques on unstructured data that's where probably a big way you to think about non-traditional databases but in general I think most of the things are kind of covered but using open source as far as possible in terms of what techniques are and others are the standard ones which you use for example do you want to add because your field is completely different from most of them yeah and I'll kind of keep this a little short we decided to go with a graph database simply because we are looking at a very sort of open world system where we're looking at data across several different contexts and very diverse data sources but again as a startup you know a lot of your technology decisions are made based on the problem that you need to solve on hand and your resources so we went with a virtual source stack one big reason it's open source we have a partnership with them so it it worked out that way so that's basically but for you know for traditional things like housekeeping we use you know regular you know my sequel databases or for you know when we do our auto-suggest on our UI we would use like an in-memory database or a tree so we you know we've been using a mixture of technologies using what best solves that problem at the risk of actually repeating and duplicating our data to some extent we've sort of tried to find the most economical way to solve this problem and you want to add you want to add yeah just a few things so on the analyst analytics stack here's what I'd suggest and if you've got it just start with Excel and if you need to go deeper go there on if you want to move to the programming side python seems to be emerging as a de facto standard mainly because of the power of its libraries numpy and nltk being on the forefront on the front end if you want to show it on the browser svg seems to be the de facto standard again mostly because it's vector as opposed to raster and you can zoom it d3.js seems to be the most powerful tool that's emerging today. Thank you very much I think hopefully that was useful for many of us so the next question is the most I think the most open which is not really discussed unless you are working in a large company or you're working in a startup facing everyday challenges so this very specific question I think which Naajot can answer in some way Rohit and Anand if you can also Jyadip I'm surely looking at you in terms of how do you basically when things break down if you are let us assume that majority of the people are using Hadoop stack of some kind whether pig or the hive or the customer in your case it would be something else no no no Jyot I think in your case if it is something else yeah the data cleansing or data basically because it involves the data flow data ingestion data processing and then data output so there is so it is not like a one one tool which does everything I think it's very open-ended questions but I think that that is a fairly open-ended question because part of the reasons why we built some of the tools that we built is to actually do exactly that right so now if the question becomes how do you debug and fix those tools then you do the exact same thing that you built those for right eat your own dog food but I think the basic answer always boils down to you know it's it's going to be an intelligence that you have to actually build into the system right not no no piece of software that you write right be it you know anything you write on top of a do be it anything right pig is going to be you know any easier to debug or fix than anything you wrote you know perhaps in Java two years ago or perhaps you're writing in Python or anything till you actually build some intelligence into it to spit out what those errors are right in terms of specifically how do you debug a Hadoop cluster I'm not going to venture into that here right I let the experts probably say take that because I might might get a few things wrong myself right but as far as sort of debugging or like so in paper is there like a flip card today I think they were mentioning they have a huge or they are building a framework where all events go from whenever things move within their system so do you have something yeah so we add that and that's the thing one of the things that was the biggest sort of there's event streams it gets over a hundred billion events a day is for monitoring that is intended to monitor the whole site right so now if and what we do is we we base it completely on custom instrumentation right we try to drive all of that into all of the sort of common instrumentation in a small set of libraries right but again it's based on instrumentation your applications have to tell you what's wrong before you'll know before that event stream the event stream itself isn't going to interpret that your application did something wrong or went into unsafe state till there's actually an event in that event stream right so part of the system that that I was talking about earlier today for command and control is to give us the capability across the site to sort of help us monitor debug solve problems in the site trace a transaction through the entire site correlate events as they happen across different layers that sort of make up an application right so so that is one of the systems we built and it's all based on custom instrumentation with primarily the common libraries infrastructure components driving most of that instrumentation but again there's I don't think there's a magic bullet if you're talking about just debugging applications right you either have to instrument them or you know you have to use if you're using Java for example perhaps you can use bytecode instrumentation which again you know you figure out at your scale if it will cause you a problem or not but there's there's only a finite set of techniques so yeah five minutes so sure I'll keep it short so on the hardwood side how do itself provides the if the mapper fails it you know allows you automatically does the you know relocation for the mappers but I agree with no just saying that you know the application has to be smart enough so if your application code is divided into components which is what Jati was also mentioning and each component can you know restart itself from this point where it not necessarily where it failed but you know from its starting point so you can maintain the consistency so if you design your application or the workflow or the jobs that you're going to either through uzi or whatever mechanism you have if you break it down in a way where if it fails say if there are a b c d d stage four stages if the stage B fails the stage B should be able to start itself from point where it actually picked up after a left that way you can allow you know that you know you don't have to run the monolithic job to start from point a to point d and there's no restart ability so I think in a design restart ability is the biggest thing because the Hadoop provides a lot of kind of a failure restart thing but not necessarily it always works great so and since now a lot of community are coming together and enhancing it but if you design your system in a way where it's restartable I think most of the problems will be solved and only small pieces will be restarted out in one monolithic because if you are reading a petabyte of there you don't want to keep reading again and again where you know that after one petabyte you generated a hundred terabyte and you're only going to work on a hundred terabyte plus one on the restart ability that's pretty key the other thing is to fail visibly in the sense let's say you've got a new product category that's been added you don't have the metadata reference data for that fine don't ignore it show it as an unknown category it's fine to handle all failures but it's also important equally to show that there has been a failure so that people can take action based on that it's only thing I died yeah I think the things that come to my mind it's humans who debug problems and human beings need logs so just log everything like log the heck out of everything I mean that's one thing I appreciated during my work life don't worry like there's enough space and and give people a way to retrieve logs I mean that's the number one thing that people uses internally used to ask for you know just don't build stuff just give me like whatever you've run a hundred tasks just give me a give me all the logs I mean I've got eyes I'm gonna figure out what went wrong I found the same thing on my assistant like all the Hadoop's assignments and stuff it's like give us a way to get get access to all the logs you know to figure out the problems the other thing that comes to mind is that I can add more you know in addition to what has already been said is that a lot of the problems are actually also caused by human beings it's not actually machine errors or chip failures or whatever thermals that are actually causing most of the problems most of the problems happen because people do stupid things so make your systems idiot proof right that's the other sort of cardinal rule of thumb prevent people from deleting your entire system prevent them from writing jobs that are going to not finish in like you know whatever like a few days don't let them log so much data that they don't allow any other data to come in and everything gets thrown out so what would they say trust but verify right something like that so like no don't trust anybody like just like you know when somebody is a user of a system put quotas on them make sure they don't exceed their quotas put those defenses up because if you don't then a bad user will you know impact everybody I think that's the key learning I think to put the quotas and implementing those quotas from the beginning itself I am sure it is across all the system so so while we are at it right so I just wanted to sort of counter one thing I think I agree with everything so we tell and a couple of my guys are here so they'll testify we tell people not to log everything I said like please don't log everything and don't log everything to everywhere right because that itself you know given the volume we run at given sort of the things we have to deal with that itself can cause problems and I think you recognized at the end I mean agreement I think like we've got exercise judgment here right but everything having like critical log like what Anand said right I mean if you if you're seeing something bad you know log it right I think but the problem actually I wanted to like follow up on what Anand said like I mean part of the issue is that okay so here's what happens right you hire a smart guy he like develop stuff he logs everything he follows advice and then he moves on right he moves to the another project another company and then nobody's looking at those logs right so so the very common error pattern is yeah we knew this was happening like we were not joining because we didn't have the joint here whatever like half the data was missing but like nobody was looking at the damn logs right so I really don't have like an a good answer to that I mean like the problem is you want to set up alerting but like false positives are like one of the biggest problems with any alerting system and like I haven't seen a good solution so yeah there are some unsolved problems so yeah it's a sticky point but I'll just take a half a minute I actually am a believer of that log as much as you can but filter it when you really want to process it filtering is cheaper on how to but if you miss the bus you cannot get that click back of the user click so I think according to me in digital space if you didn't log you missed an opportunity probably which is a high cost could be thanks so then we have come to the last question which I hope I can first address and if we have time that is just I guess we just out of the time should we stop just so I think the question is very straightforward it's basically when a business analyst at end of the day is trying to make a decision there is some kind of representation either it could be interactive representation or it is a static representation it could be derived from lot of data or it could be derived from lot of interaction of data itself so what has been your experience and I'm sure it varies across various places but since Anand is there and we are running out of the time we'll just request Anand to basically take up this one if there's one piece of advice I'd suggest when you are creating visualizations it's copy copy blatantly there are enough good visualizations out there just do a scan search for data visualizations you'll find enough examples that's more than enough if there's a second piece of advice that I'd offer then it is read Edward Tuft that's called TUFTE the person who sort of started data visualization as a revolution and sorry for those who couldn't hear the spelling is TUFTE Tuft his books are an amazing source beyond that just remember that the medium that you use to a good extent constraints the work that you do if you create your visualizations mock up in PowerPoint you're likely to be constrained by PowerPoint shapes if you create them Photoshop you're likely to be constrained by Photoshop styles paper and pen works just just fine so you may just want to do a design on paper and pen and see where that takes you one of the things sorry one of the things I would like to tell people is and you know and I agree with you as people think of an article and the tools is the more the charts the better it looks the dashboard loaded with pie and pie chart and the trend line and some heat map yeah yes it's it's beautiful it's beautiful does it tell you the story does it lead you to the question the answer that you're looking for I think if your dashboard or visualization does not tell you the story like you know the back of your t-shirt tells you the story then I think I don't know whether the visualization makes any sense so my question is pretty basic basically is there a best practice for storing data somewhat neatly in Hadoop for example say if a large log file that basically contains unstructured data but I can easily identify that there are ten different sets of unstructured data unstructured data of ten types so is it a good practice to like run that log file through a parser and then store it in ten different directories in Hadoop or should I all just dump that big file in a Hadoop directory and let Hadoop handle that I think your answer will ultimately depend I mean that the cardinal rule of the thumb is that if you are going to divide the data then like look at how fine-grained data you are creating right so I mean let's take the sort of the trivial or rather one extreme of the case okay there's a record per user am I supposed to store one like one file per user and obviously the answer is not right so there is always a you know it's a fine line right so you've got to look at how fine-grained your data sets you're producing if it's really really fine-grained go for a you know system that indexes like HBase or one of those kind of systems if they are still relatively coarse-grained then you know yeah split into directories make sure your directories are big enough so if because you're talking specific about logs right I assume that logs are not directly storable and queryable because that's that's not how you log so you have to think about how you're going to use your data going forward because if you store it in a certain way and you're going to use it in a different way and every time you land up reading all the data but and then filtering out some of the data then definitely didn't store it right if you can come up that most of the use cases that you have for using the data can be suffice with a particular design then that that is the right way to there's no thumb rule saying that store it like this like I said if you need look up kind of a thing that you got the log and you just need to show as a real-time analytics you just take it you know use whatever stream processing is and dump it in Edgebus and show it but definitely there's more so the performance difference like I said you know if you're getting every day say 50 terabyte of data right and you're storing it in a way where you'll only read say 100 gigabyte of it for whatever I say you are doing a geo analysis and most of your use cases is only geo analysis because you are a company which mostly talks about the geo part then you should store it you know from a geographic way of it because then you can produce independent analytics but if you are going to do a real-time analysis like okay you know how many clicks happen now tell me in five minutes what happened you know did this advertiser get this user and stuff like that and that you do a stream processing and store it in Edgepace right so it purely depends on what use cases you have otherwise it's all blob you know like someone was mentioning I think the data is boring to look at unless it tells a story this is not a question this is a classification as we understand you know there is a difference in the way we are looking the structural data and unstructured data in the industry now the question is that how what is the industry trend in terms of processing the unstructured data to the structured data could you repeat the question please the last part the what is the industry in terms of you know processing the unstructured data to structure data because the kind of unstructured data the volume is almost two times than of the structured data right so so unstructured data is used to derive some information which is human readable which has to be structured data so almost every data that you process when the output comes has to be structured so structured data is always there hundred percent of the time unstructured data is used to make sense so that you can look at like for example if you are getting messages like a facebook right they're unstructured but when you do the text analytics of it and you do a geography analysis and then you can make out that okay in california this particular theme is really popular right that's structured information right because then you can represent it but the thing is to you have to if you have unstructured data you have to process it and eventually a structured information has to come out of it one of the interesting things that's happening as a result of needing to process unstructured data is a conference between science and technology the kinds of unstructured data that we are processing are text video audio images and so on now sound engineers have been processing sound in various ways over the years and there are startups today for instance that take this spectrum audio spectrogram of songs and try and see if they can predict similar songs people are doing that for images using visual recognition techniques we're taking the techniques of linguistics which is as far removed from technology as one could imagine and applying that to I don't know extract structured data so what are these side effects or perhaps the drivers of structuring unstructured data is also this confluence or bringing in new insights from science from fairly diverse fields just a footnote hi I'm bringing you to NoSQL so in rdbms we had the I mean paradigm of normalization and entity relationship modeling do we have any such paradigm in NoSQL I mean how do we model our data in NoSQL I guess we'll have to take that question you want to take that question now or yeah please so I think in my profile I have mentioned that you know we have implemented star model on grid so what we did is the files that you have on grid you can call it as a table the information that you have within the file whether it's structured or you can write a udf to identify the columns so you call it as a column and then define the relationship between the two feeds if you have a one feed of a user another feed of an advertiser and for whatever reason you want to say find out whether this cookie clicked on this ad then you know that the link is a cookie between an advertiser and the and the user right so if you represent this as a meta information maybe it in your MySQL or anything you then do star modeling use any tool or you can write your own custom tool where you allow people to think in a dimension and metric form drag and drop in your UI internally it will know the which feeds and which columns it's talking about and how they need to be joined if it's a small file you do map side join or if it's a you know big file is on the reduce side so stuff like that then you can implement it but it's a hack it's not still like an sql fired and got the data it will still be a batch thing unless you implement something like a q1 hbis then then you can actually play with the java apis of the hbis can we take last two questions yeah so i'm basically trying to collect some open problems so but uh so what are some problems that are well solved in the batch processing or offline mode space for example like a typical hadoop kind of a scenario and would make sense in an online and a real-time scenario but are difficult to solve or you guys are basically grappling with things that you know problems that would really add value if solved in a real-time scenario but are hard to solve and maybe they are solved already in the batch processing space so yeah it's on so any any real-time analytics capability right i think it's it's uh early to say that real-time analytics of a very large data sets is solved right i think for example the problems like in the life sciences right in biology things like that finding those implicit relationships in real time i don't believe it you can call them solved yet like especially as as the data sets increase as the volume increases as sort of the dynamic nature of it increases right so i i want to keep adding more events while you're doing your processing and i want you to keep in keep revising the answer you give me right i don't believe that's solved yet and i think that's a common problem that everybody here here has that's sitting up here and i think most of you will have if you're sitting out there so one of the things i would like to actually ask question when when you know even in yahoo when people say i want real-time analytics i said what will you do with it and they say you know what it will help me to tell advertiser that you can bid here more i said how long does your system take to make that change and push it to the serving say it takes two hours i said you're not qualifying for real-time analytics the real-time analytics usually can help in more like a machine learning where for example if i'm generating a user value or you know constantly tracking how the user behavior is changing and for example suddenly i see some news broke out and interest started happening that is when if i have a real-time analytics and i can feed into the system which automatically can make sense out of it and pushes out more content to that user that is real-time analytics a human consuming real-time even if the numbers suddenly became 20 million 40 million users good deal big thing so what you know if you can't answer this question that you can take action on it and anand can watch more of it if analytics has no action related to it it's an information term it's it's core dump for program maybe i i think this is getting a little bit back online so i so okay so i was in rates camp i'm no longer there the reason is that when people pay for stuff they ask for stuff and whether it makes sense or not you have to give it to them right so when you're when you know when your advertisers come and they're paying you money for like because you show adgin tip clicks they say yeah google analytics does real-time analytics why aren't you doing it right what are you gonna say like take your money to google so we gotta do what customers say i've seen that even from like very small you know i remember talking to this online helped us company and they're serving mom and pop shops and their mom and pop shops are asking them for real-time analytics on their customer logs or whatever right i mean like yeah what are they going to do with it but you know they are paying customers you've got to give them what they want right the what was the second part so it's actually your question was open unsolved problems right so i i you know i think there's so much stuff going on in the world that it's very hard to claim that you know you actually know what's going on and that you know what's unsolved and not but the one thing that i you know if you develop this you know i would be a happy customer would be there are real-time systems that do very good job of processing low latency you know high number of like transactions per second kind of data and there are these bad systems that obviously you know can go chunking on six months worth of data i haven't seen a good abstraction or you know a layer that says um yeah you can define these real-time counters and if you tell me to actually back populate them six months i'm going to just do it for you you don't have to worry about it right so if somebody in the audience actually knows a classic system that does that i'll be very happy to know but where i've seen systems break down is the real the guys who build real-time systems they focus on the real-time part the guys who do the bad system they focus on the batch part and nobody tries to span like from the end users point of view i don't care right i just want my real-time analytics starting now and going back six months and please give it to me so that i mean that might be something interesting for you um some of you are suppliers of data analytics even though i'm from a different field i'm an educational researcher one of my toughest issues has actually been to get people to use data so as suppliers how do you generate a demand for your product i completely agree with you first of all i i i still think i think that's the fundamental problem that we face in a lot of businesses still where people really have a problem but cannot relate it back to data or analytics and how it can be solved so i think there's a huge education process at least as a startup most of the cases where we go and work i think we have to educate businesses saying you know if we do this this is the value that you can generate this is probably the final revenue enhancement or cost take out that coming by the way what goes in it is analytics so and this is the solution that we try to provide you so so it's an education process it can be easier if you can have some prototypes if you can visualize the whole thing and show it if you have some real case studies where you have done it makes it more real but uh going back to that whole thing i think even before you start a project many of the times you have to build in a visualization which explains to the client that how they can get value out of what they are trying to do and it probably also needs a change in mindset you're not getting them to consume data you want them to consume stories that's what's more interesting and that's therefore also a change in mindset for the people that are pushing this out don't push out data push out stories and that therefore means training on the part of the data analysts to learn how to tell stories the story could be numbers it could be a simple statement could be in pictures it could be even a table of numbers at the end of the day but let's just recognize that at the end of the day you don't want to say the chi-square value of this is 7.3 you want to say jack engel went up the hill something of that kind or down the hill i think we're done great thank you so much uh moving thanks for the panel that was excellent i'll see you guys tomorrow