 Hi everyone, my name is Ashish Thosu and I'm going to be telling you about a company called Cubool and the software that we built there to really democratize big data. I am one of the co-founders of the company and the CEO of the company and a little bit back and on about myself, I have been associated with big data for a very long time before starting this company, I used to run the big data team at Facebook and also one of the co-creators of Apache Hive which is also one of the projects in Hadoop, in the Hadoop ecosystem. So my association with Hadoop goes back all the way to 2007 and one of the critical things that we had realized, we still see in the Hadoop ecosystem is that there is still a lot of education to be done about how it can be used, how it can be operated and so on so forth and which adds certain impediments in terms of usage of that type of technology and Cubool sort of tries to solve that. So that's what I'm going to be talking about, so diving a bit further. The talk is outlined into four different sections. The first part I'm going to be talking about what are the key drivers for big data. This is probably a little old news now, a lot of people know about what is driving big data but it also gives you a flavor of what kind of applications or what kind of data sets can make use of these technologies, these next generation highly parallelizable, scalable technologies and so on so forth. Next I'm going to be touching about what are the big impediments in its adoption. A few years back it used to behave what are the use cases, now there are a lot of more use cases but there are still impediments in adoption of big data technologies and I'm going to be touching that a bit and then dive into what does Cubool provide and how it tries to solve those impediments and if you have time we will also look at a few features of the software stack itself, some of the key features and so on so forth. So please feel free to interject during this session at any time and ask questions. I think it's a small set of group so an interactive session would work very well. So let's dive into what is actually driving big data. In my view, of course there are a lot of, we as a society are also increasingly looking at or hearing about in the news about cases where you know analysis, data analysis has certain benefits but from purely the technology aspect, from purely the technology drivers, there are a couple of things that have happened in the past decade and you know it's a secular trend that has been going on for a long time but in the past decade especially you know after post 2000, we have become more connected as a planet. There is a lot more connectivity of devices, connectivity of people and so on so forth as a planet and at the same time we have also seen a proliferation of devices whether these are personal devices or metering devices and so on so forth. So as a case to point out in terms of connectivity, you know this is somewhat data chart of how our mobile connectivity bandwidth has sort of increased over a period of time and it's you know continuing to increase, you know the more backbone infrastructure is being added and so on so forth. So you can carry more and more bits and bytes on the pipes from any location. So mobility, that connectivity bandwidth is increasing, at the same time the density of the connectivity is also increased. So there are a lot more regions in the world today which are less blue as compared to what they were in say 2004. There are a lot more regions where the density of access, of mobile access has increased. Which means that there are a lot more regions where you can get your information you know at the place where you want it to be delivered and at the time when you want to be delivered. So these are just you know cases, these are just you know data points which buttress the claim that connectivity in general is becoming more and more ubiquitous and more and more available to a lot more people and not just people but even devices within the planet itself. The second part is the device revolution. I mean I don't have to preach this much but today you know the market device that we you know all use as a you know personal you know you know PDAs and smartphones and so on and so forth. But apart from that there are monitoring devices which are being added such as smart meters, health monitoring devices which are being added and used by people a lot. And these devices in connection with the connectivity ensure for the first time an environment where you can collect data all the time you know you can instrument these devices and collect data there and move that data out through the you know perpetual connectivity that you have or the connectivity that you have to a central location where all that can be analyzed together. And that is in my view really what is driving the big data revolution and what is really driving the need for this next generation technologies you know to solve that problem. And this data is has some fundamental differences than what used to be more like form generated or you know order entry generated data of the previous generation. This data has a lot more value volume it has got a lot more velocity you know since things can be instrumented so much and you know so fast there is a lot more data that is collected which adds both to the volume as well as the velocity and at the same time this data is fundamentally semi-structured in nature. So a lot of this data is log data which is which keeps changing you know there is no single schema or there is no single you know set of column values that remain static for the duration of the time for the application applications evolve and a lot more change happens in this data at the same time you know there are data sets such as you know photos or you know videos and so on so forth which are fundamentally unstructured so to say you know you have to derive structure out of them you have to you know extract metadata out of them and so on so forth. So really so you know the data sets that are being generated because of these technology trends are fundamentally a lot different from what they used to be in the previous generation where it is all about order entry and you know inventories and so on so forth and as a result since the nature of this data is different because of these properties a lot of focus in the last five to ten years has been around what are systems that can deal with this data in a cost effective manner and in a scalable manner. So if you look at the properties that are described for this data it's increasing and it and it also increases the fast velocity which means that you need an infrastructure which is very adaptive which can really take on which can grow scale really horizontally and not just vertically you know it's not like adding big machines all the time to you know take care of the processing of this data but you need infrastructure which can incrementally increase or lastically increase in order to be able to deal with this data at the same time you need infrastructure which is based on a lot of commodity servers not specialized hardware because if you look at this data fundamentally each data point in in the pipe itself may be not high fidelity but in an aggregate that data is very very useful so a lot less time is spent upon hey how should I ensure that each and every value of this particular data set is stored as opposed to how should I ensure that I can store this data set in a cost effective manner and I can also have an infrastructure which scales the compute also scales with this particular data and this has been really what has been the genesis of of Hadoop you know the map reduce systems which came out from the Google paper and then the Hadoop it says open source implementation so so that really sets up the tone of conversation in terms of we are living in a time where lot of data is being generated that fundamentally the nature of this data is different and therefore there has been need for systems which can deal with this data in a fundamentally different way as compared to previous machine systems and that is what no systems like Hadoop do so what have been the big impediments in the adoption of of this this data infrastructure now if you go back of you know few years people would say and there's still some talk about hey what are the use cases of data but there are industries now where there are fundamental use cases of of big data and very well established use cases of big data however there are still bottlenecks that a lot of these businesses face in terms of using big data one is just the human bottom these are very new systems so it becomes harder to find expertise in setting up these systems for a lot of businesses whether that expertise is a can I find Hadoop operations experts people who can do cluster management who can do who understand the software well enough to really set up these clusters and so on so forth or it could range to the other direction where you know can I find people who have actually dealt and built you know you know prepared data whether they're data scientists or Italy and Italy engineers who are really prepared data at this scale and so on so forth so that is one tremendous bottleneck that a lot of companies face today and the second bottleneck is the provisioning bottleneck now you know yes these systems are built on commodity hard you know commodity hardware but at the same time what this means is that you know you and they've been built on commodity hardware primarily because of the unpredictable nature of the workloads and the unpredictable nature of you know how the data increases and so on so forth so in order to be able to keep up with the data scale it becomes more and more difficult if you are operating in a very you know closed environment data center where you have to like provision machines six months in advance and so on so forth to really take advantage of these of the of these technologies and so on so forth so it could be you know many companies are deploying these technologies and dev and test environments but even there for running POCs they need to have a plan of k I need like a 20 node cluster or 30 node cluster and I need to you know put together these clusters put together infrastructure around these clusters and so on so forth I can use it and then when they move to production then it becomes even harder for them to keep up with the growth so you know one property of the factors and things are so cheap you know people dump in a lot of data in these systems and as a result they keep expanding and expanding and expanding and yes you know people go back and you know cleanse out data and do all those types of things but in general the trend that I've seen most you know most of the places that have seen this is tremendous want expansion and to keep up with that you have to be constantly provisioning hardware and you know keep adding those things and so on so forth so these are the two big impediments you know that I see today in the adoption of some of these technologies and what we realize that cable is that you know cloud is one of the best ways to solve both of these impediments so in you know one on the cloud if you provide this as a service you take away the impediment of you take away the bottleneck of human resources you take away the bottleneck of hey I need to find my you know the expert cluster operators and so on so forth because this is provided as a service you can take advantage of of that service and you know subs your your data analysts or data architects can just leap from that stage of you know trying to put together this infrastructure even before they want to do analysis and second the second great thing about the cloud is its elasticity dynamic provisioning you can get machines within minutes on the cloud as opposed to in you know in a private data center where you have to like you know set up your you know procurement processes and things like that so both of these work really well in solving both of these impediments that I talked about the service model helps in getting over the impediment of lack of you know lack of expertise and the cloud in intrinsic elasticity in terms of provisioning helps in getting over the impediment of hey how can I provision hardware when I needed as opposed to hey I need to plan my hardware layout for the next six months and you know do that so they fundamentally change the the dynamics and really take away some of these impediments that companies have while dealing with with big data so so people does exactly this amongst number of other companies as well we provide big data as a cloud-based service this is an optimized her to five big stat in an integrated platform along with a you know bunch of connectors workflow tools and so on so forth all in an integrated manner provided as a service on the cloud and right now it runs on the Amazon web services cloud and the the the value add here is really taking away both of those impediments you know certainly the impediment of expertise so now you can if you are thinking about I do or if you are thinking about five or you know any of these technologies or if you know big data you can go to you know a service based model where you can just get these technologies right away and just focus really on your you know on your analysis and on your data sets and so on so forth the platform has been production we do dynamic provisioning of all the hardware on the basis of demand this is all built into the software and we currently process about a petabyte of data every month and that's increasing so if you're interested definitely go and check it out I'm going to be talking about some of the things that we have built there in terms of enabling more and more companies to use big data technology more easily so this slide set of captures the the architecture of how big data sort of works in a cloud or big data technology stack works in a cloud environment and this is really the architecture of the people data service at the very base we are of course based on Amazon's cloud so we use EC2 for compute and we use S3 for all persistent storage so all your data stored in S3 it is stored in open formats what we have done is we have taken the big data technology stack we have invested in two parts so which I call the in the red box and the blue box red box is really about our data platform which takes away the impediments of operations so this does things such as adaptive or on-demand cluster provisioning it adaptively grows the clusters or shrinks the clusters completely on demand on the basis of the demand being predicted on the basis of the workload set are coming in and and the real value of this black box is that it takes away more and more of your operations piece out of the picture so you don't need Hadoop experts or Hadoop cluster you know operations experts and it does this completely adaptively and in it it does is it gives you full visibility of what is happening on the clusters but at the same time automates this to a degree that it makes consumption of these technologies much easier no longer do you have to as a data as a data and this or a data architect you don't no longer have to think about a for my workloads how many nodes do I need what kind of nodes do I need and so on so forth all that is built into the system you send your workload in and you know the system figures out how many nodes it needs you know provisions those clusters brings them up keeps them around for your workloads grows them adaptively as your work goes and then shrinks them and takes away the clusters taking care of the entire life cycle management of the of the cluster at the same time the red box also has built in technologies there to enable you know technology such as Hadoop to work well in Hadoop environment so if you think about you know the thesis of of map reducing the thesis of bigger technologies the whole whole thesis has been so far keep compute and storage together but when you go to cloud-based environment you know that thing completely turns on the head and the cloud-based environment you have SC as a separate storage entity then EC2 has compute and you know there is there is a gap between us they are not with anyone the same machines or anything like that so what we have also invested in in the in the red box is how can we continue to take advantage of a separated compute and storage here with the big advantage there being that you can you know scale these separately of each other with different cost structures but at the same time gain the benefits of the performance of a place where the compute and storage is together and for that we have built a bunch of caches and so on support in the red box so the red box really focuses on how you can really run big data well on a cloud and how you can do it in such a way that your operations or the cluster operations parts are reduced or those costs are reduced or those expertise is you know built more into the software as opposed to you know women being trying to take those decisions and so on support so on top of this substrate this red box you also built a whole bunch of data tools to enable access to these technologies by not just core developers but even analysts and you know more statistical you know folks as opposed to just you know just people who will be able to write map reviews and for that you know we built a library of connectors which can help you pull in data into this from multiple different sources can help people data incrementally and create incremental pipelines on these data sites and at the same time on the consumption side we have you know built interfaces primarily through ODBC where this can work through analyst friendly tools such as W and Excel and and so on so forth at the same time in our data tools layer we have also built some simple interfaces for exploration and scheduling and sharing so you know point about sharing is when you are working with these technologies in a very dynamic environment where the data is changing very very fast the traditional approaches of hey let me catalog each of my data set you know catalog each of the metadata of the data sets and so on so forth sort of break because the data is the nature of the changes quickly so you know sharing in collaboration is one way to sub you know to refrog that so simple things you know if you for new data sets if you are able to connect users with expert users who have used that data sets in the past or show them you know sample workloads that have worked on those data sets in the past some simple things like that can really you know help people discover data much more easily and so on so forth so a lot of focus on the blue box has been what simple data tools can be built in order to bring these technologies closer and closer to the end users who are very analysis driven who are very who are thinking in terms of transformations were thinking in terms of data sets and how can we hide a lot of the infrastructure complexities from those those folks so that's a that's a bit about these two boxes I'm almost sort of fine but any questions so far or yes so the the security is basically what Amazon AWS provides secure says you know AWS has I you know I am and security group security groups are more for compute but from S3 you have ACLs and so on so forth so we continue to honor those ACLs and then you know since this is a service and there are a lot of different clients using the service what we ensure is that the clusters that we bring up on demand are brought up within the clients AWS accounts so there's a complete separation of well there is complete separation of the VMs of which are going into these clusters from client to clients and at the same time you know S3 is separated now within the account there are certain other security mechanisms that are built into technologies such as high even do which you can leverage for example you know if you're coming from the IDPS world you know about you know roles and grants and privileges and things like that similar structures are there in in in high for example which you can set in order to control the security access within accounts and so on so forth so those are all that you know we sort of expose in in our so a big difference between all these companies and us is that we are a service so we're not a distribution so we're not shipping the software we're providing this entire integrated stack as a service in the cloud so you just log in and you get everything the the other you know the cloud of the world of the currency of the world are catering to you know they may you know sell this their components to shops which might want to take them and integrate them and create a similar service internally we take it one level up we say you know this is all integrated and provided to you yeah correct these are our tools now some of these tools also work through ODBC connectivity not the karma spheres of the world but you know some of these tools work through ODBC connectivity and they can work with our platform as well but our general thesis is that a lot of you know you know at least 80 percent of your you know functionality should be provided within the integrated platform and you won't have to the big USB here is that you know we are operating this for you so this is all as a service the you know just like Amazon operates its data centers for you and provides that as a service we take it one level say we operate and you know we provide the entire service for the big data stack that you would need so you don't have to build all the stuff by taking cloud eras and karma spheres of the world internally and you know there are cases where people do that and you know especially if your data is all tied up in internal data centers then of course you know that is the only choice that you have but if you are able to move data to the cloud or your data is being generated in the cloud there you know this becomes extremely viable because you know then now you don't have to run cloud era in the cloud and then a karma sphere server in the cloud and then connect them together and so on so forth with this platform you get everything without having to either set that up or to operate it yeah so we'll run them in a single cluster we but what will happen is as these queries are coming in so the way the the the software is set up you can set up a policy saying that you know the minimum size of my cluster is you know 10 nodes and the maximum is 100 nodes and when you're acquiring these nodes use either the on-demand market on Amazon to procure those nodes or the spot market and you can insert up these policies once you have done this as new queries come in they will start going to that same cluster but it will keep adding more nodes adaptively in in response to the workloads that these queries are generating that to the extent that within this policy framework within this policy framework we'll keep expanding the cluster and we'll you know take it to a place where you know those policies are met and the the queries are running and all this happens adaptively and transparently to the user from the platform itself so it's even further than that if you are not running any queries and if you don't run any query for a time-out set of hours we take down the entire cluster we'll get rid of it so 10 is the first time the cluster is launched we'll launch a minimum set of nodes which is the trade-off here being that as you add more nodes it takes time to add more nodes right so it the first few queries take that hit now if you already know that you know your initial workload is always going to need 10 nodes why not bring up those 10 nodes as opposed to having the algorithm you know adaptively figure that out and you know grow those things so that is the trade-off there but we on the cloud this is possible which is also not possible in a private data center where you can you know bring up the clusters completely on demand and a lot of the analytics workloads are extremely bursty right you know you might need hundred machines for you know a couple of hours in a day and for the rest of the hours you just need like you know five machine go ten machine or whatever something like that right and this substrate the cloud substrate gives you the API is gives you the capabilities where you can do that and we take make full use of that so that you know you just run the enough number of machines that are needed to for your workloads so good question so there are a couple of things that you can do of course you know we are internally using the fair share scheduler of Hadoop and you know that has certain levels that you can set up to you know high priority job low priority job the second thing that we have also added this doesn't account you can create multiple different clusters as well now there's a feature which is just rolling out but so you can almost have a designated cluster which is extremely high priority cluster we can send all the jobs which are extremely high priority and a separate cluster which is more ad hoc and so on so forth so all that dynamics can be done so it's two-folds one you can do you can use a big hammer where you just separate out the clusters or the second you can do is you can provide the priorities in the API itself saying that you know run this at a higher priority or a lower priority and you know deal with that question I think I'm already out of time I'm flies so thank you very much for coming to the session and if you have any other questions please feel free to you know ping me I have my cards here if you need just ask for one thank you