 So firstly, thanks for coming for this session and it would be really great if we can make it more interactive rather than a monologue. So what we want to see today is how to select the right NoSQL technology for any given enterprise in their big data stack, big data technology stack. So obviously, there is not a very straightforward answer but in the process of this presentation we would like to see that which are the things still more or less known to us in terms of our experience in terms of the maturity of the certain problems in the enterprise space. So now little bit of my introduction, I am Saurabh Mazumdar, I work for Infosys Limited and I head their technology group for the big data practice and right now I am personally involved with around five or different clients, enterprise clients to as an advisory capacity to help them in selecting their different type of big data technology choices. So what I plan to cover today are the NoSQL fitment in enterprise big data use cases. So that essentially creates a foundation based on which the organization should select NoSQL. Challenges in selecting the NoSQL databases and a sort of framework which can be used for selecting the right NoSQL technologies. Techniques for the scalability analysis of the NoSQL databases just like I mean just in the previous session I was hearing from FoundationDB that how they do the scalability analysis for their solution and also we will talk, I will talk little bit about the Infosys big data practice and our Infosys solution around the big data which incorporates NoSQL and other big data technologies. So coming to the first section the NoSQL fitment in enterprise big data use case. So there are multiple applications of big data as all of us know but in our experience what we are seeing mostly working today around the big data are these four particular things. The first starting with the customer relationship management then sales and marketing related use cases, third mode around the supply chain management, every organization has lots to do those things and the fourth is plain and simple cost optimization, IT cost optimization. So these are the four different quadrants where we are seeing the maximum use of big data technologies. When it comes about the customer relationship management it starts from knowing your customer knowing the entire data around the customer and it also goes to the level of how you can give a very very fast and immersive experience to your customer by giving some sort of you know capability to the customer so that they can figure out all their information very soon by giving them the experience of fast search all those different things. Coming to the sales and marketing pricing strategy figuring out the effectiveness of different type of sales and marketing strategy that is also pretty much important and pretty much candidate for the big data use case. In the supply chain management starting from the product quality management, supply chain effectiveness, contract management these are again the cases for big data and fourth comes the cost optimization which many of us are talking today as augmented data warehouse where you complement your existing data warehouse technology with the big data technology and then the low cost near real time analytics which is very very important and very very catching the momentum right now and next generation EAI through which the different applications can interact much more seamlessly in a very very low low latency with very low latency. Given these the use cases for the big data in enterprises the key use cases if you see the refer what should be the reference architecture for big data at one side the right side of I am sorry the left side of it we are see all the different type of data which is accumulating every day in any organization. It starts with the transactional data the regular relational data there are the social data which are generated based on the interaction then there are the observational data which are getting created through logs and different type of other systems the application logs. So these are the three main data which essentially comes to this big hollow drawn what we say the big data platform and then bunch of stuff happens and eventually at some point of time these are the use cases which are on the top the big data applications those are the use cases that try to actually use those data most of those use cases are read there are some use cases also which are writing to the big data platform and that finally flows back to the system through the data integration layer. So which are these use cases I am not saying that this is the end of it these are the all six type of use cases but these are the predominant use cases which are derived from the previous slides. So the first one is the dashboard reporting drill down those type of very traditional use case where you create a report but after creating the report at the aggregate level people want to drill down to see to go to the granular level of details nth level of details to figure out what exactly the reason for a certain behavior. So this can be any type of report starting from where you are a particular say marketing analyst he is trying to figure out I have done some market campaign where exactly the campaign was really effective and how many people actually related really responded to that marketing campaign and bought some product bought some services. Next comes the application context or insight what we can say it is it is as simple as where you keep on throwing some sort of advertisement to the to the interface of the customer it may be the gaming console it may be some sort of web application where based on the customer's context the advertisement is thrown there there are other use case like ask me what if analysis type of use case where you you have a particular report say you have a report around value at risk which is a very very interesting use case in the financial domain where there is a portfolio you see the entire particular portfolio at certain level and then you apply a particular criteria that ok if my particular equity it gets the price of the index of that equity changes by 10% how may enter portfolio gets impacted. So it is not essentially anything to do with the predictive analytics but based on the existing data itself the data which can give you a very definitive result based on that data doing some sort of what if analysis it can be as good as that ok some customer representative got a call and let us assume that is a manufacturing company and they got a call that there is a problem in a particular device. So he is trying to figure out that what are the problems what could be the reason for the problems then comes a search and discovery this is typically the use cases of the big data scientist type of people who try to create or try to see the emerging pattern from the different variety of data then comes a predictive modeling where actually based on the situation you are trying to you are trying to predict some sort of implication now the prediction can be in terms of predicting some fraud is going to happen or fraud is happening in a particular account takeover type of scenario or fraud is happening in such credit card transaction this predictive modeling even can be applicable to any type of things like say in a point of sale a recommendation is going on when a particular customer is purchasing a product a recommendation is happening the last one is the enterprise application integration this is one where I am trying to put most of the things like caching and all other online type of things together under one bucket. So if you see many of the cases the NoSQL technologies today they are being used as a intermediate caching store where you try to capture the data as much as possible from the front end application and then either using strong consistency or eventual consistency you process the data. So that is where we bucket it more as an enterprise application integration situation. Now this six being the key use cases for the big data within the enterprise stack let us see what could be the typical solution. So coming back to this the technical solution what the first step is surely to get the data into this big box and process the data to apply a bunch of transformation bunch of say machine learning algorithm may be different type of business logic business rules to get the overall data which is then consumable and sometimes even you may need to consume the code data the raw data itself which is more a use case for the data scientist and in this entire thing if you see when it comes to the layer of the applications there are three type of interactions which are typically happening. One type of interactions which are more like batch interactions which I put at the left side of it where the latency requirement is typically minutes to hour so people can people can wait for couple of minutes and two couple of maybe one or two hours. Then next comes the interaction which is around couple of seconds. So it is typically of say a web application where the end user is waiting for the response and he or she wants to see the data within couple of seconds. And finally what it comes at the subsequent level where the end user wants to see the data the entire information within one or two digit of milliseconds. So in the current scenario of the big data technologies most of these most of the most of these two cases the minutes to hour and couple of seconds can be taken care of by regular not not no SQL big data solution like which is part of the Hadoop ecosystems. Now when I am saying Hadoop ecosystem I am just taking out HBase because that is also another no SQL cluster no SQL system as such. And beyond that anything which has to happen through at subsequent level there we see the need for the no SQL database. Now obviously there are different type of thought processes like when I say the couple of seconds couple of seconds level response can happen from the Hadoop ecosystem. The caveat is I am actually assuming the technologies like Stinger, the Impala, the Drill those type of technologies which would be running on the ecosystem of say yarn when you can have multiple workload working on the same Hadoop ecosystem or same Hadoop cluster. Whereas as of today those that type of interaction also which is at the couple of seconds level actually being done through the no SQL databases. So in overall what happens is the data will come from the regular systems regular data stores whether it is transactional data, interaction data or application log. It will come to the Hadoop type of cluster where the data processing will happen. Then the part of the data in different formats maybe some aggregation maybe some machine learning model that will get transferred to the no SQL data store and from the no SQL data store the final queries would be served which all these applications will like to access at real time. So given that these are the six different type of use cases and this is the overall reference architecture which makes sense at the enterprise space. Let us move to the next section that so what are the challenges in selecting no SQL databases. This is a very common representation or common characteristics of the no SQL database. You take any of the things of interest whether MongoDB or Cassandra or Couchbase or Hbase, Voldemort whatever you want. These are the different layers around which they give the implementation. At the top most layer you have the interfaces where predominantly most of the things are available like whether it is REST based interface, Thrift based interface, language specific API and also this is even people are also thinking of how to put the SQL. Next level what it comes is logical data model. In the logical data model there are different choices starting from key value map, column family, graph database, document database. At the third layer we have the data distribution model and which are mainly characterized by cap support, the support for cap theorem whether it is more consistency availability based or availability partitionability based or partitionability consistency based. Then very important thing is to figure out the sharding and replication that how those things are managed and also very important to figure out what are the ways you can do the dynamic provisioning because when it is about no SQL, when it is about having a lot of use cases being served for huge data the dynamic provisioning dynamically adding removing the servers is very important. And finally you have the persistence layer where many cases it is just memory based like if you see the grid game it is essentially a no SQL database but entire thing happens in the memory. Same probably we can talk about the memcache. Then there are disk based then combination of memory and disk where the data moves transparently across this to layer there are many a places it is more custom pluggable. Now given all these different varieties of characteristics obviously it is very, very much challenging to select the right no SQL because different no SQL technologies they support different type of this characteristics. And as a result of that obviously the challenge is what all of us see to picking up the right no SQL database are both around the use cases as well as from the technology perspective. From the use case perspective you need to support multiple type of documents multiple type of data sources which also typically will change over a period of very short time span because many times you get the data from your third party from the other vendors and the data interface itself changes over a period of say three months six months. There is a large volume of data to be handled they have to be related. You need the flexibility in changing say machine learning modeling. Today you are using probably support vector machine to do the machine learning tomorrow you want to change it to say random forest. Then predictable response time and also scalability with general and seasonal load. At the other side that the technology obviously there are too many no SQL solution then you also need to figure out what is the right type of solution to give you best total cost of ownership because many times when you select a no SQL technology it is not just about selecting the technology and forgetting it you also need to figure out how you scale how do you operate how do you figure out the right type of skills. Integration with existing technology stack which is also very important and absence of any standard which is very predominant today in the no SQL world and finally obviously the security and audit. Any other particular challenge you people are saying in selecting the no SQL. I will move on maybe we can discuss these things at the end of the presentation. So now we know the what are the use case we know what are the characteristics now let us come to the framework what could be the right framework to select a good no SQL technology. So these are the six use cases which we discussed which are on the which is represented here as the column and the features needed by each of these use case sort of evaluated here that what is the level of level of maturity needed for these features so to start with the things like read scalability most of these applications they need a huge volume of read in a given second so when I said it's scalability terms of throughput in terms of latency same it comes with the right scalability then load balancing which is which is very important because when you are trying to serve data at the at the scale of maybe even couple of gigabytes which is across a 10 different cluster 10 different nodes in a cluster and say you are trying to support 100 request per second there is a very good chance that all the request will go and flock around across one or two nodes so that means those nodes will get overly overly used whereas the other nodes will not get used at all. So load balancing is very important the next important thing is a secondary indexing because many of these use cases will need the capability to search certain data which is not the primary key of a particular table or particular collection. Once it access this is also very important especially when you try to build some transaction capability on top of this that once you do a particular search on a primary key you'd like to get the entire data from the same store or same place support for multi-dimensional schema this is where you try to figure out that for your particular use case whether you need just a simple key value or you need maybe four-dimensional column or you need a complex flexibility across a complex data structure which may which may be going to the nth level of depth and finally though we are talking here no SQL but it is a very very common ask now today at every organization that how do I support SQL access to my no SQL data store. Now these are the key attributes or key features which are needed but at the same time I am I'm taking out from this framework certain things like say security because all of us know the security is probably needed in each of these particular use cases and at the technology level none of these technologies actually give right now a very good support for the security it is still evolving. Now this this being the overall overall guidance factor for the use cases the next letters SQL stable MongoDB, Cassandra and HBase. Obviously there are many others and it won't be possible to do the entire evaluation but since these are the most popular I picked up these three of them and we also in the infosys we have a maximum experience with these technologies. Now if you see here the MongoDB they are probably good in most of the cases apart from right scalability and load balancing and they're good one of the reason also probably they're they're quite quite popular today in the industry many people are using it and they are also quite old they're they almost they're being used starting 2008 2009 time frame. Then we have Cassandra, Cassandra actually it it excels MongoDB pretty much in the right scalability and the load balancing because of Cassandra's the the entire thing what Cassandra goes by the the availability and partitionability rather than strict consistency so that gives it the scope to really increase the right throughput because all the data gets first written without any particular consistency check whereas during the read time Cassandra tries to resolve the conflict and figure out what is the right data to serve. It is also very good in the load balancing because of its consistent hashing and in the latest version they also have the concept of virtual node where even a particular range of the key can be distributed across multiple nodes so that way it is it is very good when it comes to the H base H base predominantly good in the high scalability and the strict consistency level since Hadoop is all consistency oriented the HDFS so that is the biggest biggest characteristics positive characteristics of H base and it is it is also very good in terms of its scalability the SQL access wise probably at the H base there is the majority of the word going on compared to the other databases like Cassandra MongoDB and there where it is it is slightly better than the other two solutions. Any any other viewpoint around these things based on you guys experience so now comes this complex framework and it is it is I mean purposefully I tried to put this complex picture just to show that probably right now in any enterprise context there cannot be a single silver bullet which will take care of all these things all the six use case and their requirement of different type of architectural or technical quality whether it is read scalability write scalability and all of them so you will always find that couple of use cases are good for MongoDB couple of use cases they are probably good for Cassandra couple of use cases they are good for each base now the the main question here would be that how do you then go about it and to answer that question that how to go about selecting the right type of no SQL technology the the main the main point what we need to remember is the more we go forward there would be the need for polyglot persistence you cannot have one type of database technology one type of persistence store which will take care of all different type of use cases that is that is almost impossible there are there are some work going on where where people are trying to march together like I was I was listening to the talk of foundation DB they're saying that okay if you go by the foundation DB on top of that you can define different type of layers whether it is document object based or key value based and it can get asset transaction but I still doubt that it may take probably another five to ten years to really make all the use cases converging to the single solution if it is at all possible now having said that having said that what are the what are the techniques still you can use to select the no SQL technology at least from the perspective of scalability the the basic technology what we use very regularly is using the queuing model to really figure out the bottleneck and scalability of the technology of your choice now there it is very important to select the right type of load right type of data mix right type of creation or depiction of the software and hardware model here is a very simplified hardware execution model for say H base where the main thing what I would like to show here is it is not just about modeling the data nodes or name node of the HDFS when you are trying to model a workload in H base you also need to model the H base region server you may need to model the H base master and for each of those processes understanding of how it works is important so that you know that whether it is more CPU bound or memory bound or both of them like for H base you know that H base master is typically CPU bound it really is not have much to do with the memory whereas H base region servers have quite a bit contribution from the memory because of the way the right H log works there the key points what needs to be benchmarked for the optimization for no SQL technology choice like ensuring that the workload and service center specific service demand is linearly constant over large volume of request so if you are seeing that in the region server CPU there are lot of requests are coming we need to figure out that whether the region server CPU is constantly performing over a period of or over a range of large volume of workload may be starting from 100 requests to 10,000 requests right scalability of representative complex entities what is the insert update rate that is very important to figure out similarly the read scalability when we are going for read scalability is important to figure out how it works when entire thing in memory and when most of the things in this and that also gives us the next important parameter that what is the percentage of working data set is good for a particular no SQL technology optimal replication how many node you'd like to replicate to ensure that your availability is maintained optimal strategy for the sharding key the salting of the same so that you don't get into the lopsidedness and also the latency of the SQL layer if you are at all using the SQL on top of no SQL so with this mostly I'm through with the basic presentation so here are couple of slides on our Infosys big data practice where we essentially have working with multiple type of no SQL databases I do we have partnership with most of this different product vendor and most importantly we we have created this IP which we call big data age and this big data age is the big data age is more about how do you integrate the no SQL solution with the Hadoop type of ecosystem so that you can very easily move the data from the basic Hadoop based processing to the no SQL for ease of access of the data so as you can see here that you can create an entire workflow for your data processing and the Mongo like Mongo here you can up Cassandra or H base any type of particular technology and this entire workflow can finally represent the use case which we saw in the reference architecture so that's it that is the overall paradigm what we selected based on the reference architecture to ensure that depending on the no SQL choice you can have different type of use cases implemented with with less cost and less time to achieve the final implementation so here ends my overall talk if you guys have any question please go ahead. So that is a wrong notion that we are using H base and you can still use the same Hadoop cluster H base typically cannot perform within the same Hadoop cluster where you are running high or map reduce because the need for block size for a regular map reduce type of cluster is much higher and the block size need for each basis much lower one is at the megabytes level another is a kilobytes level so this entire thing essentially will not work in same cluster so anyway you have to have two different clusters. I again still I will say that you know you probably need to say it a little bit more deeply whatever your use cases because one thing the no SQL is more about random access and random access needs data block size to be much much small whereas when you when you go for the regular data processing you need it is that the data access is more around the sequential scan and the bigger your data block size the more performant it would be and more than whether it is a it is a H base or Hadoop if you see the framework whatever is trying to present here here are these other different type of parameters you need to consider and H base probably will not excel in your requirement across all these parameters it may or may not so that's the reason you need to go through the scalability and performance benchmark to see that whether you can really achieve the same having said that it it may be as well very much you know suitable for your thing so only way I can see that H base cluster can get map to the same Hadoop cluster only when they do some sort of integration through yarn using the optimized RC file those are the only times probably it would be possible otherwise I don't say so any other question H base right obviously you can you can replicate this thing for those things also as the the parameters these are these are pretty much common across any of these use cases right just for the you know sake of the gravity yeah okay I mean if there is no question then probably we'll close the session now thank you