 So I'm going to talk about Lambda architecture today. It's a new way to build distributed systems, distributed data systems. And you know how we utilize it to analyze large scale unstructured dynamic data at index. So we've been using it for about 10 years. So this is going to be our journey about, you know, how we ended up using and implementing Lambda architecture at index. So before we start, you know, we like to do a quick audio of index. So an index, you know, we are trying to build the world's largest product database for branch retailers and developers. So what do I mean by that, you know, we like to provide price and product intelligence and answer these questions. So, so a few example questions are am I priced higher or lower with respect to my computer on let's say a product like Nikon D P 700 an interesting question there is at the bottom, which is what is the average price change of all Nike shoes in Walmart in the last three months. So, so we we answer questions for customers. These kind of questions. And how do we do it? For that, you know, let's let's let me talk about the data pipeline at index and how do we get this data to customers? The first thing is, you know, we we crawl the web. We do a focus crawl of the web to get product information. So what you see is a product page there and how do we do it? We we we crawl the web next next part of the data pipeline is what we call parsing where we convert this unstructured or sort of semi-structured HTML into structured product records. A product page might contain a title, a price availability spec spec text or unstructured data like description. We extract this data and convert it into product records using a series of a set of manual as well as machine vision based algorithms and we do and we do this for thousands of sites. The next step is what we call classification where we use machine learning algorithms which have models where you know we build models and we try to annotate these product pages into different categories which is based on our tax on them. The next part of a data pipeline is you know, we take this products which are categorized and we try to cluster them to get similar products that are that are sold across stores. For example, an iPhone iPhone 5 64 GB might be sold in multiple stores. So we cluster them using again, you know clustering algorithms and that happens in this step and this finally you know goes into our catalog which contains products and prices. The next step in our pipeline is what we call analytics. You know, we take product and price catalog and induce pre-computations and insights in multiple dimensions. We also create a search index which allows you to make which allows you to search search for data in the catalog and finally we expose all this end users where multiple endpoints like we have an app. We have an API and we also expose via mobile and this small announcement I need to make. We just released the version one of our API today. If you want to know more about us, you know, you can go to developer.index.com. Okay, so this is a quick overview about data pipeline at index. The next thing I want to do is you know, talk about some of the challenges we have in in buildings data pipeline. The first thing is this data is dynamic. What do I mean by that? So if you see the product page, you know, it's it's dynamically changing in the sense that you know companies e-commerce companies are adding more information as they get information about the product. So our passes need to keep up to that. So we need so our passes need to extract this information. For example, a specification text could be added into a product page. So next time when our passes pass, they should be able to extract it. Similarly, there could be information that is there on the product page, but our passes could not extract it for various reasons. Maybe they're not they're not we've not written code to extract it. But let's say once we do that, we need to go back and repass and and look at the historical data and extract this information. An example of this is availability or you know sizes or colors or other other types of attributes. This is also other stuff like our ML models are ML models are also evolving on a regular basis, which means that our categorization can become finer and coarser. So a product which was categorized as a shoe might be categorized as something like a sneaker. So I mean, these are these are things that keep on changing on a regular basis as our ML algorithms evolve. Now what happens because of all this is our category because categorization is an important aspect for building the rest of the pipeline. So when this changes, we need to go back and recompute all the other stuff that we just the product in the price catalog. Not only this, even the re-computation that you saw for analytics and search index also need to be need to be changed. So go back and you know selectively update data. But you know that's that's too complex. And the reason behind that is you also need to work at scale. So we are product catalog currently has about 400 million product URLs. On a daily basis, we crawl for terabyte of HTML data and our data pipeline process about 100 terabyte data on a daily basis. So given all this, you know, we looked at implementing the first version of a data pipeline. So this was almost two years back. And this is how our data pipeline look look like two years back. We implement a batch system using H-PACE and MapReduce. So our storage layer is implemented in H-PACE and our processing is done by MapReduce. Now, why did we choose H-PACE? We were primarily inspired by Google Google's big table paper because you know we wanted to be wanted a system that could store millions of documents or millions of HTML documents in this case. And the big table architecture is also very similar because the white table architecture of big table allows you to store your HTML page as one of the column families and you have the other information that gets extracted subsequently in other column families. So we use that approach. And we also like for example here, what will happen is the crawler will write the HTML pages into the first column family. The parcel will read the data from there and then write it to the next next column family and so on and so forth. So this this was in production for a year, but we ran into multiple problems. So I'll just go through them one by one. The first thing is. Okay, I mean there's a red arrow. There's a red circle there. So any guesses on you know what could be the problem. I mean focus on the arrows. So what what could be one of the problems with this? Sorry, right? So I heard mutation somewhere. Yes. So that's that was one of the biggest problems for us where multiple systems would write and read from a from from a single column family. So that used to so there was a lot of mutation that was happening and we ran into the same issues that that play concurrent systems. So what used to happen is issues that that would occur. It was very difficult to reason out, you know, what was the real cause of any issue issue that occurred in these systems. And another problem with mutable systems is, is what I call human fault all wins to put this into perspective. You know, let me ask a question here. How many of you have have written update statement and forgot to have the weight clause it please raise your hands. Yeah, we okay. I was expecting more but okay, I don't think I mean you are a professional programmer if you're not if you're not done that okay. So the point I want to make is that you know bugs are inevitable. I mean we as humans make mistakes. So what we want is our data systems to help us recover from these systems. So in this system what used to happen is that if a bug was made and there is a cascading there's a cascading effect because each system takes the its input as the output of the previous system. So what this would mean is that you know, we'll run into what are called Byzantine Byzantine errors where a small small error here will will appear as a big problem in the downstream systems and there was no way for us to recover because the data was immutable and we used to and there are some war stories around it that I can talk about but we used to run into migration jobs. You know that will try to recover and we were not very sure whether we recovered everything or not. So what we learned what was the hard lesson that we learned from this is data systems should be human fault already. So this was the second problem we faced again. There's a question coming. So this is a plot of a request per second versus time of the space to put right and right and read to put right is in green and blue is and right and reads are in blue. So any so you can see the downward spike there any guesses on what what is the downward spike running a query. Okay, but I mean we've been running queries in a reads are happening else in the other before that also right. Okay, what is running every 30 minutes? Okay, so what's running every 30 minutes is is compactions. If you people are familiar with H base and H base uses compactions to ensure that you know the read performance is not affected. So I may spend a minute about talking about what are compactions. So to ensure the heavy heavy right throughput what H base does is it stores everything that all new data that comes in into memory in what is called mem store. And as soon as the mem store goes beyond a particular limit, it actually flushes flushes it into the disk. Now the files that are flushed into disk are immutable. Now the problem with this is that if the files are small, the read time is affected. So what H base does is it goes through what are called compactions which you can either you know do it either you can turn it off or you know you can tune it whenever you want to run or you can run on a periodic basis. So what these what the compactions would do is they'll they'll compact smaller files into bigger ones that will that will improve read performance. So what happens during that time is because these compactions are disk and CPU intensive the performance of H base both read and write goes for a toss what I mean by that and the analogy is very similar to how JVM garbage collection happens you know you have to tune it you know. So so we also tried a lot of a lot of tuning techniques but we realize that you know we were not able to guarantee performance of of H base because you know if you turn off H base compactions read performance would be affected. So what is the lesson we learned from this? We learned that you know random write databases are hard to manage at large the third problem. So this back process would use used to take 16 hours to run used to run it on a periodic on a regular basis. It used to run in a while loop seduced to take 16 hours for the problem with this is that the business will not not agree with you know this high latency 16 hours latency is a lot we wanted it to be couple of hours. So I'll just summarize the three problems. So no human fault tolerance because of mutable state operational complexity because of random writes and compactions and high latency and that systems are architecturally you know you can't you can't guarantee low latency in our in bad system because of all the bookkeeping and all the overhead that is required for running jobs. So I'll come back to this these three problems. I'll refer to them again in the presentation. Okay. So we mean we it was it was chaos for us about a year back primarily because of these three problems and we realize that we had to rethink our data systems from scratch. So our our approach although I mean we read the research papers and all looked fine but it didn't work in reality because of problems that's when you know we started looking into Lambda architecture and it seemed to be a step in the right direction. So what is Lambda architecture? It's an approach to build big data systems. It provides you guidance on architectural components and principles and it and it tells you how to tie batch and real-time systems together. It is general purpose we applied it to our domain but the concepts can be applied to your domain also. So it's been coined by Nathan March is the next Twitter engineer and creator of storm a real-time processing engine. So one of the key concepts in Lambda architecture is the next two slides. This is how you know we rebuild the data system. So we had H base which was the source of truth and we have we had application and the application would read and write to H base. What we what we had to build was this where you have to store an immutable raw data your immutable raw data in us. Sorry your raw data in an immutable store which is which acts as your source of truth and you create process views out of it and your application should read from the process views. It should not update the immutable data store. So this this is something you know this slide is something that I'll maybe pause for a few seconds here. You know if I could time travel you know and there's something technically that I could tell myself two years two years before I think this this is what what it would be if you need to build a data system that's good. And this I'll touch I'll I'll I'll again refer to this slide in the in the subsequent part of it. So again, you know this is the traditional approach and this is a new approach you're talking about. Okay, so to concrete to concretize this particular approach you know let's take a small example. So let's try to find the count of unique products in any given category for the entire time range. So the key thing is any given category and for the entire time range which means that you can't do on the fly computations you need to create a you need to pre-compute these values. So let's let's take a stab at implementing this solution. And of course you know there are two requirements that we can't forget the data is dynamic which means that you need to re-compute or whatever other technique the other technique that you need to use but you need to you need to ensure that you know all the dynamic nature of the data is taken care of and it also needs a large scale you might have millions of products in the given category. So let's try to look at implementing a bachelor. So the first thing that you need to decide is how do I build this immutable data store what are the properties of this immutable data store. So one one tab one stab at it is you could have an append only data store where all new data is appended to the current are so which means that you know if data that's coming to if new data is getting added it will only get appended there's no mutation that's happening it's just getting appended to the end of the data store. So in this case what I've done is I have predicted products as as boxes and colors represent category. We're not working with HTML data here we're working with structured product data which is already categorized. Now now you can see that the data is already fragmented. You see the products products are at different time boundaries but similar products are not together. Now if you go back to the query what you wanted to do is we wanted to find the unique products in a given category but this view is is is not is the data store is not conducive to that what you need to do is you need to build a view which allows us to efficiently answer those queries so you can you can do something like this. Yeah, so this one point here is that going to be used HTFS for doing this vertical partitioning. What you can do is you can write an MR job that will convert this view into the view that that you need so that you know you can make make efficient queries so here what I've done is I've converted that into a category based view where all the products in a given category are stored are stored with the with the particular character in HTFS you will you will you will represent them in directories. Now the next thing is you need to do the actual calculation so you can write another MR job that will go through each of these categories and then do a unique of the products in that category now you get this batch view and you can implement this batch you in in each base the final step is when somebody asked for a query. You actually you know refer to this space view and and get the data out. Okay, so let's see how this solves some of the some of the requirements. First requirement was recombinations so whenever there is new data that comes into the product master data you actually recombine the intermediate views and the batch view so so and replace the old batch view with the with the new one. The second column was scale or sorry the second requirement was scale so you can you can handle scale by using how do how do HTFS and MapReduce and HPAS all these three technologies are proven to work at scale and especially HTFS and MapReduce you know we've heard of we know of examples you know where it works at petabyte scale in thousands of machines and it space itself you know we know that scales to terabytes so the two requirements that are original data data pipeline implementation had which was recombination and scale is handled by this new one now let's see you know if it handles the problems are not again recap of the three problems. Let's look at them one by one. What about human fault all this by by human for fault all this I'm talking of bugs and how does the system allow us to recover from those bugs. So let's say you have a bug in the batch jobs. So no problem you can discard the views and become what if you have bugs in the master data in your master data itself you know where you incorrectly you add incorrect data into the master master job. So the answer for this is simply you know you reprocess the master data to hide the old data here I'll refer a regular stock during earlier in the day where he talked about data being immutable and data being a fact. So here we are not deleting the data. We're adding new data that hides the old old incorrect data the implementation of this you know we can we can vary but this is the core concept and of course you could have bugs in the query layer because it doesn't have any state. You can just redeploy the query layer. We also get traceability as a side effect. If you see that we are not we are we are actually computing intermediate the competition is happening in with intermediate steps. So if there is there is any problem here you can trace back and see where the where the actual problem happened. So you get traceability as a side effect of this. The other problem we saw was operational complexity because of random rights now in this in this batch layer you don't have random rights. So what you do is we use bulk updates and build the batch. So there are no random rights happening. But that sounds good. But what about latency? So this is where I'll introduce a new concept called the speed layer of the speed layer compensates for the latency of the batch layer. What do I mean by that? So let's say your batch layer takes couple of hours to run. So any new data that came in the last two hours will be processed by the speed layer and how does it do? So let's say when you get new data new product data, you can I can be added into the queue. You can use Kafka as an example and then you and then you can you know process it at real time using using something like storm. Okay now now here's here's where some of the issues with real time processing come into picture. So remember the original problem we're trying to solve. We're trying to get unique products in a category. Now for unique you need to keep the existing products that are already part of the category in memory. So whenever a new product comes, you see if that product already exists or not now imagine you know keeping this in memory keeping all that all the unique products in memory when when you have let's say millions of products in your category. Now that is a problem because I mean a typical implementation would be you know you could use key value data stores or let's say even if you talk of memory or talking of you know hash maps here, right? So these won't scale. So what you do here is you trade accuracy for memory. So you can use data structures like hyper log log. These are probably probabilistic data structures which will which which take less memory but will give you unique counts with an error with the probability of what whatever error and you and you can it's and you can choose I mean if you need to need lesser error, you increase the memory. So that so so what what it means is that you know for for implementing algorithms in stream layer you need to you need to trade off on accuracy. You need to trade off on accuracy for memory and some example data source that you can use our react H base or Cassandra but one thing you need to understand here is then you're here this random rights and updates are happening. So whenever a new new product comes so we checked to see if it's there in the hyper hyper log log if it is there, you know, we don't need to add it but if it's not there, we need to update it. So this is the first time you know, we are having updates and finally now you can you can actually query the hyper log log to get the unique counts in the speed layer. Now let's talk about updates, right? So I said you know mutation is bad, right? But speed layer has mutation. Yes, we layer has mutation, but it deals only with with smaller data a batch layer might might be working with months and years of data, but the speed layer is only working with few hours or one day of data, which means that it's easy to manage that it's not it's not a lot of data. It's just small amount of data as when you compare it to the patch patch data. So what it means is you know we have we get what is called complexity isolation the composite complexity of handling random right databases has been moved from a large volume batch layer to a small volume speed layer and the final step here is merging the results. So we saw how the batch layer was was calculating the data and how the speed layer was calculating. So let's take an example. So the category count that you get from the batch layer is is 50,000 and the speed layer gives with an approximation. Let's say 499 and the final query gives you 50,000 499 which is not what we wanted. We wanted 50,500 right? So how do we get this accurate? So that's where the batch layer the beauty of the batch layer comes because the batch layer is running in a in a loop. So the next run of the batch layer will will override whatever was calculated by the speed layer and give you the exact value. So what you what you get get is eventually eventually accuracy. So with this I'll formally introduce the Lambda architecture. So you have new data coming in it is it goes into the master goes into the batch layer into the master data set and then you have batch views that are that are being created on the on regular basis. You could have multiple batches depending upon how you want to query them and then you have the new data also entering the real time into the speed layer to compensate for the latency and then finally you have the query layer that merges it merges the two to get the results. So a quick overview about how we implemented Lambda architecture at index. So this is the overlay of the tech stack for new data. We use our car our real-time crawling engine uses our car our master data set is HDFS our map reduce jobs are run on Hadoop using Scalning and we also use Spark for some of our ML algorithms and then our batch views are stored in H-PACE and solar and our real-time layer is actually implemented as a as a as a micro batch implementation of of our bachelor itself and then finally we have the query layer there. So a little deep dive into into our batch layer we use an abstraction called pale that allows us to do vertical partitioning that I spoke about in HDFS HDFS also has this problem of small files pale solve it by using what is called consolidation views. Scalding Scalding is an MR abstraction that allows you to write map and reduce jobs in Scala using the higher order functions in Scala views. The first version was written in vanilla map reduce it wouldn't allow us to write complex so difficult to write complex workflows but Scalding has made it really easy for us we also use trip for imposing schemas views H-PACE and solar for views bulk updates. We do bulk updates to create this speed layer indexes still work in progress and to reduce latency we are using micro micro batches use the last batch run and and bulk updates if you want to know more details about you know how we will prevent a lambda architecture you can actually talk to talk to me and and and the team from index will be at the Nexus boot today and tomorrow I'll just spend a few minutes in the open challenges one of the things we've realized is in a managing both batch in real time still painful you know there are different architectures at play and you know they so speed is for for low latency but you have to trade off between exactness and and and large data and patches for exactness and large data but you have to trade off trade off on latency. So it's managing both the systems is really difficult now we are seeing two broad directions we are seeing abstractions that are coming up that try to abstract the both both the both architectures where you write code only in one and it gets executed in both Twitter Twitter Summingbird is an example. There's also unified stacks notion of unified stacks that's that's coming up its path and path stealing being the speed component there LinkedIn is also proposing Kafka and Samza or storm to storm to verify the two and Google recently last month released Google their cloud data flow that again tries to solve the same problem. In conclusion lambda architecture it's a different approach to build data systems it's based on solid principles of immutability and completely isolation it's domain agnostic you can apply to your domain but the tools are not yet mature here are few resources I'm actually writing a series of blog posts on the engineering blog at index so 10 parts these are finished two of them you can go and refer you know where I'll go if you're interested in deep diving or implementation of lambda architecture at index you can report to that and these are some of the links that you could refer I'll just talk with the key takeaways and questions with respect to the lambda architecture there is actually a myth that some more outcomes the cap theory what's your take on it? Okay, so I mean it it it definitely tries to beat the cap theorem because you know you're isolated isolated it I mean the problems in the speed layer I mean so so we know that we have to choose between consistency and availability right? Let's take the partition tolerance out of the way now most of the systems would would would choose availability so what you're dealing with is consistent but there the problem of latency comes because if you need to make your data systems consistent you need to I mean you need to ensure that you know you're replicating all the data in across the system so in that case you know latency latency will be high I mean so I'm saying that the speed layer ensures that you know you are I mean the complexity of that is actually isolated to a smallest mean you still have your patch layer which is which is available and consistent but your but your speed layer for the for the for the for the subset of the data is not is not consistent but it would eventually eventually act. Hi, I have a question here so where exactly do you store the raw data? Is it in HBS or in HDFs? I mean right now we store it in HDFs so earlier implementation was storing it in HBS so now we've stored it in HBS so why can't you store in HBS as a different column for just actually yeah that's a good question so our original implementation was storing that in HDFs itself so in HBS sorry in HBS itself so what used to happen was the column family of of HTML is very dense in the sense that you know you have 200 kb rows for example but but your other column family which is say your past data only only has a few few records but HBS allows you to the arbitrarily isolate and collocate data right? You can actually really move your column family to separate set of machines and not mix with the your process to yeah but but what so when you're running MapReduce jobs so it has to take the entire row and and and give it to the MapReduce jobs right that is so number of regions I mean so the number of so let's say I mean HDFS the HTML data is fine but when you're talking of the column family for for the sparse data or the smaller data the amount of data that's there in a single region is not not high which means that the number of mappers that has formed up is is a lot. So it is quite possible to isolate all your all your raw log data in a separate column family and not mix it with the process to facts which you are going to imagine a separate column family yes so then what is the other problem you said the problem is okay when we're running MapReduce jobs on let's say the the non-broad data that is a process data okay because because I mean ultimately it's finally row right yes so so what happens is you know when maybe you know for for a given region the number of records of the HTML data number of rows of the HTML data is say hundred the corresponding column families will also be will be only hundred means that we are only processing hundred at a time but you can actually process a lot more lot more of the process data so that's the skew of skew of because the data sizes are different skew between columnist rodiges you know cause problems for us and we can we can talk about it offline the other ones also will be written. Exactly exactly I mean there is there is I mean you could actually have a lot more of the process data but because this other column family gets filled up only subset of it is available and this other thing is already left yes yes I mean that's so yeah that's we actually first I mean the grad the gradual thing that we did was we moved it into a separate age it's it's we stable itself the HTML column the HTML column family and then we know moved it to HDFS so that sort of progression you know helped us I mean that that is moving to separate the table was much efficient and HDFS is you know it's definitely working up working for us really well with the vertical partitioning we can do two more questions which stop the user and so I do see that the data is split in this case one in the speed layer and another in the batch layer say we have a query say we have a query where I have to aggregate data and it has across these two layers how can we do such a query I mean so so there are some constraints there firstly your your query should be I mean you should be able to they should be associative and committed in the sense that you know if I like you take some as an example right you can you know if you get you know one and two here the answer is I mean you can just some some some to get three but you talk of Unix I mean that operation is not is not easily aggregate aggregated so that's where you know I mean it's it gets a little tricky so you have to I mean you need to I mean that's something that we are all struggling also some of our algorithms are not easily I mean you you can't aggregate them you have to I mean there are there are different data structures that you have to use and and and the example I gave of hyper log log is a good example where you can implement an associative nature you can implement an associative hyper log log and and and count and but you give approximate value it won't give you the exact values still I see it's a challenge it's a challenge yes it's a challenge thanks your questions up hey this Gowri from Exactly Corporation so I'm here so one of the design principles you mentioned in the slide was to reduce the map reduce timing 8 hours to probably couple of us right so when you create different views of the master data how did it help me in reducing the map reduce times in fact it should be higher right no I mean see you need you need different views for because your your final queries that you're making right so I mean you need to optimize for those so these the views that I mean one thing one thing good with map reduces that you can linearly scale but on the query side if the queries are not stored if your view is not stored in the right right indexing format you know that then you know end users will have a problem right so what we do is you know we increase the number of machines to ensure that we build the right views which are optimized for the for the end user so you build a row keys pretty much optimized for the query needs yes the views are completely optimized for the for the final queries I mean if you can do it with a single view you know you can do it but in cases where you can't I mean there are there are two different queries and a single in a single view can't solve the problem we actually create two different thank you.