 So Mysalapur, I work for Mintra, which is India's largest online fashion store. Here I am going to talk about a data platform which we have been building for past 7 months now and USP4S, this is cloud based, quite scalable and low cost. Okay, so how many of you have heard of Mintra.com and how many of you have actually bought anything from it? So yes, Mintra is a fairly popular website. Last time I checked it, it ranked among top 50 website in India in Elector.com but still we are nowhere near the scale of Google or Facebook or even Amazon or eBay. So you should take whatever I present with that pinch of salt. My intended audience for this, prime audience for this are people working in small to medium startups who want to foray into big data but are inhibited due to lack of upfront investment or without the thing that is going to be quite a heck of a challenge to maintain such a system or because they do not have the engineering values etc. Okay, let us start. So why do we need big data? Why do we need to look for big data in a company like RBC, which is purely a business driven company? Why do we need to hunt elephants? We need it because we need to know what to sell, home to sell and how to sell. What to sell? We are not a FMCG company or we are not an electronics company. Like in FMCG, top 50 products will keep on selling this month, next month, next year, probably even after 3 years. So the pattern is predictable. You can predict which of the products you have to focus on, which of the products you do not have to focus on. Also in electronics, Apple releases iPhone 5, sells 2 million, handset on day one. Blackberry releases these 10 hardly any buyers. So again, both of these were expected. So you can predict which of the products are going to do good or not. With a fair bit of confidence. Things are not the same in our industry, in fashion industry, where each product has a small life cycle of 3 to 6 months. The product will be formed there. Some will kick off, some won't find much buyers. So we need to know which of the products are doing good. We have to promote them, which of the products are not doing good. We have to liquidate them. That too in a short span of time. Whom to sell? We are not a very niche market. We are open to almost every online user in India. So there can be people for whom budgetary countries are paramount. And there can be people who are very fashion oriented. They always wanted, they always want a new season product, etc. Also there can be people in northern part of India who will purchase winter we have seen in December or January. But people sitting in Bangalore might not need them. So we have to know whom we are serving. Also how is the very important part as well? Because some of the products are such that which will seek discount. Some of the products are such that which we will have to promote as high fashion. If we invest, let's say $5 on doing Facebook marketing. Or if we invest $5 on Google marketing, what damage is fetching us? More returns. So we need to know the how part as well. So these are some very high level use cases. Let's look at some of the more concrete use cases. Least spread products don't sit here. So in Mintra currently we have more than 100,000 active products and probably around 50,000 products which are in a stock. Of course not all of them can be displayed. So what should be the display order? And a very good use case is the product which have a high click through rate. They are generally attractive and let's try to promote them. So take impressions of all the products and whether they are clicked by purchase, added to cart, etc. Some products will find will turn out to be attractive and others and give benefit to such products. Now suppose I have to model it in a traditional way, the RDBMS way. It's fairly easy to model. I'll have to do some submissions and some mathematical operations. Probably this is a fair representation of what query I'll write if I have to model this RDBMS. Problem is more than 100,000 products, more than 500 million impressions a day. This is going to be difficult to scale. I'm not saying that it can't be done through RDBMS. Yes, but if you are using commodity softwares like MySQL, commodity hardware like ours, then it's going to be difficult to scale. Let's look at another use case, user segmentation. Different users have different modeling patterns. Let's segment them based on their history and provide them different experience. So probably if a loyal customer come to my website, I would like to showcase them more than more products. Probably if a customer is coming for the very first time, I might like to showcase him the best sellers. I might want to provide him an experience where buying is very easy because he or she might not be familiar with us. But if a loyal user is coming, then we can showcase him the new features. Again, if I have to model this based on RDBMS, not very straightforward to model, but probably what I'll do, I'll go and see at which depth, what is the depth traversed by each user, how many users are coming to home page, how many users are looking at products, how many users are buying, adding something to cart, etc., etc. And do a group by operation and find out whether this user belongs here, this user belongs here. Challenges, currently we are getting more than a million unique users a day, and this number is going to, should increase. And they are coming from different browsers, also from different devices, from iPads, from phones. So collating them is going to be going to difficult, and hence it's difficult to scale. Another use case, we have to recommend products which you might like. So a very basic thing can be computing scores of a product based on attributes, and attributes for us can be gender, can be brand, what kind of product it is, what is the color of the product, whether it's a winter wear or summer wear, etc., etc., what is the cost range of it. Computer score of users based on the products they are browsing or they are purchasing, and recommend them similar products. So suppose you are always browsing t-shirts, it makes sense for me to recommend you t-shirts. Suppose you are always browsing for ethnic wear, it makes sense to recommend ethnic wear. Now can I model this in a RDBMS through SQL? Probably yes. So we will have 10, 15 of attributes of any product, give different weightages to each, and then give weightages accordingly for users, depending on whether they are buying or they are buying any products, looking at product, etc., etc., but it's very difficult to compute. Like these weightages are ever changing, we can't predict this, what will be the weightage of any product, these attributes are changing as well. Also with so many users and so many products, again if I do a join of them, then it's not easy to scale on commodity hardware. So we know that yes, depending on such use cases, we need to look out of the box, and probably a big data is a solution for us. But being a business event company, there are constraints, there are constraints like development has to be fast paced, we have to show tangible results, the budget is limited, engineering bandwidth is low, and most of these are the same which Edward talked about in his keynote speech as well, because a business wants to move fast, they don't have patience. If I say them, yes, I'm building on a platform which will be ready by 2015, and it will solve each and every use case in the world, I'll say good enough, but we are not an R&D firm. So we have to time box the development cycle, we have to say that yes, we'll provide you results in two months or three months for the first version, and then probably of every alternate week. Also bandwidth is typically low in start-ups, like for example, our platform was built by two engineers, most of the start-ups go through such problems, such constraints. So what will be the design goals? If I have to design keeping such use cases and such constraints, my solution should be able to scale up and down. This is non-negotiable, because on day one, the data volume which I will receive is going to be very low. The use cases which I'll be catering to will be fairly simple. So my solution would be simple, but as and when we grow, the data volume will keep on increasing, people will start asking more and more difficult questions, so complexity will increase, so the scale should increase. Also there will be some use cases when we want to compute not one day worth of data, but probably a year worth or a quarter worth of data. So in such cases, the solution will go through a spike. So if my solution is not able to handle both the ups and downs, then it's not going to solve a lot of business problems. I should, again, record data now, ask question later. This should be philosophy for people, all the people working on big data, because if you are saying that, yes, I'll start recording these 10 data points or these 100 data points, I'll come out on something which might prove beneficial tomorrow or next quarter or next year. We should be greedy while recording the data. Probably some of the data points will never be used, but still we should be greedy. And that's why we have to cater to a generic data model. Again, why people have moved away from RDBMS to things like no SQL storages, to document storages, to storages which have unlimited columns is because of this, because they want to cater to generic data model. We need to segregate reads from writes. There are many solutions. In fact, the one which I worked on earlier was based on MongoDB that has a single point for both writing and reading. Now if my solution has to scale up, then sometimes it will very, very read heavy. And in that cases, writes can get blocked, which is a big, big bottleneck. So segregating reads from writes is a design goal for this. It should have a low-running cost, of course. It should have a low maintenance overhead, of course. So most of these things hint towards cloud computing, towards moving to cloud. And let's quickly go through some of the pros and cons of cloud. And most of them are general opinion. People can argue both for and against each of them, but the pros and cons which I'm going to list are general sentiments only. Those are, of course, pay as you use. Scaling is definitely easy. And we have managed services. And because we have to have low-running cost, or because we have to have low maintenance overhead, because we have to cater to faster development cycle, these pros help. Of course, there are cons. One of the prime cons is performance is not as compared to what is there in physical data center. And this solution which we are building is for essentially batch processing. It has a real-time component as well, but I'm not going to talk about that in this meeting. That's because it's a batch processing system. That's a performance. It's not of paramount importance. It is important, but it's not of paramount importance. Reliability, yes, even all the cloud providers themselves say that reliability is low, and we have to, you have, the developers have to build for failure. Very much. Security, people say that data security either has this. Yes, and now there are many solutions coming across to answer this, but the security is not as much as it is in your physical data centers. Control, because a lot of things are hidden from you, you don't have a very low-level control. But again, we are working at a high level, so control is not of paramount importance again. Okay, so let's look at how will a very basic data system look like. It should have a data accumulation layer, it should have a storage layer, and it should have a data crunching layer. A more mature system will have many more layers, probably a proper scheduler layer, probably a very good monitoring layer, a layer for exporting and importing data from your traditional systems. But to begin with, you need these three layers, and what are the characteristics of this? The accumulation one, it has to be highly available. If it's not available, then you might miss out on some data which you don't want to. It should have low latency. It's very low latency as other systems might not want to talk to this, and it should be agnostic of the storage which is going to be used. There can be different storages, and probably the NoSQL movement now has 20, 30 different storages. So if the accumulation layer says that I can cater to only one storage, then it's not solving the purpose. Tomorrow you might want to move to another storage, and this will become bottleneck. Storage, it has to be highly, highly reliable. Again, non-negotiable, because we are saying that we are working on data if we cannot afford to lose any data. It has to got to have huge capacity. Again, non-negotiable, because as and when you increase, your data will keep on increasing, and you don't want to discard old data, you don't want to archive them. That's why the capacity should be huge, and you should cater to any data model, you should cater to relational model, you should cater to non-relational model, documental model, et cetera, et cetera, and it has to be cheap. If it's not cheap, then some, probably your CFO guy is going to knock on your door and say that, yes, cut down this, we know we don't want this. It has to be cheap. Data crunching, yeah, it should scale up and down fairly straightforward. It should be essentially distributed, and there might be cases when you can't scale up, and it should be very easy to use, because generally, while developing such a platform, you will be done by first and second layer in some time, and then crunching will happen day in, day out. People will start asking more and more difficult questions, so if it's not very easy to use, then, then again, developer is the person who will suffer. Okay. So keeping in mind such constraint and such design goals, what is the architecture which we used? Okay. Yeah. This is the high-level architecture for us, for our system. We are receiving requests from over STTP. The requests are particularly, are a post-request, and the data is coming as JSON and post-params. This is writing, a web server is running, which is based on Finagal, which is getting to this data, and writing into Kafka. Then, batch job is running every five minutes to transfer these events from Kafka to S3, and we are running EMR based on top of that. We are running it through each and every layer. Why STTP? So there are solutions like scribe, flume, etc., which captures log in real time from your web server, from your application servers. Also, sometimes it's easier to transmit logs at a small frequency, let's say every 15 minutes or every hour into your eventual storage. But the problem happens if you are using, let's say, 50 servers to get out of your website, and they might lie on different data centers, they might be on different OSS. So, ensuring that such continuous things like flume or scribe works is a bit difficult, compared to any request, which is every website makes tens or hundreds of web services calls. So that is very easy to maintain, very easy to code for the other system, very easy to maintain at our end. Things like high availability and reliability again comes with them. And also we want to capture some data from our application servers and some data directly from the browsers. If we are if we are collecting data through log files, then probably we have to have a different end point where our browsers can send data. That's why we went ahead with HTTP request. The web server for us we didn't use things like Tomcat or engine etc. Because we wanted to have very high performance. And the functionality is pretty limited. We went ahead with Finagal, which is open source technology provided by 2.10. It it can be used in building very high concurrent servers. It is built on top of NetT. It supports asynchronous operations. Your latency will be always less. It's very flexible. Tutorial is using it as app server, as load balancer as web server, etc. So it's very flexible and easy to use as well. In next slide I'm going to show you one line of code which will say that yes, how easy it can be. For we are using Kafka for this event aggregation. Apache Kafka is again an open source being developed by LinkedIn. It is a persistent queue. It is distributed. It can handle very high throughput. They have tested it beyond 200,000 requests per second. It can have multiple subscribers. So you can fork out one real-time processing engine based on this and one batch processing engine based on this. You can segregate events based on their properties. If you are not interested in listening to some kind of event, then the consumers can very well discard that. It's written in Scala but still it beats technologies like 0MQ or every time which are built on C and Erlang. Reason being it very effectively uses things like PageCache. It doesn't create intermediate objects. That's why GC never kicks in. So the performance is very predictable. We also Kafka is distributed because it sees a zookeeper to maintain that we are all the servers and we are the producers, we are the consumers for that. Zookeeper is a distributor configuration management system of a zookeeper. It has become sort of the fact to management system in technologies like Hadoop. So we and Kafka has a very strong relationship with this we use Zookeeper. Our eventual storage is Amazon S3 Amazon simple storage service. It comes with practically infinite capacity. I don't know how many zeta bytes of storage this do the support but for an end user it's infinite. It comes with very high durability, very high reliability so there is very very remote channel of any event getting lost and even getting lost. It has flat file storage. That's why you can cater to any data model and it is cheap. It's much cheaper than your hard disk. So for S3 storage between Kafka to S3 we call consumer. So the way Kafka works it keeps everything on disk and there can be consumers listening to that and it won't be like your normal AMQP consumer which are reading message by message. It will read messages in bulk. So for us the this intermediate layer Kafka can be horizontally scaled. The eventually storage S3 is definitely scalable. So this layer also should scale horizontally and that's why we built a layer Hadoop based layer which can scale as and when we need. So data is there in S3 and what we are running to to crash the data is EMR. It is elastic map reduce service provided by Amazon. It's very easy to scale up and down. Probably you have to just try just fill in some config file to say that I need 10 server instead of 20 or I need 100 server instead of 10. Again it's pay as you use so it won't burn your pocket if you are not running anything. This is architecture which we are using and let's look at what are the numbers which we have been able to address. Currently we are getting close to 20 million events every day. They correspond to close to 800 million data points and 25 GB 25 GB of data is getting added daily. We are running close to 100 jobs a day and the biggest job has a footprint of 100 days or so. It comes to around 2 billion events. The cost as I said the cost was one of the criteria. Currently we are paying close to 35 dollar daily which is like $1000 a month. It's not touchy but still had we developed this on some physical hardware then the cost would have been at least 3 times of this. Also we can shut down it any time if we are not interested in extracting any data we will say that the trends which we have analyzed so far give such a good direction then we can cut down the jobs part completely. So the study cost will be close to $20 a day. What are the key learnings from this project? Yeah I was talking about Finagal is very easy to use. You can literally code in English so this is this is code taken from one of the live server my service handles exception and then records in Kafka and then respond. So yeah again I talked about Kafka previously so generally while evaluating any system we say that what is the language in which it's written in and generally we say that those written in C for long will be more performant but Kafka is an exception. I encourage everybody who is interested in building scalable solution to at least look at the design discussion on Apache Kafka website and they are very detailed they talk about how they have maintained performance how they have maintained scalability without without making their developers suffer and writing some languages which are very difficult to deal with yeah also while working on data we are always conscious about the fact that how we will extract the data and say that those data set which provides secondary auto-shared indexes are easier because we can query on that. Yes definitely that's the case but until you can search without index this is typically how we arrange our data in S3 so the bucket name then the event name and then segregated by year month they are in minute so if we have to look at any data it's essentially going to the data point only it won't be like your MySQL query but still you can accept data in very less time EMR cluster the elastic MapReduce cluster an advice to anybody who is thinking about that it's very easy to start but also if you are not monitoring it well then it won't be cheap because typically all the pricing are at an early level so if your job is running for 1 hour for 1 minute still you will pay for 1 hour price if your job is running let's say 1 hour 1 minute then you will pay for 2 hours so you should segregate your load you should have some ever running clusters and decide that we have to submit my job to also EMR Amazon has this fund of spot instances we should use that effectively M1 is small or not small so all the architecture which we saw in previous slide all of them M1 is small and so far we are working fine I've been a fan of org and grip for a long time but this project I really taught me this because sometimes you just want to have a very basic knowledge about what is coming and org and grip can help you a lot in there and Apache mailing list so this is me before I started working on this project then I pulled each and every strand of my hair and also the beard yeah this is my request to Apache some of the finest engineers in the world work for you please please please make your mailing list more usable they are very difficult to navigate they are very difficult to search take some inspiration from Stackout flow or Google group yeah so with this well in time I've come to the we are hiring this is the email you might want to forward your CV to ok questions ok so S3 is the eventual storage yeah the question was we are using S3 for storage it's definitely reliable but the right performance are slow so how to deal with it we are not using S3 as a streaming mode as a streaming endpoint we are writing everything in Kafka broker first and every 5 minutes the job kicks in which is transmitting to S3 so that's why S3 is not not a bottleneck for us and the real time analytics which we are doing that is not done on S3 that is done directly through Kafka brokers only the batch processing happens on through S3 and for that probably a 5 minutes or 10 minutes or even an hour later and say is good enough yeah this Kafka are ebs backed how many Kafka servers you are using 3 can you give detail about how many EMR servers are being used okay so EMR servers are for different users they are different servers like I said there are some jobs which are early some jobs are which are daily so our daily cluster is a 10 node cluster currently and we also keep changing keep tuning it based on how much time different jobs are running so so let's say jobs are taking let's say my whole day quota gets over in 2 hours 50 minutes I will increase the capacity by 10% to get the computer in 2 hours probably if it's taking 2 hours 45 minutes or 3 hours 45 minutes then I will decrease the number of instances so that the early boundaries are stretched okay and what is the memory of node say 24 mv okay fine okay one more thing do you use personalization based upon user browsing history a personalization based upon user browsing history that's very much in pipeline the analytics part for that is almost over now it now it's come to the website team to how to personalize there that probably in 3 to 4 months if you browse on minter.com then you will see how do you compare your architecture to lambda architecture suppose I guess you heard of it I am not aware of that fine hi sorry I am going to interrupt actually there is a queue here and a lot of people are asking questions they are not getting to ask so if we can just follow the pattern then also the mic gets recorded that's all so wait for the mic to come to you and then talk is that okay sorry how are you managing your EMR jobs right now because AWS doesn't provide you any job cellular or something right we have written a custom Java code for triggering the jobs for finding out which are the appropriate cluster to fire to a monitoring part is currently something which we are working on we have talked with AWS people and a couple of things they have suggested I often don't remember the name one is being developed by Yelp one is one wall developed by Yelp and another from Netflix if I try to remember the and try to look into my notes I am here already if you if you are here then I can tell you what so but still we have not evaluated them hi how are you managing historical data does it also remain in S3 also remain in S3 something which I forgot to mention here so you can see the EMRs are taking input and again writing it back to S3 so the output of all the jobs are in S3 that's why we don't recompute let's say I have to get 15 days worth of pattern then probably for some extreme use case I might have to recompute but typically it's not the case so are you also thinking about Amazon Glacier for your historical data Glacier doesn't solve the problem because Glacier is right once read never you might want to you need for things like tape backups there is very low probability your historical data you can move back to S3 in case you want to re-aggregate the data that has its own cost involved to be fair we didn't evaluate what would be the cost if suppose I have to pull out 10 days data from Glacier to S3 but again Glacier is cheap but compared to S3 it's around 20% cheaper we didn't find much value I was thinking you have historical data for which you already have aggregation done so you don't need the real data to be in S3 because you don't know what aggregation you want currently we are running some 10 kinds of jobs tomorrow we might need to run 11th job for which raw data is required and these new jobs are coming every 15 days so we can't keep them in Glacier can you give me an example of when you said the performance will be low on the con side for the cloud computing can you give me a specific example you have experienced where the performance was an issue if you keep if you run MySQL instances on EC2 not using the RDS if you run MySQL I am not sure with RDS because I have never worked with that but if you run MySQL instances on EC2 and similar config on physical data center I have seen physical data center provides you physical boxes provides you better results also this the earlier earlier when this architecture was not there it was one box architecture which had MongoDB which was both the reader and the writer and there we were using EBS as a backend it's around 6-7 months back at that time provision throughput wasn't there in EBS so if you have traffic increases then writes become slow and we have phase problem where the EBS became unresponsive for short duration for 30 seconds but unfortunately those 30 seconds those 30 seconds cause the system to go down the MongoDB system to go down just wanted to ask you just wanted to ask you it's like you are using I can't so you are using Kafka and then you are going ahead to storing in S3 so this looks like a good candidate for real time as well so why not storm and why yes storm we in fact the real time analytics which we are working on is using storm for storm needs some input input we have to get the data form and for that it's listening to these brokers we are using the Kafka spout for storm and storm Kafka spout is very clean I think probably the two most popular spout for storm are the crystal storm and the Kafka spout but it's a challenge to scale down with storm yeah it will be a challenge so far that's in POC mode only probably I I won't be able to answer the scale up and scale the questions hi when you say real time using Kafka what is the seconds you are talking how real is real time around 3-4 second delay thank you that's all in Kafka you can tune when to write so again it's based on number of events or number of seconds and I guess we have kept it as 1000 events or something it's not very high so that's all we have time for today please catch up with the poor thank you also just in case you are thinking of developing something though we are not a consultancy provider you can definitely contact me I'll be able to I'll be happy to provide you any inputs I can help you