 I am Shantanu and I am an architect at Flipkart and they are basically architects get to do a lot of different kind of things. So one of the things that I do is take care of a few of the analytics sections at play. So most of the things that I work on are face towards the developers versus matrix and data for business. So Hyperion is primarily a system which is used mostly by developers and options yes etc. There is one more system which is called Bigfoot and that is used for business level analytics. So mostly I will concentrate on this guy today. So what is it? It is basically an event processing system which is used by dev teams to get insights into what is happening in a deployed system or on apps it is used by on call for debugging customer service issues for some cases specially issues arising on apps are on the web not the main website but we have a web reader which is used to read the books on the website. So any kind of issues which is actually pretty JavaScript heavy. So it has to work on a lot of different kind of platforms like couple of days back we got one guy who was using Firefox 3 on Windows NT ok. So debugging those kind of issues is very difficult because we do not even have all of those items locally all of those environments locally so and it is used by business teams for certain use cases. So we will look at a few of the use cases. So this is one of the screens it is like showing so we have a in house storage system which we built it is a distributed file system and this guy showing events that are happening on that and like 24 hours and resource gate is the highest number of events from this particular application. So this is basically a monitoring screen and we have a trend graph also like over a month how many of different kind of events happened from a particular system and there is also search and a few other functionality. So why did we need this thing so this guy actually came up from our digital team which I was a part of for some time I recently moved to a different team. So there we were actually developing apps to read the books on different platforms windows iOS Android web ok and we were also building the back end infrastructure to support those apps ok for example download systems right like some download system which are about samples they are cached some which actually send out the books they are obviously not cached they are DRM etc etc right so they actually go so we needed to monitor all of those systems right and we needed to know like how many people are reading books how many books were downloaded ok and if the app crashed what was the state of the system before the app crashed or the app before the app crashed ok and CS issues like somebody comes and says ok I am not able to download the book I already paid for it ok what what what are we supposed to do now so that right JavaScript error I mentioned it is like a big problem to debug remotely especially you do not expect your customers to be a lot savvy about like what is happening on their index DB inside the browser etc etc right so yeah besides that there are few small business use cases like we have ebooks which we get from multiple sublet say penguin or random house etc so they upload the book right after they upload they come back and ask ok I uploaded this book what happened to it ok I need to know what happened to the book it was supposed to go live I did not go live right and besides that we have this guy is now also integrated with our retail app so retail app is it has about I don't know about a couple of million installs I think ok and so we send out some marketing notifications etc etc so it is about when was this notification sent why was this sent it it came from which downstream system etc etc so these are like some of the major use cases for Hyperion and there are also other use cases which we as developers use like if I pushed some new code and there are certain matrix I want to see if that some section of the code is actually getting it right if there are errors what kind of errors are happening and I want to be alerted in those situations right so all of this kind of requirements came up so at that time we were starting off with our some some internship program so we started this guy up as one of the things that in as one of the projects ok so and we came up with a sort of expensive problem statement list it was like so first requirement was everybody that wanted to use this gave us the primary condition ok we will send data but we should not slow us down right so that was the primary this thing because somebody click download it should immediately start downloading it you cannot slow down downloads because you want to purchase some of the download information somewhere in your system right and we have a lot of different kind of technologies that use ok so some people some guy will use kala that will use ruby others will use java somebody will use python so this guy needs to work with all of that at least the event push part onboarding should be simple and there are many teams and if it is complicated to push an event nobody is going to use it and click so that is not something that you would want and there are also some basic guarantees that you need to give if you are planning to build this kind of a system the minimal part of that is that once an event once you accept an event you cannot say that you have not been able to purchase the event due to some reason ok that is a guarantee that you have to give that if you accept an event you have to push it out ok how well the database is written there is only one truth about it is that it will go down for some reason or the other ok so that is reality of life ok if anything else a couple of your boxes will go down in your cluster you cannot stop that from happening ok so the system that you are writing should be resilient to that resilient to that in the sense that you can have a slow down in the message that is coming say if your processing latency was 20 milliseconds it can go up to say 100 milliseconds but the system should not go down and even if it goes down you can still take messages and basically when you are back up you should be able to process that very very fast and get up to speed ok and if you have certain critical functionality based on your event stream you should plan in such a way that even in case your database or something goes down you should still be able to not like able to support those critical use cases for example alerts ok let us say if you have a critical alert on say for example your checkout one of your machines which is doing user checkouts being down it cannot that particular event cannot get blocked because you are not able to purchase to the database ok some more so there is a lot of different kind of analytics that you can do on your event stream but we decided to go with a small and mostly used subset ok which was querying so you can query an event based on the fields that you are sending one sending through it second is grouping like group by different component a group by a publisher and a book and the different components that it went through ok so group by count right and trends right and of course histograms that I showed you like histograms are important because they give you a visual representation of the state of your system ok if it is 0 you know something has gone wrong you do not even need alerts for that so visual representation is very very important most of our people most of the people who wanted to on board hypogems they said ok we will make our own consoles we do not want generic this thing we know what fields we will query on we know what graphs etcetera we want so we will make our own console so we had to have a set of API's that simple API's that those guys could use right and back in so the data model that we are storing we knew upfront that the all the different kind of use cases that people are going to do on this guy probably we will not be able to code it up as a platform team ok so individual teams will probably write their own custom analytics so the storage system needed to be somewhat friendly to that so this is the base architecture that we came up with so these are the guys who will push events and there will be an API which like this guy is just a API to ingest events ok it takes single events and it will take batch events it has no other function ok so from there we had to put it in some messaging system which will work as our which will work as our staging area right so from there we will use some kind of a stream processor and put it to two different stores ok one is the short term query store which will be used for the base functionality that I said graphs like histograms queries like simple search queries and groups and trends right and there would be a long term store where which which people will run their own custom jobs of ok and the query store will expose API on the query store and people can write their consoles on top of that or they can pull data from the backend store and into their own storage and create their own APIs so with that we started our design this thing and now so basic architecture was fixed now we had to fix the text stack right so first was the messaging system what to choose so here we wanted replication we wanted parallel read capabilities and we wanted the system to be resilient to downstream failures that means that if you are reading once from the database and so if you are reading once from the queue and you are trying to write to something if it fails I can come back again to write again to read from it ok first second is this is valuable data so others might want to write others some other system which will use the same data stream right so that guy should also get the same functionality ok so there we chose Apache Kafka at the time ok so the problem with this was when we started off with this the stable version of Kafka was 0.7 ok which did not have replication per say at least not inside the system but that is something that we needed because of the base guarantee right that if we take an acknowledge a message we cannot go back and say that we could not persist it so we wanted replication on the queue which was not present there so we decided to go ahead with Kafka 0.8 which was beta 1 at the corner frame ok so the good thing is yeah it's open source it's extremely fast way it's very very fast so the base philosophy of Kafka is that if you are writing sequentially to a particular file or you are reading sequentially to a particular file from a particular file your reads and writes are almost as fast as reading from a traditional memory ok so since our use cases mapped very much with that and we were never planning to do any kind of random six other than in failure scenarios we decided to go with this ok so now if you are starting off with something like this I had RG to take a look at a project called Luxon LUXUN ok I have not used it but it looked pretty promising I have used their other projects it's the source is available on GitHub you can take a look I think it's from a Chinese company anyways so yeah as I said it was beta and the other part of Kafka is the server has practically no control so if you guys have used something like travidian queue etc where you actually attach a queue to a consumer to the queue and the server will send you messages ok here it's a little bit different you have to specify which offset you are going to read from ok so every queue or topic as it is called in Kafka is broken up into partitions you use a single thread to read from a particular partition and you specify an offset so you say ok from this partition I will start reading from partition offset 20 say and it has metadata calls that you can make to the server to get what are the valid start and end offsets on the system ok so one of the major things that you have to do if you are working with Kafka is to figure out how to store your offsets so Kafka has two sets of consumers which is one is called a high level consumer another is actually the low level consumer so if you are using the high level consumer it has no act kind of mechanism ok so if you ask for a message it will write the offset to zookeeper and then give you the message ok what we wanted to use was the low level consumer where you have absolute control you can read from any particular offset but in that case you have to maintain your own offset so ingestion API because of the multi platform thing we did not want to go with anything fancy and we actually wanted to see how much latency this guy was adding at the same time we decided that we will go with the fastest available framework so we decided to go with the rest express which was which we saw after evaluating a couple of them which like drop wizard and rest express and few others we found that this was the fastest guy ok honestly speaking we would have tried out spray at the time which we did not because but yeah this guy is pretty damn fast actually so and it's also very lightweight ok and as I said in our API server which in our ingestion API we basically wanted to do almost nothing other than just writing to the queue so storage as I said it was broken up in two parts so the query store where you keep stuff indexed and your data will get TTL doubt ok so in query store we you don't actually store all your data you store a relevant amount of data and for certain time like that depended on the team and when somebody wants to onboard this guy we ask him like how much variable data do you want say some guy will say one month others will say two month some other guy will say ok I can do with one week etc etc right so we work with that and the long term store where we keep all data ok so and that is also our golden source in case our query store goes down we have scripts and code ready to pull data from the golden store back into the query store so at the time when we started basically what we decided to go with was the short term store was going to be MongoDB and long term store was going to be edge based in the long term store basically we wanted no features it was just a key value dump for us and we have a centralized cluster across flipkart like we have two three clusters so we decided to go with the fast right cluster that has off support and all of that so we decided to go with edge based for that but we could have gone with any other storage for the back end which works well with key values right and the query part we went with MongoDB not an not I don't know like how many people would do that if you are talking about a real time system but we did some tests and it was it it's actually pretty good ok talking from speed and why we went with this was it has a good set of functions that you can actually use for example aggregates and groups right and trains and in aggregates basically you can create pipelines of different things for example in which you can do fairly complicated so few days back I wrote a eight segment pipeline which can actually create a funnel of all the events across the a particular app ok this many people started reading this many people of that finished etc etc right so it comes with its own quirks of course like every database but we saw that most of our use cases were getting supported with this right and it was it's pretty easy enough to maintain ok and the storage architecture that we use from MongoDB is basically we have shards so we are we were till now using only one shard so basically we had three machines and it's critical ok so three machines is something that we felt was the optimal size to use a shard ok so you need one master and two replicas for that ok any in a in a particular replica set you need three machines so one of them will go down or you might have to take one note down voluntarily for certain things for example what MongoDB does is it will not it will not release space ok so let's say you have an app which was pushing a lot of events due to some spike for sometime which took about 100 gigs of data right and that guy got TTL the way right that spike all the data what TTL the way but that place is not released ok so after sometime what will happen if you have not taken this into account you might have to take one note down delete its data and rebind it from the other notes which is fast but it is you cannot live with that kind of latency ok because our architecture is such that our queries never hit the primary ok they always hit the secondary and primaries are only for rights ok so these are certain things that we knew and some of them we learnt along the way while we were using Mongo like it's pretty good we are currently in the process of moving away from it but I'll discuss the reasons later on so processing pipeline we wanted something that was good with retries and it was storm without a question ok so storm there is a seminal talk by Nathan Mars who is actually the creator of storm very describes the philosophy behind storm ok and the base philosophy is that your systems will go down ok even storm itself will go down storm has supervisors and members which is basically the central server all of them will go down your GUI client will go down for storm ok but the thing is your everything should be come back should be able to recover once the other dependent components are up ok so storm is basically built on the philosophy of being very fast obviously and secondly recoverability ok so we were using storm we decided to use some obviously at that point of time Kafka was so one of the most complicated pieces that you might ever have to write on storm is basically the spout so spout on a storm topology which is basically a computing basically a representation of the computation that you want to execute on a storm cluster is called a topology so inside a topology you have something called as a spout spout is the guy who actually reads data into the cluster ok and bolts are guys who actually process the process the data right and there are commuters etc etc so you can go through the storm documentation to understand all of that but the writing a spout is fairly complicated stuff ok and the good part about storm is there is a good community of people who keep writing spouts for different kind of data sources you have spouts for HBS you have Kafka you have rabbit in queue and lot of different kind of spouts interesting ones Twitter firehose etc etc so when we started obviously there was no spout for Kafka pointing and Kafka pointed the protocol is actually fairly different from 0.7 like 60 70% different ok so there was no question of being able to use a lot of the code that they were doing but still we decided to go ahead with it right so now with so basically the major parts we have covered we have the queue system which we decided to be Kafka which was going to act as our which is going to act as our staging area right then the processing part which was this which we decided to be storm and two data segments one was the query store which was going to be MongoDB and the long term store which was HBS right now what what what do we do actually in the storm cluster so we decided to go with HBS as the offset store right because we knew exactly what we are going to write so it was never going to with exactly what we were going to write as well as to read so it was never going to be a scan it was always going to be a get or a put which is extremely fast on HBS okay so offset decided was going to be in HBS right and MongoDB we did not want to hit MongoDB for frequently used queries so for example histograms so if you open the Hyperion console the first thing that comes up is your histograms and if you visit Flipkart you see many people who have using Hyperion I have actually that console open on one of their screens or on some monitor etc right so that is quite a quite a bit of place which did not want to put MongoDB through that so basically what we decided was to do certain kind of pre-computations at least for the matrix like how many events are coming in etc okay other thing was we did not want to expose all of the functionality given by MongoDB because of several reasons first is not every database is optimized to do everything okay secondly we wanted to keep our options open when we were talking about the query store okay so that is why we wanted to expose only a few functionality even on search say for example you are searching on an integer field we wanted to give you all the normal operators like less than greater than greater than equal to etc etc but some not something like for example module actually that's a bad bad example so string search we are probably going to give you equals not equals etc but we are not very keen on giving you contents because MongoDB is not very good at doing content queries even on indexed fields okay they have done certain improvements in that field but at that point of time that was not an option so what we wanted to do was in the cluster while we were processing the events we wanted to do certain amount of metadata analysis on the events that came in so once we get an event and that has different kind of fields and we wanted to see which field was integer which field was flow testing etc blah blah and stored all that metadata information somewhere right when you are carrying from the front end and even from the API is you could hit a API to get to the metadata for a particular event or event set and use those but operations only to query other things will raise an exception and if you are doing it from the console they want even visible so our failure recovery is pretty simple in storm either we write to both or we fail okay if there is an outage on say H base or MongoDB it will fail it will retry there is no asynchronous write okay we in the bolt so bolt basically a topology is arranged like a series of operations think of it as write to H base write to MongoDB and then commit the offsets to H base right if any of these three fail the event will get processed again okay one more thing that was important was since this was going to hold a fairly large amount of data we wanted to divide the query space okay so dividing the query space was simple enough we decided to come up with the concept of apps okay so an app is basically a logical separation so for example ebooks ebook reader is an app that and ebook delivery is an app okay like that so it's like a logical separation of the whole event space so what does an event look like so event looks like this we have a mandatory header part where you give a few things like app event type is for the user only like reader crashed book downloaded error book download rejected due to ip conflict etc etc right so this is something that makes logical sense to you platform is since this was one of the major use cases was to be able to use it with apps platform is a component in the header itself where you say it's like Android or iOS etc etc and then we have timestamp instances some id to identify where this guy is coming from for example if this is coming from a particular machine not a app not a handle device this white content the IP address of the machine okay and event ID is just an ID for the event you can just generate and put some you ID or some other string that you want so the key for an event of our system is actually a hash of all of these fields and all of these fields are mandatory okay the interesting part is actually the data so in data you can basically send any type of field any valid JSON object so once you send it you can also have nested JSON etc etc and through our metadata analysis we come up with a list of fields that you have created and what are the different types for that as I said we have API from which you can actually retrieve the data so for hyperion actually instead of one we have two topologies okay so the first topology I described fairly in depth like what we do we do metadata analysis we write to both stores we commit the offset and then return the other guy actually writes to a secondary Kafka cluster okay and we have a small library which has a sort of a predicate system predicate language which you can use to read a subset of events right so this helps other guys to build systems say a alerting system you want to push some new code you just want to see monitor it for a day to see if there are certain sections of the code that are failing so you hook into the secondary cluster with the subscription system and you pull events from that and only error events from that so you write a small amount of code which will send your mail if that kind of event comes so those kinds of use cases and as you see these are use cases mostly these will be cases which cannot fail due to a data store failure so this guy is written to by a separate topology okay so the status currently is takes around 35 40 million events a day we have a 3 node Kafka cluster to worker nodes on storm 3 MongoDB node cluster around 900 gigs of data on the cluster and about 5 6 terabytes of data on HBS we are doing a fairly large amount of upscaling on this so all of this will at least double the events will actually go up like 5 6 times so Kafka is becoming a 7 node cluster and workers are like 2 2 4 2 8 MongoDB is increasing to 6 7 and this guy the query data will go to 3 terabytes HBS will also work accordingly at certain things we found out while doing it is that in the ingestion API basically don't do anything okay basic amount of checks which let you keep the system functional just do that for example if you are dependent on the field of your header just make sure that all the requisite fields are present and just push okay don't do anything else batching is mostly a no brainer but this actually made a fairly different fairly good performance impact for example if you are writing to Kafka writing a single event and a reasonable size batch takes almost the same time okay so in the Kafka protocol itself they have support for batching where they compress multiple events into one one particular message and then forward so that speeds up writing very very fast similarly MongoDB writes are fast HBS if you are doing a put list instead of a put where you put like 20 30 events at the same time it will get very very fast one more caveat about MongoDB you should not call update all anything a lot okay so call update it will become a problem if you use it as long as it use it as some kind of a app and only store is very good okay one more thing about Mongo which we are not using but in case you plan to use it is that if you are writing to Mongo try to get the object ID as the ID for the document okay that's very fast so Mongo has a system where it can generate an ID for your document that key is well the key generation actually is very fast okay and it contains certain other information also which you can leverage but that was not an option for us due to various reasons another thing is if you are planning to use Tom it has a nifty feature which is not very well documented where where actually your processing units get a context object where you can save your connections and other frequently use things so once it spawned on a cluster node that guy actually lives okay it doesn't die for every batch so that code is persistent so if you save something in your topology object in your topology context you can retrieve that it's like a map okay so you just save it with a map and object and type first and use it later on so it's actually very fast very fast in the sense our processing for an event came down from 2000 millisecond to 45 milliseconds right after we started even context okay even when we were batching inside this guy right so yeah so that some one important part is basically how to set up your expectations about what you are going to do right and the major part that I try to explain to everybody is that the system will only facilitate getting and seeing your data if you are putting garbage into it it is completely useless so the point is to get a good balance okay you should put enough data that lets you analyze systems but not so much that you are actually for example don't send a 1MV message okay there is I can think of almost no event level data that can have 1MV of context okay anyways that might happen in some case but my point is that what I ask everybody when they are planning to onboard Hyperion or any other this kind of system is do you know what you are going to push so coming up with a schema of events is very important okay before you are writing from this kind of a system analytics what you are going to build on the what you are going to build on the real time set should be minimal and that should of course follow like cover whatever you want to do with the data but it should be minimal like it should be the minimal set like any other fancy stuff that you want to build on top of it you can do it offline you can have a pseudo real time kind of thing where you are processing a delta of the data in the back end right from your permanent data store using some table scan or some timestamp based scan right and keep generating that data and store it in a cache and when people ask just send it out from the cache right do not put too much unnecessary analytics pressure on the query store if you keep it simple as was said in the last presentation okay your focus should be on simplicity if all like so comes basically of all things that you can do always go for the simplest thing especially if you are looking for speed that should be a primary focus speed speed and nothing else right and you have to consider scalability as an integral part of your system so every component that you choose for this kind of a system you should consider scalability as a primary requirement okay Kafka add a few nodes you can add partitions it will get balanced so some balancing will happen and you can distribute the load right same for MongoDB you can add one more shard and it should be okay right and same for HBIS of course I mean so scalability should be an integral part of your system so besides hyperion we have a lot of other things couple of other systems that we are working on okay one which I am actually very proud of is what we call as a pre-cog as the pre-cog system okay so this actually started up as a hack dev project as a hack dev project and so what this guy does is it creates a computational cluster so there is a devian package you install it somewhere it becomes a part of the execution cluster it has a server component so what and what we can do is basically write small primitives that take a message do something and then release the message okay or not release the message at which point of time the message execution stops so what you can do is so a problem with storm is the deployment of a computation is actually some sort of a code deployment okay so it's not like you can do it from the console it's not like that okay it's code deployment actually this guy what you can do is it gives you a tool sort of a toolbox okay and like storm it has a source you can take a source attach a few computations to it and just deploy it click click click right this one to create a computation and deploy okay on your cluster and you have control like on how many nodes this is going to execute etc etc right so we are working on this hopefully this will get open sourced some part of time so that's one system other part is actually let me come back to the data storage so the problem with MongoDB is and I think you must have understood by now indexing is a huge problem in databases okay like you can have indexes for your computations but with time as the number of indexes increase it will start to affect okay problem with Hyperion now is it's fairly popular so a lot of teams want to use it they want to create indexes and we don't have a lot of control so sometimes from the console they'll fire queries on fields that are not indexed okay which will affect everything across the cluster so so we wanted to have something which is built on indexing okay and we knew about that at that point of time also that was elastic set okay the problem that elastic search at that part of time was there was only one type of thing that you could do on it from an analytics point of view which was aggregation which was basically faceting so now in the new version they have come up with the aggregations framework which is actually quite powerful okay I urge you to take a look at that actually very very powerful so what we are doing is we are writing a data access layer you can call it a data access layer which looks like a database where the data files are actually data files is actually edge base and the indexing is actually elastic set okay so you query data on elastic search and you get it back from the data actual data comes from the edge base cluster so we are working on that that's called hockstrat again that that might be open source fairly soon it will go into production soon the other pre-cog right now is being used as a log aggregation system by multiple teams it does more than a hundred million events every day so these are the resources of custom rest express questions yeah so yeah actually we did take a look so far as distribution and replica where primary yeah so what is asking is basically did we look at other kind of queuing systems when we were looking for when we decided to go with Kafka so the point is actually we did and at that point of time with our use cases that was mostly the only guy which was fairly stable and matching but as I said that looks in project it looks promising so probably you should anybody who is planning to do it now should probably take a look at that spout spout it's also written by Nathan was so the question is did we consider using Kestrel while we were going well we were using while we decided to use Tom Kestrel yes we considered but even Kestrel was not so stable at that point of time and there are some of our other teams were using Kafka and they had only good things to say about it so we decided to go with Kafka we wrote the spout in case you want to know now Kafka 8 is stable and there is an official spout for Kafka 8 that you can use ready made stop because the long term storage we were one of so the question was why did we not go with MongoDB as our long term storage so the point is for long term storage we were expecting large amount of long running jobs on this guy and to be hit way more than how the query storage it like scans the number of scans would be high so most of the teams are comfortable using Hadoop for that okay so pulling the data using scoop etc it will make a sense a lot of sense and so we decided to do it actually it amounts to one-fifth if you look at it it yeah it contents index and secondly the data is not evenly balanced some guys have more data than others so yeah yeah so multi DC yeah so we are it's not yet multi DC we are planning to do multi DC we should be fairly simple if you are planning to do multi DC with this you should take a look at so Kafka has a component that is used exactly for this purpose you can copy from one system to another across data centers okay yeah so yeah you should not at least the way I have seen when connections go I think it's a bad idea to write directly to database from one system to another one DC to another it should be content you should transport the data from one system to the other from the queue it's staging right you have backup so so there's one more reason okay cross DC rights will be slow right and on Kafka basically there is a TTL like in MongoDB right so what you're doing is as you do slow rights your events will for sure come faster than you are writing right so you'll develop a lag irrespective of how many number of partitions you are so after some point of time say seven days or eight days you'll start losing messages so that's probably not a distance better to copy Kafka Kafka and then write from that thanks