 whole take away of the exercise biggest take away I would like to focus more on the design aspects how we approach this whole exercise how we define the problem how we scoped it out and how we ended up executing it and what are the key takeaways from that. So yeah for some of the old timers you would recognize curly from the three stooges. This is the kind of situation we found ourselves at the startup I was working at we were just six months old but we were in a very rapid phase of growth and already some of the ad hoc initial systems we had around events and analytics based on elastic search already they were not sufficient you're not scaling up so if that was the situation just six months down the line when we are doing POC then if you do well you know it was a scary scenario so that was the time when we actually sat down very very broad mandate was given to create some system which can scale over time at least for the next two years and it was very broad mandate no definition of what data is there so everything was open to discussion starting from what kind of data we need to capture what kind of capabilities we need to provide and how we'll end up executing it and so the company to give you brief about the context on who we were again it was a fintech startup lot of transactional data lots of profile information and it was very in a rapid state of growth so that was the context and along with it comes a lot of requirements about you know where exactly can you store the data the data needs to be localized you can't really put it in the cloud because you have sensitive credit card information you'll have personally identifiable information and then there are other concerns around you're a startup you don't really know where you will grow so you need to have a flexible system you can't invest a lot in a particular technology meant for a particular product because you don't really know where you will evolve so a lot of design decisions were based on that thing that we need to be flexible over time if the business model changes we need we shouldn't incur a very heavy cost because of upfront capital investment so coming to the problems that we identified from the very start we had a lot of people with any good and well justified opinions about what kind of systems should be used to create a generic data lake but I think we realize that that is a problem we often do we look at a solution we get excited and we execute it and then we realize that it is not solving the problem because we didn't really define the problems up front so the first very first step was identifying the problems that we were facing and in the near future that we could face again being a startup we couldn't over engineer we didn't have the luxury of either money or time so the minimum thing that can get work done for the foreseeable future the very first problem that companies in growth face the face is probably proliferation of data silos every new product of feature or team starts storing the data it expands and over time you'll find 20 different systems maybe different technologies with their own experts with their own understanding data representation and it's a very difficult task to actually get them together the problem with data silos is not just duplicate effort and the complexity but also that now you cannot get higher intelligence by combining all the data together for people in one team cannot understand the data in the other team because they all are you know structured in a different fashion there is no single language that they talk about so exposing data silos duplication effort if you look at production systems you'll need a different DevOps team specialized for each different system they could be a edge-based cluster they could be Cassandra's cluster and these are you know real-life cases that I'm talking about that we faced over time we had n-numbered different systems and experts for every different system and still you cannot bring all that information together and do really interesting analysis yeah one the last one is an interesting one data outlays tech very often it is said that one of the key differentiators or advantages Google has is the data it has it has data for several decades obviously the technology that they are using to store the data has changed over time and that will happen in your organization also so very often we look at a solution and then we are constrained by how that solution stores data represents data but we need to decouple the two things we need to look at information how to structure it and then we need to map it to one of the storage solution so that in the back end even if the technology changes my data representation in the capability should not so having identified the problems then we started defining the solution you know exactly what technologies what stack and what architecture should be used to solve it but before actually jumping into execution we need to define the problem in engineering fashion and when I say engineering fashion we need to define both the functional specs like what kind of data you have do I have very structured strictly defined data or do I have free text schemas which I don't know how they will evolve or I want to find a middle ground that I'll have free text schema but it can only be hierarchical like a tree adjacent tree so I need to take that call and define what kind of data representation I want to support so that those are the functional specifications what kind of data how much data what are the what is the complexity of the data and so on and second is the performance specification those are also the kind of things you need to define before you can choose a solution because a solution like post-graeme may seem obvious when you have 50 gigs of transaction data which is the order of data we had in the beginning but just one and a half years down the line we actually had a seven and a half terabyte of elastic search cluster so you don't really know how things will evolve so you need to have those discussions and you need to define the performance and all these specs you know what is the maximum latency of data that you can live within your systems to give an example if an event is raised is it okay if it reaches 100 milliseconds later or 10 seconds later or one hour later because that's cost of a solution between a 10 millisecond storage solution and a one hour storage solution could be 1000x so you those all are the kind of specifications you need to define as you implement and especially in a startup you can refine those numbers over time but you need to write them down so that everyone is on the same page and it really helps to choose a technology like when we wrote some of these solutions down at the very onset we were able to discount many of the six solutions out there because they did not meet one of these criteria or we were able to for example be used edge base for one of the storage systems then we had to really edge base is a big table storage format and it's a key value store and we used it to store rich deep json objects so at the onset we knew that we need to have some transformation layer you know to translate this to that so you will not be caught by surprise later in the end if you write all your performance and function specification at the very onset so that is one thing we did we defined the problem broke it down in this style then we defined the data model data model is forget about how it is stored on the file that is just the storage representation how you think of data in your head is it like a relation schema key values or is it like a big table schema or json schema and we also looked at how other people are solving problems so it identified that we need something like events something like entities like a rich profile a user profile is an entity we'll have metrics we'll have logs and what are the common behavior all of these have and can I have a single data representation so that I don't have to create four systems for this can I have a single pipeline for all of them and have a single data representation for them and you know some annotations or some way in which wherever necessary I can fork them up into different systems in the edge not and have a common in the middle so try to reduce duplication that was the theme and yeah in general not be a vision should never be blocked by the implementation so that was one thing we always try to look at the specifications that we have defined and try to create a solution out of that not based on a particular technology that we have chosen and then based on that we see you know what all data we can fit should be the other way around coming to the implementation is a very rough kind of a I didn't have enough time to create a proper architectural diagram but it is a typical pipeline where at the source you have multiple you know sources for data it could be a mobile apps they could be a million users using your mobile app creating events and doing all those modifications and logs and you we had a microservices several hundred instances of microservices running and creating all these data objects logs metrics events entities and then we were able to create a single unified data representation model and we used kinesis as our data pipeline people use Kafka kinesis there are a number of solutions out there kinesis work for us and we had interesting experience with scaling it optimizing it so yeah we can have that discussion later if you have any performance or scaling you know aspects that you want to discuss so we were we implemented kinesis that allowed us to decouple the destination from the source as long as the source has given handed over data to kinesis we know that robustness durability has been taken care of because it is a hosted service on demand scaling we knew that we can scale up to easily thousand x ten thousand x and we'll only pay for the amount of capacity that we are using and data sinks we created our own microservice distributed markets microservice architecture where we could add instances at real time and they read from kinesis the relevant data and they push forth the data to individuals sink a sink is a data destination it could be a persistent store like we had elastic search we had hbase for a while we had s3 for disaster recovery cheap storage or it could be other consumers like we had API consumers which we used to for example read data from here and we did a poc where we were sending data to amplitude a third party system so yeah that was the glue here that read from kinesis and sent to data sink and we were able to achieve isolation if one of my technology clusters is down elastic search it will not affect any of the other things so back propagation of failure or data logging so all those issues were taken care of isolation between sinks and scaling so yeah that was the essentially implementation each of these technologies or pieces is complex enough to have a complete talk by itself so yeah if there are any specific questions at the end I can feed no no so data lake is effectively this one way of implementing this data pipeline is how your data is coming from n number of different sources technologies and flowing into this while taking care of scalability latencies throughput you know all those aspects so data lake could be just to give you an example a data lake could be simple hdfs cluster using let's say parquet file format where I have dumped all the data it could be entities events logs metrics in different file hierarchies and I just need to create a layer of data governance and data isolation so that if someone wants to look at logs they only look at that hierarchy and for example I can put a sequel layer on top of it to query it so data lake is just a way of storing data and making it in enabling people to query it that is just this part yes right so it's a good question the context is that as a startup initially we didn't want to create n number of systems but this gave us enough flexibility to later for example if you wanted log data to go elsewhere there was a different consumer for it which was pushing data to a different destination it could be the same elastic search cluster it could be two clusters or it could be two different destinations altogether so in terms of plumbing we had conceptually speaking four different pipes one for logs one for betricks one for events one for entities and we can just change the plumbing here to send data to different destinations I'll fill that question later just because okay yes yes yeah so one thing I haven't shown here is this plumbing basically the data collector what we called which read from the persistent queue and pushed into a different whatever the destination is so think of it that there are different destination there is s3 which is for backup and which stores everything there is edge base which only has events there is elastic search which has events and entities each of them has this application has different instances running and it's a distributed application so you can create a thousand instance each instance knows that I only have to read logs and I have to store it in s3 another instance knows that I have to read logs and I have to store an elastic search so you know there is that blue over here which this diagram does not represent that takes care of it and any business logic about filtering that comes here we actually had business logic of doing some amount of data health checks and rejecting dirty data so key wins some of the biggest takeaways that went well based on the design and over the next one one and a half years we validated them in production when we scaled up to I don't know 100x the amount of data volume and size one was that because we have defined the standards clearly and even though there were different types of data model there were only a there was a simple for example JSON format in which we had published that this is how data is stored so everyone knew both from the developer who is storing data as well as people who are querying data what to expect and how to expect the data as well as the performance guarantees we had told that in the worst case you can expect this much of latency so even product managers knew that when doing some analysis they should not complain if they see latencies up to this so standards was one thing that really helped communicate implement two standards it helped us not create a lot of band-aid fixes because sometimes you want to do x but a technology doesn't provide that so you do a lot of hacks on top of it to make it work but that haven't heard was less duplication so we just had a team of two three devops which was handling this whole infra because there was only a common data pipeline in the middle and yeah these were the key wins some of the things which we did not cover were scaling of metadata basically that was one thing we were going to do next as your data grows your data dictionary also grows and then it is very hard to figure out what data have in a system only then can you actually query that data so data dictionary was the next logical evolution step that we did not tackle and data health was one thing which we underestimated over time for example mobile clients did not have a way of logging so they started using events for logging information and then we had debug logs in events so more data is not necessarily good if you have data for for example events data if 90 percent of the data is noise effectively your no one will use that system so data health is something that you need to define as part of spec you know how do you have just like unit test you need to have all the tests for your data health and at all times you should be able to visualize your score and take proper steps to keep your data neat and clean so that a for example business results come out right it should not happen that your city name has small d for deli and sometimes capital D for deli and sometimes new deli and then someone doing an analysis sees three different categories so all that data health check really bite you hard later on no so that is something we did not do that is something we had to build here and data governance we created distributed async data pipeline so if there is any unexpected error it is very hard to propagate it back or even to realize whether it is a recoverable error or a not non-recoverable error because yeah and do we have time no so yeah that is all from my side maybe we can take one or two questions yeah we we had a health report where whenever there was a failure notification send out to devs which means me and we used to have basically classify them and then take a call and over a period of time we were able to classify most of those recurring issues in either actual errors or not but yeah no advanced system classifications yeah all that was taken care of that is a basic kinesis data consumption patterns yeah yes I can feel any questions offline yeah