 Okay, welcome everybody to the winter edition of the fifth elephant, I don't know if that's too much of a contradiction to talk about in Bombay, our winter in Bombay, but I guess life is also full of contradictions just as the data ecosystem is full of contradictions and I suppose what we're going to do today is actually unravel the contradictions of data, not really the frontiers. I'll introduce myself for the conference and how we plan to proceed, who all is here today. I'm Zainab Bauer, the co-founder and CEO of Hasgeek, the company that has organized the fifth elephant. The fifth elephant started off as a conference in 2012, the history being that at around 2012 there was this sort of conversation going on about how data has been gathered rapidly in e-commerce and there was sort of a discourse building up around big data and it felt interesting to understand what's happening in the space of data. We launched the conference in Bangalore in 2012, had 700 participants talked about visualization, engineering, analytics and discovered that there's a long way to go in the data ecosystem because there is both anxiety and aspiration, anxiety in terms of the kind of messiness that lies with data, the way data is collected, ingested, utilized, the challenges that lie in organizations where there is a business requirement for data as well as there is the engineering side of data. I think over the years we rapidly discovered the kind of challenges that lie between data engineering and data science teams in terms of working in coordination and I think we've been unraveling a lot of this over the years. I think this year one of the things that we kind of felt was it's important to take this community out of Bangalore and start conversations in around other cities because while we stay in Bangalore we are also sort of the frogs in the well in Bangalore where we don't know what's really happening outside of Bangalore and it's useful to understand what are the kind of challenges that you face and I think in that sense it's also interesting to note that many of us over here are actually not from Bombay. There are people who've traveled from Delhi, from Chennai and from Bangalore to attend this conference apart from the Bombayites. So I'm expecting that there will be some interesting discussions around here. I think among the anxieties that lie in the data ecosystem is also centered around how do I hire the next best data engineer or the next best data scientist when they actually don't really exist to be honest and I suppose that anxiety is also driven some of us to come over here. So we'll put your anxieties to rest and one of the things we've done is we put out a whiteboard over there and put some sticky notes over there. If you have a job posting you can just announce the position, the company name and either the Twitter handle at the email address and put it out there. I wanted to mention that it's a job board and not a job horde so please keep it to one sticky note per position and not multiple ones. I hope that your conversations actually, I mean the purpose of doing this is to ensure that if your goal was to come here to kind of talk to people about hiring then that's taken care of by that but you can continue to have your conversations around the real and more meaningful issues that we're going to address over here which is how do you scale your data infrastructure? How do you build these data platforms? How do you deal with messy data etc? And we are open to feedback at any point in time if you think there's something we can do better please come and talk to me. My colleagues are here so do look us up. We're all wearing the Hasgeek lanyards. On that note a couple of quick pointers. There are some of you who are wearing a red color lanyard like the three individuals here or actually four sorry. The red lanyard is for people who do not wish to be photographed so we will make sure from our end the photographer who gives us all the pictures will blur the images or delete the images of people who own the red lanyards that's if you want to not have your pictures posted online and if you're comfortable otherwise your pictures will be posted and you're wearing the regular lanyard or what I'm doing myself over here. I think we'll get started now in about three minutes but a couple of quick house rules. There are washrooms for women at this side and for men the other side. We've kept some water bottles outside so please feel free to help yourselves. And please turn your phones on silent or switch them off. It's just a matter of pressing a button. It's extremely disturbing and insulting to a speaker and to your fellow participants if your phone rings in the middle of the conference. So please be kind and sensitive on that note. We're going to start with the first set of presentations. The way the schedule is organized is that the first set of five talks are primarily going to focus around the whole idea of the building of the infrastructure and the building blocks. So we have Pushpesh, Jayesh, Venkat Pingali and Govind Pandey who will sort of talk about the engineering side and how the data platforms have been architected and what are the kind of general principles and their own learnings in the process. Following that, we will get into domain use cases with Thuang Nayan talking about credit scoring, Sudipta talking about telecom, Piyush talking about logistics and data science, etc. We have a birds of feather session in the afternoon which will help us tie in the first four talks together and also get a broader understanding and you sharing your experiences. This is one change in the schedule because we are a very small group of about 50 to 60 people. The BOF on machine learning and logistics, we will have it over lunch so that we don't necessarily need to split. So you can have your lunch and have a conversation. We will ensure that it goes off smoothly. So that's the only change but otherwise the schedule stays as it is. So on that note, thank you very much for coming here and hoping to have some really interesting conversations and taking this forward in terms of building the community. Over to Pushpesh, our first speaker. Cool, so good morning everyone. I underestimated Bombay traffic earlier but it's all right because we all are here on time. So cool, I am Pushpesh. I am co-founder and CTO at Moonfrog. Some of you might have heard, some of you most probably have not heard who we are. So I'll go through that but what I want to especially talk about here is our journey in the early stages of being a startup. Some of you might already be founders or working in a startup or might want to do a startup or might want to build a new product from scratch at some point in time, right? And we all run into this. We are building a consumer product. We want to understand how users are behaving, what's happening in the product, where are the drops, where are differentiators going, what features are working, not working. And then analytics or data platform, oh, we have been focusing on building the product or the application but not the data analytics platform. So this is our journey, how we did it early on, like from the day one itself, because we knew that we have to build multiple products so we will need data analytics again and again all the time on day one of the product launch. So these are few steps, few guidelines that we think that one should follow and generally keep in mind, again, this is our journey. You can obviously take your own path but these are just guidelines to share from our learnings and kind of mistakes that we made. So just to give a brief idea about who we are, we are a mobile game gaming company, game development company, which makes mass market mobile games. So we aren't talking small size, we are saying mass market, anybody who's out there who's here, who's on the street, who's in the shops, who's driving cabs, who's working in a big corporate company, everyone is a potential target audience for us. Not just by making one product, by making a suite of multiple products and again and again. So you realize that we as a company have to make new products continuously at every interval, every cycle. So we need data from day one for this new product to figure out product market fit again and again. So this is very important to us. What's our current scale? We are approximately five, this is little old data but we are approximately six to seven million daily active users, daily active players, more than 15 million and more than 25 to 30 million monthly active users across all our games. On the right side, you can see that okay, some of the games that we have made. This is not all the games but these are some of the prominent ones. The top one, yeah, Teen Patti is famous in India but we are one of the top ones there and we are the top grossing across the board in India as a Teen Patti game, as a gaming company, as anything you can say. The other prominent games being one on Aliyabhad game we did, we did the Bahubali official, like the big game which is a real time strategy game, real, like okay, 20 MB real time strategy game like anything Clash of Clans, Clash of Kings or all that kind of shape and size made for India. Then we have, on the bottom, you can see a Ludo Club game which is currently very, very popular on, especially not on the Google Play Store but on the Facebook Messenger. So across all these games, we have more than 6 to 7 million daily active users. All these games, okay, at some point and there are features, they are real time games, real time players are playing with each other in real time on Indian network or Indian subcontinent networks. We have been doing this for five years. Obviously now internet speeds have improved dramatically and will continue to improve but it was not the case back in 2013. So we have been kind of doing this for India for quite some time. All the games are cross platform games, optimized for our primary markets which is India and the subcontinent and as a company we are profitable. So this is our journey and in a short, I'll talk about the current scale in terms of data analytics that we do. We do approximately 20 billion events per day, unique events per day as of now. Total size of data that we ingest every day is more than 800 GB. So this is not small scale, okay. Though the users are this, but for each user also we kind of capture tons of information continuously all the time we are playing our game, anything we are doing, we kind of are tracking it in real time as real time as possible. So obviously this is not where we started but we knew that, okay, if we get to scale we will see a lot of scaling requirements in analytics. So how did we go about it? Now just to kind of go back, roll back four to five years, what we actually wanted on day one, like which were not negotiable. So we needed access to data of any product or game launches immediately. Like you can't say that we'll launch the product and know about the data or how the product is doing next week or next month. No, that is strict, no, no for us. If we do not have data we'll not launch the product. This was the rule to be followed from day one. Should be able to query roll level. Like in the beginning, when you just launch the product you have one user. You don't have 100,000 users. You have one user, two users, third user just came. So you need to know what's happening, what's that user's journey, what is he doing, what is he not doing, where is he dropping off, what is he not liking. So for that you need to know each and every event that is getting fired by that user in that session as quickly as possible and as real time as possible. So we said in the beginning that okay, less than five minute latency to the warehouse where we can query roll level. So this is the requirement we put on day one. Second, cost sensitive. Cost sensitive in both terms, infrastructure, because obviously being a startup, bootstrapped startup, no money to start with. So obviously we'll use AWS free credits. So that was the first thumb rule. Second, ops light, click. You can't spend, when you are busy building the product you can't be spending your mind share and all this time cycles in fixing analytics ops. Third, resources. You don't have all the companies, all the products out there in the all startups for sure. Do not have unlimited resources, do not have resources to spare then to build analytics rather than building the main product, right? So in the beginning we said we'll start with a shared engineering resource. Only that we'll try to write down or kind of have our requirements very clearly and then increase it to one dedicated person after we start. That's it. Not 10 people, not 100 people because you can't. This is reality. This is our journey. So not saying that it's good or bad but this was what it was. And the third large scale up requirement. We knew that we could games. We had worked in games before. This was not our first foray. We had worked me, my co-founders, my early people like everyone worked at Zynga. Might have heard of it. Games like Mafia Wars Farm will, like really large scale games like 30, 40 million daily active users kind of games. So we knew what scale means and what it would look like if it was a success. So we knew that we'll need to scale up at some point in time but also we knew that we don't want to over architect. So we said should not cause headache till one million daily active users. So it's all right to plan for some future but not necessarily for eternity. Now, so I've listed down seven steps. Kind of I've made it as concise as possible. Seven steps which I feel we ended up following obviously in hindsight it's much easier but take it as a guideline and see if it helps you in your journey of building a data pipeline for your products. First, understand. Understand your requirements. Understands your business constraints. Each business is little different from each other. Do not go by what's written in the blocks as like verbatim. Understand what is a must for your business and for the product or the product market fit to be figured out and what is good to have and what is okay to delay for future. That is gonna be very different for you as a business and as a team as well. Who you are. If nobody's gonna look up the data then what's the point of just storing the data and capturing it for sake of it. And that's where as an engineer I feel like that's the difference between what you need, what you want versus what is cool. And it's very important to differentiate early on because you don't have resources. So that's what is very important. I'll go in details. Second, think generic. This is more from our perspective. Obviously if you're building one product you don't need to necessarily think generic to its definition. But we have to, we had to support multiple games. We knew that this is the first game. We'll make another game in next quarter. We'll have to make four games in next year. So we knew that we have to rinse and repeat the same analytics platform or data pipeline. So we had to think generic. I'll give examples on how did we go about it which is a lightweight way of looking at generic data schema. Third, produce data well. If you do not have very foolproof, very clean system of data production your whole pipeline is not gonna be much useful. Like a lot of ML, a lot of data analysis that you are gonna do or your team is gonna do later on. If your data is noisy or has kind of corrupt data or corrupt things coming in then it will be more headache for you later on. So keep it clean, keep it simple and produce it well. Fourth, design V1.0 of your data pipeline. Do not design for eternity, design V1.0. I'll talk about how we did it in the sense where to cut corners where not to and how did it pan out for us. Fifth, open up, enable many data interfaces. Once you have captured the data it's very important to open the whole, whatever you have captured as quickly as possible to your team. If you are not using the data and you have just captured it, it's useless. So know how to use it in your product for the benefit of the product, benefit of your business, benefit of your tech, whatever it is, but open up. Then sixth, obviously, tune and repeat. But specifically I'll talk about optimize for usage. Not necessarily for just, oh, this is a newer, new tool available in the market or somebody saying that this a bit. Optimize for own usage because you have certain people, you have certain business requirements. Fix for that first and then go to the next step. Seventh, upgrade to V2.0 after that. After you are, you know that okay, you are capturing data well, you are storing it well or ingesting it well or all that stuff and you are using it well. Before that do not try to upgrade to a new system no matter what. And especially this is important because games do not give you warning before they scale up. They can, if someone tweets about it they will go to the next level immediately. Like okay, in a matter of the day I've seen games go to 1 million to 5 million daily active users. So games do not give you warning. So I think that okay, we have an example where it's all right to think in this framework. Still with the market going or market forces acting on their own will, you can still catch up and kind of do the right thing. So let's go deeper like the first. Understand requirements and constraints. This is example of what our business requirements and constraints were. Listed down, you can obviously look at your own products. Divided into three columns. Primarily business, tech and ops. We knew that we are a tech company but we have a real business which we want to. We did not want to just make games for the games as a product part of it. Games as a business is what we do. So we listed down first the business requirements or the constraints that real time ingestion is important. What we mean by real time is not necessarily to the engineering definition of real time but we mean a business definition of real time. Like a user click the button. I need to know about it in some time. Not one hour later, not a day later but in some time but doesn't have to be the microsecond or a millisecond. So know that what is the real time definition for your business is and stick to that. Second fast query speeds since we had to, we did not have ready-to-get teams and other things. So we have to in the beginning worry about okay, how can we get access to, can I run the query and get the result? Rather than submit the query and get the result in an email or in a notification, I think no. I want to run the query, who will be running the query? Your founder will be running the query or you will be running the query, the developer or the QA or the product manager or all these people who are actually a one resource at the end of the day. So you need to run the query and get the result immediately rather than submit the query and see it later. Third SQL query interface, again this is our choice. We came from a gaming company, we were already using something, we knew that whoever we are gonna work with, they all will know SQL and that's it. So let's stick to that. So SQL was very important, we put it in the constraint upfront. Let's not kid around here, we don't have time here. So let's put SQL as a important query interface. Good that we know that all product managers have been engineers or generally have exposure to SQL, very easy to learn SQL, all resources. Your engineers know SQL, your QA can also potential. So anyone in the company can potentially learn SQL very quickly and write queries. Fourth, row level granularity, need to know what each user is doing. And fourth, fifth point in the business requirements, rich events. Cannot have some other team to kind of figure out how to join or this event I have, but I don't know what this user's metadata is about. So try to make the event as rich as possible so that the event itself has some contextual information. I'll go through the example and you'll see what we mean by rich. So that one person, whoever that person is, can kind of answer a question, whatever business question is, or get the answer to that as quickly as possible alone without requiring an army of people. Going to take constraints or requirements, we wanted generic data design because we knew that we have to rinse and repeat. New games, again we have to use it. So we wanted generic data design. Forward and backward compatibility. Games as products, you are releasing more or less every day. And we are in business five years and I do not know of a week where we have done less than three releases on one game. So we are doing releases continuously. So we have to be forward and backward compatibility, cannot break the whole data pipeline. Third, simple architecture because it will be simpler to manage and less resource intensive. Fourth, immutable data, that we put it as constraint, once written will not change it. Whatever the event is. And I'll go through how that has impacted our design. Ops, that's also equally important. When we say that we are a data-first company, we need analytics on day one. If you cannot keep the uptime 24-7, then you're not really a data company then. So hosted services, we have to use hosted services. We don't have bandwidth. We don't have DevOps and SRE and all that stuff. Scaleout capability will require from day one and resilience to bad queries. This is very important. And this is kind of, due to the choices we have made, we said that anyone can access data in the company. That means that any, most probably half of them or most of them will write inefficient queries. So it's all right. We will handle it. We will take care of it. So that is a design, how that matters in the design. At the bottom, I've given example of the kind of questions we ask in the business to give you a flavor of it. So one question is this. How many users who played at least one game yesterday? Very simple question anyone asks. We are saying we ask even more complicated. How many users who played at least one game yesterday saw the bonus pop-up also? Even more complicated, how many users who played at least one game yesterday saw the bonus pop-up for the first time? And this is not a complicated question for a business to ask. This is very simple, like, okay, while walking we'll ask this, okay, yesterday we added this pop-up, how many people saw it for the first time? And we have to answer it. We have to convert this question into SQL query as quickly as possible. So your schema has to back it up. And then this is not where the business question ends. This is just the kind of at the opening of the rabbit hole. Then the next question. Oh, how does it compare to the last week? What happened yesterday? What happened with bonus pop-up? This is the trend. What about the other pop-up that we added last week? So it goes into that. So your whole data pipeline has to support that kind of questioning. Otherwise, you will hinder the questioning part from the product side itself, which again, I think will show up in your speed at the end of the day, but just to give you a flavor of what we do. So this is high level of three different buckets where all these requirements and constraints matter. First, data design. Like data design should support multiple features and games. So generic data model required and changes to schema often required in the beginning. Reality is all right. Interface and usage should have SQL interface. SQL is of use for the teams and resistant to error in queries as well as inefficient usage. So people will write bad queries and it will be inefficient. They will do extra joins and inefficient filters. It's okay. We have to make it work. Third, scale, have a scalable system. Scale up and down because the games do go up and down like this in terms of daily traffic. I'll show you a graph at the end and scale out easily because you will see scale in terms of number of users, also the ceiling going up. So how do you scale it out? Now let's go to the point number two, which is think generic. On the right side in the box, what I've written is some high level guidelines, in terms of keywords that I would want to put. If you don't remember anything, it'll be good if you just try to remember some of that. First rule was generic. Data schema has to be generic. That was our requirement. That's rule. Nobody is trying to ask question why it has to be generic. No, it has to be generic. We'll add more columns as we go along. It's okay. We said we will add, that's the reality. So how do we support it? Third, normalization. How much normalization and how much not? We all go through that whenever we are doing schema design. So we do have some normalization requirement, but then also not getting to full definition of all CNF firms to get somewhere, stop somewhere in the middle, knowing your business requirements. So this is the example. This is again a subset of tables I'll give an example. So these are some of the tables say we created. We have created like first is table count, M table count, the moon frog table count is for all generic events in the game, like which in Google Analytics or Flurry, we'll have the normal whatever the table is. Every event comes there as a dump. Second is M table economy. Like games, if you have played, you know that there is virtual currency, right? Currency is coming in the game, going out of the game, users are gaining from bonuses and other things, buying it and then spending it and other things. As game as a business, that's the most important thing that we have to worry about. Like if any, like more than the game itself, the economy is important. Because if there is no economy, you're not making any money. So economy gets its own table to be managed. M table user for all user specific event and user metadata. M table open for all app open events. Sometimes when we're doing app development, you know that half the time people have opened the app and kind of tech problems and then dropped off or something of that sort. So that doesn't require user data necessarily. We just need to know how many people opened and the loading screen loaded or did not load. So that sometimes becomes a little more aggregate level analysis. We created separate and payment, payment obviously, we all know why it's important. Now why this table helps us do certain things? So this is like, if you just join M table count and M table users, it helps you kind of do answer some of the questions like this, how many users opened bonus pop up? How many users from Rajasthan played poker and Rami both yesterday? So these kinds of questions like you can just do joins. Now each question, the person who's writing the sequel is now thinking which two tables I can join. And I just have to write that and it's okay. And it'll work, okay? So very basic and it works. It works for every single person. Engineers can do it very easily, but it works for every single person in the company from practical experience I'm telling. Then you can merge any two table like user and payment. How many users made successful transactions in the last five minutes? Fair enough. How many you think users on iOS faced transaction failures yesterday? That also can be done by doing this. And since it is joined by user, the subsequent question you can ask, I need to know which user exactly so that I can pull up the whole session information like every single event that he fired in that particular session to retrace his steps, his or her steps to know where the problem was or what exactly happened. So this helps us kind of ask subsequent question very fast and answer it. You don't need to kind of run this query somewhere and then submit the result to a central team saying that we'll find out more about it. No, you can find out yourself. Now going a little deeper example, why is mtable count? What we did is very interesting. And I don't know if you guys have seen this anywhere, but this is what worked beautifully for us. This is a sample of what it looked like early on in early days of what the table was. This is the schema. So counter, count, kingdom, phylum, class, family, genus. This is specifically very important. No manclature wise, we all have seen this somewhere. We all have studied in school. This works is like we have seen Google Analytics. It gives us two, three levels. Maybe some other tools gives us three levels. We wanted more and why did we want more? I'll give it to you an example, but we came up with, okay, it has nothing to do with the game that we are building. For no matter which game, for what feature, for what part of the feature, you can track it in this way in a very, the whole biological system has been divided into this. Every single animal tree to any living being fits into this kind of nomenclature. So why can't our data fit in? So this is a very interesting way of nomenclature wise. And trust me, it has become first level. Like, okay, it's no longer in new joinees in the company are asking, okay, what is kingdom exactly? Why did you name? It is not common. Nobody is having a kind of a special moment or hold on what is kingdom? What's the difference between phylum? Frankly, the usage of data doesn't have to care about what is kingdom and phylum. And frankly, the data analytics doesn't have to care what is inside phylum and class and genus and et cetera. The context is up to you. Data pipeline doesn't care about context. It cares about the structure and that's it. So data is structured for the sake of tech, but at the same time it is unstructured. The context is up to you the way you want to use it. Now this is the example. So example one, like we say start session is the counter. So whoever is the product manager or the responsible guy for that part of the business looks up to that. Kingdom can be like player load. It's up to them to define context is theirs. They can say player load or player load not whatever they want to name it, it's up to them. We just say it has to be worker of 50 characters, max. Now whatever they want to put. Player load started, where it started, loading screen, which button did it click, button 120. Genus empty, I don't want to use it, it's OK. Which is the player ID? OK, here's the player ID. What is the client timestamp or the timestamp on the server side when we saw it? So there can be differences and other things, fair enough to capture. OS, which OS was he playing? We said Google Play, iOS, web, whatever it is. Game ID, which game are we talking about? So these are the generic parts for data pipeline. So when the data engineer is looking at it, he's just looking, all kingdoms are coming fine or not. All filems are coming fine or not. Not looking at, OK, the player load is coming fine or not. No, we just look at kingdom is coming fine or not. The same game for some other event will look like start session, player load. Player load started, now player load finished. Now what kind of questions you can ask just by this, these two events? How many users tried to player load? You can ask just group by kingdom. How many start session have fired a started event but not finished? That also you can. How many times users successfully loaded by clicking on button 120? Because you will change buttons, right? You will add more buttons, remove more buttons and other things. This also you can do. It helps you do all this stuff and data engineer, data pipeline guy doesn't have to come into question. He doesn't know what is inside the kingdom and it's all right. So that was example, now come to data production part. Produce data very well. If you don't produce properly from all different sources, it will bite you tomorrow and it will doesn't help with the next levels. Some thumb rules that we followed. Keep your data producer dumb. Keep them dumb. Do not put brains on that side because then it will be very hard to monitor them, change them, fix them, where the errors are happening. Keep it very simple, very clean, very dumb. Less transformations. That's a rule we followed. We allowed data to insert the way we want to consume as simply as that. And the data structure that we talked about helped us do that. In the reach the data, now you saw like a kingdom file in class genius for a family genius, five levels. So add more context. The row itself will make sense to the right person by itself. You don't have to, oh, what is this saying? I do not know. That itself has enough information and misuse is okay. It's okay. People will do a typo, player load, they wrote something else. It's okay. Why do you care? As a data engineer, you should not care. It's okay. Let them fix the context themselves rather than taking, kind of doing a very tight linking between these two. We kind of remove the link completely. So what we did here, identified all the data producers, understand the requirements accordingly. In our case, we obviously have Android iOS and all these apps. So they have to generate data. Users are clicking directly there. So we actually, in one of the games we captured for small time, captured every single swipe that they do. Because we were trying to figure out where they're dropping off, why they're not getting this right swipe action in games, it's important, right? So we enable that also on the fly. And it's okay. So some constraints which come into reality cannot keep sending each event over network. So we solved it by batching, seems like, cannot lose data even if app crashes or is killed by the user and other things. We would want to capture as much as possible. 100% is not possible, but still as much as possible so that we can, we know that, okay, why did he uninstall or why did he drop off? Important. So use local disk storage and back it up and other things. Keep out of context from the application itself. Know that, okay, when I said we do three releases a week on average for last five years, in some of the weeks we are doing multiple releases on the day. So your code is under flux at a tremendous pace. So your data pipeline cannot come into that fire. So it has to be separate enough, but at the same time, not breakable just like that. So try to develop it as a library. It's okay. Keep it generic and other things. Second part of our stack is lot of microservices, which most of you must be also having. So lot of microservices mean lot of data producers, they will scale at their own will. That team will decide what to do, whether it's a monolith or whether it is a full on microservice, whether they are horizontally scaling, vertically scaling, whatever they're doing. So your data production part also has to scale accordingly with it. Cannot keep sending each event over network. Your microservices cannot be sending so much over the internal land itself that it becomes a starts to become a bottleneck at some point. So again, batching, use local disks on the boxes, try to see how you can do a network sync easily. Keep data collection agnostic of microservice itself. You don't know how many microservices that particular team is gonna write. So today they have three, tomorrow they will have 30. And this is a reality. They will have. So develop it again as a library. You should not care which microservice is calling or sending you data. Why do you care? Data is the first level citizen, data as it is. Whoever wants to send, send this like this. Send us like this. I'm not gonna make an exception. I'm not gonna accept any other, this thing also. So those requirements, batching, using local disk and standardized common libraries, this became a requirement to write the data production layer which is the SDK or those, you might have your own microservice to capture data, all that stuff. Example I've given, okay, you will see that, okay. There are three examples I've given, like count, count sample visit. And they all take similar parameters, counter count, kingdom, phylum, class, family genus, exactly like the data structure was, how it's gonna be in the table. So direct mapping, okay. There is not much data transformation is happening and for the reason. And also all these functions are just wrappers. They're just abstractions. So you have given that abstracted layer to the application developer so that they can do the right thing, but your data structure remains the same. You're ingesting everything in the same format. So you're just given different interfaces and that's it. Sample usage being stats controller count. So start, session, player, load, finish, cancel, something like the example we talked about. Now the fourth step, which is design V1.2, very, very important. The V1.2 is the most important part. Now this is an image of not what we did. This is a very typical data pipeline architecture, data sources, then your integration layer, ETL, goes to data lake, then to data warehouse and then dashboards reporting BI engine uses it. Very standard. Obviously you don't have resources, you don't have time and how do you build it? So first we defined the guidelines. Less layers. This has too many layers for us to handle practically in the beginning. We don't have time and resources. Keep it transparent. Keep it transparent and what you get is what you ingest, the rule for the data pipeline engineer. You ingested player load, so you will get player load. I do not care. So nobody, product managers cannot come to a data engineer and say that okay, I ingested player load but I can't find it or I put bi-pop-up in the kingdom but I can't find it. Data engineer has to be agnostic of that. You entered in the kingdom, right? So it should be in the kingdom. I don't even have to look it up. I have not done any transformation so it will not go anywhere. So that's it. Did you put it in kingdom? Did you search in the kingdom? That's it. So it helps. And it's only V1. Most important, know that it will change. You will change it yourself. More requirements will come. You can, you should make it harder or you should make it little, enough questions should be asked but still it will change. So it's okay. So this is what we started with high-level. We kind of merged the data ingestion or the ETL layer to the left side which is the data sources itself. So microservices, whatever it is, everyone gets a SDK or a client to send data. We have given it to the library. You send us data using this. This will directly ingest to our data warehouse not to data lake. Why? Because we needed to query immediately. So if a row has been ingested, there is a end user here lying around, okay, can I query that? Can I do a join? Can I do a select count star, whatever? So we had to insert in data warehouse first and then in parallel do it to the data lake and sync it to the data lake. So data lake became our secondaries in multiple discussions with a lot of AWS and data folks around Bangalore and other things. I realized that this is anti-pattern. Like, okay, we are going data warehouse to data lake rather than data lake to warehouse. It helps the use case. If your whole kind of data consumers or users can wait, can or find submitting a query and coming back after one hour to look at the results, it's okay. Maybe the pattern works fine. But for us, that's not the case. So we had to choose this anti-pattern. And then users are directly querying the warehouse and the dashboards reporting BI tools, engines and other things are also querying data warehouse. So it helps. This is the first cut, obviously. This helped us do the early things on the products. We launched a product, we knew immediately users are dropping or did not drop. How did the install funnel look like? Users installed the app. What did he do in the first minute? For games, like the most important question, we were like, okay, kind of kill each other over this. What did he do? Tell me, what did he do in the first session? First minute, 30 seconds, what did he do? If we don't know, we are blind. Then we are screwed. Games are not transactional platforms. You don't know that, oh, he wanted to buy a shoe, so he bought a shoe. No, users don't know what they want to do. So they clicked because Shahrukh Khan's picture was there. Now it didn't turn out to be that. So we have to know that. We have to know their favorite color before they do. And that's important. Some details of this, you might have seen a squirrel kind of thing. So that's the name we gave. This is our internal client. So just in some details about that, it was a data collector, but a thick client, not thin, because we merged the whole data integration in ETL to it. Written in Golang, we knew that it has to be performant to some extent. Deploy it homogenously. We don't care which microservice, where you want to deploy it. Here, game team, take it, deploy wherever you want. Rotate and move style in a local folder. Processes, events, only once, guarantee. Only once, and for sure. Input interface was a file, which is being written locally by the SDK. And batch reading, file rotating, archiving, cleanup, so that the disks are cleaned up. Output interface directly to Redshift and S3. Redshift being a data warehouse that we chose in the beginning. Again, free credits. Worked, SQL interface. In the beginning works very beautifully. Obviously, we didn't have to pay in the beginning, so it was a good decision, in that sense. Batch rights to Redshift and S3 because you're doing a network call. S3 APIs versus Redshift APIs, which one works at what pace you should do it, and then accordingly, tune the batch. I'll come to that in the tuning part a little bit. Redshift started with DC2 large, whoever knows. The smallest node took over, I think, and then 15 GB RAM. We started with that because we thought that's enough, but since it can scale horizontally, it was okay to start with. Different users and queues for ingestion and usage know that it's all going at one place, so your compute is gonna happen, so compute should not start ingestion and ingestion should not start with compute, so have different queues. Keep moving window of data. Started with 120 days. So we started with 120-day retention so that disk is limited and daily unloads for that. Fifth step, open up many data interfaces. Keep it simple, transparent. Misuse is okay, again, from behind, and use third-party interfaces. So we directly give access using DB Visualizer and SQL Workbench to do whatever. Several tools, MetaWise, Redash. Daily business reports through JDBC or DVC scripts and other things. We did a hack. We used Datadog as our charting library. We sent data to that in 15 minutes, five minutes intervals, and used their charting. That's a hack, as a startup, obviously, you can't build the whole UI upfront. So that's booting, alerting, we use them. Some few things we also had to do, but we don't have time, we can talk about it later. Since we had less data, we had to load data back if somebody wants it. They want it rarely. Once in a week, they would want older data. So we just enabled that using Rendex scripts. Now, tune and repeat. Know that not all data is important. Always backup at all levels. Drop and rebuild quick. Any issue, drop the data from your warehouse immediately and rebuild it quickly. Have that capability. If you can do that, it enables you to do lot more. And data will increase with time. Not just in terms of number of rows, but also in terms of columns, and the columns will also increase in complexity. So Redshift, you have to enable compression on certain things and when to do, when not to do, also comes in picture. So bottlenecks in ingestion, we had to worry about it. Insert versus copy in Redshift. Parallelization requirement. Thick client leading to problems as the application itself scaled up. Bottlenecks due to usage, we realized we needed more columns as we went along your data scientist or product manager. Coal equation of queries. User asks one query, then another, then another, then another, like 10 queries continuously to answer the business question. That means that, okay, if the table is big and you are looking at only subset of data, it becomes a issue. So we had to introduce split tables on the bigger tables, like mtable count was split into count, popup, count button, and other things to kind of give a faster query speeds. Now, then comes the last step. This is how it looks like roughly V2.0. Clay, okay, this side is all the applications, different kind of applications from mobile clients to your web services, microservices, some 20 of them now. They all send data to what we call Badger Pipeline. That's our new system. Centralized kind of controls have written in Golang. NSQ, if anybody has used for distributed queuing, works beautifully over there for us. Very small footprint, but handles our current scale. Again, data lake, anti-partisan, we have still hold on to. We are pushing data warehouse directly. Redshift has still, but we added a real-time layer also, which is powered by MemSQL. So that the SQL interface is maintained, but at the same time, we are able to kind of have some data much faster than otherwise. And we have added overtime priority queues and other things for different data, different priorities, and different SLAs, and other things. I'm not covering that here, but high level. The, now we have moved from thin, thick client to thin client, which we call Badger. Does more or less same thing, but doesn't upload to the network, just sends data over TCP to NSQ and that's it. So that becomes faster, very lightweight, works beautifully for us. And again, keep them dumb, less transformation, misuse is okay, and centralized for performance. Not centralized just for the sake of it. Centralize only when the performance becomes a bottleneck for you. And current scale, total events per day, 20 billion plus. This is unique, not counting some of the redundancy. Total size of the data, 800 GB plus. Average events per second, 200,000. Peak event per second, this is the bottom graph you can see in, before clay okay dinner, after dinner games, do see high traffic, especially when everyone is sleeping, it will be low. So peak events go up to 350,000. And even at the lowest point, it is around 65 to 70,000. So it's never zero, but at the same time. So our infrastructure has to scale up and down accordingly, and also be ready for higher scales. So this is what it looks like. And I think this journey has helped us get there. Build it incrementally, not worry about that big bang aha moment, but kind of knowing the business, knowing it inside out, knowing the usage. And I think the most important thing I'll emphasize is collect well, but at the same time, make sure you are actually gonna use the data and you are using it. Make sure data is the first level citizen in your company, in your product. Otherwise, all this is just a good tech problem to solve, not a business problem, or not gonna help the company or the product much. So just to recap, these are the steps. Happy to talk about it if anybody wants to go deeper into one of the specific things and other things. Thank you. So this. Thanks Pushpish. I think this was a good presentation. And I think now going from the general to the specific, unfortunately we do not have time for questions, but we'll take Jayesh's talk and then based on the time that we have left, we might have a question and answer session with both of them. In any case, Pushpish Jayesh will come back for the BOF session and Pushpish is gonna be around. So we'll go to the next presenter, Jayesh Siddhvani, who will then sort of go from the general presentation that Pushpish made to go into the specific case of how they've scaled their platform at Hotstar. Over to you Jayesh. All right. Thanks Pushpish. Wonderful talk. I just pick up from wherever you left. I think we've had a similar journey when we started, but at focus, this talk more on what we did as a V2 of how we started earlier. I'm gonna talk about data platform patterns at scale. Bit about me. I'm Jayesh. I'm a director of engineering at Hotstar. Pradavists have led data engineering teams at Grow First and Tanyao. Quite active on Twitter. Reach out to me at Jayesh Siddhvani and I'm always available on Jayesh at hotstar.com. Bit about Hotstar to sort of set the context, right? I'm guessing most of you have either heard of Hotstar before or have used Hotstar. But a quick recap. Hotstar is number one ODD platform in India. We have over 350 million downloads across our iOS, Android, apps, and the living room devices. At peak, we've been able to do a 10.3 million concurrency. That's 10.3 million users were active on our platform at a point to sort of put it in context. It's a little more than the entire population of United Arab Emirates and Portugal individually, right? And all of those people were there on the platform right at the same time. We just hit a world record with that concurrency. This happened in IPL 2018 in the finals. And at peak, we've been able to consume approximately 75% of the total internet bandwidth in India. So the entire bandwidth, we took up 75% of that. We're available in more than 15 different languages today and will soon be available in nine different countries. We started off with India, but we're going to US, Canada, UK, Indonesia, Vietnam, Australia, Singapore, UAE, yeah. I think what sort of mixes a little different and sort of poses different technological challenges is the variety of content that we have. We do live content, we do on-demand content and then you have sports, news, TV, movies and everything is there in the regional catalog, right? So there's a lot of, you know, diversified understanding of our users and our demographic and hence building for that. Stock is more about, you know, why we chose to build a platform and not what. So I'll just quickly start off with what do we have and then we'll, you know, drill down into details of each component. A quick thousand feet overview, you know, we have a very simple, traditional data platform, you know, starts off with, you know, clients on the left. We have around 13 different clients. Like I said earlier, all the mobile devices and living room devices, we have 13 different clients who constantly send data points to our platform. These data points mostly are clickstream events. So, you know, what did you view? You know, which page did you open? Which button did you click? You know, likewise, you've instrumented around 120 different, you know, events. All of these events are ingested into a central message bus. This message bus is built on top of Kafka, right? And Kafka is only proxy to any data, any external data rather coming into our system. Post the message bus, we have a bunch of processing compute units, right? So these processing unit continues to listen on the incoming data, massage the data, transform the data and, you know, push it for downstream consumption and storage. We've tried to make a very simple storage layer. You know, I'll talk about that storage component later, but we have a data lake, which is primarily built out of S3 and we have a data warehouse, which is primarily H-based, but we'll dig into the details. And our hot star, moving to the data products, our hot star, I think data is very ubiquitous, right? It's not just used for analytics, but we, you know, engineers build a bunch of data-driven products on top of it as well. In fact, the entire app is, you know, personalized for your experience, right? And the whole personalization engine runs out of the data that's invested here, right? So a lot of content personalization, a lot of ad personalization, a lot of traffic modeling, and of course analytics, right? Those are the consumers of our data platform. With that said, you know, this is what I'm going to talk about. I'm going to talk about different patterns that we know we came across along the way, a few anti-patterns, you know, given the context of where we were and we'll break it down into ingestion patterns, storage patterns and consumption patterns, right? Quick, brief ingestion patterns, you know, we've been able to build a platform that has been able to ingest a peak load of one million events per second, right? And we've been able to do this for, you know, a good period of 30, 40 minutes, right? So throughout the period, there were one million events coming right at us and, you know, we've been able to ingest it with high durability. Like I said, it's built entirely on Apache Kafka, highly available, highly durable, because those are one of our primary expectations of this tool. We'll talk about it later in detail. One million events was what we did last year, IPel 2018, we've grown tremendously, you know, from that point. This year around, we have a projection of 50 billion events per day during IPel 2019. Our storage, we produced, like on a regular day, we end up producing about 3.5 terabytes of data per day. The data is primarily stored in S3. We're also actively looking at HDFS now, now that our engineering teams are more mature. Whereas, like I said, it's entirely in HBase. Like the 50 billion projection, we're looking at 14 terabytes of data to be ingested per day, this IPel 2019. Our consumption patterns are also very unique. Like I said, the entire company uses data to build their products. And on a regular day, we end up doing a supporting rather than 300 terabytes of data scans each day, right? This is including everything, all our analytics and the data products that are built in-house. We wanted to make the consumption very simple. So we have one single interface to talk to data. Again, talk about it in detail later. To bring parity between streaming and stationary data. And we didn't want anyone to be limited on the amount of data they can scan. Start up a data ingestion, broad vision. We wanted to build a scalable HTTP ingestion proxy. Sure, there are multiple other protocols, but a simple HTTP proxy that everybody in the company could understand. And something that's unique for us is the data spikes that we get, right? We handle about 3X spikes in less than a minute. And 3X is, I'm talking about millions, right? So we do from 3 million, we go to 9 million in less than a minute, right? So the platform needs to be aware of how to handle that scale. Good, pattern number one, right? For data ingestion, allow everyone to ingest any data, right? And that's important, right? Because one of the most important things that we realized is one of the biggest success factor of a data platform is you can't stop anyone from ingesting any data. And people really don't know what data they want to ingest. They only know things in English, right? So let them ingest JSON, Avro, you know, protobabs, whatever they want, let them ingest it, right? You should understand binary. And that's what you wanted to do, right? For us, bunch of clients, like we spoke about send data, a lot of microservices send the data via the databases. So we replicate the entire database of our different services, right? We have a content service, we have a user management service. And each of these services have their own databases, which could be MySQL, Postgres, Dynamo, et cetera, right? All of the data is ingested from, you know, the write ahead logs, you know, like bin log in case of MySQL. It's ingested into our common interface, right? And from there on, available for query. Like I said, you wanted to make the ingestion simple. So HTTP for people who do want to understand anything beyond it, low-level TCP libraries for, you know, our power users who want to, you know, extract more juice out of the system. And, you know, Kuma spoke a lot about how client SDKs, you know, should be built. You know, we pretty much follow the same principles. I'm going to skip over it. But in a brief, a client SDKs are super simple. You know, all that the client needs to worry about is just one function, which is track, right? And that's just about it. You know, the SDK, you know, transparently takes care of batching, retrying, ensuring, you know, high deliverability, et cetera, right? And this allowed, you know, our consumers to just go all wild and, you know, just instrument any data that they thought could remotely be useful, right? And that gave us a lot of power. The second pattern is, you know, quality of data is super important, right? And I sort of, you know, tend to draw parallels with how we would build a database system, right? Imagine, you know, you're starting a new service, new application, and you have to choose a database. Let's say, you know, I pick up a MySQL. The first thing you do is start modeling the data, right? The first thing you do is design the schema of that table, right? But I think what happened with this and what's also unique with a lot of, a lot of companies handling data is events and data is not treated as a first-class citizen. There is not enough thought on what should be instrumented and how, right? That hit us hard. I think for a good, good one year, we were, you know, suffering with clients, not thinking about what the data format should be. Two different clients thinking about data differently, right? And a simple example would be, you know, let's say, watch to video is one of the critical, like, you know, the most important metric at Hotstar. There's one property called watch time, right? So, you know, the clients would go to an extent where an Android would send watch time in minute as an integer to some other clients, you know, who would send watch time in minutes as a string. Now, these things have a trivial hurt really bad when, you know, the end users end up consuming it, right? And think about it, right? We have like 150 engineers, 45 data analysts. They need to have that inside. Hey, iOS is sending a string. Let me cast it, right? So all those things, you know, hamper does big time and we, we brought in this whole schema registry layer, right? And in front of our data ingestion system and we mandated everyone to define an event before it could be ingested into the system, right? So arrest the quality right in the beginning, right? Not in the end. And we ended up doing fancier things, you know, later, right? A, we mandated everyone to, you know, send us data and Avro, right? Which also made it leaner on the wire, right? And we started doing strict checks on the data, right? You know, checks from, you know, if you're sending a value, let's say subscription status, right? It can only have a fixed set of values. It has to be an Ena, right? So if let's say you misspelled canceled or you introduced your own subscription status for God knows why the data won't be ingested, right? And we sort of have two, two flavors to not letting you ingest. A, we just cut it off and say, you know, you send me a bad data. I'm not going to let it go through. And I'll raise an alert, right? I'll, you know, raise a page or duty if you've been doing it for a very long time or I'll just send an email to the, you know, the owner of that event, right? And in that case, you know, the event goes to a dead letter queue and, you know, guys come in manually scan and see what's wrong, what's right, fix it and replay it, right? The other mode of course is, you know, pass through mode, right? In which, in which case, you know, if the producer takes the, you know, responsibility of, hey, look, I don't, I don't care about, or I don't know that, you know, rather I am, I for certain know that the downstream consumers won't be impacted. This is just an event for me. You could optionally choose the event to, you know, just pass through with a normal alerts, right? Though we don't, you know, casually allow that in practice, but we have an option to do that. The image on the, on the right, you know, talks about how a monitoring system catches the alerts. So you could see a not in enum alert, type mismatch alert, a missing field alert, right? So all of these have, you know, allowed us to ensure a strict control on the quality, right? And of course, garbage in is garbage out. We don't want to do that. The third pattern for us is very common. You cannot drop my data, right? And I think if I had to sort of only pick one thing, that's very important on that's very crucial to, you know, being very durable. It's hot that has abnormal peaks, right? The graph, the GI that you see on the right is an actual traffic pattern during, you know, the IPL finals, right? This was a match between Hyderabad and CSK and the peaks are crazy, right? The data comes unannounced, right? Dhoni comes on the, on the crease, he hits a six, boom. We have three million users on the platform, right? Nothing happening, four million users just get out of the platform, right? So these are the events you just cannot predict, right? But you still have to build for that, right? And we've learned our things the hard way, right? For, you know, scaling this abnormal peaks we started off with proactively putting our boxes, right? So if we knew that there's a Bombay versus Chennai match, we used to put up like, hey, look, let's let me provision 60 boxes because I know at least for, you know, one second during the match, I'll have like a eight million concurrency, right? I'm building for that for the entire six hours, right? Of course, not useful, right? We ended up burning money, burning money, like anything. And then from there, we've matured, matured very well. The second thing we did was we started scaling up on the number of requests, right? So, you know, we always plot the number of requests that we're getting and a simple derivative maybe of the incoming requests could tell us that, hey, look, you know, we're expecting a more, you know, peak in the next three seconds or next, you know, a minute, right? That worked well, but I think now we have more richer data. I think now we've done, at least I've been in the company where we've done three, this is the third IPL. So with these three IPLs, you know, we've got a good sense of how the traffic behaves. Now we are in a position to do a lot of predictive traffic modeling, right? We could figure out if Dony's coming into bat, we could see a, you know, X percent of increase in the traffic. So that's helped us, you know, be more proactive with scaling. The other important thing we've realized is systems will fail no matter what, no matter how. And one very good example is again, IPL, right? There was this match. IPL usually starts at 8 p.m. 7, 10, our Kafka brokers go down and we don't know why, right? We're struggling with figuring out, looking at logs, we do everything, you know, we could do in the world. Panic kicks in, of course, right? There's 8 p.m. match, the whole app is, you know, just based on what you've built, right? And luckily, you know, for that match, kind of 753, 755-ish, right? The broker suddenly started working, right? We don't know why, we don't know how, right? And we could do the match, right? And it's one thing that we realized from that, right? You know, you are not always lucky, right? And I think we did an RCA, a CTO just lashed out at us, but we took all that, but we realized one thing, right? Like, tomorrow, this Asia Cup, the systems fail, we're not sure, you know, we'll be back on time. So you spend a lot of time in ensuring resiliency in the system. We have a lot of circuit breakers now, right? But even apart from that, we figured out a degenerated mode of the system, right? Now, one thing is very important to us, like I said, we cannot lose data, but then we have degenerated SLAs, right? Which is that, hey, look, SLA one, I'll give you the data, and I'll give you the data in real time. SLA two is I'll give you the data, but I'll give it to you delayed, right? So now what we do is that, if let's say at seven, 10, I realize that my systems are not up to the mark, my circuit breaker opens, the data starts going to a secondary data source, which mostly in our cases, ACE three, right? So the data goes to ACE three, and then we have a backup channel that takes data from ACE three and you know, puts it back into the, you know, main flow, right? So that will be at least ensure that, you know, we don't panic when the match starts and we don't, you know, still end up losing the data. So that happened and I think, I think, you know, we just survived that scare and then we got another scare, right? The other scare was when we started doing a peak of one million, 1.2 million events per second, our load balancers started throwing up and for us, it was the first time that, you know, load balancer couldn't survive the scale, right? We then, of course, after a lot of research ended up in sharding our load balancers. So now we have a LB1 and LB2, et cetera. That worked for a while, but now we've again matured and you sort of, you know, become a little better at it. Now, you know, we do smart routing on the edge, right? So on the CDN provider, you know, we now have compute that's happening on the CDN provider. That has allowed us to do smarter things like, let's say, you know, I am, I am browsing hot stuff from, let's say Kanpur, right? I'm not connecting to my Mumbai data center. I'm not connecting to my Singapore data center first. I'm connecting to my, maybe a UP pop of my CDN provider, right? And we run computes there, right? We run computes to figure out, you know, from the headers, you know, what kind of event is he sending, right? Is he a premium user? Is he a free user? And those decisions, you know, help us figure out how critical that event is for me. So a certain watch video gets a fast lane that, you know, all the events get aside. This is the event I'm letting go through, right? Versus maybe a downloaded video, dude, you come, I'll, I'll, you know, start analyzing you in maybe an hour or two hours, right? So those smarter decisions have now helped us scale better, have, you know, now helped us unlock, maybe whatever input scale that, you know, comes, comes onto a platform tomorrow. I'll quickly move on to data storage. So we just wrapped up the first part, the data ingestion part. Data storage is the middle, middle part. Data storage is simple, right? No frills. I just want to store my data and I want to query my data, right? No other asks. Our data size has grown dramatically, right? I spoke about this earlier from last year to this year, we've grown 200%, right? And my sense is that next year we'll grow maybe a triple digit or maybe I'm sorry, a four digit percent, right? Because now, until now we were only India, right? But now we have India, Indonesia, Vietnam, who are as populated and, you know, as crazy entertainment driven country as India is, right? So it's something that, you know, the things that we've built today, of course it'll be useless tomorrow, right? So we have to continuously reiterate and, you know, make it better. But data storage, one of the most important things you wanted to do is you wanted the data to be available in very real time, right? And I'll give an example. Let's say there is a RCB versus CISK match happening. Okay, I'm a Dhoni fan, so I'll keep calling Dhoni a lot. So it's Kohli, you know, RCB and CISK match happening, right? And let's say for a brief period, there's a lull, right? Like three, four overs, nothing is happening in the match, right? A dead phase in the match and a user's drop, right? And they do. And let's say, suddenly Kohli starts hitting six. He goes buzzer like he does whenever he gets into his mood, right? We want an ability to maybe, you know, call back those users who had dropped from our app in the last five minutes, right? Because those are the users who are super, you know, sort of likely to come back on the platform, right? Now you want to build a storage, you want to build a platform that can allow our analysts to do that. You can't expect an engineer to be pulled over to run your scripts then, right? Analyst should just come and write a SQL query. Hey, give me the users who dropped in the last five minutes. I want to send them a push notification. You should be able to do that. So that in mind, we've, you know, built a system that allows the data to be available for query in less than 10 seconds. Data storage pattern one, that noisy analyst will always be noisy. I don't know how many of you, you know, sort of how many of you relate to this, but it's very common, you know, thing that I have noticed, right? So this is one dude who will run queries for eight hours, 10 hours, and nobody in the company will be able to use the data system, right? How much have you tell that guy? He won't listen to you, right? So we did the other thing. We said we won't tell you anything, but we'll build systems that can, you know, allow you to do whatever you want to do, right? And Hotstar is very seasonal, you know? So you have IPL that happens every year. GOT is happening after two years now, right? And while it is seasonal, the other important thing for us is that, you know, every day there is this new season that will happen, right? So this month will be launching Hotstar Specials, which is our version of Hotstar Originals. So every month there will be a new, you know, original season, every, yeah, every month there'll be a new original season coming, right? So this every month is a new season that, you know, our analysts will get busy, you know, analyzing data for, and for each season in season, they end up doing like two years, three years worth of query, right? And just over in my cluster. When we started, it was very difficult. Folks used to just, you know, tap a button, run the query, go for coffee, maybe go for a smoke break, have lunch, come back and still see it running. But we've, you know, done better from there. We realized one thing, right? Like one of the super important thing for us too, is to decouple storage and compute, right? Because storage is not elastic, right? Storage always keeps on growing, but compute is very elastic, right? If you're running a query for three years worth of data, maybe I could provision, you know, 10 boxes for you for that 15 minutes, right? But I should be able to scale it down immediately. When we started off, we were, you know, mostly on Amazon Redshift, you know, and that didn't allow us to decouple the data well, and I'm talking about last year. From there, we've given a lot of thought on just decoupling it, right? And we've been able to do that, right? With that now, we've been able to do resource isolation very nicely, right? So you have a retention team who have their own compute instances, which could just scale up, scale down, however they want to, right? Our sales team have own cluster, and engineering teams, they have their own cluster. Everything is independent, everything just scales independently, but the data is same, right? All of these compute clusters in the end go and talk to our data stored in S3, right? So then that now allows us to scale independently. Data storage pattern too, simple. Find patterns and optimize the query further. This is one thing I've realized, right? How much ever you optimize a query, there's still room for more optimization, right? And to that extent, one thing that will be very sort of particular and alert about is we log everything that happens in a query, right? The query planner, right? What happens in the planner? How is the planner running the query? To how many scans is a query doing? And all the while we're interested in understanding, here look, is there a data which is frequently queried by multiple queries and all of the queries just run through the lineage, right? And what you've done is that, you know, periodically after all of these analysis, we've been able to figure out that, hey look, if I could break down a query into multiple sort of lineages, lineage one and lineage two are common in most of the queries. Let me do one thing. Let me create a aggregated data or a consolidated data of lineage one and two and let me ask the analyst to start from lineage three, right? So that has allowed us to, you know, make the queries be more performant. And when we do that, our analysts are, you know, like of course much smarter than us. They come up with new definitions, you know, of data which just breaks the whole lineage pattern, right? Once this example was, you know, last IPL, IPL 2018, our analysts ran out of all the definitions of, you know, things they wanted to do and they came up with new definition called IPL reach, right? And that's anything on the platform that could have a first degree, second degree, third degree association with an IPL that could also mean an ad run on the platform that is related to IPL, right? I want to analyze all that content and figure out what was the impact on the users, right? So then again, you know, our engineers jump into, jump into the picture, figure out, hello, please, please, please don't run the query on the raw data. I'll create a new definition for you, which is IPL reach, right? We did that. But again, that, you know, meant that our engineers spent a lot of time on doing this. So progressively we, you know, built this thing called ETL as a service. We call that loosely, which now allows the analysts to just write SQL and, you know, all the aggregation, all the heavy lifting of scheduling it on a Spark cluster, running it on, you know, map, reduce days, et cetera, et cetera, is taken care by the platform, right? So analysts still continue to write the query and create, you know, a derivative data source, so to say. This worked beautifully for us. And, you know, we've now invested more resources in making it more better. My last chunk is on data consumption, right? Super critical for us. Like we wanted our producers to be, you know, we wanted our producers to not think anything up before, you know, ingesting any data. Likewise, we didn't want our consumers to think before consuming the data, right? Talk is cheap, show data, cliche it, but we swear by it. And I think it's something that runs in the organization top down. Our CEO wouldn't entertain any, you know, business meeting without looking, you know, at data on our BI tool. And that just helps us, right? Helps us and, you know, also helps us get more work and also helps us be more relevant in the organization. But yeah, I think this is built in the DNA and, you know, we just stick to it. Another thing we wanted to do is we wanted to give a single interface to query, right? And we wanted the single interface to be tightly controlled by engineers because we realized one thing, right? The more the control we have, right? The more quality checks, the more goodness we could extract from the system, right? And in the end, it doesn't just become, become a matter of convincing an analyst to work in a certain way, right? We just do it automatically on the fly. Did I condemn pattern one? Don't make me think, right? The data stack is very complex. When we started, we had Kafka, we had Hways, S3, Redshift, right? For a very long time, we were thinking of using Ignite, right? Bunch of technologies behind the scenes, right? But our analysts need not worry about it. They shouldn't even know what's beneath or what's between their query and the data, right? We spend a lot of time building platforms, building libraries that abstract all of this away from the analyst, and we have only one single SQL interface to the data platform, right? So analysts can come, write a SQL query. That SQL query can talk to data coming from Kafka, talk to S3 via Hive, talk to HBase, or talk to any other data source that's there for that matter, talk to even a relational data store like a MySQL or maybe a documents store like a MongoDB, right? And again, it's super simple. You just write a SQL query, right? We build systems that resolve where a query needs to go and how, and while we do that, we sort of, you know, given a lot of thought on treating a streaming data and a stationary data alike, right? I think, Kumar, you mentioned a point where you said that the data that lands in the data lake and the data warehouse needs to be enriched, right? It needs to be full of information, right? If you want to support that, the data ends up coming there very late, right? So in our case, the data in the data lake and rather than the data warehouse comes up one hour after it has been ingested and we wanted it to be that way. Yet we wanted folks to query the data in real time. So we let people consume data from Kafka, right? Our Kafka ingestion system, like I said earlier gets the data in 10 seconds from the moment it's generated. So you click a button within 10 seconds and I know XYZ person has clicked a button, right? And I could run my analysis, my aggregations from that 10 second window, right? So now analysts just try to cycle query, right? Underneath the figure out hello key needs data which is within one hour window. So I query, I talk to Kafka. If it is beyond that, I go and talk to my, you know, data warehouse or my data lake. This is what we've been doing. We are always, you know, enhancing whatever we've built, right? I spent some time talking about our future work. One of the biggest things we are now working on, you know, having something that I like to call local data and global view, right? Like I said, you know, we'd be in 15 countries, right? So we need a data lake, a data warehouse which is local to that country. So, and Vietnam CEO, Vietnam hostile CEO should be able to minus data efficiently. And a global CEO should get a visibility of how all the 15 countries are performing, right? While we do that, we have to ensure GDPR in Europe. We have to ensure whatever the new privacy law is there in India and likewise in other countries, right? I think like a bulk of this year and next year would be spent on, you know, building systems that can, you know, scale globally for us, right? Our vision is to have like a global warehouse, right? I come right to SQL query figures out where it needs to go. And then, you know, the routing happens transparently. The other thing, we want to be more leaner on the wire, more in-flight enrichments, right? The client should just tell us, hey, look, I'm a user ID U1 and I'm watching a content ID C1. That's just about it, right? All our in-flight enrichment should do, you know, streaming joins, get data from multiple places and make the event more richer, right? And our expectation is that the one hour that we take to enrich the data and make it available via data warehouse, we want the same data to be available in the 10-second system, right? So within 10 seconds, you should have everything, you know, and a fingertip, right? Stop talking to your warehouse, start talking to Kafka, right? Attach the rich data real-time. The third thing I want to do is build in a lot of phenomenally detection models and it's super important for Hotstar. And I'll give you an example. Yerishta kya kalata hai? How many of you know that show? Yeah? Only one, Zainab? So that's apparently the most, so do you also know that? He does, you know? You're hiding. Yeah? So Yerishta kya kalata hai is the most watch show on Hotstar and like one stat I'd like to share is for this particular show, Hotstar, the OTT platform, there's about 25% of the watch time, of the entire watch time that the show does. And that's remarkable, right? Because OTT industry is just a year and a half and a year and a half years old, right? And within a year and a half, internet has been able to do 25% of what TV has been doing for 20, 30 years. Two more years, I don't know if cable will be there, I hope it is, but you know? So you know with that, I think there's a lot of, lot of anomalies that happen in the system, right? One such instance in sometime in August, you know, a watch time on the platform suddenly dipped, right? And it dipped by a double-digit margin, right? Everyone panicked right from, you know, the product managers to even the CEO, right? People are not sleeping in the night, people are in the room, you know, bringing in their hypothesis, running their queries, analyzing data to figure out, you know, this is not why it had dipped. And you know what? And like in a day and a half's time, you know, what is the reason for the show to dip? The reason was that two lead characters were planning to have a divorce and people didn't support that storyline. So people stopped watching that show, right? But we wasted 48 hours figuring out why people watched it, right? So, and this is very common, you know, with us and you know, we now spending a lot of energy building anomaly detection models that, you know, again, that 10-second window, figure out if there's an anomaly in that 10-second window, right? If there's an anomaly, alert it, right? Alert it to the person who's responsible so that, you know, we can action it immediately. Last, I think we are heavy consumers of open source tool. Everything that we've done, of course, we started off with hosted on-prem solutions, but slowly over last year, we've moved entirely to our own in-house platforms and everything is built using open source system, right, from Apache, Kafka to Hive to like everything, right? We will be, you know, open sourcing our ingestion system. It's highly available, durable system, which I'm sure a lot of folks would benefit from. It's something you're working on. We write a lot about whatever, you know, whatever I just spoke about at tech.hotstar.com, right? Whatever we talk about, you know, we write it there. We have monthly meetups across our Bangalore and Bombay and Gurgaum offices. Do check it out if you wanna know more about how we do, how we work. That's all. Open for questions. Mike, so that, so we have a question here. Did I miss anyone else? Anyone else have a question? Okay, we'll start with Amit. So this is a technical question, like, so just want to understand, what is that single SQL interface? What kind of technology you are using behind, you know, single SQL interface? So based on the SQL, it will hit the source system, whether I should run my query on relational or HVAs or MongoDB. Yeah, that's right. We use Hive as a proxy for that and SQL interfaces, high QL basically, that goes in. You've done a bunch of, you know, you sort of try to fork certain parts and put in our routing logic in it that figures out which table is where and, you know, more goodies there. But Hive is what you're looking for. Okay, so based on that Hive, we decide the source system. That's true. What's the assignment of that? Because you have the kind, the volume of the data that is coming in. I'm sure you would have had to do partitions on any number of these dimensions. Did you do something interesting there? Sorry? Did you do anything interesting as far as the placement of the data or scoping the query to be able to just go? Something can't scan 800 billion or whatever number of events every day. Yeah. I think for us so far, we've, you know, not done anything fancier than the regular data patterns. We've been very conscious about how we partition the data and where we store. But incrementally now, we've started thinking of hot data and cold data more seriously, right? And we sort of now talking about tiered storage. So you have the historic data in S3, which is cold, right? You have warmer data. If I may use that term, right? In an in-memory database, now let's say Redshift could have like, you know, four months, five months worth of data. So thinking of a third year, right? This is like a cache, like something like Ignite, right? Which has more real-time data. Now it now goes as, you know, if you want to query for recent days, recent week, talk to Ignite. Months maybe, talk to Redshift, years, go to S3, right? This is what we think maybe could solve. Of course, you know, this is something that, you know, we've been just debating about because we do see that problem, but we've not certainly fixed it yet. Kumar, what do you want to take? Yeah. So one thing which we have realized because we know that usage patterns are not fixed, like the questions will change, like, okay? Users start using the product differently, click on different buttons, or you see a dip in something else. Like as Jayesh mentioned, like in that particular, like, so popular side of a dip or sports vertical sees a dip. For us also, like, did the monetization see a dip or a user engagement or a retention saw a dip? So user queries will change. So the only constant in the whole this is, especially when your data is immutable, is time. It is time series at the end of the day. And generally queries will be very co-located in a time. If some analyst is analyzing what happened in last week, then his or her, all questions are gonna be in that. So what we have done is generally go into this hot and cold over time. So keep the recent data as much as possible, easily accessible because you don't know where the dip will happen. Like demonetization, right? Every demonetization at 8 p.m. on that particular day, we came to know about it by looking at the data because we were in a meeting and all the graphs and all the alerts started finding out what happened and then we got to call that again to 8 p.m. And this is the power of it, okay? Only time is the thing which will tell you what is wrong and it's all right. So we do partitioning on the basis of that and you talk to either MemSQL for us or any real time systems or the queue directly or you talk to the Redshift or S3 for different times of the day. What this is primarily for you but you can also pitch in. So the architecture of the data pipeline that you guys shared, this is mostly the present state, right? But like going back three, five years down, it must not have been, so you might have done couple of iterations, couple of modification, improvisation. So starting at, let's say somebody who's starting off fresh, right? What are the common pitfalls that they fall into our, any learnings that you can share with the group that we should consider, you know? Yeah, so we talked about the scaling of rates because it's a general question but probably Jayesh can answer. So scaling up, so you generally have to scale up in matter of seconds, so generally. So, but scaling up, for instance, is generally take a while, right? So how do you, I mean, do you have any, do you have, you know, warm standby instances or how do you achieve this quick scaling up? Yeah. And that's one question and one question for Pushpash. So we talked about immutable data, right? So what is this immutable data? So how do you achieve immutability? Data immutability, do you have any specific systems around making data, so like blocking updates or whatever and blocking updates and allowing them only for ingestion surveys or something like that or how do you achieve immutability? Okay, so you're doing it now, Laila? Yeah. You started that? All right, to answer the first question, I'll try to answer it. So that's what I try to cover in the talk, okay? It did not look like what it looks like as of now with scale, obviously in the beginning you don't have to worry about scale that much, more than correctness. So I think few pitfalls and few things, very typical of any engineering system and obviously applies to data pipeline. Do not over architect it or do not try to over architect it. Know the boundaries and if you can design the data schema in a more as generic as possible depending on what your business is and other things or whatever event patterns you can foresee. What helped us is we had the contextual experience in gaming, so we knew what scale would bring or at least had a fair idea. If you have used that and applied to the data schema design more than the architecture design, the system architecture will most probably change. Like we ended up choosing Redshift or some like we wrote it a thick line and in some other things we chose a queue and over time we changed it because tech ecosystem does change and it's all right. Fix the data schema. Luckily we haven't changed the data schema much. Like Kingdom Phylum or this whole nomenclature we chose on day one and has not changed till now and has served us well. So I think that would be a pattern which will help and tech architectures I think it's okay to change over time new technologies will come. Which will help. Only one point to add. I'm pretty much covered it. For us, we need to understand what business are we in. Hotstar is not here to build tech platforms from day zero. Hotstar is there to just let people watch entertainment content. For a very long time we were only using third party services. We were heavily using on-prem services by Amazon. We use a lot of ClioTap, ClioTap guys are here. But we relied a lot on using external systems because he just really wanted to figure out what is it that we want to extract out of that system. So no engineering until you figure out what the use case is. And over a period of time we've got an immense understanding of how users want to send data and how users want to query data. It's only then we decided that they look. I'm sure I would have continued to use the system as long as my CTO would have given me more dollars. But that stopped. And once that stopped we had to sort of go back and figure out alternatives. But I'd go off-slide if I have to start again from day zero. I can take the first question on scale. That is very difficult for us and that's a very good question. So when we did last IPL, we did a lot of dry runs. We used to call these game days. So IPL started April 7th from mid-feb. We did multiple game days. So in each game day we used to simulate the entire traffic pattern. So day one, let's do a one million peak, five million peak, seven million peak, 10 million peak. And we just test all the systems. To a limit where I figure out that this is where I want to cut off. This is where I want to stop making the service operational. It was only one thing. If everything fails, the video has to still run. Even if a user comes, sees a blank screen, the player should play. So that gave us our cutoff limits. First year was entirely manually pre-provisioning it. So with these game days we got a lot of understanding. And then when the match is to run, we used to have a game marshal who used to marshal a match and that guy used to call out that dude. Looks like we are at three million now but donies come on the screen. We definitely see it crossing five million. Go upscaler system. So we had a brief five minute, 10 minute period to upscala systems. But that was again more manual, right? But we did it that way to get more sense of how it will behave. This year we're investing heavily into containerization. We have our own Kubernetes platform, right? Hoping the spin up times would be much faster this time around. We're doing a lot of predictive traffic modeling as well this time around, like I said. So no more figuring out a peak capacity using dryance, right? It's more of, you know, how has the application behaved over two years and how is it expected to behave? I'll give us more control. Maybe an opportunity for me for one more talk to talk about predictor scaling. Thanks. So just our learning and the scale set, obviously HotStress is much higher peaks at the, like on the match days. We also see some peaks, but not to that level but what we generally do is a mix of time-based and your load or different patterns, right? Either it's CPU or network or something will be there. So a mix of both. Generally, like for us, thanks to Leo K being a game, like you know the traffic pattern of the product. So if you know, like, okay, you get some pre-provisioning to do or you can do, you should do. And then at the same time, some unexpected events will happen, like Dona will ask Chris Gale will go, Berserkz or what do you do? So then Leo K taking care of that. Now the immutability part, what we did was kind of had it as a requirement or a constraint from day one, in the thought process. Like if everyone who's, like every analyst knows and Leo K completely understands and Leo K, it's a requirement that, okay, once you have written that, okay, the variable should be this or the counter should be this, then you can't change it. You can't come back and say that, okay, all the data that we have collected yesterday, can you please rename it? Or can you add one column to know if the column did not exist yesterday, it will not exist for the data which was collected yesterday. And it's all right. So this constraint, I think, like okay, the different businesses might find reasons to change it or kind of not have this constraint. But we think that, okay, in a consumer application where your product is taking shape as users are using it and it will look very different the next year. Like IP 2019 is gonna be different than 2018 and your product looks different. You look at the UI, it looks different. So why your data should not be different? So it's all right. I think it works dramatically well for what we call games as extreme consumer businesses but choose depending on that. It helps and that's why the data is read only. So for example, all the data scientists or product managers, they don't have right access at all and it's okay. Only the data engineering team has right access. They can change if they want to but in the last five years, maybe we have got one or two instances where some specific payment tables had to be corrected because Google or Apple side there has seen but overall we haven't changed much. So my question is around ETL as a service. So you're explaining about the lineage identification and pre-computing base data, right? So I hope and I believe we are doing it from the DB logs. So do you run any kind of modeling on that to identify those patterns? And the second question is around on-demand aggregation. So what do you do to support those on-demand aggregated queries to run efficiently? Yeah. Yeah, sorry, I'll just summarize. The first one was how do we do ETLs efficiently? Sorry, the query lineage. Yeah, yeah, yeah. Of course, no, we do a poor man's job. We don't have modeling around it. We aggressively capture the logs, right? And I think it's a lot of data managing that sort of happens after that. And that just works for us so far. Maybe we'll think about modeling it to make it better. Maybe we'll add the AI layer in it, but no, we do it manually today. We just look at logs. The second question is how do we do aggregations efficiently, right? How do you on-demand aggregate? Yeah, yeah, yeah. I think two things. So on-demand can be on pre-aggregated data. It could also be on raw data that is coming in, right? Pre-aggregated is very simple, right? Raw, people end up using Spark string jobs or flink jobs that continuously listen on Kafka stream. More recently, we've been experimenting with the HiveConnect that could also talk to Kafka. And that's, again, very interesting. Of course, you know, it's still in POCs. We've opened it up for, you know, alpha users, right? But that's, again, interesting, right? So you write a query and you define a window. Let's say, you know, I'm looking at a 10-minute interval, right? I'll just, you know, get all the 10-minute interval worth of data from Kafka, do an in-memory aggregation, and, you know, throw the response back. Does that answer the question? Of course. So my question is regarding, you said, we can't drop the data when there's a need for more than one cluster. So how are you planning to achieve it? It's like the segregation based on the regions that you will be having separate clusters all together for the same, or it will be like the replication of the same data to handle a huge amount of requests. So we did multiple things. I think we just started with, started simple, right? So we already had a cluster that was heading to Kafka. We dropped in a circuit breaker. So whenever, you know, Kafka is unresponsive, the circuit breaks in, throws the data to S3. That's what it started off with. And then of course, you know, over a period we were making it more fancy. We start off, you know, now distributing data into slow lane and fast lane, right? So even if I'm getting like a 10 million concurrency, my fast lane, you know, events, such as watch video, viewed page won't have that much traffic. You know, they end up getting around a million, you know, a two million peak. So they give the entire cluster exclusively for them, right? And the other guys goes to a, go to a different cluster. I'm sorry. And now this we've been able to achieve by, you know, making this decision on the edge, right? So the pop decides which cluster to go and talk to. And I think, I don't know, maybe we'll have a better pattern as we, you know, see more of this, but now it's more of static partitioning across clusters to ensure, you know, it doesn't fail. And of course downstream, just break the circuit, let it go to a secondary degenerated mode. Hi. Thanks for a wonderful talk of sharing your experience. When Kafka came out, like all of us data engineers immediately jumped onto it because its popularity is kind of testament to the fact that it solved a real problem. Like the guaranteed ordering of events across partitions, across everything at high throughput, right? What is the one piece of technology? What is the one piece of technology? You wish someone would invent right now. In Kafka? No. At your scale, the kind of problems that you're seeing, what is the, because we used to spend an inordinate amount of time just making sure that events were ordered correctly, right? At high throughput. What is the big problem you're dealing with today for which a solution doesn't exist that you wish someone would invent? I wouldn't know if the solution doesn't exist for it, but with Kafka, I think one of the things we want is have a way to decouple compute and storage there as well. I think we've been finding it hard to, you know, keep our brokers always up and running because sure the data is always there, but our compute just dies down in the end, right? And there's something that, you know, we're in fact talking to Apache Kafka committers, something that, you know, they've been planning to do late this quarter, right? But once they go on Kubernetes, you know, you'll have like a different, you know, cluster that just manages routing of data, persisting of data, and a different cluster that just stores the data. I think I'd want that, right? If I have that, you know, I am not more worried about the spikes that Kafka can handle and the amount of data that, you know, I could persist in Kafka. That's one. There's another thing that, you know, I would love to have is, you know, I spoke about it in my last slide, and, you know, in the latest slides that, you know, we need to sort of blur the definitions of streaming and stationary data. We're doing a poor man's job of, you know, blurring that definition, but I think a lot could be done with, you know, having this, this one system that allows me to query, you know, real-time data at this scale, yeah? At this scale and yet, you know, fall back to a stationary data. Okay. I think T is about to be said, but just a couple of quick announcements before we break for T. Of course, I think to use Jayesh is this familiarism. Thanks to Pushpesh and Jayesh for a Dhoni-like start to Fitfell. Of course, that also reminds me that we should be aware of tech materialism, but that's another discussion. All of you have badges here. There's a QR code over here, which is to enable you to share contact information with others. You can share this contact information by downloading the HasGeek app, version two from Google Play Store. There's unfortunately no app for iOS, but if you try to scan the QR code without the HasGeek app, you won't get anything. So if you want to kind of share contacts with each other, if I want to share un-oops or need to share contacts, I can just sort of scan with my phone. So that's number one. The other is you have, each one of you has got a set of four coupons. There is tea, lunch, tea, and t-shirt. Each one of you will be getting a Fitfell t-shirt, which will be at 1 PM when we set up the t-shirt counter out there. And you can exchange your coupons for tea over there. One coupon each because one's for morning and one's for evening. Hopefully we'll have some extra chai. Since we're running a little, we are running 13 minutes late because the tea break, let's come back here by 11.45 to resume the next session, which will be Govind Pandey from Flipkart. Have a good session. Okay, can I please ask everybody to settle down? We'll start with the next session. Can we move inside? We're starting the next session now. Okay, I think as Govind is settling in, can we please take our seats now? Okay, we'll start with the next talk, Govind Pandey from Flipkart. Can everybody please settle down? Can I please ask people who are standing to sit down? Okay, I'm personally quite excited about Govind's talk because I think the prequel to this talk was what Govind delivered at the fifth elephant in Bangalore last year. And this is work that he has built on. I think at this point, it's fair to introduce Govind as one of the people who's also helped in reviewing talks for this conference. And I think he's gone on to become a speaker, from a speaker to a reviewer, to somebody who's carrying on his work. And I'm very glad that you're part of this community, Govind. Over to you. Hi, everyone. I'm here to talk about my experiences and my team's experiences about detecting anomalies that happen in Flipkart's fulfillment network. Now, I realize that not all of you would be familiar with the logistics industry. So we'll spend a few minutes introducing you to the logistics side of things. And then I'll go little bit into the details of the engineering challenges as well as the data science challenges. Now, I might be here standing and talking, but there's a big team behind this, a photograph and a credit where it's due. Little bit about me. Been in the industry for about 12 years now. Everybody went west. I choose to go east, spend time in some of the universities in the east. Now I'm spending time in supply chain automation, predictive automizations and actionable insights at Flipkart. The talk that Zainab mentioned is linked there. That's the platform that we built on and came to this. Let's take a little segue, right? And let's pause our engineering brains. Let's pause our data science brains and talk about aviation for just a few minutes. How many of you came to Mumbai via flight, the Shivaji International? Awesome, right? Let's talk about the Shivaji aviation. How many flights per hour on average? 42 flights landing or taking off in an hour. And that's a good big number. I mean, it might seem like two digits, but for an airport, it's a big number. I know 40% flights are delayed. 8% of those are because of the air traffic control. Now I don't want to trim about flights being delayed. That's not the intent. The point I want to make here is that those 8% of the flights which are delayed because of an ATC call is because the ATC detected a risk. ATC detected that something could potentially go wrong and hence pause the flight. With me so far, that's the reason for the delay. I mean, let's not assume that they're not doing the things right. They are actually doing the things right. They are pausing the flight because there's a delay. There is a risk. For them to do their job, they need visibility. They need alerts. They also need recommendations, right? And these are things which help them do their job better. Now with this in mind, let's move on to the next step, which is I'll introduce you to the Flipkart logistics domain, right? This analogy, keep it in mind as we go through the talk. It will help you. I will come back to that a little later, right? I'll cover the motivations of the problem that I'm solving or the team is solving. They approach that we used and some of the learnings towards the end, a fairly simple agenda. What does Flipkart fulfillment network look like? On the left-hand side, we have the sources of the packages that you get, what's inside the package, the sources, right? It could be a marketplace seller. This person could own a big store somewhere or could own their own warehouse. Or we could have our own warehouses, right? For context, the warehouses are like thousands of square feet with vertical space footprint, as well as horizontal, right? And items are actually aware of our postal services and up at the delivery hub. And this fairly self-explanatory. From here, the Wishmaster picks things up and brings it to your doorstep, right? This is a very brief view of our fulfillment network. The point to understand here is that, look at this, it's a complex network in itself, right? And there are end-to-end processes here, which are complex. There is an end-to-end process, starting right from the point of shipment is received to it getting to the customer. Not just that, each component has complex processes. A sortation center, for example, is a five-leg process with multiple inputs, multiple outputs. So it's a five-leg process, a sortation center in itself. A warehouse is an even more complex process, right? So an end-to-end complex process with complex processes in each leg. None of these are homogeneous. They're all non-homogeneous processes. Each process has its own identity, its own complexities, its own differences, right? And we deal with high-variabilities. Again, I'll not talk about high-variabilities because the previous talks touched upon it significantly, but yeah, we do have high-variabilities, right? And to manage all of this, we have something called as a control tower. And this is where the analogy comes back, the air traffic controller and a control tower, right? So we have the ATC or the traffic control for aviation works in multiple levels. Likewise, our control tower, right? We have central control towers. We have asset-level control towers. Then we have process-level control towers. And they're looking at all of the things and trying to make sure that the fulfillment network works as it's expected and the customer gets the delivery when we promise the customer. Now, yes, we talk about 42 flights per hour and I'm not comparing the impact or the importance of an ATC versus here, but in terms of the pure volume, it's orders of magnitude higher, right? Talking lakhs of shipments. Again, they have a zero-delay target. We have a zero-delay target. Hundreds of facilities, thousands of processes, like I said. Mutations, again, 200 GB per second. These are numbers. There are multiple systems. There's going to give you understanding of the complexity of the problem that we are talking about, right? We need to tell them that something is going to go wrong with the 15 minutes turnaround time and 15 minutes work for us and not seconds really because we are dealing with physical processes, right? Physical processes require somebody to go do something physically and that doesn't happen in microseconds. That doesn't happen in seconds. It takes minutes, even hours sometimes, right? An example is if I have more shipments than trucks, I need to get a new truck and that truck will take time to come. So we can live with 15 minutes, right? But what we can't live with is inaccuracy. So for example, top of the funnel data where you're looking at visitors, you could drop 100 or maybe 1,000 visitors, but your millions of visitors trend will remain constant. Here, if one shipment is missing, the person will spend time hunting the entire warehouse, the entire sortation center, where did that phone go, right? So accuracy is paramount, latencies are okay. Again, they need visibility, they need alerts, they need recommendations. So this is where we get into the solution aspect of it, right? So these are the characteristics of the problem. These are the motivators for us to go and solve it. We started off like most companies do at an early stage with just crons. You understand crons, I don't have to go into details of crons. I assume everybody here, anybody who doesn't understand crons? Okay, we have an engineering audience. So we started with crons, right? We schedule crons, raw data gets pulled, it gets sent via email to an analyst or an operations person, that person crunches it in Excel or some place and then takes a decision, right? The previous talk was the point where we moved to dashboards completely. Cells serve automated dashboards that refresh periodically and the data is presented in to you. We discovered the next problem where we ended up with a plethora of dashboards. Anybody has encountered this problem where you have just too many dashboards to handle? Awesome, so somebody's there. And we had a scaling problem here, right? How do we staff our control tower? How do we make sure that nothing gets missed in the multitudes of widgets and dashboards that we have, right? So we have reached this maturity curve. Any guesses, what did we do next? And some of you talked about this, previous talks. We looked at alerts. Can we proactively detect things as and when they are happening and let the relevant person know? And that's where I could have ended because alerts is not a new problem. Alerts in itself is not a new problem. People have solved it in different manners, right? Systemic alerts have existed for eons if I may talk and they have matured to a significant degree. We could stop here, but we didn't and I'll cover the reasons why. So push based alerts based on dashboards is where we are today, right? The reason alerting for us is a different problem is because imagine a CPU, right? You can have a flat line threshold. You can say whenever it reaches 99.5% usage, let me know with me. The different problem that the previous two speakers spoke about is how do I predict when Dhoni is going to hit a six, right? And that's something that we would want to know whenever you're ready. Here also it's a little bit of a similar problem. How do you predict what is the threshold for action? So this image is talking about a planned throughput of a warehouse versus the deviations, right? On a day to day for a month. So the x-axis as day of the month, y-axis is deviation from plan. I can't give you the exact numbers, but that's deviation from plan, right? As you can see, it's very variable. It's difficult for somebody to come up with a flat line threshold and for it not to be noisy. Similar, so I spoke about sortation processes, right? So sortation, let's take an example of three processes. There's a primary scan where things come into the sortation center. There's a secondary scan where they are batched together in big volumes and then there's a bagging scan where they're put into bags which go into trucks, right? So three processes for simplicity and each of those have throughputs which vary as this, right? So they typically go together, but even isolated processes have different throughputs. Similar example for a delivery hub. So throughput of a delivery hub, again, very variable. Day of month throughput on the y-axis, right? So what is the nature of the problem and what would be an ideal solution for this? Let's define that now. We want to enable proactive action for any anomalies in the fulfillment network. And the word proactive is a single word, but it has a lot of connotation behind it. And I'll touch upon some of it a little later, right? So this is a problem statement and very simple problem statement. What would be a good characteristics, right? So we don't want people waiting in front of a dashboard. We don't want people refreshing their emails. We want them to know when things happen, right? Worst case scenario, from the time an event has happened, let's put an SLA of 20 minutes to make sure that they understand that or they know that something has happened, right? And again, I explain why 20 minutes, it's tolerable here. In certain other use cases, it may not be. It has to be backed by precise data. And I'm not using the word accurate, I'm using the word precise because even single digits matter to us. We should allow for natural variabilities. So if it starts raining in Bombay a few months down the line, we should know that it's going to rain in Mumbai and the throughput is going to change, right? And we should not be alerting people for that because then it becomes noise. It has to be very specific to each use case. Each process is different, each asset. Asset is for us a warehouse, a sortation hub, that's an asset, each asset is different. Each component is different. So it has to be specific, right? And once a disruption happens, needless to say it has to be ignored. For whatever reasons, if the throughput went to zero, it should not, or it, for example, if we have a six day big billion day, so the seventh day should not be based on the previous six days, right? Otherwise everything would be unleashed on the seventh day. We use many algorithms for this, machine learning algorithms for this. I'll touch upon one that has helped us well and feel free to go read about it. I'll not go into the details of the solution all of you are smart enough to do that on your own. So the isolation forest algorithm, the references are there. The goal is to isolate a few and different very quickly. A step is we construct a tree. We get a sample of data, random, we select a dimension, select a value in that dimension and draw a line. Now we do this repetitively till we isolate each and every point in the dataset. Now, how do you detect an anomaly? The points which were isolated quickly by the fewest number of lines tend to be anomalies. And then you can have your own score around it and then decide what's an anomaly and what's not, right? As simple as that. So here's an image representation. That point gets isolated in two cuts and hence is an anomaly. Here's a visual representation of the same thing. I'll pause for it to keep going on. So the score is eight for a particular point, right? And you can have your own thresholds on how many cuts indicate an anomaly, how many don't indicate an anomaly. So this is an isolation forest. However, if you go read the paper about isolation forest, it was intended for static data. You understand static data? Data which is not evolving, right? Our physical world is essentially a time series data. Somebody made a statement, I don't know one of you, but everything almost is a time series data. Likewise, ours is a time series data, right? And this is where some of our proprietary innovation has come in where we have taken isolation forest which works for static data and applied it to time series data. So this is what time series data looks like. And we change it to map to certain static data characteristics that can be consumed by the isolation forest algorithm. I'll not go into the details. This is IP for us. We'll be publishing a paper and we'll let the folks know when that's available, right? But this is something that we have been able to achieve. And that's the point I wanted to make here. So the whole platform works together. There are multiple components in the platform which work together to trigger business alerts for our control tower, right? And this slide talks about what are the various components? So this is the part which I spoke about in my previous talk, right? The incremental data processing platform. What we do here, and there's another one of our innovations is that we don't process data in huge passes, which process data incrementally so that we are able to guarantee precision within a given timeframe, right? We have a lot of sources on the left-hand side that they come into the incremental data processing platform and like I said, there are dashboards. And the dashboards are still used today and will continue to be used because alerts in isolation are not important enough. You have an alert, you want to deep dive. And that's where the dashboards come in. What we have built for business alerts is starting here. So we built a sampling service which is going to sample the metrics that are relevant in a pre-configured, timely fashion, right? So it will keep building a trend of those metrics. It will put it into our Flipkart data platform. From there, we build and retrain models and that gets into our machine learning platform where the models are consumed. Then we have the business alerting service which is using inputs from the Saplanik service which tells the service what is the metric value right now across dimensions. And then we use the machine learning platform to figure out if that particular value is an anomaly or not, right? So remember the photograph that I showed you in the beginning of the team. So the team is spread across all of these platforms. Some of these are existing platforms. Some of these are new. And then eventually we have a business alerts dashboard where a person can see what are the alerts that have come and what are the details of those alerts and then click and get it to the dashboard. So I come towards the end of the talk here, right? So what are the learnings that we have had? And I think some of you would be just interested in this slide. So let's spend a few minutes here, right? Going just from a scale perspective and manpower perspective, going from a team which is monitoring dashboards and trying to figure out crunch data or crunch patterns in their mind to going to a machine learning based system which is telling you that things are anomalous and you should act on it. It is definite cost savings for the business. And although we are all engineers this is the most important point that I want to make, right? That's a significant cost saving here. The second point I want to make is that we tend to feel that machine learning will do it for us. It won't. Machine learning is not enough and that's been a learning that we have. You need deep domain expertise. You need somebody who understands that primaries can process in your mother hub and has been looking at it for two years to come tell you what is the nature of that beast and then a data scientist can deliver a model which works for them, right? So this is again a very important point. Machine learning itself is not enough. Make sure that whenever you are trying to solve a problem through machine learning you embed a person who understands the domain well. So in the gaming system somebody who understands the game domain well should be there, hot stars, similar, right? Make sure you embed somebody who understands the domain well. For us, a breach for FSG or for eCart if you know, breach is a very important number and that gets dragged in every leadership meeting. Breach is the number of shipments which did not reach the customer by the promised time or promised date. But if I start detecting anomalies on breach it's not going to help me. With me? Why isn't it going to help me? Sorry? Yeah, it has already happened. I can't do much. Breach has happened. The shipment is delayed. So now what can I do? Just now figuring out that it's delayed doesn't help me. What's important is to identify leading metrics. What leads to a breach? Right? So breach is a lagging metric. What can potentially lead to a breach is the packing throughput in a warehouse. The packing throughput in a warehouse is a leading metric which will help you prevent breaches. If there's an anomaly in the packing throughput in a warehouse you can do something put in more people in that area or start diverting orders to another warehouse and prevent breaches from happening. So leading metrics are important and this is where the challenge comes. Your lagging or your metrics of concern may be few. Most businesses run with three but your leading metrics they are a plethora of leading metrics. And that's where things get interesting because each metric and use case is different. Just because you have a model which can detect anomalies in warehouse throughput doesn't mean that the same model will detect anomalies in your sortation center throughput. And more the number of metrics, more the number of models. So we have a dedicated model or we intend to have a dedicated model for each and every metric, each and every use case. A single model may not be able to catch all and that's been our learning. I know there are people out there debate and I'm open to that debate saying that no, one model should be able to do it. Yes, let's get there. But today's learning is that no, there is no one catch all model. False positives, false negatives are a way of life. Be receptive to them and learn from it. This will happen. At the end of the day it's still a machine. So a human may be needed at the end of the day. An alert which may not require action let the human decide. Or if the alert did not come for something that happened let the human come in and give you feedback. The next step for us and would be to try and automate actions. So today the control tower receives these inputs and then they act on it. So the next step and that's the next leg of our journey where we would want to automate actions and automate recommendations. And this is where I want to go back to the analogy I started in the first place, the air traffic control. How many of you want automation there? Thank you. And the reason there is for those of you who are into the details the algorithm that I described and most of the algorithms that we use are unsupervised. So for us to get to a place where we can leverage a supervised algorithm which tells us that this reach happened and also this metric value that's of value and that's maybe when we can look at automating actions but we are a little far off from there. I think the industry is far off from there. So this is overall the learnings that we have had and the last point that I want to leave this audience with is this is a borrowed slide and the reference is there is that on the left hand side we have a prediction of a projection of the volume growth of passengers for the aviation industry civil aviation industry in India and that's the growth. I can't share Flipkart's numbers but imagine use your imagination and if that happens somebody detecting anomalies manually on dashboards is not going to work. Even alerts themselves will have to start getting more and more intelligent. So this is something that would be of value to anybody who is going to see growth very soon and if you project growth in the near future start working today because these platforms take time to develop they don't happen overnight and as it is no out of the box solutions exist either. Start thinking of these and there are a lot of learnings that Flipkart has and a lot of players out there read their tech blogs and you should be able to do overall. Going back to the aviation industry you don't want things to go wrong nobody does and that's where as much as you can leverage the various tools at your disposal you should and this is one such tool that we have been leveraging and we use. I'll end my talk there and I'm open to questions. Good talk. I have two questions. One is this tool built in house or are you using any in house? Interesting. Just to qualify that statement we leverage open source for tech stack we have elastic search we have Kibana but a lot of this intelligence is in house. Are you planning to open source that? Hopefully I can't give you a categorical answer today. Thank you. My other question was more on the anomaly detection models are those models supervised or unsupervised I really want to understand I'm sorry. What is the strategy like when starting to sort of build those models to just go all out in the wild or did you start with something and then The first question was answered. Most of them are unsupervised today because and I'll get into the reason for this. The data collection aspect of what happened on the ground and was that an anomaly we are a little. People working and they have their own method or it's a false alert. So we're trying to get that data once we have that label data will move to supervised algorithm. The second part of your question again very fair. We did not have data up front. So this the first few models took a few days to start going live and then our accuracies were low but they have improved and that's the natural cycle of any data science or any machine learning solve right. The only learning that we have is that make sure your business is aware of it. So get by in the front saying that noise for a few days please live with us work with us at some point it will be good and get that by in the front. Did I answer both your questions? Did we try out? No, we haven't. Yeah, I agree. Let's be we might explore it today. We use elastic as one of our terminal data stores. Good talk. So you mentioned you used isolation for us right. So what are the other kind of models that you tried with and also what is the kind of feedback loop that you have you know so that this anomalies what you say are happening you record it and then try and build your labeled data for the future to take care of this false positives false negatives. So the first question the basics is where we started just trying to look at moving average for example right we use few other algorithms I want to keep that internal for now but going back to your second question about the feedback loop and that's what I just spoke about right each alert is today being labeled as alert which makes sense to business or simply a false alert right and that's the feedback loop which will go in we try and look at how many alerts were generated over a period of time and how many of those were actually acted on one of the insights again thank you for asking that question but this is where the insight comes right your end metric cannot be a judge of your model because the end metric is going to improve otherwise you're not doing a good job right so make sure that your end metric is not the judge of your model there is something else which is going to help you evaluate if your model is working or not making sense now from that so you identify those tools and then you use those tools to keep improving your model right so for example one of the things is I'll take an anecdotal example right we raised an alert and the business person just the operations person came and told us that yes sir we moved our key break by one hour and that's why throughput has gone down now as a data scientist or what does that mean to me is my alert wrong is my alert right what do you think yeah my alert is right and hence it becomes a true alert right but then do I need to learn from this no so you should be very clear and that's why deep domain expertise comes in that what's a good positive what is negative right and there is no catch all for this you need to have embedded domain experts there answer your question I like the groundedness of the talks until now hopefully I'll continue in the same flavor and I'll talk a little bit about our experience doing feature engineering and what has helped us to it a little bit about us unlike the previous companies very we are very small we're a company called scribble data we are an ML engineering platform our users are data scientist and we do a lot of training data set generation for the data scientists and this is the I'm summarizing the lessons that we have learned doing feature engineering at two large scale large not of the moon frog or dream 11 at their scale but for most organizations and our customer base is not the new age technology companies but the old school retail companies everybody in the whether is building up their data teams and this is experience from deploying this our tooling and learning the process of feature engineering and the key message is that productionization of ML is actually very expensive and if you deconstruct it and look at it very closely we know the elements that are driving these costs and therefore we can go about it very systematically and address it and we have personally seen a 10x productivity improvement and I would like to share the few things that we have done to get to that level so while they set this up just to give you a sense of how expensive it is if you look at today's the ML engineering teams work starts after the data engineering work ends that means the data is delivered through Kafka into your lake hive H base whatever it may be from there it goes through a series of transformations and you generate matrices that are input through the models and then there is a separate step of actually model serving just for my understanding how many of you are data scientists here okay data engineers okay good mix so we're talking about talking to the right audience so the and and it happens in in stages because the input data volume itself runs into terabytes and terabytes and you have to bring it to a point where you can feed your scikit-learn or your spark ml or whatever whatever it may be and you need to do this reliably the expectations of the data scientists also have been changing in the last two three years the output the expectation today is that you are able to build this model and put into production and actually improve the KPIs like Govind was talking about the expectation is not that you will impress me with your wonderful new algorithm the meter has to move forward right yes yeah is there a way to properly because okay sorry about that now this is a picture from the both their blog as well as recent presentation from the uber steam at fifth elephant in august the this system just to give you a sense it took about 20 plus people over a period of multiple years to build these things and get them right and what they do is essentially drive 5000 models and what we have consistently seen over a period of time is that one the moment you put the first model the second model the third model there is a proliferation of models there is a proliferation of use cases there is a proliferation of pipelines and so on that is that's what starts building up a lot of this complexity and why do you need so many pieces there was a very interesting paper which I totally love from Google sorry the attribution is missing here this is from 2015 neural IPS conference where they talked about where does their effort go at Google right and if you look at this picture the ML code is this this black box in the middle everything else is about making sure that this box actually is being fed properly it is computing the right things and it's doing consistently day after day after day and the entire ubers 20 people team is essentially making sure that all of these pieces are running all the time and if you deconstruct it further what is what are all of these pieces achieving the first thing is speed the point is that we every day we can imagine new use cases and new models we need to be able to put them into production and data scientists are in under increasing pressure to productionize whatever models because it translates into direct business impact the second purpose that all of these boxes achieve is correctness because of the statistical nature of the computation and the world being very complex with all of its nuances it is very hard to get an ML model right continuously over a period of time or for all corner cases so a lot of this work is going towards making sure that this this is working as per expectations and if there are deviations keep correcting itself the third one is evolution the assumption that we have is that the data scientists actually know which model to build ahead of time but there is a huge learning process right because you have to discover a lot of tacit knowledge and constraints and assumptions about the world about the process about your own system deployment and so on so this this thing is enables you to deploy many models many versions of the models really fast and finally operate you know drive a lot of data through this entire system I'm going to focus on one particular problem in this the rest of the pieces are for a future talk and something that we are thinking a lot about let me zoom into that this particular step feature extraction so this is essentially what it's doing there are billions and billions of rows of detailed events that are coming in and it has to be translated into a large matrix and this mapping could be as simple as a arithmetic operation or it could be running a full fledged model itself to derive one of the columns for example this could be the predicted spend or affluence category whatever you could infer there could be a model driving each of these columns itself and you have to do this I mean this is typically one is to thousand compression that is happening between this point and this point and because of the complexity of this step and the large volume of the data that we are dealing with it never happens in get go our computations that our customer for example this this process is a 12 hour process right even for not a very big data set and we are not even talking about Google Google scale and this has to run every day because essentially it is generating data for every customer there are new customers every day they do new transactions so this feature matrix if you will has to be updated every single day and it doesn't happen in in one step it happens in three steps you know roughly mapping to the scale of the data there is one thing that consumes the last one years worth of data is probably terabytes it will run in batch mode and it will run every week or something and then there is something that is near time this could be last 24 hours last seven days and then there could be last few minutes and the feature engineering if you will this mapping of the actual transactions to the matrix happens at all of these three levels and once the matrix is available you feed it either offline depending upon your modeling strategy or online right and the output of this model is the one that is actually served through the end application and what we have observed is that once you set up the first model this quickly proliferates because we are not limited by the ideas or the limited by the use cases we were the most amount of friction or the slow down was in putting together this massive process so when you this is a traditionally done in house in big firms like flip cars they have entire ML engineering team which is doing all of these things what has happened in the last one year because of the growing importance of this all of these companies who have built homegrown systems have come out and talked a lot about what they have done as well as new offerings have come from big data engineering names as well as there are some newer players entering who are just focused on that particular this particular competition and I would strongly recommend looking at some conceptual work that is done and this is there is a great presentation by Loin in the last three weeks or so you may want to look at it and Loin is X Airflow Committer and same as with innovation so the question is what is it that is making all of these things expensive the first problem is that you need to have confidence today you cannot put a model into production just like that the business will immediately ask why should I believe this model on what basis should I have the confidence and trust is an end to end property it is not the property of only your model if the model is being fed garbage it does not matter that the model is great or whatever the statistics is great so the need for and this is a very expensive operation in the past what I have found is that when there are mismatches errors in the competition and you realize that end chasing it through the entire pipeline or going all the way to the raw data is it just burns up time like nobody's the second one is that this is continuously evolving system there is no this will never stabilize this will never settle down right unless you have a very established very focused use case in most organizations there is a proliferation of use cases proliferation and of course the algorithms are every day there is there are new algorithms that are coming in and there are questions of even the the stability evolution of the input data sources a lot of the data that we consume comes from past systems and they keep changing changing schema in changing semantics continuously and you have to cope with it right it is not a highly controlled systems like and well thought out systems in that were discussed in some of the earlier talks the third problem is the cost of development the time and effort that it takes for data scientists to define some of the features that they need this also is is not to be underestimated because the features runs into hundreds of thousands the last one is operations this thing has to run every single day so invariably things break down and you have to keep doing a bunch of activities to make sure that the current feature matrix is not actually not only accurate but also available and if you do it correctly so we there was a natural experiment set up in the sense that we did something similar for a tier one food and beverage company in India about three four years back this is where my concern and my interest in this whole thing emerged and we took a lot of the lessons embarrassments that happened during that time and embedded into the current approach and this is a comparison of what did we gain and where did we gain the key thing to remember is that given that this is a continuous activity this is an intensive activity this is evolutionary activity you need to think through this whole process and there are three four dimensions along which you need to think through and if you do that you are looking at a very significant productivity improvements here it's a good use of your time the first one is trust every day or even infrequently if you are having questions of whether you believe the numbers that this whole pipeline which runs for about three hours to 12 hours to several days if you have ever questions about correctness you should stop there and start focusing on this because it is every investigation of yours is going to be very very expensive so building in visibility and auditability knowing where any data set has come from in your entire system will give you enormous benefit over a long period of time in my opinion the part of it is the what does auditability require what does trust require it involves things like metadata standardization what this means is that any data set that is ever generated in this very long compute process I should be able to know who generated it when it was generated, why it was generated and what are the dependencies with other bits of information in the system and it should be readily available to you for investigation at any point in time we do things like linking it directly to the git commit that generated the code I actually know the commit that generated this code and this has saved us a lot because we have several deployments, several test deployments production deployments all over the place in order to know whether something was wrong linking data with the code has helped us a lot and this is part of the metadata that we had the other thing is that the data sets proliferate lots of files keep getting generated all over the place and you have to be able to surface that information knowing having a simple search interface that says type out any name any can you fix this thing please having visibility into all of these processes and the files that they generate is critical and this has saved us a lot of time early warning system by that we routinely see data quality issues at the ingestion because these past systems they are made by some third parties there are not many controls over the incoming data so very aggressively watching the ingested data as well as very extensively building quality checks is very important by the time you compute and you feed the model it is too late ideally should not even go to the model the last one is like I was saying one big problem is complexity there is a proliferation of pipelines, data sets models, versions runs with various parameters you have to have a way to cope with all of them and clearly identify each one of them and so you will incorporate things like namespaces versions and linking of code and data and so on clearly isolating the outputs of various runs and this has helped us a lot to cope with the volume of the output not in terms of the number of bytes in terms of the number of different files and sets that are generated so the other thing that has helped us a lot is start looking for abstractions that will give a natural interface for the data scientist remember I was telling you how these features run into hundreds to thousands imagine a data scientist writing a lot of code more code is more errors in my mind so the question is what is the most compact way for them to express what they are looking for can we create a higher level language for them and we have introduced our own DSL for our customers now we have started working on an open source version of it a reusable platform independent feature specification mechanism if you will a lot of reuse pipelines are not that different you should be able to reuse a lot of the development that has been done before today the code of a typical data scientist looks like this long pandas or spark code and imposing a structure on it to provide reusability is going to be critical and part of the thing that comes out of this reusability and the framework associated with this is ability to test typically these things what we have observed is that even data scientists sometimes lack discipline as far as testing their own ML code is concerned having an easy way for them to build a lot of tests make sure that they are running all the time gives them confidence in their own code so some of this structure is developing so a lot of this looks like importing what we have learned in the last 30 years with software engineering this should not none of these things should be a surprise but I think the opportunity at some level is to incorporate the right abstractions build the right abstractions with the starting point being the lessons learned from the past I was talking about versions and metadata metadata was not a big thing in the past but now it will become critical because routinely we handle millions and millions of files all over the place and this is only going to increase in future so the last one is because of the resource intensive nature of this computation you have to keep a watch on what is happening if there is a 12 hour process and it fails in hour 7 it is very hard to recover you are lost today and all the pain associated with it so one of the simplest things that is required is a default integration with a set of tools that are already available that will give you visibility into the performance aspects of your system so that you can understand their behavior as well as debug it and some of the things that we like is for example net data which gives you memory and CPU consumption for us memory is a big deal because pandas blows up very very quickly the moment you cross about 100,000 records simple scheduling I want for example my data quality monitor to run every hour in my data lake or I need some background processes with supervisor and so on all of these tools are available a integration out of the box integration with a bunch of these tools will help the data scientists also understand the behavioral aspects and therefore we can fix the problems earlier and this has speeded up our ability to deliver this feature engineering almost 10x we are doing 10x the volume of the data with 10x the number of features with only and half the time that is that we took in the past it says something about my not having thought through the previous time that is one possibility but I think these are all good ideas that have been demonstrated to have value in other places at other times as well so with that let me leave you this is a long topic and we will have I expect that this will be a thread that will come up in future again and there are more elements of this pipeline that we need to discuss as a community and I am a strong believer in end to end discipline at all levels of this data product if you will and I look forward to that conversation any questions any questions ok so if there are no questions we will continue the conversation we are in the so Venkat you brought a important point about my data so and what is your view in terms of industry as such in terms of data standardization because your platform seems to be looking into streamlining what comes in before it gets consumed so maybe you can share your thoughts in terms of what you have seen in industry what is the maturity overall from metadata perspective we looked around at metadata and asked what standards exist out there today the standards are very small I mean very limited there is data package which is done by the journalism community and there is DVC all of these have different flavors so 2-3 things that I see is that I first see a need for open source tooling for metadata something that is standardized used by everybody whatever be their computation model are Julia Spark whatever it may be the second thing is that we have to agree upon interoperability that means that the standard has to be shared across people so part of the conversation that needs to happen is what needs to go into the metadata what would cover bulk of the discovery and auditing needs of the customers the third thing in my opinion is that none of these things can save you if the data scientist and the data engineers don't believe in the need for trust in your offering because why do you need auditing because you want to believe the output that you have that you are generating so the third one which is a bigger thing in my mind which is an appreciation by the community as a whole for the need to build trust in the end to end service the questions we are open sourcing a tool partly to I mean based on our experience to drive the conversation forward we would like to see metadata standards thanks Venkat up on stage next is Thuong how do I pronounce your last name Thuong we will be talking about his work with trusting social on credit scoring algorithms and the approaches that they have tried thank you Zana for introduction and hello everyone my name is Thuong I am working as a research scientist as a trusting social so before going to my talk I just want to have a quick introduction about trusting social and what we are working on trusting social is a company who deliver data science technology for financial inclusion so we do credit scoring using alternative sources of data so that we can access creditworthiness to the whole population and we have offices around Asian countries like the headquarters in Singapore office in Ho Chi Minh City a few in India and research team is in Ho Chi Minh City and Melbourne so I am based in Melbourne so here is the outline of my talk today I will divide the talk into two parts the first part is the motivation of the company start with the credit scoring problem and credit scoring using alternative sources of data and the economic scale of our solution in the second part I will focus more in the technical part with brief introduction of prediction modeling and the challenges that we face so far when we deliver machine learning and data science to this problem and how we work through that and what we are working on at the moment as well so first I just because I prepare the talk for general audience so I just want to have a quick introduction of credit scoring some of you may have known this topic but basically in finance learning is the main business of the institutions but before if the banks or the financial company can provide a loan they need a way to estimate the level of risk associated to the consumers so by definition credit scoring is the process to evaluate the risk of a consumer both of defaulting on finance application so to give a bit of context at the moment FICO is the biggest credit score provider in finance and if you work in banking and finance you must have known FICO so for as a statistic every year about 10 billion FICO scores were sold and every day 27 million credit scores sold by FICO and in the context of the US 90% of the landing decision was made used by FICO scores so for to have a risk credit scoring model this is how they work so at a prediction as an observation time and normally this is the time that people apply for the loan we want to predict whether they will default or not in the future and what they need is the history of the person and in FICO solution they need the credit attributes about the person for example I will show in detail later and then using those attributes they will predict whether in the future that particular consumer gonna default or not so what do they need what they need is basically the financial behavior of the consumers some examples are the payment history for the amount owed or the length of credit history some other things like the new credit or the types of credit that the consumer use and some other public records that they can obtain publicly from public records so given that information can you give me a problem FICO, what is the problem that FICO may face for example put in the context of India what is the problem that they are facing now lack of data but in particular what kind of data they lack any other answer so in the US most of the people will have the bank history they gonna have bank account and they have long history of course some of them may have never taken the loan but put in the context of India for example most of the population haven't got any record in the bank and they haven't never taken the loan so how can they actually assess the great worthiness of the consumers if they haven't got any any credit history so that is the reason why we are here that we will so this is the issue with FICO credit scoring because they cannot assess the great worthiness of the consumers who haven't got any banking or credit history and why there are about 1.5 billion adults that haven't got any credit history so FICO score will keep those consumers out of the financial look if they do not have credit history they don't have credit worthiness and the bank not gonna give them the loan or they give them the loan with my interest rate so basically they are excluded from financial assess so what is the solution if we face that problem can you think of a solution what can we do anyone has an answer for this any answer we have to use alternative sources of data because if you say that you need credit history to give a credit worthiness then we can't solve that problem in developing countries like India or Vietnam or Indonesia that's why we have to use alternative sources of data but what are they what are the sources that we can use can you give an example so some company use social media data but that that direction is basically very very hard now because for example Facebook closed APIs in I think in 2014 so we cannot basically we can't draw the data from Facebook anymore especially after the Cambridge scandal so at Trusted Social we partner with the telco data to use the telco data and also combine with other sources of data as well to assess the credit worthiness but the main principle here is that traditionally we use financial behavior of the consumers to assess their worthiness we are going the other way round we use their behavior overall to assess their credit worthiness the reason is their behavior of data is much more richer than financial behavior so financial behavior you only have their credit history and how they pay the loans and things like that but behavioral data is capturing just more about the person we can continue about this topic offline so the effect of using these alternative sources of data we have a few effects so basically we assess the credit worthiness based on behavioral data and we have already proved that our score is better than the other scores using financial behavior we can cover the whole population and especially we cover the unbanked population that's why in our slogan of the company is that we provide financial inclusion for all and we also can scale the business much more than other solution so to give an example in the previous slide some previous slide I showed that the free core score can only assess the credit worthiness of the consumers who already have the credit history but because we can give the credit score for the whole population so we can we can provide better 5 grand customized product to the consumer so for example if we can assess the credit worthiness with a good accuracy we can confidently provide a loan to the consumer with a lower interest rate and so in the second part of my talk I would like to focus a bit more in the technical and as a research scientist in machine learning my talk will be favored in machine learning perspective and how like the challenges that we um the challenges that we face so far in the journey so because the talk was prepared for the general audience so I just want to have a very quick overview of prediction modeling so in prediction modeling what we want to do is we want to predict an outcome for some example so in this scenario in our problems our goal is to predict whether a consumer is a default or not and which what um level of confidence so what we need um so this the the whole process of training machine learning our prediction model so what we need is a set of training examples and of course we have to extract the features from that set we need to fit that into a supervised model like a logistic progression of random forest um those models gonna require some parameters and at the first point we need to initialize the values for those parameters and then the model is gonna give some prediction for the example that we have in the training set in the training set based on the prediction um based on the prediction of the models and the labels the ground truth that we already have the learner gonna improve the parameters to provide a better um prediction and we going to go into this process until we satisfy with the performance of the model and after we train the model we test them by a test set where we haven't seen them during the training uh phase and we use the parameters that we already trained to do the prediction and then you compare the prediction with the labels um to evaluate the performance of the model so going through this process what do we need to make this prediction model work we need, first is the data of course the data with the labels right and then from the data we need to extract the features we need the labels and we need the models and of course we need computational first facility to run this process but each of these elements comes with its own challenge and I will go through a few challenges that we face so far first is how we gonna obtain uh the label data the second thing is especially put in the context of um great scoring we face uh class imbalance problem the third challenge is a complex and noisy data the fourth challenge is the curse of dimensionality the fifth one is the huge volume of data the sixth is the reliability of the model and uh last one is concept drift so I will go quickly through each of these challenges so the first challenge is how to obtain uh the labels data this is the most important part because this is necessary to build any machine learning model of course I put in the context of prediction model um and the golden rule here is garbage in, garbage out so how we can obtain those data we have to build partnerships with different stakeholders including the banks the telcos to answer institutions um but it's not an easy part task and we have to spend a few years to actually come up with these partnerships the second challenge is the imbalance data so the thing is to give you an example if we have a set of 10,000 loans we have only maybe 5 to 600 loans that are default so the rate of default is very very small compared to the whole number of labels that we have so there is no actually model who can deal with this imbalance problem easily so and there is a very fundamental limit on what accuracy we can predict with these kind of labels and we have to accept that so the lesson learned with this challenge is we have to be mindful of not only the data, the features and the labels but we have to be mindful about the imbalance of the data if we do not use a suitable measure performance matrix our model is going to be screwed up and the second lesson is we should not use the machine learning models as a black box we have to customize them to suit our needs the third challenge that we face was complex and noisy data the data that we have are quite bad in the sense that there are a lot of missing values and a lot of duplicated and correlated columns of sources of data so to give an example we have the data from the telcos we have the cone and sms transition we have the top-up transition or value-added service and most of these data sources are noisy and unstructured so I want to emphasize again here is the golden rule here is gap is in gap is out even though you have a very very good machine learning models for example actually boost very very powerful machine learning models but if you do not process your data well you don't have a clean feature set and a clean label set your model is not going to work so how did we deal with it so first we have to rely on data engineer team to do data cleansing very very very heavy job for them and then we need the data engineer team to do the feature engineering they need to understand the characteristics of the data to come up with the set of features that they can extract from data but this process also comes with its own problem we can generate up to maybe 10,000 features but so with those large number of features what are the problem that they may come up with that is called the curse of dimensionality so because we have a huge dimensionality and the high level of spastic some features may only cover 10 or 15% of the population so the level of spastic is very very high but we cannot drop those features because we are not sure whether they can contribute to the models or not and also because we extract them somehow like using statistical functions so many of the features are strongly correlated to each other and with that kind of data the problem that we face is it's very difficult to train the machine learning models and it can be easily overfitting with our dataset it produces unstable models and it's difficult to scale so for example if you want to predict the race call for example of 200 million consumers how long your machine learning model is going to run for the just prediction phase only if you use 10,000 features so what should we do with these problems so first in terms of infrastructure we need to have the best system that we can have from big data engineering perspective we should have the best solution that we can but it still cannot solve the problem so we have to come with machine learning solution so we have to do feature like selection a lot of techniques that we can use feature like selection or we can do automatic feature engineering and fully automatic feature engineering the fifth challenge that we face is scalability so to give you the context of what kind of the data that we deal with in one of the telco that we cooperate with they have 50 million subscribers and they produce a few terabytes of data a month and millions of records but moving to bigger markets like Indonesia or India telco can have up to maybe 250 million subscribers so you can imagine how is the size of the data that we have deal with to be honest with you I don't work in this field so if you are interested in this topic you can come to our booth and talk to our big data engineers the sixth challenge that we face is reliability so typical approach to evaluate machine learning models are we can use k4 cross validation so from the training set we split the data into training and valid set and we do that for 10 different subset of the data or we can even hold our test set we don't touch them at all during the whole process of training including the fight tuning the parameters but is it good enough in our situation the thing is we need to convince not only our team internally but we have to convince our partners the banks the financial institutions that our credit score is reliable so we have to run intensive evaluation process with our partners to evaluate our performance the seventh challenge that we face was concept drift because of many reasons the word is dynamic and the behavior especially in telco data the way that we make the code or sending the message are different depending on many factors like the culture or background or the country or demographic groups especially we face that very serious when we move to India and the behavior challenge also changes over time because of many factors like seasonality on social or commercial events like you may have a festival coming up or even a change due to personal situation so how we can deal with this concept drift we have to analyze the bias and variance in the data and in the distribution of the data and our experience is that if our concept drift is high then we use a low variance model and the other way around if the concept drift is low then we can use a low bias model so to summarize my message here there are few lessons that we learn throughout this whole process that the real data is always complex and noisy and the golden rule even though I'm from machine learning background but the golden rule always is gapage in, gapage out so even though your machine learning model is very very good but if your data your features and your labels are not clean enough then you won't you won't have good result so that's why data cleansing and feature engineering is very crucial in our journey the third lesson is we should not use machine learning models as a black box because I'm working as a research scientist but also sometimes I take part in the recruitment part because I can see that a lot of people now put in their CV that they are data scientists and they put a lot of keywords in the resume saying that they can do this and that but we can figure out after maybe half an hour talk with them that they only use them let's say import from scikit-learn and then import and fit the data and train the models if they ask any further questions then they can't answer if they can't answer those questions then how can they fight through the models how can they modify if the data is not working well if the model is not working well with that data so the rule is we need someone who really know what is under the hood and even though we still use the models library to save our time but if we need we still can jump in and modify the models that is the lesson that we will learn heavily through our our journey and evaluation another lesson is that the evaluation should not be limited in the simple train test split but we need to keep testing dynamically with the new data that we can have and that is the reason is that the real data is highly dynamic and the philosophy that we implement in Trusted Social is that we always hire the past person for the job who knows what they need to do so with those challenges that we face why Trusted Social works and can go through the journey so we have different teams and I'm proud of the team members that we have I can say that we have the best person for that particular job for example we have a business team who come up with the business problems that we have and the partnerships that we have with our partners the data analytics team have a very deep understanding of the data and come up with the set of features that are working well with the models the machine learning team can control and manage the advanced machine learning models and we are working on the cutting edge areas of machine learning community the big data engineer team is very good in data governance and scalability and finally the software engineer team can deliver the high quality system and products so to give you a bit of context of what we are working on now I will go through a few directions that we are working at the moment in terms of machine learning and deep learning so we work on graph analytics because the data mainly like one of the important sources of data in our data that we have is the graph, the social graph the contact graph between customers and we also work on representation learning and a few other topics of deep learning like attention and transformer network and deep generative models we also work on transfer learning how we can transfer the model from one domain to another domain and we also work on computer vision and NLP in terms of the products and problems that we solve we solve a few problems in finance like risk assessments for detection phase recognition and identification for the know your customer purpose and finally we also working on chatbot and robot advisors that also concludes my talk today so if you are interested in you can talk to me offline or you can send me an email and this time for questions I have a couple of questions actually so you talked about behavioral data so how do you go about collecting those and what are the examples of behavioral data and how do you go about collecting those let me just finish the fecos course that we are talking about so how do these I mean it's a generic question so how do these fecos course tie back to the economic indicators so do you also track like distribution of fecos course over you know population of country or something like that I mean so how do they actually for example GDP per capita so how do these tie back to a particular economic because social responsibility in question as well right so I have a low fecos course so what does that mean for me it's just a score for me so if it does not tie back to an economic indicator or if you are not tracking that so I just want to know how do you track that kind of thing as well so the first question that you asked was about behavior or data and we inside our team we also have that kind of discussion throughout the process as well and the thing is if you look at financial problems if we only limit the data in financial history we may not cover other parts of the data of the consumers and the thing is financial data is somehow more static than the actual situation of the person for example if you have the history of the person in the last two years and he hasn't applied any law for the last one year for example then you miss the gap between the last law that he made until now but our behavior or data we can capture the dynamic behavior of the consumer more accurate so in our particular solution we partner with the telcos to assess the great effectiveness using telco data is it the answer your first question the second question is about a FICO score so FICO score they assess great effectiveness mainly using financial behavior so if a person who has never applied for a law and never had a financial history FICO can still come up with a score right but the problem is they cannot give very high score for the new customers so if you come to the bank and apply for the first time for example and you don't have much information to provide it's likely that the product that the law the bank give you is with high interest rate to cover the risk that they face so with our data, with our great score we also run back testing with the banks to verify the accuracy of our models so even though you haven't got any great history we can still give you a better score if your behavior is more similar to the people who have good great score in the past so that's how it works that satisfies your questions can you hear me I think you will run into privacy loss here because you have 50 million 50 million people's private transactions with whatever services that they are using I understand the technical mechanisms and I also assume that you have a lot of judgment and you're doing it you're stewarding the data properly but are you running the risk of triggering some privacy loss so every time we come to a new market we have to obey we have to obey the privacy law very seriously so for example in some countries and because the law wasn't very very clear so we have to negotiate not only with the telcos but with the government agency about privacy law and to make sure that we obey the law and the rule here is the telco is going to record your data anyway at the moment even though you know or not that's the rule yes but the way we do is we never get the data out of their data center so the rule is whatever the telco is working we are working in the same way that's the philosophy for handling this size of the data what is your current data engineering stack and infrastructure to be honest I'm not working in that field so if you understand you can talk to our very genius data engineer at our boot but I'm more focused in machine learning part so they're going to handle those kind of issues for us so and to be honest I'm really I have a very very good trust in that perspective as well so if there are no more questions is there any other questions? so if there are no more questions we can break for lunch it's 121 right now we'll take a 45 minute break and come back by 205