 Thank you everyone. Good afternoon. I'm a little nervous right now because I looked at all the various talks at this conference and mine is probably going to have the, you know, embarrassing misfortune of being the least technical talk, so I apologize for that. But this, the content of this conversation is based on all the various learnings that we've gone through at Flipkart and it's my very sincere hope that sharing this out allows you to rethink some of your data strategy, validate some of your data strategy in your various organizations. I strongly believe that some of these patterns that I'll be talking about are things where organizations will have to disproportionately focus on. So every pattern, I've got I think six to eight patterns. Every pattern in there is going to be a subject which any serious data minded organization should focus on. We ourselves are focusing on that and I'll be annotating, you know, these patterns with examples from Flipkart throughout the course of this talk. Another quick note, this is, I'm not a PPT guy. I'm basically an engineer. I'm horrible at PPT. So my slides are going to be very minimal. Don't focus as much on the slide. Just focus. Those are there to act as my memory pivots. Focus a lot on what I'm actually going to be talking about. All right. So the first thing that I, the first pattern that I actually want to talk about is data being a first class citizen in SDLC. All right. This is going to be, this is going to require a little more explanation. So I'm going to spend some time on this particular pattern. How many of you, I'm presuming that most of you are engineers? Anybody who's not an engineer here? All right. Most of you are engineers then. How many of you, at some point in your career, had to, had to get data by scraping through logs? All right. All right. For those of you who have not yet done that, I'm presuming that you've not coded a lot or not built, you know, not done a lot of systems. I'm sorry. I'm just kidding. Most of us go through that experience, right? Any thoughts on why do we end up doing that? Don't know the code. All right. I'm specifically asking about where you had to, it's not troubleshooting that I'm actually talking about. I'm talking about getting access to data. I'll give you an example from Flipkart. We wanted to know what are people actually searching for? How do you go about doing that? Scrape through all the various web logs, standard activity, figure out the, excuse me, Jeff Dean. What are the pages that the user is visiting? Absolutely. So you'll figure out which are the URLs that actually look like the search pages, and you will extract that term from it. That's what the user is actually searching for. That's what we start out with. And one fine day, due to one feature request, we realized, we decided that all of our browse pages will actually be powered through search. All right? And all of a sudden, the structure of those URLs changed, and our data was completely wrong. I'll give you an even worse of example. We realized that, you know, so we started going big time on this mobile app thing, right? And we started prefetching all the data because that creates for a much better experience. And all of our searches and browse data, search was still, you know, actually that data was not particularly wrong. But browse data was very wrong because we were doing a lot of prefetching for a lot of the most browsed parts of the app. All right? So what's the issue here? Anybody? What's happening here is the guy who's consuming the data is very disconnected from the guy who's actually producing the data, not just in terms of roles, but very often in terms of the chronological order. We think about data very retrospectively. We think about data when the need arises. It's only then that you start groping for that particular data. If you're lucky, you find that in logs. If you're unlucky, you know, you find partial data in logs and you don't even know that it's actually partial data. Okay? And if you're particularly lucky, you know that you're not logging the data, so at least you don't have incorrect information. All right? We at Flipkart realized that things have got to be better than that. We, one of the beliefs that we actually hold very dear to ourselves is that for companies, particularly for internet companies, the only true intellectual property that we have is our data. It's not scale. It's not infrastructure. Those are very critical things, but those are not your most true IP. Your most true IP is the data that we have. Does that thought process reflect when we are building our systems? How many of you think about scale when you're designing your systems? All of you? I'm presuming. Otherwise, your managers would notice you're not raising your hand. How many of you think about quality when you are building your systems? Right? You think about unit as you think about all of those things, right? How many of you think about what data points my system should generate when you are actually building the system? All right. I see a few hands. Great. You guys are on the cutting edge of technology. So what happens is we tend to think about data retrospectively. Why am I saying this? Because we tend to extract data a lot lower in this stack, right? There's a database. We've got CDC and all of these things and change data capture and things like that. And you're continuously pulling data from your databases or from your log files and generating a warehouse and data lake and all of that kind of stuff, right? You are fundamentally thinking about data retrospectively. You're not thinking about data a lot more up in the stack where the workflows are happening. All right? Imagine if every time a search happened, the search system fires off an event saying that the user such and such did a search at this particular time and this many number of results were found. Okay? I'll give you another example. In a warehouse where we keep all of our various products, a particular good is moved from this shelf to that shelf is an activity that happened. Something mutated in our business process. It needs to be captured as such. Recognizing all of these a lot more upfront in your SDLC, thinking about it when designing your product roadmap, when designing your systems, and when doing production management. That's what I mean by data as a first-class citizen in SDLC. It would be a first-class citizen if you think about data alerts that the amount of data being generated by my system all of a sudden has dropped by 10%. Okay? You would be thinking about data as a first-class citizen if you've got unit tests around data. All right? So data being a first-class citizen in the thought process of building systems, this is going to be an increasingly important thing for organizations to do. Organizations that do this are the organizations that will be much better prepared for success than the others. The second piece, scale of data. All of us know that. Data is exploding, right? Data is exploding. We recognize that and we try to build for that scale. But as engineers, as good engineers actually, there's a good trade that I'm going to talk about. We fundamentally try to think about the right balance between time to market and the amount of investment that you make in making your product more mature, correct? It's a balance that we actually try to strike. Otherwise, you're doing over-engineering, correct? You guys, I'm hoping for good reasons, you guys get concerned about over-engineering, right? The trouble is that the parameters of that balancing act are fundamentally different for data. If your business is growing like this, your data is exploding like that. And engineers typically, you know, when we are building systems day in, day out, we are unable to differentiate between these two rates of growth. We apply the same parameters of incremental hacking, refactoring every three months and things like that in both the cases. Big mistake. Very, very quickly, you will realize that you would, you have hit a wall that is very different, that is very difficult to overcome. At Flipkart, at one point in time, you know, we had every system dealing with its data in its own different way and almost all of them hit a wall very, very quickly in about six months. That is all that it took. And of course, you know, Flipkart's growth scale is actually very quick itself. But that also meant that the scale of data growth was that much higher. All right. This is one area. Data systems is one area where I strongly believe that we must over index. We must think about scale a lot more than what people around you are actually talking about. If your manager tells you that we will see data growth by 10 times, imagine that, you know, that number is actually 100 times. Okay, and you guys will thank me for that. I can guarantee you. All right. Over engineer data systems. You know, I mean that with a pinch of salt, because over engineering is not exactly a good word. But think about data scale in a very disproportionate way as compared to your regular system scaling. So this is point number two, but very often missed. Pattern number three. Metadata, discovery and relationship inference. I think it was Yagnik in the first talk, right? He was talking about metadata. Beautiful concept. What we realized is, so we had all of these various systems at Flipkart pumping in data into our central data platform. And, you know, because of data being a first class citizen, not only we were actually capturing very rich data, but the number of data types, the number of schema had exploded. It had run into thousands. Okay. One fine day, a particular data scientist actually wanted to get access to, for, you know, a model that she was building, she wanted to get access to all the various page views that have lasted more than five seconds and have not yielded any conversion. In this thousands of schema points, how does one figure out what data stream to target? That's where we realized that as you start thinking about data a lot more richly, a lot more deeply, there's a huge proliferation of your data definitions. And thus, you need to invest in some sort of a dictionary, some sort of a metadata dictionary where you are capturing, you know, in the search space, what are the various kinds of data points that actually get generated? Okay. Around checkout, what are the various kinds of data points that actually get generated? Just purely from an organization standpoint. You start off with that and you very soon start realizing that there's so much more that you can do. You can figure out dependencies, right? That this particular, you know, if my search, number of searches drops, my number of checkout flows actually drops as well. So you can infer those causal relationships. That's what I mean by relationship inference. And you can do automated RCAs. If you know that what data stream is leading into what, these are some, you know, beautiful things that you can actually do on top of that. It's a matter, it's a subject matter of, you know, that I can go a lot, lot deeper into, but we are still in the process of building out some of these things. So maybe, you know, we'll share out some of this work, you know, at the next fifth elephant. But this is a very important issue. Start acting on it a lot earlier in the game. If you do it when you have a very large amount of proliferated data, you would find it a much harder thing to solve. Pattern number four. Primary users will be machines, not humans. You know, quite often we tend to start thinking about data in terms of this report, that report. Those things will continue, but the power of data will come because of systemic consumption of data. When it's the machines that are consuming the data because that's where a lot of the power of data will emerge. Very deep contextual decision making. When the machines are doing those, that's when you are actually making the true use of data. This pattern has large implications on your system design. Okay, you think about API contracts a lot more often. You think in terms of latencies and design for bounded latencies. It has huge implications on the architecture of your systems, which is one reason why I thought it would be very useful to share this out. The data systems of tomorrow are going to be, you know, 90% of their work is going to be towards feeding systems than people. Okay, so, you know, while innovations on reports etc will continue and in fact we have a few talks on a partial lens etc as well. So, innovations on the, you know, the human consumption, visualization etc will continue, but this will completely, you know, explode in size in terms of impact. Pattern number five. This is, you know, there's a natural derivation from pattern number four. When the machines start using, we realize that the more data that we had at Flipkart and the more systemic consumption we were allowing, that much more real-time processing, real-time needs started emerging, you know, in a very, very strong way. People started thinking about if this customer is sitting at the airport and opens up the Flipkart app, I want to recommend at that moment, here is a quick read for your flight. People started thinking about things like that. We started thinking about, you know, if, when in the future we are talking about three to four hour delivery times, how do we use real-time traffic information in order to determine whether can I make this promise to the customer or not. In a city like Bangalore, you know, you know that it's actually very difficult, right? All of this requires real-time processing. So start over-indexing on real-time processing today. So this is another pattern that's actually going to be very, very big. Data systems that are oriented towards real-time processing will have more, you know, edge over the other systems that don't. Pattern number six. This is very straightforward. There's a lot more about democratizing machine learning, all right? Everybody having access to machine learning constructs. It's like, you know, in languages we go from assembly language and exposing better, easier constructs to deal with. It's the same notion I'm going to actually, in the interest of some sort of a Q and A, I'm going to skip deeper focus on this. Lastly, deep learning will move out, actually has already started moving out of the research space into a lot more of day-to-day use. In fact, we've got a talk on this tomorrow morning on how we've applied deep learning in the NLP space to solve for some, you know, real-world issues at Flipkart. I strongly encourage, you know, attending that as well. Deep learning is no more a research subject alone. It's not black magic, black art alone. It has real-world implications and if you are bothered about your data strategy, you need to start thinking about your investments in this area very deeply as well. So this is all that I really had. I don't know to what degree I was able to relate, you know, express this in a good way, but I'm very open, very, you know, happy to answer any questions on this case. I have a system-level question but a very different part of your system speaking as a customer. Do you have any plans to educate delivery people in how street numbers work? So the trouble is that street numbers do not work in India. I have a street number and it is no use telling it to a Flipkart delivery guy. Yeah, see, most of us, and you know, I believe so, this has been my personal experience, most of us actually in India do not navigate using street numbers. We navigate using landmarks. That's just because of the fact that in India, most of the numbering schemes are not topological. They are not actually, you know, defined with a certain topology in mind. Right, it's over the bar to shop or something. Yeah, yeah, something like that, right? But if you don't have a good language in common, giving that whole sequence of instructions can get very difficult. Canada's well-per-swell-per good too. Yeah, see, there are ways that we are actually trying to approach the problem. One of the things is that geolocation capabilities being available in smartphones is a big step, is a big help in terms of figuring that out. In fact, towards the same thing, one of the, you know, one of the realizations that we've had is we were trying to solve for, you know, solving the last mile problem in an automated way. We realized, and we started off with address classification as a mechanism of approaching that. We realized that address classification using the numbering schemes or those phrases in the address was a lot more worse off than identifying using landmarks and, you know, the pathways. It was a lot, lot easier. So I can say that, you know, it was almost like the data proved that you've got to figure out a lot more in a, in a, you know, in a geographical fashion than a numbering scheme fashion when thinking about efficiencies there. Cool. Any, any questions folks? Yeah, please. A point you made about metadata and relationship inference. Yeah. Metadata is a static thing compared to relationship inference. And whenever you infer relationships in one round, in the next round only you can think of, or maybe in a periodic map to change the metadata. You can't do both at the same time, because metadata is already there before you do the processing. Not entirely. See metadata is any description of the data itself. It's the meta part of the data, right? So some of the metadata would be static. Some of the metadata is going to be dynamic. Static metadata, I'll give you a few examples. Which part of the product ecosystem does this schema belong to? Another is who was the author of the schema? Who needs to be alerted when this particular schema is changed? These are static metadata, which, you know, you actually put in. The dynamic metadata examples would be that it's been inferred with a 90% probability that this data points against the schema have a causal effect on this other schema. This might be an inferred metadata. So you'll have both kinds of metadata, static as well as dynamic. It's not like it's getting overwritten. But the inference is always dynamic over a period of time. The inference is dynamic, yes. So to become metadata, it requires to be some kind of a proof or some stability in that inference. Otherwise, we can't go on changing metadata every time. Absolutely. What's the reason we need to categorize? Okay. Or why we need to think in this direction? Yeah, so the question is just to summarize it out for everybody. Why do we need to think about human consumption as distinct from systemic consumption? That's the question, right? Human consumption tends to be a lot more exploratory in nature. The kinds of workload patterns that it actually creates are very different from the kind of workload patterns that a systemic consumption actually creates. In case of systemic consumption, there's a model that you actually created and the system is pretty much reading from that model. Because you've built that system to read data from your data systems, you know what kind of patterns it will actually create and you can significantly optimize for that. And plus, because this systemic data consumption is, in a lot of cases, will have a customer touch point, you need to have very hard bound on latencies. Because if your data is not bounded on the latency, latency is not just in terms of time to first bite, but a lot more in terms of freshness. That how old is this data that I'm actually consuming. When systems are consuming that, they might need to make business critical decisions as in if this data is older than such and such time is older than five minutes, I'm not going to use this data because it's no more useful to me. Okay, so systemic consumption workload patterns are fundamentally different than the human. Human then maybe you need to give very summarized information. Yes, that's also true. That the amount, the volume of data and the diversity of data that a machine can process can be a lot, lot more than humans. And that also, I think, you know, I was implicitly talking about it when I was talking about workload patterns, but you're right. Yeah, because whether a data is old or not, it's applicable for both, whether it's a human or a machine, right? Sure, yeah, I agree. Okay, thanks. Yeah, sorry, we'll come back to you. I have one question. What are the factors that Flipkot has considered forcing the customers to use mobile apps only rather than desktop applications? I'll read this as a technical conference, not I'm going to pass on that question. I would love to discuss that, but offline, you know, over length. So are you discovering a innovative customer behavior using mobile apps? We believe so, yes. So with a lot of real-time systems coming up, like you mentioned some use cases. Yeah. So how do you differentiate between real-time processing or what to be processed real-time between what to be processed in batching? So how do you do that? What, come again, what's the real question here? So processing, so when to do the processing? Do it in batches? What needs to be done in batches and what needs to be processed in real-time? I think it's very use case dependent. There are use cases, say for example, when you're talking about personalizing something based on the actions that the user did in the past one minute, in that case, you're talking about a lot more real-time personalization. On the other hand, when you're talking about personalization on aspects, on demographic information, because the gender of a person doesn't change on a minute-by-minute basis, or their spend, propensity, does not change on a minute-by-minute basis, those kinds of personalization are done in a lot more offline fashion. So it's really on a use case by use case. Okay. Thank you. Hi, this is Sanket here. My part of my question was, this question was included in my question. So you said that we should more focus on real-time processing, right? So I just want to be more clear on this, like, and a little curious also, like, what Flipkart is looking into real-time processing. Okay. And how do you solve it? Okay. So very good question, actually, because it's very... So these two statements are very different. One is we are encountering a lot more, a lot of real-time use cases. Second is over-index on real-time capability. These are two very distinct statements, right? So we have anecdotally found that real-time use cases are growing in number. Okay. We have a strong belief that people's mindset, by people I'm talking about a product manager, an engineer, you know, all of the folks who are involved in building products, our mindset is way too prejudice by what we have seen in the past. So instead of waiting for those real-time use cases to emerge, build out this capability and then force people to think about real-time use cases. Because if you've done that, you know that the experience that the customers are actually going to go through is going to be significantly better, significantly more... You know, the customers will be significantly more wowed and competition will find it more difficult to react to this sort of a situation. That's our belief. You mentioned about the unit. You mentioned about writing unit tests about around data. Yeah. So can you please elaborate more on that? How should the thought process be when you are writing a unit test around data? So starting from... So the unit test that you are actually... So say for example, you change a feature. Okay. You change a particular feature in your product and you're validating that these data points that were supposed to be getting generated, are they getting generated or not? That's the unit test part. Then when you are actually productionizing this, the way you'll think about quality, that is there any metric that is regressing, that is degenerating, right? In the same fashion, you would be observing, you would be having alerts around, is your data stream volume actually seeing a dip or not? So it's the same notion. Yeah. I think real-time processing means real-time streaming you are talking about. Streaming is the most commonly used way of solving for real-time, but right now I'm actually specifically shying away from calling out any implementation detail. Tomorrow, if say for example, the compute capacity grows by such a huge amount that you don't have to rely on streaming and you can do real-time calculations on very, very huge data sets in a non-streaming way. So be it. So... Hello. Yeah. So what are you using for that? Are you using Kafka, Storm, or storing it on which database? So we actually have a talk coming up on... Today evening, Sid is... Today evening, right? Today afternoon. We've got a talk on joining streaming data sets. All of those questions are better answered there. Hello, Anul. I think you'll talk about over-engineering. Sorry, here. Here, here, here. Okay, yeah. You'll talk about over-engineering, over-instrumenting. I think it's very insightful. And coming from a traditional data-ware housing world, going back 10 years in time, we have been explicitly told the primary problem is give space for OLTP transactions and move the data into a different area. So I think it's a big cultural shift in a way. That's why you see organizations have horizontals which do data management. Yes. You don't have to move away from that mindset into a mainstream within the organization. I think that's a very radical point of... Oh, it is. It absolutely is. In fact, I'll actually remind you because you're talking about that era, you would remember that 70% of the time of that horizontal time team would go into what is called as data sanitization. Right. Yeah. Right. That time would get spent because the guy who's producing that data didn't give data that kind of a seriousness. Right. And one thing that you can help for organizations which want to leap is how do we balance transaction systems with the data? Because over-instrumenting data comes at a cost. So you obviously would have done something with in-memory grids and stuff like that. I'm not trying to take away this time on that, but something that you can allude to would help. You mean in terms of implementation? Yeah, yeah, yeah. So in terms of implementation, there are actually a whole bunch of talks. I think five talks from Flipkart, each one of them in some way or the other talks about what we've done. Those might be actually much better places and the context would be more relevant for everybody. Sure. Yeah. Thank you. All right, cool. Okay, so I see a few more hands. I really don't want to be standing between the lunch of people. No, we do have time. All right. Okay, here. I had a question on the machine learning part. I wanted to understand how Flipkart uses machine learning to reduce returns and control returns. To reduce returns, instead of me answering that, I will point you to the data scientist who's actually working on that problem. Why don't you meet me offline and I would be very happy to connect you with him. Great. Thank you. Are we thinking about using deep learning now at Flipkart? Where are you? Okay. Yeah. Come again. Are we thinking about using deep learning now for problem solving at Flipkart? We are already using actually. The talk tomorrow morning talks about some of the work that we've already done. Hello. Yeah, you mentioned one of the patterns saying data as IP, right? When you mentioned data as IP, you're saying the amount of data you're collecting that can be used as IP or more the process. That's my first question. Amount and richness as in the quality of that data. I can store petabytes of data which is worthless. So do you also think that in a general context, IP is more like little more something which you can use it over a long period of time? But do you think that is true for this kind of a context where you said data as IP? Because data itself is changing a lot, right? Data is changing a lot. See the point is that, okay, let me elaborate on what I mean by data as an IP. Today I understand better than a Samsung why a particular phone sells less or more as against its competition. Why am I able to do that? Because I have access to a lot more fine-grained data about customer behavior as in when they come on to this particular product page, what is the actual product that they end up buying? And I, you know, so that's my IP, okay? At what point in time I start using that in order to make this IP useful is a separate matter, but doesn't take away the fact that data is really the true IP that we have. Yeah, one question here. So keeping privacy concerns apart, are you thinking of doing something like a credit score for each and every user? So every kind of use case becomes available, including that. Hey, hi, I think your talk was pretty insightful in certain points are very interesting. So I have a question around, you mentioned about scaling early, right? So, I want to understand it a little bit more because currently there DevOps processes and cluster management and all of those things have become pretty evolved. It's easier to auto scale and do all of these operations. But I'm wondering, that's not the kind of scale you're talking about. Are you talking about scale when thinking about over partitioning your data? Or I basically want to understand more when you mentioned the word scale. Do you mean, what exactly do you mean? Sure. See, auto scaling is going to be constrained by the architectural choices that you've made. Okay. There are very few architectures that actually scale indefinitely because there are trade-offs in every kind of an architectural decision that you make, right? DevOps is essentially taking care of production management for you in the constraints of the architectural decisions that you have taken. My point about scale was to force all of you to think about a lot bigger volume of data than what you might actually be thinking it would be and incorporating that in your architectural decisions. Sure. So let's say we are talking about distributed systems, right? So if a distributed system is architected well, so you're saying probably we are not even probably architecting a distributed system well enough and the problem is more coming to the fore because we are actually dealing with larger volumes of data. Yeah, actually we usually do not architect distributed systems well for the simple reason and this is particularly true for startups. The real reason is that there's a balance that you strike between your time to market and the maturity of your engineering system. Doing it the right way takes up more time because you think about all the various ways that things can go wrong, all the various kinds of workload patterns that might be made on my system. You think about all of those things. That time may not necessarily be available. That's the reason why I was talking about that balance between the two that you strike. In case of data, air in favor of doing it the right way than time to market. That's what I was saying. Last question please. You spoke about OLTP, OLAP, you know what may be useful there. What you spoke a little bit about what problems real time is useful. Can you give examples of when you would use real time over something that requires pre-processing? Okay, firstly I didn't use the words OLTP, OLAP. I think those are constructs and notions of an earlier era. Regarding the question about when real time, when not real time, it's really like I answered that in one of the previous questions. It's really use case dependent. But in general, the approach that we are taking is force people to think about real time. What would you do as a product manager? What use cases can you come up with if you had this capability? So we are actually flipping the problem there. It's not about a product manager coming and telling you that I had this use case in mind but it requires real time capability. I'm flipping it around and saying that this real time capability is available better than anybody else. What can you do with it? All right, thanks a lot folks. That was the last up to you. I have a question here. If you define here. Yeah. So we talked a lot about real time processing. Can you clarify more on the word real time? How fast or what do you mean by real time here? Real time is a slightly fuzzy word. I talk about it. I usually mean a sub minute latency. But at times the say real time system which I see is say for a nuclear control system that has to handle it within microsecond of latency. I agree. In a use case by use case basis you would not be using the word real time. You would be talking in terms of this is my time bound. You would be talking about the boundedness of that than the word real time. Real time I'm using this as a very broad umbrella thing. Okay. One last question. So just to take off on the last response you had which is you're sort of turning over to the product management team and saying, hey, we have all this data. We have this capability. Why don't you think about things you can do? Yeah. How are they approaching that? How are they thinking about it? So they're actually very excited and so there's a very, very recent step that we have actually taken in this direction. The excitement is huge and there are some very good ideas that have begun to emerge. There are things that we are thinking of how to amplify that. So we are thinking about data hackathons where you have this huge pool of data, this huge set of capabilities. What can you do with it? So in hackathon specifically focused on data. So those are the only things that we'll be using to nurture that orientation. Cool. Thanks a lot, folks. Yeah.