 Okay, so hello everyone. I am Somya. I work as a senior software developer at Walmart Labs and I work for a team called Customer Backbone whose primary goal is to get all the customer data for all the teams in Walmart. So that sounds quite simple, right? So let me complicate that for you. Walmart is the largest retail company in US and it has both online and offline presence. It has online stores as well as the e-commerce website and it has a warehouse club called Sam's Club. It has so many of its acquired businesses like jet.com and now even Flipkart you would have heard of. So there are millions of customers who are visiting these platforms every day and they are generating tons of data in terms of their searching, browsing and purchase history and if everybody in Walmart tried to get this data on their own there would be a lot of chaos. So we being the good Samaritans that we are, we decided to build a single platform to get all this data and make it available to everyone so that they can come in and make queries like give me all the customers who like Apple products or get me all the customers between the age of 20 and 30 who bought a clothing item in the last one month. So our consumers can mostly be divided into three categories. One is the marketing team that sends out the promotional emails that you would have seen that there's this offer on this product and that offer on this thing and then there's the personalization team which uses the user preferences to customize the website according to your preferences and we also have a bunch of very intelligent data scientists who are working on building multiple models like which are the users that are most likely to make a purchase in the electronics department and so on. We also ingest these data models and make it available to everybody in Walmart. So let me tell you how we did all this. So the first question is how do we store all this enormous amount of data? So we went for the very obvious choice for it, the Apache Hadoop system and I'm sure all of you already know and have already worked with this. So let me tell you about just the two main things that were that we thought were the best for us. The first thing was it's highly scalable. The amount of data that we have can never reside on a single machine or two or three machines. We need lots of machines and Hadoop provides us the option to distribute our data and make it highly available and create backups of it so that we don't have to worry about all those things. And the second thing is that Hive is a very easy to learn language and anyone and everyone in Walmart can just learn it and then create multiple queries of whatever they want on our dataset. So the typical data lifecycle is something like this. We ingest the data from the source and do some basic cleanup like removal of nulls and blanks etc. We then merge the customer from these multiple platforms like store ID, website ID, device ID and stuff like that based on multiple things like their phone numbers or email addresses, IP addresses and things like that. Once the customers are merged we might have to put in some defaults because a store customer will not have a home address while a website customer will have that. So we also standardize the data so because this is coming from so many different platforms we need some standardizations. So this means things like if it's a time-related facet converted to Epoch, if it's currency maybe converted to a single currency. Also the data in Hive, joining in Hive is a very costly affair so and storage is quite cheap. So what we do is that we denormalize the data, we do all the joins in the back end and give this data so that we make the life of our customers easy. We also partition this data for the same reason and then do multiple checks on that to make sure that we are consuming the data in a very good manner and the correct manner and then finally publish this data. So with this thing in mind we what we did was that this is how we could have designed a system that we take the user identity, create the mappings, do a join with all the source data and publish it in a final table. But as I said joining is a costly affair and joining 20 sources or 30 sources in a single call will mean that you will require lots and lots of resources and probably that query will run on for hours and probably even days. So join was not something we could have done. So we added a staging table in between and the data from the source would be joined with the staging table one by one and then finally published to the final table. But even this setup ran for around 20 to 25 hours. So instead of having a single staging table, we added partitions to that and the data from each of the source will go to its own partition in the staging table. This parallelization helped us a lot and brought down the time from 20 hours to five hours and it also made the life of the developers easy because previously they had to do an integration test. They would require a full day to do that. Well now it would just take four or five hours. We also did a bunch of other, sorry. So one thing that was still very wrong with our entire setup was that we were running all these tasks through Cron tabs and believe me that's not a very good option. Cron tabs are very difficult to debug and manage, especially when there are multiple people working on the same set of tasks. So we needed a good scheduler and we looked at a bunch of very good schedulers and finally decided to go with Airflow. There are a lot of good things about Airflow and I don't have the time to discuss all of them but I would definitely tell you the few that we like the most. So the first thing that we liked very much about Airflow is the way that you can generate the DAG itself. So unlike the other schedulers, what you can do is in a programmatic way, if you see here, what we do is that we have all the metadata for our sources in the MySQL table that the Airflow fetches at the real time and then it iterates over then that and creates the task. So that means every time we have to add a new source, we don't have to write code for that. We just have to write the basic metadata for that in the MySQL and the DAG will automatically pick it up at the real time. The other thing is that Airflow is highly scalable. So if you see that the scheduler is completely independent and you can have as many workers as you want. So we have a common setup for our entire team and we have different workers for different kinds of setups. So we have an independent setup which doesn't have to care about the rest of the task and we can just run our stuff on the system. This architecture also helps in maintenance because suppose there's something wrong with your scheduler and you want to deploy that, but there's some task already running on the worker that you don't want to stop. So this system helps us to deploy the scheduler independently or deploy the workers independently without affecting anything else. And Airflow also has a very rich UI which helps in a lot of things like these starting stuff and looking at the logs. All of those things are available at the web server itself. We also did a bunch of other optimizations like we get the data incrementally. So if we have already processed something once, we are not going to process it again. That also gives us item potency. We also used compression. So if you see here we looked at the ORC format with a bunch of compressions and ORC plus Zlib gave us a very good result in terms of storage space. It reduced it by a lot and it also benefited us where runtime was concerned. It made our jobs faster. So this was a very good optimization. Apart from that, we also for data reliability, we also do a bunch of stuff like every time there is a task running, we compare the input data as well as the output data and make sure that we're consuming all the data in a correct manner. And we create snapshots of our data, which means that rolling back is as simple as just deleting a done file. Our system is fault tolerant, which means that we make sure that we take care of all the edge cases, even if they are occurring just once in a blue moon. Monitoring, according to me, is the most important thing that you need when running a large scale application. You need to be aware of things that are going wrong with your system even before the user has a chance to raise a trouble request. We also used verdict DB to do some approximate query processing. So in very simple words, what verdict does is that it creates an intelligent sample of your data, runs the query on that and gives you a projection for the entire data set. So how this helps us is suppose you're a marketer who wants to promote the latest iPhone X and you want to create a user segment of people who bought an iPhone in the past, say six months. But once the data comes in, then you look at the user segment, the number of users are very small. So you decide, okay, let me go ahead and get me get all the users from the last five years. And if that is also less, then maybe get me all the users who bought any Apple product in the last six months. So a marketer always has to run a number of filters to get to the final segment that he wants. And if each of the, like because the data set is so huge, each of these segment creation takes around eight to 10 minutes. And waiting that much every time you change a filter is not a good idea. So in comes verdict and what verdict, as I said, it will create a sample and run the queries in a very short time, like five to 10 seconds and give you an approximate count of what the users will be with those filters. And once you're happy with the approximate number, you can generate the actual number. And verdict says that it's 99% accurate and there's just a scope of 1% error, which is a good number for an approximate count. So with all those things in mind, what we finally had was that we create the identities of the user. We generate those identities based on the things that I explained earlier. We have a bunch of generate tasks running for each of the sources where we get all the data from the source. We also have a bunch of tasks running for getting the metrics from all the sources. We group the source data by the identities and then compare those numbers with the numbers we got from the source and then finally publish it. With this entire setup, we have the pipeline running in three hours. We are consuming the data for 500 million customers who are present on the web as well as 400 mobile devices doing 1 million daily activities on the site and 24 million daily activities on the mobile. And so that was it. So now on to the most dreaded part of this presentation. Fair warning, I may not be able to answer all of your questions, but I'll try to do my best. Thank you, Samya. Can we have an applause, please? Okay. How many of you have questions? Can you please raise your hands? Just one? That's it? Okay. Can we get somebody to give him the mic, please? Hello? Hi. Can you talk a bit more about the reliability you said? Like you compared the input and output. How do you do that on the last set? Because your pipeline already takes too much time. So we try to make sure that all the reliability things that we're doing do not affect the runtime. So if you see here, the metric collection from the source happens in parallel with the generate tasks, which means that even if we didn't have those tasks, we will still be spending some time on the generate tasks. So this helps us in reducing the time. And then once we have the data in the final table that we have, so the comparison takes some time, but I think that is necessary because we do want to make sure that we have the correct data. Apart from that, we also have some checks while we are running these generate tasks. And again, we try to make sure that all these things run in parallel so that we're not spending a lot of time on that. And yeah. Yeah, okay. So the metrics that we collect is that the most basic thing that we run every day is that the counts of the number of customers that are there in the source table should always be the same as that in the final table. Apart from that, we also have weekly jobs and test suits that we run when we are deploying a new change that compare the exact value of the facets. So suppose I'm collecting the data for gender for a customer. So we make sure that we have all the customers and the data from all the customers is the same as that in the source. But this is a heavy query which we run only once in a week when the load is low or we run it when whenever we are making a new change to make sure that the changes are not breaking anything. So this is regarding, so when you mentioned that like you have to clean the data before you put it for the scheduler. Right. So how do you automate that? So one basic thing that we do have to do is look at the data once we are adding a new source and to make sure that we don't have any new kind of stuff that we're not already handling. But otherwise things like default checks, null checks and we have for each, like for example, gender again, right? So you can have only three or four things that should have been in that column. So if we are getting something apart from that, we'll ignore that and nulls and blanks are all always there in all of the sources. And then we do do a basic set of sanity tests before adding a source to make sure we find out all the kinds of values that could be present for that. Hey, nice talk. Can you talk a bit more about the staging tables? Okay, so the staging table is basically a kind of replica of the actual table that is there. The only difference is that we have partitions based on the sources. So whenever we get the data from one source, we'll put it into that particular partition of its own in the staging table. And once the data from all the staging tables is available in that from all the sources is available in the staging table, we group by by the identifiers like store ID and email ID and stuff like that. So that way the structure is the same as the final table that we have at the end is just that has different partitions. So just to clarify, so basically the joining happens in two stages. Right, right. So one from the source is joining and then the final joining, right? Right. Hi, Samya. Thanks for the talk. So you said that you're partitioning by sources. Now each source will have varied amount of data. So that means you're generating skew in the system. Yes, yes. So how do you deal with that? So mostly we are since we are getting the data at a customer level, the data is not that skewed. But sometimes we have seen like for example, when we are getting the data for devices, then the number of mobile devices will obviously be more than the number of actual customers that are there. So we do have some kind of bucket joints where we like the first time when we are joining the identifiers, the mappings and all. So in that case, we try to make sure that we have a good distribution. And apart from that, we are also since we are bucketing, so the bucketing scheme we decide based on what the data is going to come in. If it's going to be a large set of data, we get more number of buckets so that the number of items in a single bucket are constant across. Yeah. Hi. So just wanted to know more on how do you all manage the backup and you said like you all maintain the versions of data, right? So in warehouse, how do you all go about keeping the backup of the whole entire big chunk of data and managing it in versions? I mean, it's a huge task. So how do you all go about doing that? So Hive definitely helps us in keeping the backup across multiple machines, the distributed system. So that is one thing. Apart from that, every time we have a data, we try to create partitions based on the date. So if I'm adding some data today, it'll have a separate partition based on other things as well as the date. So this also helps us in as I said in rolling back. So if the data in the latest partition is wrong, we'll just delete that partition or maybe delete the done file for that partition. And but yeah, since we have a lot of data, we do have to have a lot of machines to make sure that we have all that data. But we don't keep the partitions, sorry, the replicas indefinitely. We have like maybe last 10 or 15 months so that if we are rolling back, we have a scope of going back three or four steps, but not more than that. Hi, Samia. Thanks for the talk. I have two questions basically. One, do you do any kind of enrichment or deduplication in your staging layer or even after that? And two, what is your choice of data structure or the data model that you have chosen, which makes it easy for all of this to happen? So the first question, we don't enrich the data ourselves, but we have several models. For example, if it's a store customer, right, we won't have a lot of data from them that they are providing. For example, taking again the example of gender, a store customer will not be inputting somewhere that, okay, I am a male or a female or something like that. So we have, we do have data models on that where we look at the purchase history and based on that, we try to, like the data sciences have done this, made these models and we ingest that data and we make that available as well. So we're not doing that ourselves, but we have a bunch of data scientists from the other teams who are working on multiple models on getting more data, more probabilistic data rather than what we already have from the customer itself. And the data structure, we have a very fight, very flat structure. Most of the data is, maybe it'll have a single value or multiple value, so that goes just as a column. We also have some complex data, for example, purchase history. So purchase history will have multiple things like what are the things that you bought, what are the department, what is the date when this purchase was made. So we try to use struct for that. And there also we don't want to have too much nesting because that'll again make the queries slow. So we try to make it as flat as possible so that the final queries that are running in good time. Hi, I have a question. How many segments do you eventually create? That's the first question. Second question is, how do marketers use these segments? So segments right now, I think we have maybe five or six segments per hour that are being generated. And the marketers usually use these segments, for example, if, as I said, if there's a new iPhone coming in, right? And so we, they want to get all the users who had showed an interest in Apple products in the past. They might have bought it, they might have browsed it. And they use this, this segment to send out emails that this product, this new product is available. Or there are also retargeting stuff that is happening. For example, if you look at something today and maybe if you didn't buy it, then tomorrow we might show and add for it on Facebook or Twitter so that it reminds you that you can go ahead and buy this product. Do we have more questions? Having an active audience asking you so many interesting questions is the best feedback, isn't it? Thank you for such an amazing, nice first talk for the morning. Can you please give her another round of applause? That was so amazing. Thank you. Thank you everyone.