 Covering Spark Summit 2017, brought to you by Databricks. This is theCUBE and we're having a great time at Spark Summit 2017. One of our last guests of the day here is Shafak Abdullah, who's the director of data infrastructure at The Honest Company. Shafak, welcome to the show. Thank you. Now I heard about The Honest Company because of the celebrity founder, right? Jessica Alba? That's correct. Okay, but how did you end up at the company? Weren't you in a startup before? That's exactly correct. So basically, we did a startup called InSnap before we actually got into Honest. And the way it happened is that InSnap was more about instantaneous building personas and using machine learning and big data stack. And Honest at that time was trying to find someone who can help them with the data challenges. So InSnap was the right piece in terms of its technology and expertise in big data and machine learning. So we basically built real-time instantaneous personas to increase engagement and monetization. And it was backed up by big data, machine learning, and Spark, and state-of-the-art technology. So we used that to basically help Honest to really become data-driven, to solve their next generation problem of making products which drive value out of data and understand their customers better, operate better business, optimize business better. And that is why they acquired us and essentially, we integrated the technology in their stack, not only the technology, but also the culture, the business processes, and the teams which operate those. Okay, we're going to dive into some of the technical details about what you're developing with George in just a second. But I have to ask, the company culture is really important at the Honest company, right? They're well-known for being eco-friendly and socially responsible. What was it like moving from a startup into that company environment? It was just a natural thing. Yeah, so basically, of course, Honest was a much bigger startup four or five years after it was initially created. So we had in-snap, you know, very lean, agile, and much more dead-driven. That was a bigger difference. So the way we solved it, that we actually allowed us to create a data organization called Data Science, which was heading all the data initiatives. And then we worked with other cross-functional teams with finance, with accounting, with growth, with sales, to basically help them understand what their needs are and how to become really dead-driven by driving the value out of the data by using the state-of-the-art technology. So it was like a mix of team alignment and cultural change focused on the business goal and getting accurate together around it to basically make the change. And really enjoyed that while we actually carried out this journey of Honest from being just descriptive, which is essentially just finding what has happened in the data, just generating reports for revenue by becoming more predictive and prescriptive, which is more like advanced analytics and also like advanced advisory role, which data plays in making decisions around features, around businesses and operations. And George, you've talked to a lot of customers today and some of the same themes. Do you want to drill down some of the details? I'm curious about how you chose the first projects, you know, to get quick wins and to establish credibility. Yeah, that's actually a very good question. That's, and basically we were focused around the low hanging fruit in order to give us a jump start and to build a reputation so that we can actually take much more advanced strategic projects. And in order to do that, what we did was, for example, if you go to Honest.com and you search in their search bar, their search was like very flimsy and it was not revealing good results. So we already built our engine like a matching engine. So it was very easy to extend it into like a full search engine. So that was the first deliverable which we thought we could deliver and we delivered it in under a month and a half or two months right when we came in. And it was like, hey, these guys just improved our search by a 10x or 100x, we are getting much more hits, much more coverage of the search terms and that stepped the stone. And then it was like, we also wanted to, another piece which we wanted to tackle was, hey, how do we improve Honest recommendations? That was another project. But before doing that, Honest did not even have a data warehouse which it could call like enterprise data warehouse so that you can get all the data in one place like a data lake. So because the data was siloed in organizations and the analysts could not really get the data in one place and basically mix and match and analyze the data. So that was another big piece which we did but we did it very early on. That was our second big deliverable, even before recommendation, the data warehouse. So basically we plugged in Spark right in the middle, get all that suck up the data from different devices, chip the data in, made this ETL king which basically extracted, transformed and loaded the data into the data warehouse. Now this data warehouse basically broke away those silos and make them like a cohesive data lake which could be used for driving value and understanding patterns and especially for machine learning, analysts and all the decision makers. Was it a data warehouse or was it a data lake? And the reason I ask for distinction is data warehouse is usually extremely well curated for navigation and discoverability whereas the data lake is, you know, as some people say, a little step up from a swamp. Yeah, so that's right. So basically when I call data lake I actually call it because we have two kind of say data aggregation or data gathering infrastructure. One is backed by Spark and S3 which we call it data lake where unstructured, structured data, they're all kinds of data there, mix and match. And it's not that easy sometimes you need to like basically like do some transformation on top of the data which is sitting there in order to really get to the needle in the haystack. But data warehouse is like in Redshift which basically gets the data from this data lake or like the Spark ideal engine and then makes it more like a metric driven report so that it's easily discoverable and it is more like what the business requires right now as more like formal reports and the dimensions and all those attributes are much more well thought of. Whereas data lake is like kind of throwing it all in one piece so that at least we have the data in one place and then we can like analyze and process it. In putting all the data first in the data lake and then essentially refining it into the data warehouse what did you use to keep track of the lineage and you know to make sure that there was that you knew the sort of truth or truthfulness behind all the data in the data warehouse. Once it got there. Yeah so basically we built like data model on top of S3 and Spark so we use that data model as a basis as a source of truth to feed in the reports and that data model was consistent across like wherever you find it. So we want to make sure that those attributes, those dimensions and anything related to that data model for the e-com and as well as offline platform is consistent. So we use like Spark, we use S3 essentially to get that data model consistent and also like we use a bunch of advanced monitoring stuff so that when we are processing jobs we want to make sure that we don't lose the data and we remove the coupling between the systems by decoupling them and essentially in the next version we made it even stream even based streams so that was like the general strategy which we adopted in order to make sure that we have consistency around data lake and data warehouse. What would be the next step? I mean so now you've significantly enhanced business intelligence and you have the richest repository behind that data warehouse. What would you do either with the data in the data warehouse or the data in the data lake repository? So we are constantly enriching our data lake because that needs to be updated all the time but at the same time we want to connect business with our metrics and the insights we derive out of that data which is sitting in data lake to help optimize a problem. For example, we are working on sales optimization. We are working on operations optimization, demand planning, supply planning in addition to customer insights. We are also working on other strategic projects. For example, instead of just recommending or predicting LTV or churn, what we are doing is we are trying to be more like descriptive in our analytics in which it takes an advisory role and looks over all the marketing spend and not just predict the high LTV customers but actually allocates budget for different marketing spend across different channels for omni commerce. For example, TV, display ads, all of that. So that's also happening as we speak as we enrich our data lake and essentially like generate those reports. Now then we also need to circle back with the business folks or decision makers in order to really convince them to use that. So that's why we create this cross functional teams aligned to a business goal, contextually aware teams which know their roles and responsibility but at the same time, which can collaborate effectively and produce a result which drives the bottom line. Yeah, what kind of customer insights were you looking for? I mean, did they deliver family products, diapers to the home, that sort of thing? What kind of customer insights were you looking for and how's it working? Yeah, so basically like on us is, in order to target customers, we need to better understand what their needs are. So customer insights, for example, the demographics of the customers. In addition, we also wanted to see what are the things, what are the patterns which are common in customer so that we can recommend products which are being bought by one segment of customer versus the other. So those common properties will be, it could be related to like mothers who have, you know, who've recently had a children but who live in this neighborhood and have this kind of income level. So how do we ensure that we actually predict their demands before it actually happens? So we need to understand their habits. We need to understand the context behind it if they are making some search, how many page views they did for this kind of a product versus that kind of a product and similarly like other things which kind of enhance the understanding of the customers make them into different buckets of segments and then using those segments to target because we already have data about LTV and churn as a predictive models reveal if the customer is going to churn for whatever reason. So we know like by doing a similar campaign for other customers, this has successfully, you know, given us more like a subscriptions so it helped us to reduce the churn. That is how we kind of target them and optimize our campaigns, our promotions for that. You're also looking for the overall lifestyle of the people who are passionate about bonus brand or brands which exhibit similar values. For example, eco-friendly, safe and trusted products. Right, okay, so we're just a couple of minutes ago before we hit to the break, this is great stuff and George I'll come back to you for a final question in just a moment. In 30 seconds or so, tell us why you selected Databricks. You probably looked at other options, right? You give us a quick why you made the decision. Absolutely, so when we came in at Honest, all they had was a bunch of MySQL developers and very limited big data knowledge. So now they really need a jump start in order to really get to that level with in very small time. How is that achievable? We don't even have a dedicated DevOps on our team. So basically Databricks helped to bridge that gap by allowing us to get the infrastructure efficiency we needed by spinning up in clusters and in hassle free manner. They also had this notebook feature where we can scale the code and scale the team by actually reusing the boilerplate code. And similarly, different teams have different expertise. For example, data science teams like Python and data engineers like Scala. So now those Scala people write functions which can be called by teams in data science in similar notebook, essentially giving them ability to collaborate effectively. And then we also needed some tool to give more interaction and visualization for data scientists as well as data engineers. That Databricks has a visualization built in which helps to understand the correlation correlation, at least correlation right of the bat without even importing the data into R or some other external tool and making those charts. So there are a bunch of advantages around which we wanted. And then it has like a platform API like a DBFS or like a distributed file system on top of S3 which are cool APIs which again provide us the jumpstart which we needed. In so much, so less amount of time we actually made those not only data warehouse but also data driven products. Sounds like Databricks has delivered. Oh yeah. Awesome. All right, George, just enough time for one more question if you want to throw in it. This one is kind of technical but not on the technology side so much as how do you guys measure attribution between channels and the omnichannel marketing? That's a very good question. We have like this project called Multi-Touch Attribution and essentially the scope of that project is we want to give the right ways to the right clicks of the customer as a journey of subscription or conversion. So we have like a model which basically use a bunch of techniques including weighted and linear regression to basically come up with some kind of a weighted way of allowing those weights to be distributed among different channels. And then we also, the first problem to solve is that we needed to instrument logging so that we get those clicks and searches, all of that into our data lake. So that was done beforehand before starting the MTA project because we have like a bunch of touch points. Customer could be doing search, he could be calling our sales rep, he could be tracking his order online or he could be just leaving his card in a state in which he's not fulfilled. And then now we're trying to get it offline also on top of that and we're working on to get so that we know what the customer is doing in store and we have seamless experience using this MTA as a next version of it to give them seamless experience in store or in brick and mortar store or online. Great, well that's a great stuff, Shafak. I wish we had more time to go. I will talk to you more after we stop rolling. And thank you for being so honest. And we appreciate you being on the show. Thank you, I really appreciate it. Thanks so much. Shafak, that was great. All right, and to all of you, thank you so much. We're going to be back in a few moments with the daily wrap up. You don't want to miss that. Thank you for joining us on theCUBE for Spark Summit 2017.