 Once again, we're really really packed if you can't find a seat, please somebody volunteer laps Apart from that each young, please raise your hand Attendance marking this gentleman here. All right, so for tonight. Who is a honours be user? Okay, the rest of you, please sign up before you leave the room Okay. Now today we have done prior to joining honours be I think he gave a talk previously and you were at Come on. So I see me. It's a small start up Yeah, so he's given a talk before so we're gonna listen to that talk about data science at honours be so leave you guys do it, right? So I know it's Monday evening and thanks so much for all being here and for having me here again so for those who is Interested my previous talk is quite small about Data warehousing data mod Platform and data visualization but for this talk it will be more of practical data science the science product application and That would be the first part and the circuit the second part would be More to a data infrastructure to support data scientists And of course is that honestly, which is later where I work right now So before I start I would I would like to know whether Some of you are leaning toward data engineering Infrastructures or data science. So if you are leaning toward engineering software engineering data engineering Can you give me a raise of hand? That's like 10% of people here. Okay, and then the rest would be more of data science, right? Okay So a bit of introduction My name is that I'm currently a lead data scientist at on a speed I come from computer science Mathematic background been in Singapore for like six years now working more or less in data field and mostly in startups You can find me basically everywhere from GitHub my personal website linked in Facebook Twitter feel free to add me up. Just tell me that you are from data science actually So, yeah So about 15% of the people here have used on a speed before but Who are the people that know about on a speed? Okay, so have a people here, so I'll just give a quick introduction about on a speed So there's a video that we have on YouTube that I'm going to show It is a precious. Sorry. No, I couldn't break it Okay Time is a precious commodity these days from the moment you wake up to the point you fall asleep at night There are countless tasks and responsibilities Bombarding you every step of the way peak grocery shopping, for example, we all dread doing it But there's no way of getting around it We need to eat drink and have necessary household items to function normally Wouldn't it be great if you could delegate grocery shopping to someone else responsible and spend that precious time doing something else You truly enjoy. Well, we've made your dream into a reality Honest be is a full-service online Grocery delivery company that exists for only one reason to serve its customers to its fullest capacity With a wide selection of products available from your favorite stores Customers such as yourself can enjoy same-day delivery On top of that, you don't have to compromise on the quality of your grocery Because our trained shoppers hand-pick products to ensure the highest level of satisfaction So no longer do you have to change from your PJs into something that you look presentable in Neither do you have to spend money on petrol or put more miles on that priced car of yours We'll deliver whatever you need whenever you need it at your doorstep Feel free to give us a try and see what our customers are raving about So now that you know a bit about on a speed I'll just going to do a quick recap So we we used to be a full service online groceries Now we are groceries, but plus laundry as well in the beginning of the year We are operating and having offices in Singapore Hong Kong Taiwan and Japan now We just opened offices in Malaysia Philippines Indonesia and Thailand So we have a wide range of supermarket and boutique stores for you to choose from we run a kind of grab over like referral Scheme where you you you send a link to your friend you get $10 and your friend get $20 for the first budget. So if anyone here haven't done it Here's the link that you can do and let me know Please let me know if there's any ideas anything that you can contribute to on SP so This is about data science. Let's talk about data science So at on SP from a technical point of view from a The data science Approach we divide our data science problem into four First one everyone should know our predictive models, you know, you talk about Classification you talk about regression This is where you you you see the most in in you know cargo competitions you see You use Algorithm like random forest logistic regression gradient boosting machine To basically predict something in the future or classify something that you don't know For for this type of problem we have item availability Prediction which I'll go back to it later. We have customer lifetime value customer profitability grading We want to know whether this customer is going to be more profit to us going to be you know very Value customer in the future so that we can engage them earlier And then the next and then we also have customer demand forecasting We want to know and did they this day of the week this week of the month? and Either is in Promotional period or not how many customer we are expecting to have And then the next one would be recommendation engine. So it's very common in e-commerce business you know, basically We will like you also Also like these item people will look at the yeah this item also look at these item kind of thing We have recommendation on we will have recommendation on website and for CRM comments campaign as well Then the third one would be clustering analysis more of a data mining things So where you look where you look at occurring? Data set you want to make it More useful. You want to add more data into it by basically from one example would be that you look at a customer name You try to determine whether he's It's a female or male So we want to be a customer segmentation a profiling our customer a 360 views of our customer within on SP and then the last part would be Operation optimization so on SP also take care of the delivering of the groceries or the laundry that you have So we want to to have some kind of task scheduling You know like Uber Pull or like grab hitch type of problems where you try to minimize the cost, but you still keep the fulfillment level and then of course route optimization So for this talk I'll talk about two things item availability predictions and Then item based recommendation engines that we currently have in house So the first time item availability prediction So if anyone buy something from on SP before you see that there are some item missing sometimes So why basically item is not available at the store when the shopper get your item? I Mean it's very hard problem where if we want to solve it perfectly We have to have a system that updates the stock of every single item in fair price in boutique store So that we know exactly when an item is going to available, which is something that we don't have So why is it important if you if you are a customer on a speed you go into a website The more item that you get the more fulfillment you feel if you had everything that you ordered you'll be more happy And of course if we can deliver more item we get more revenue we get more profit So how are we going to do that if we don't know the actual stock of the item at any point in time? So we just feel a predictive models So it's going to be very simple binary classification We try to communicate with a customer beforehand when a customer look at a website or try to order something They will know exactly whether Not exactly, but you know with with a higher chance higher confidence rate that You know how many items are going to be available? So it So these are the features that we are going to look at when we try to build a models The date of delivery so whether it's Sunday, whether it's Monday But it's the the end of the week whether beginning of the week. It's all different in term of it's all make I have a item availability difference Of course product metadata is the items If it is more of a fresh product is hot is harder to be available at the end of the day comparing to begin number of a day for example and of course the store metadata the item that you see at for example Fair price finance comparing to fair price extra is going to be different And of course we will look at external data as well the weather of the day where the item actually get purchased is different A whether it's a public holiday whether it's there's a promotional period at the store itself You know looking at some time along the line of Five years three years financial data inflation rate also important And of course we have the ground truth. We have so many orders from so many customer before where we know exactly whether an item Is going to be in stock also So before I go into the algorithm who here have heard of extra boost before okay, so If you don't my advice today is You know just the first thing you go to when you are home. It's a look it up and try it It's really the awesome Algorithm that you can find right now It's the season three base gradient boosting machine In a sense it's kind of similar to random forest, but it makes the whole thing more efficient and it basically converts faster It's available in python in all in Julia you can try right now. It's fully open source It has been a state-of-the-art winning Algorithm for a lot of Kaggle data science problems like I don't know how many here's like the four of them that either mainly use extra boosts or Use extra boosts solely So Let's talk about evaluation metric as well. Why do we need evaluation metrics? So we need it because we want to know whether our models actually works or not actually improve or not So for this particular problem, we choose to go with a you see Is this area under the curve? There are few links here. Basically why we choose it is not affected by highly skewed data set for example if our normal availability rate about 90% and You just predict everything to be available if you choose accuracy you get 90% correct You know, but that doesn't mean that your models actually make sense But if you do that with a you see your score would be 0.5 which is a fail rating score So if you at the end if you get something like 0.8 to 1 you would be very quite confident with the models that you have so So what what it's actually look like on production at the moment So we come up with for every single item we come up with availability score in the future So it rains from 0% to 100% Whenever it's go back go down to 50% we'll put a low in top Low in stock flag or more accurate likely out of stock flag. So if you see these items online at the moment Try to go for something else And actually it's actually when we have it online We see the customer change from getting You know random item because they don't know whether it's going to be out of stock or not to More likely to get the item that doesn't have this flag Which you know in turn increase our fulfillment rate and increase our forfeit And then you know availability score is important It's also used for other other products such as product ranking You go into a department. You see a list of products if everything with a higher chance of Being out of stock was pushed down You're more likely to get item that is in stock and The same thing for you know, you're filtering our products that are likely to be Out of stock in the recommendation engine that you You would give our recommendation for customers So So the next product that I'm going to talk about We say yeah, that's actually a very good question because you can order right now and then you can buy You can get it delivered a week later or one day later for now We don't care about the date where it's get delivered we predict everything for the future only and But by doing simply that we already have an AUC score of very good So there's no point to try to change the customer flow So that they have to so basically Let's go back a little bit when you when you order something for honesty You don't have to specify the date when you want it to be delivered Which doesn't help us when we try to predict it at any point exact in the future So what we do is it any time in the future? How many percent is going to buy us off? Yeah, that's also a good question. We we don't need it to be real time if we have not we can actually simply push out the predictions call at the beginning of every day and that's it so we update it every day and Whenever a customer click on an item they see it Which is basically a database retrieve function instead of a machine learning API that we need to be Because in a way, it's very practical the way that we do it So that we don't need to worry about how to serve a hundred thousand of customer at the same time So your algorithm is pretty awesome, right? So you can predict if it's out of stock What you have done is it's pretty awesome, right? So we tell customers it's out of stock people don't buy So how do you collect new training data for when something is out of stock and how do you evaluate? So maybe you could be saying that it's out of stock, but it's not and Therefore you also lose business Basically we train the data every day At every day everyone who buy an item we have that Real truth of that item being in stock Which is basically improving The machine that the engine by giving more and more and more data every day That's a good point, but if no one actually buys items that are out of stock anymore Because you're so good then you don't have any ground truth of items that are out of stock There'll be someone who buy it. Yeah, that's that's a good point Yeah, because sometimes let's say you there's an item and it's the only item that you want and it's low in stock They basically super Since it's a logical break, can we have everybody there just moving in a bit? I think there's some space behind way on right? Sorry, we're just gonna squeeze you a bit. Thank you Thank you So for the second Yeah So for our engine actually run on every single country differently because we don't we don't believe that Singapore data will affect Taiwan data and Again, we have evaluation score, right? So whenever a machine our engine actually Keep our evaluation score that is good enough, then we put it on to production. Otherwise, we wait until we have more data That's a very good question It's them does matter a little bit In sense in the sense of the week of the gear and a month as well where because you can only all those within the next week only and Yeah, and we know that a bus a particular percentage of people Who actually buy it for today or for next day only? So we still use that data. It's not It's not fully used but at least more than half of it Yeah We don't have that data but we'll try one of the analysis where we try to See the pattern in restocking of the product, but as far as I can see it's very random sometimes Depending on which store which day of the week whether it's a public holiday as well This is important for the future audience. Please hold your questions Because it's kind of packed. So for people with seats. I know you're really comfy Let's let's finish it off first. Yeah, then we'll hold the questions to the end So recommendation engine who have you a recommendation engine before What algorithm that you use I use the same thing Yeah, I'll go back to that algorithm later Basically, it's a recommendation engine. It's answer the question people who buy something also buy the other thing Why is it it well basically for? E-commerce companies it's important to give a better user experience If you save you five second every single customer five second when you buy something it saves so many And you know within that five second you can either do something else or you buy some more and That's that's something that that health of business. It's increased the cut size increase The frequency of customer buying for us because they kind of you know where they like to come back When they have a good recommendation you you feel like Your experience is personalized and so how how do we do it? collaborative filtering so it's a very well-known Algorithm or more or less like the technique that everyone like Amazon next Flix Zara they all use it How do we do it? We implemented we we use jacket index as algorithm We've limited with Python that but us So so let me talk a bit about collaborative filtering So again, it's traditional. It's popular technique used in recommendation engine The input of this algorithm is basically a user item matrix. So this item is actually books in this case And then you see some of the score And the score can either be continuous value whether it's one star five star or When it when a user watch the movies We can have like the percentage of the movies that a customer watch from zero percent to hundred percent or Sometimes some of the cases can just be binary values whether a customer click on it visited or purchased From this there are two different methodologies User base and item base recommendation system So so let's talk a bit about user base. So basically when you see user like you usually buy these That is where they use user bay recommendation system So it's kind of work for social network it work for some kind of taste like recommendation like movies Netflix Work fashion it work for things like Instagram People follow this evil like this actually Like like you actually likes to follow these people as well kind of thing and then so basically the remember All you need to remember here is that the input is user item now with User Bay recommendation the the output of the algorithm would be user user Magic where each of the score here Give you the similarity between this person and the other person So you also have to keep in mind that the performance of this scale with a number of users So if you have so many users A thousand users you you have to store a thousand times a thousand a million Values So this is the recommendation that you usually use see in the user home page the emails in app notification The other type is item base. This is what is In my experience is use more frequency Frequently basically is answer the question of user who buy X also buy Y So it's it's you for complementary purchase You know if you add something into the card you see a list of item that people usually buy together You see it on Amazon you see it on Lazada Zalora as well it it usually sometimes get used in new suggestion as well and The output of this algorithm will be item item matrix And again, it's performance is scale with the number of items So what do you usually see this in the product base recommendation when you browse the product? You see a list of product that's similar to it And cut rate recommendation as well if you want to do with me Okay, so the algorithm is so It's very simple actually so Just just focus on this part where the gray area So imagine you want to compare the similarity between Item a and item b so the gray or blue area give us the list of orders with item a and For the orange one it gives the list of orders with item b so the intersection between the two Circle over the union Is the score of the it's the similarities core between a and b so basically it's It gave you One of a tree and a similarity score between these two because there are two others that have the two items over the total of six orders One thing to note is that it's very sensitive to spots input so sometimes So sometimes you have a matrix of item and item and a lot of it is empty So there's a few techniques that you I want you to look it up to solve it Or an easiest technique would be okay. I don't care about The the bottom 70% of our item I just care about 30 top 30% that the easiest way to solve it and The next slide it will be the actual code that we have on production right now about 10 line of code which implement exactly The Venn diagram that I have in the previous slide we use Python in in on SP As a main language We use spanners as a way to manipulate data frame It's a very well-known Libraries that I think everyone here should know if they use Python mostly The algorithm is actually not written by me is this this link that someone just posted there and It's pretty nice So with this algorithm what it looked like on production It will be there soon right now. You don't see it on our production because We need to test it we need to deploy it and everything but basically The engine can automatically generate the suggestion of garlic Onions or other different type of vegetable to go with this even though it doesn't know that the customer is Lately to cook something right now. It doesn't really understand what is babies to be next But it's not what to go with it Next example you see Baby product immediately you see baby food to go along with it because Again, it doesn't know that a customer is is you know a Baron who has a baby and Then the next one obvious barbecue style parties Here's a you know the book with stuff you can Used to have for for your barbecue And again all the suggestion are done with down knowing the actual product is and Basically everything it ingests into the engine is the historical purchase of the customer And this is something that you see on production with the production data Any question You do the Well, this is the item based recommendation So if I lock in or I don't lock in I look at the product the recommendation engine the recommendations that I see is the same So it's gonna help with the costar problem as well Which is the nice thing about it and this is one of the reason why most people use it So we have right now what we have is the historical purchase for the use from the users Which is quite nice in the sense of groceries Because if you look at if you go to Amazon if you go to Netflix You don't watch the show that you watched before But if you go to on a speed you buy milk that you buy last week You just keep doing it as kind of useful So we just keep it That's a good question Right now we don't have an engine to determine whether you are Muslim or not That's a good question I think I need to twist the engine a bit But I think I think for religions problems related problems I Would solve it by engineering Where I have a customer I have customers stated that he's a Muslim and then I just remove it from the catalogue So it might be solved easier that way Where it's actually satisfied a customer better You're looking at the name Sometimes we know but yeah, yeah, that's something that we actually have Sorry So it's based on the orders that customer bought together instead of Knowing whether it's baby product So it doesn't it's actually not similarity between item but Getting bought together Evaluation is one of the biggest problem with recommendation engine We basically we don't know For sure, but what we can do is it's most of the time we ab tested We have two different entry in running together or one with engine and one without engine And we see the performance in term of click-through rate in term of revenue And that's something that we we just have to try Association buildings I'm not sure what you mean, but we can take it Sorry If we have a lot of items then the matrix is going to be sparse Yeah Basically We try to be practical We just try I bet that's different algorithm together and it could be the case that sometimes this algorithm is better Or most of the time this algorithm is better than we go with it And it can be the case that depending on the type of the customer which algorithm is better So it's kind of going to another level where you have user based item based recommendation So it basically whatever works and whatever will work better So let's let's go to data infrastructure So I like in all the talk that I've given I always want to talk about data infrastructure because it's it's a thing that usually get overlooked I've been in in Singapore for a while and I read companies that I talked with a Lot of them have this common bit for all, you know you try to Start a data science team and the first highlight you have is data scientists And then you expect magic from him or from her and it doesn't happen. It doesn't you know It's not efficient to be like that you don't start with a data team without a proper data infrastructure proper data warehouse to begin with and So there's two cases either the guy suffer other company suffers because the guy cannot You know cannot work efficiently and so For this part I'll go briefly on how data infrastructure in honest we look like So this is actually the infrastructure we have right now It's quite simple actually so in the lowest level we have a combination of easy to spot instances servers that are on demand and reserve servers And we have so everything is running on Amazon web services. We have Amazon web services auto scaling to deal with scaling up or scaling down the number of machine we need and So on the next level we have misos which is our cluster resource management So it's still with CBU gram GBU that you need available It's widely used by eBay Apple. Yeah, a lot of different company on production right now and then together with misos we have marathon so marathon is kind of So it's used a lot for long-running services It's focused on making everything stable and then at the same time you have spark Which we'd used to do with Highly distributed processes to run like within one minute two minutes, but we want it fast And then on the on the top layer which is the application layer. We have airflow schedulers. We have airflow drop. We have API is running services ETL everything running on a on a schedule basis So everything is set up by our data engineering team who stand there if you want to ask any question So so the setup we have previously kind of really help our Team contribution process really help us to be more agile So we In in on a speed we have like a shared git repository between data engineers data scientists and the BI Then the BI people where We have auto integration auto deployment relying on github Travis Docker Registry and a bit of marathon as well. So basically what it means is that We have beta scientists. We have PI people. We have data engineers data application programmers Create, you know, write our own code Create a pull request on github and then Travis to take care of everything from unit testing baby test Making sure that the code is you know, looks fine. And then after it's best all the reviews And it can be merged We release it. So once it release it should Travis automatically deploy everything on the production So what it means is that as an application developer as as a data scientist, I don't need to care I don't need to care about deployment. I don't need to care about all the scaling up. I don't need to care about You know, I need 20 machine right now and then a hundred machine later Everything was done automatically and it was done by The the data engineers and it's very important for you to be efficient So so this is the tech technology stack that we use on a platform We have Amazon web services of a C2 S3 RDS both Quest and Amazon Redshift Advocation layer we use Docker. We have airflow as well. So for those who interested airflow is kind of a job scheduling management system if you have used ground before and Then when you have like hundred corn that a hundred corn job running at the same time and sometimes this one depending on the other one And sometimes the other one depending on the three different job. It's kind of hard to do it with a corn top Corn job So airflow is kind of help you a lot with that with an interface as well So code review test and integration. We have GitHub Travis CI Then on resource management. We have a machine resource Amazon web service auto-scaling. Yep. I'm a son. I'll get Marathon as well And then on the language level we have Python and SQL Mainly so again Why are we doing all this? Why do we even bother with these technologies? Because I think that it's important because it's automate all the boring tasks it lets you focus on building what you actually want to be and and At the same time, you know, we have a code base that's shared among everyone in the data team It makes everything transparent. It makes that you know, you know who contribute the most who contribute to the a lot of features who actually need help and You know, it's give you a lot of time to make you just go in and you know I want to review this this pull request from this person and that's it. So the whole thing might You know, we improve our own ownership of the code base of the law infrastructures and Most importantly, you know, it's make the team much more child So that's gonna the end of the talk Feel free to cast about data science What do you do for those parts for for those products? It depends on what you actually want to do with them Basically, if you want them to have more purchases, you're gonna have to Like boost them up with product ranking. All right, if let's say tomorrow. There's major blueberry milk and You click on blueberry milk, but no one has bought blueberry milk and something else yet So if I look at blueberry milk, what is recommended to me? If you only have item to item Color-related filter ring, it will not recommend you anything But there's a way for you to feel it with either best-selling item in the same category or either You can go to document based recommendation where you actually look at the similarity between The two items in terms of the name the description Okay, makes sense. There's another one. How do you evaluate your your recommendation engine offline? So for example, maybe you could have five recommendation engines, but any one point in time your traffic can only sustain traffic to test two So from those five, how do you evaluate and only select the top two? To be honest, I Don't really believe in offline evaluation for recommendation system. It's just something that I Would rather Sometime I would rather you lower it on It saves us a lot more time to make better Recommendation instead of having it offline and just test it and just play around with the data until you feel like okay It's something that we can use But but of course, I mean Once you just maybe it's because I mostly work in startup in one or two years So offline recommend offline evaluation might not work very well for You know short-term companies But if you have like companies with five years and already have some recommendation running It might make sense That one I can give my teammate And Actually We just need to In Normally we do data team deployment once a day at around 5 p.m And then we have one or two more hours to look for anything broken And we fix it as soon as possible But normally it should it should all be reviewed by 4 p.m And ready to be deployed by 5 Yeah I'll answer the last one first So if you are not available there the shopper out of the leavey-leave or we try to call you if you're not available They'll do whatever best they can leave it there or ask your neighbors to to collect it because it's most of the time is fresh item and You know either day you leave it there or you throw it away So that's the way how it works So in term of operation optimization We have we have a team to do that. We have a different data scientist focus on that part I Don't really cover that part right now Yeah, we every application that we view is on top of the data mod Data mods that we view on top of data warehouse. So and then The postgres database we use is more of the delivery or interface Layers to other team so sometimes it is more convenient for them to Connect to a database So we give them postgres access or sometimes it's an API sometimes it's a file on history Yeah We'll look it in the future as So we we try to build something that works right now because we don't have anything right? So we try to be something that works and before we try with because when you talk about click data or impression data it Kind of explode in term of the data set Guess, you know, how many others that you have your time sit in a hundred a thousand you have how many clicks that you have Which you know harder in term of engineering But our next step after deploy improved production will be looking at additional data and then we just improving it along the way It doesn't help right now guess our company is only about One and a half two years so you know the financial data is kind of the same over the last few years So it's kind of stick to the date features Every day you have like STI index or for every day and then you just use it as a feature It's block it into actually boost You see whether evaluation it improved or not Okay, so don't you'll be around for what? Yeah We came late, but so are there any specific examples of for example like conversion increase due to Yeah, so With this one we noticed that customer less likely to buy item that are market At the same time so because they are less likely to buy those item more item get delivered so we have more revenue and profit Which I cannot disclose or no more like this in one digit Okay, so thank you all it's once again. We're at the end of the year Well, I don't think we're holding any more meetups. We're going to try On top of that. I think we have some books from O'Reilly. We're trying to figure out. What's the best way to give it out? so Right now some ideas those of you who have taken like really good attendance. I've been on time and stuff We're thinking of giving to that group, but it's still a very sizable number So give us some time while we work out some of the logistics So just some incentive for you guys to come on time and take attendance, right? Apart from that as you guys can see we are always looking for a bigger venue So anyone with suggestions your company wants to sponsor us for a bigger venue. That'd be great If there any speakers on any topics, please come look for one of us Always happy to hear new stuff Apart from that Merry Christmas, please take your trash on the way out. The trash doesn't belong to you Please take it out as well. Thank you. And lastly data team at Honest beer is hiring Thanks again