 Merci beaucoup pour l'introduction, Nilo. Je vais faire ma présentation en anglais parce qu'il faut vraiment que je me prépare pour Python. Mais bon, comme vous voyez, je parle français. Donc si jamais vous avez des questions à la fin, n'hésitez pas de la poser dans la langue qui vous met le plus à l'aise, tant que c'est le français ou l'anglais. Voilà. Donc, good evening, everyone. My name is Françoise Provencher. I'm a Data Science Technical Lead at Shopify, and today I'm going to talk about what we do. So yeah, I said Shopify, we're not a music streaming company. We make commerce better for everyone. So what does that mean? So I think we're mostly known for our e-commerce platform, but you can use Shopify to sell online or in person. So whether you have a retail brick and mortar shop or you want to just sell online. But if you know Shopify already, you might be thinking, Françoise, what are you doing at a Python meetup? I thought you guys were doing just Ruby on Rails. Well, I mean, you're right. Shopify is one of the biggest and largest and oldest Ruby on Rails application. And most of our engineering department uses it on a daily basis. Most of them, but not all of them. Because in data science and engineering, there is at least 150 people that use Python on a daily basis. Our whole data warehouse is powered by PySpark. And at PyCon, I will share the stage with my colleague, Chris Fournier, who's a data engineer at Shopify, and he will talk a bit more about that. So it's going to be, probably, I hope, a tape. So if you ever want to see a stock in a couple of weeks when they are released, but today I'm going to talk more about the data science aspects. And it's not only about the tool, it's also about the community. So I personally, I'm one of the organizers of PyLiddy's Montreal. My colleague at Shopify, Catherine, is the organizer of PyLiddy's at Toronto. The sound guy tonight works with me. And also, we host the Ottawa Python Meetup. And even the guy who organized PyCon US in Montreal works at Shopify. So I think the people in this organization really care about the community and we're really present. And so that's why we're sponsoring PyCon this year as well. Also, the library that Tristan talked about, Lifelines, was created by one of our co-workers. We use it internally, but it's also available externally. And we have a couple other libraries that are open sourced. And people also, in their free time, contribute to the libraries in Python. So I'm just going to give a quick overview of what we do on a daily basis as a data scientist, because data scientist is like a really broad term. So one of the things that we do is ETL, Extract Transform Load. It's interesting because I think that in some companies you have ETL engineers and then the data scientist only touch the data once it's cleaned and it's available. While we are also responsible for the collection of the data and the cleaning. So in the ETL, we write the PySpark jobs to transform the data with the business rules to clean it and to put it in the data front room. Well, the front room of the database for everyone to use in the company. So I really like it because when I'm building my models I know exactly how the data was cleaned and I'm also empowered to go and look in the code because everyone in our job, we know PySpark. So it's easy to know when something is wonky to deep dive and figure out what it is. As data scientists, we also do AB experiments. So that is something that you do when you want to figure out is feature A better than feature B. So that can be done, say, before you release a new version of something or you're just really curious and you want to know should we put a little icon of a padlock so that people feel more secure during a checkout or not. So this is a screenshot of what it looked like. The whole back end is in Python. And I'm just showing the conversion rates here but we also have the survival analysis one and that uses lifelines. We also have a team that is completely dedicated to creating recommender systems. So if you're a merchant then you need to find the best theme or the best apps, we will recommend them for you. So we do personalization of the product via recommender systems. So the data scientists are not just working for the business, giving insights to the business, but we also are software developers. We create those products. Another example of a data product here would be product classification. So to make sure that we... So we say that our data products internally really helps us create more value for other data products. So for instance, I have a t-shirt shop so I just put it there. So let's say that you're selling t-shirts but there's no way that Shopify will know that you're selling t-shirts because you haven't told us. Right? We have to infer it from the title, from the description, from the image and we do machine learning on that to create a dataset to know all the products at Shopify, what kind of products they are. And based on that then we can give better recommendations for the merchants themselves, either if that's a recommendation of where to market their product for instance or how to price them. And I'm going to talk today mostly about exploratory data analysis because that's a huge part of what I personally do these days. I'm now part of the systems research teams. We're mixed method teams. I'm the only data scientist on that team. The others are UX researchers and yeah, doing a lot of exploratory data analysis. But before I go on, I just want to check in with you. Are you comfortable on your chair? Because I'm 5'2 and when I sit on a conference chair like this, usually my feet don't really touch the ground. It's super uncomfortable. Like after 10 minutes I'm all trying to get more comfortable. And I mean, this would fit like most people, but not all people. Luckily, I mean, I spend at least 8 hours a day sitting on a chair. But it looks more like this. This comes in three different sizes. I have the size small. My feet touch the ground. I can adjust. There's something like 12 different adjustments that you can make on a chair like that. But it really takes into account that I'm a bit different than someone else. And my point here is that one size does not fit all. If you want to create a great product, you have to think that your users are a little bit different from each other. And you have to have a mental model to think about them. So next time you buy a software, do you want a software that looks like this or a software that looks like this? So my point here is that there's a lot of value in segmenting a user base. So that's what I'm going to talk about for most of the talk. I'm just going to take a little sip because it's a long talk. So this is our user base. Let's segment it. We can maybe cut it in two. People with long hair on one side or hair on the other side might make sense. Let's say that you're a shampoo company, that might make sense. But there's maybe something like this. People with glasses, people wearing hats. I'm not sure exactly what's the best way to segment all those users. And the point is that there is not a single right answer, either. So I'm just going to walk you through a methodology that multiple people have used in my company. And it worked pretty well. So the purpose of our segmentation is to create a mental model for the product team to think about, well, entrepreneurs because our users at Shopify are entrepreneurs. And the idea would be to find two to five groups using a method that is called clustering. And why two to five? It's because we want to create this mental model. So if we have 150, the meetings with the product managers will never end. So you want to have something that people can have in their head at all times. Those groups need to be big enough. What I mean is that we don't care that much about the 20 people in a corner or if we have half a million people here. So they must be big enough to make sense, but also small enough to bring some color to the way that we want to understand our users. And the most important point is actually they need to be interpretable because you can show as many features and as many clustering algorithms you want at your data. But the end goal is really to have this mental model. But interpretable, you really miss the mark. And we'll see that it's actually a very important part of the process. And also as a Tristan's talk before, the workflow is real, but the data is fake. So the tool I will be using for this it's called Jupyter Notebook. I really love it. If you've never used it before, check it out. It is in your browser and you can have a markdown to do some text editing, take notes. It's great if you're a scientist because then you have all your methodology along with your code, which is super great. You have your code and when you execute it the output is printed right below. So you have everything in one and it's called a notebook, so check it out. So step zero, because it's the first step in Python we start counting at zero. Bring some features with your colleagues. And that's super important because garbage in garbage out. And what are the features? So let's say that I want to segment my user base I will talk, say, with a designer and ask them what's on your mind about your product and maybe they will say something like I don't know if I should put this element up or here or if this thing should be drag-and-drop or if people are okay with writing code or if we should make it easier to add different variants to this product page and things like that. So cool, you've talked to one designer. Maybe talk also to a product managers because they will be your clients at the end so they have also pretty strong opinions on that. If you have UX researchers in your organization UX researchers are a great source of information especially because they have anecdotes from talking to a lot of merchants they've done interviews, they've done surveys they maybe already have a ton of CSV files that are not even in your database that you can reuse and they have a lot of empathy for the user so they're a really great source but if you're in a small company and you don't necessarily have UX researchers the people and the customer support they talk with the end-users the most and they have a really good idea of what is easy to do, what is not easy to do and the different types of people that they interact with so at this stage of a project the risk is all the unknowns, unknowns so by talking with people this is how you figure out those unknowns and then you get all the become known physically then step one get data from multiple sources as I said you might have already surveyed data floating around in a CSV file and you might have a colleague that has already done an ETL job that would dump a very nice data set in the front room in a database but you might also have petabytes of data on HGFS and you want to exploit that so once you have your grocery list of features you can go in and put them all together so just import pandas and at Shopify we have something that's called Starscream which is basically our ETL pipelines so the read method will just help me read a folder on HGFS to have it as a PySpark data frame so cool with pandas I can read from a CSV so this is where I got my survey data we have this very cool thing that the data engineer made so it's a python command that you can pass a string that is SQL and it will return the pandas data frame which is super useful and you can use the read method get the PySpark data frame then do your heavy manipulation so here say I'm grouping by say shop ID the data itself might be too big to fit in memory but since it's in PySpark my cluster will take care of it aggregate and then return it to me in a pandas data frame and then I can join everything together so a few years ago it was not possible where I work but now we have all this cool tooling so in a few lines of code I have all the data I want in one big data set in this case a pandas data frame so my step 2 would be just the basic cleaning removing rows so maybe in my survey the survey was localized so I just want to keep the rows of the users that answered in English I want to remove some columns so for instance a lot of questions are other please specify I don't necessarily want those for my clustering so I will just remove them and most importantly dealing with missing values this is where you need to use your judgment but in my case I just replace the missing values with the not answer because this is what happens in my survey but maybe in your application it would mean that it would be okay to say replace all the not a number by a zero for instance or maybe it's not okay and you should really remove those rows so you have to know your data to know what is appropriate to do here but in my case I just replace it with the not answer step 3 do more feature engineering because now you have your feature that were given to you basically by your database but a lot of times a good feature might be an interaction between two other features so for instance well I talked about variants and products so let's say my product is a a cardigan like this and the variants might be 5 different sizes and 5 different colors 5 different variants versus someone that sells something else maybe microphones and there's only one of this kind or maybe 2 because you have different voltages or something like that so maybe the needs of those you know of people who sell a lot of variants is different because the product complexity is not the same so maybe that's an hypothesis that I have and maybe I want to have this feature but it's pretty easy you just add it so I find that a ratio say of variants to products would be an interesting feature to have so that's something you can keep in mind think about like what are the features that would be interesting to have as ratios other ideas bucketing especially if you have categorical data you can put them into buckets what I call bucketing differences especially with time a bit like what Tristan said a bit before maybe you're not that much interested in the date that people started something and they stopped doing something or did something else maybe you're interested in the duration between those 2 events so think about differences and I put logarithm in there because sometimes you know things just scale logarithmically so for instance I don't know some physical things like like the power of sound or other things like that or like the perception of sound is logarithmic so you might want to you know if you have the just like the acoustic pressure you might want to take the logarithm of that instead or you know things like that cool then now that we have all this data you have all kinds of numbers in there and some of those numbers might be really big and other is really small so if you want your clustering algorithm to work you need to put everything in like you know in a nice range so one of the first thing I like to do is if I have category call data so let's say do you have a retail location you have maybe 3 different answers you can do one hot encoding that is also called dummy variables and in so I think it depends on in which community you are I know why like my colleagues call it one hot encoding but in pandas it's actually called get dummies but what it will do it will create 3 different columns with ones and zeros depending on if the user had this attribute so it's pretty cool because now I can give this to an algorithm because it's numbers while in the first case it would not have worked so you do that and if you have other things like you know numbers that are not between 0 and 1 for instance you want to scale them so here I just have an example so maybe you have the number of apps that the store installed but instead of having the values between 5 and 15 you might want to have them between 0 and 1 and this is very useful that scikit-learn has a lot of different scalers I like the minmax scaler because I like things to be between 0 and 1 but you might prefer another one depending on your application and now dimensionality reduction if you've ever done machine learning you've heard of this the curse of dimensionality I recently did some clustering on my survey data I had something like 100 different dimensions and when you are in high dimensions your distance matrix your euclidean distance start to lose sense because in 100 dimensions everything looks the same so you really have to reduce it to less dimensions so something that is pretty important that people do very often is using a PCA principal component analysis so here I've just plotted the number of components and cumulative explained variance so that's what usually people look at so basically what it shows is that you don't necessarily need to keep all of your of your features you would get most of it right but I kind of like to also double a little bit in this and see in just two components just two dimensions can I get away with a pretty nice clustering and then we'll use k means for this so if you've done some probably Coursera class for the very classic example is with the iris dataset so that's maybe what you have in mind when you think about let's cluster something it looks like this you have three classes that are pretty well separated from one another maybe not the two in the middle but you know there's one that's pretty different and you might apply you know a couple different what do you say you know ways to look at how your clustering is good such as the silhouette score or the Kaliński-Arabra score or the sum of means inside your clusters and it will all look good and you will always have the nice elbow or the nice peak for those things but what I've seen in the wild looks more like this so people people are on a really interesting spectrum of behaviors and of tastes and of personalities you don't have let's say you know because otherwise you just have stereotypes like big heavy metal guy and like little something something it doesn't work like that with people you don't necessarily have big clusters what you have is basically the spectrum like this whole pie and your goal here is just to slice it in the way that makes the most sense but it doesn't mean that there is one way to slice the pie correctly doesn't mean that there is one way to slice and that's why I put this quote here that the goal is that the objects in a group should be similar to one another and different from the objects in other groups but in many application clusters are not well separated from one another so if you've never done machine learning and you thought that people that were doing it were super geniuses they just know those three lines so actually the hard part about machine learning is everything that I said before preparing the data and doing the future engineering and how you're going to interpret it afterwards and all that kind of knowledge around it but actually like the implementation in Python if you're doing simple stuff like K-means it's really just that now I have my clusters and so when I have my clusters it kind of looks like this I have my hot encoded features I only kept three because otherwise it wouldn't fit here and the number of my cluster and the number is pretty meaningless it's just one of the colors that we've seen before and that makes the interpretation part a really important part so the way that I found it was working really well for us was to do something that my colleague calls indexing so what we mean by that is that remember like all my data no actually that's not true yes okay let me start that again so if you take your cluster data and you group it by cluster you can take the mean of that thing and then you can subtract the mean and normalize so basically everything that will be at zero will be this is normal for my group if you're closer to one you will be over indexing for that group that means that your group is doing this thing more than the others and if you're at minus one it means that your group is doing this thing less than the other groups depending on your application like for me this works really well because all my data is between zero and one and I don't have a lot of outliers you can use the median instead the idea is to find what's your baseline is your baseline the median for this thing or the mean then you put everyone there and then you don't even have put between people between zero and one but I kind of like it because then you can it's easier for me to interpret but that's the main idea and also something very very cool that I learned recently is that you can apply styles to your data frames which makes my life much easier so here I'm just saying ok if you know the value is smaller than point five minus point five it will be red otherwise if it's larger than point five it would be green and this will just make my life make my life a bit easier when I try to interpret my data and it looks like this when I do it I usually have you know maybe a hundred different features so so this is really cool because now I can see at a glance let's say cluster one is a cluster of merchants who would use well it's a full time business they have a physical star so that means they are a retailer they had business experience before and they have at least one employee so to me this looks like really like my you know I have this image in my head of like my neighborhood retailer that have you know their physical star well if I look at cluster zero their business is not full time so they're part time they don't have a physical star and they do have an online store so they're online only they are using media, social media channels and they don't really have a business experience before and they don't have an employee so that looks like me more like someone that has maybe you know a hobby that they want to turn into profit un online store or something like that so what I will do I will look at something like this and take notes and try to draw a picture around that and if it's something that totally makes sense then I will go back and talk about it with my co-workers if it totally doesn't make sense I go back and rethink about my features yeah and step nine oh did I skip a step darn tell people because when you're when you're in a a role where you're doing research you could have done all that research and never tell anyone but then nothing will change that's not how you have impact you can't just write on your CV oh I've done like 10 clustering things if you've never presented it to someone and it actually changed the way the business works so I find that as a data scientist you really need to get out of your introverted self and get out there and talk to people and really make a change in your organization or whatever project you're doing with by communicating that data and sometimes communicating that data means you're showing a pie chart I hope you're not using pie charts but if that works for influencing people do it like in our case this is actually well I've blurred the sensitive parts but we can make kind of a personal card like this so next time I talk with a product manager I could say oh product will impact brick and mortar business and this is what we mean by brick and mortar business it's people that have their their main focus is their in-store experience they're not really worrying about their online store what they need to succeed what they struggle with what are their goals and some key metrics at the top that we've blurred there but that's just one way to but so far it works pretty well to do it that way also I would say if you have a channel in your company to do talks either like I don't know a Friday with a beer get in front of everyone and present them this kind of thing or just book a meeting with the appropriate people but you need to get the word out so in conclusion doing this kind of segment segmentation uses a great deal of domain knowledge so that's why we talk to everyone in the first place and also some pretty darn cool data science skills but also something that is maybe I haven't said explicitly so far but you've probably sensed it with all the cool tools that I use is that I'm empowered to this kind of work with the scale of data as I work with a team of data engineers you know for instance like my I Python notebook Jupiter notebook runs on a cluster that is maintained by someone else which is amazing because in my previous job I was maintaining my own cluster and that was like a lot of time when things go wrong so I really like to have a team and people to lean on to take care of that kind of problems or that kind of challenges and also they create like those awesome tools like the presto query that I can use to query the database so that's it for me, thank you je peux prendre des questions en français ou en anglais oui je suis vraiment surpris je ne sais pas si vous voulez parler de ça mais vous avez tout vu toute votre dimension de 7.2 et ça vous a donné 10% de la compétition de la compétition de la dimension 2 oui, donc vous parlez de la PCA oui donc c'est pas vous êtes-vous familiar avec la PCA? oui oui et la chose c'est même si vous êtes vous savez, parce que donc, ce que je fais et je n'ai pas nécessairement le temps d'aller dans le processus je vais utiliser des choses comme un score de silhouette pour voir si on voit une structure de la dimension de la dimension que je peux voir avec mes yeux et je vais essayer un nombre de components de PCA ou des variants comme ce que vous voulez voir et voir si l'interprétation change mais parfois l'interprétation ne change pas donc ça dépend vraiment et ce que j'ai remarqué c'est que quand je n'ai pas prouvé assez de components vous pouvez commencer à avoir des études très bizarres je sais, parfois c'est des gens qui utilisent Twitter pour quelque chose comme des merchants que nous avons comme ce n'est pas vraiment rélevé donc c'est peut-être que j'ai un problème avec mes features en première place ce ne pourrait pas être une feature qui devrait être là ou peut-être que j'ai trop de too many features j'aime vraiment à moins réjecter certains d'entre eux parce que c'est une reduction de bruit comme je l'ai dit dans ce genre l'interprétation est vraiment ma métrique parce que les autres métriques typiques que vous avez n'est-ce que nécessairement je vois que après que vous n'avez pas les clusters et que vous récentrez vous normalisez le micro cela veut dire la variance entre les différents clusters donc si vous divisez c'est entre 1 et 1 si un cluster a plus d'informations avec les autres ce n'est pas important oui, ça dépend de votre application il y a différents types des scalers certains d'entre eux donc K signifie que votre variance est la même pour tous vos clusters donc ça fait le sens de re-scaler de cette façon mais comme je l'ai dit avec ce type de choses il n'y a pas il n'y a pas seulement une façon de donc le point est d'avoir une façon informée d'avoir une façon d'avoir des segments d'avoir juste une assumption d'avoir des retails d'avoir des gens en online alors que quand on utilise ce type de procédure nous trouvons des choses plus nuantes et plus riches qu'on peut y aller et en fait c'est un carton de personnel c'est aussi pour intervier les gens en s'assurant donc j'ai le sens que les gens qui font machine learning ne sont pas vraiment cluster mais le point est d'avoir un point où vous pouvez grouper les gens dans lesquels ça fait le sens et en pratique, ça fonctionne d'avoir un classifier et d'avoir des données d'un type d'utilisation oui, c'est le point donc la question c'est peut-on utiliser ça pour faire des décisions et en fait, oui donc le data survey est difficile à obtenir ou expensif à obtenir ou à des petites données avec ce très riche data et aussi utiliser d'autres signes de la database et d'utiliser ça pour savoir par exemple, recentement on a eu un survey où les gens s'ont self-labelé eux-mêmes comme des modèles de business j'ai un data set que je peux utiliser pour cluster et puis pour protéger mais en ce cas j'ai des labels je vais juste utiliser clustering parce que c'est une méthode non-supervise j'ai aussi utilisé d'autres méthodes de machine learning donc vous avez le second comment est-ce qu'il y a de l'utilisation d'un data set qui vous dit que vous êtes en clustering et que vous êtes comment est-ce qu'il y a de l'utilisation donc comment est-ce qu'il y a de l'utilisation donc pour la base d'utilisation je pense que c'est un usage que nous avons plus de 600 000 utilisateurs donc si nous allons utiliser clustering sur tout le monde ce serait le nombre de rows cela fonctionne en mémoire pour des choses comme les systèmes recommandants que je n'aurais pas fait dans un notebook juste en mémoire avec Pandal et des libraries de learning qui travaillent bien pour cela donc j'ai beaucoup de collègues que vous sentez, pour exemple vous avez plus de questions ? oui last c'est du exploratory analysis comment j'imagine que moi j'ai des questions sur le step zero je parle avec step zero par exemple as-tu un exemple d'un bright star qui t'a amené de step zero jusqu'à un step pas sans repasser par cette ligne vous avez discuté comment est-ce que tu vas rétablement amener de l'exploratory analysis à une feature qui va rouler entre le Jupyter notebook très pratique pour faire son analyse et après d'amener ça à un pipeline de production qui roule à tous les jours avec des écols c'est une très bonne question c'est une excellente question si j'ai un bon exemple de ça quand je parle de features j'aurais peut-être dû être plus explicite que de features qui vont être mises in put à mon machine learning model je parle pas de features comme de features de software mais c'est quand même une super bonne question en fait je le dirais pas dans ce cas-ci parce que oui parce qu'on les a surtout utilisé pour des choses comme par exemple recruter des gens pour des surveys on a fait x nombre de on fait aussi du hardware puis on veut le faire tester par des gens puis on veut être sûr que c'est pas juste la même sorte de gens qui vont avoir le hardware dans les mains, on veut que ce soit un petit peu plus balancé on va utiliser ça par exemple une fois mais j'ai fait d'autres projets qui n'étaient pas nécessairement du clustering de gens comme ça qui sont en production puis je dirais que c'est quand même tout une job le delta d'énergie ça prend entre un type qui fonctionne sur moins d'ata dans Pandas versus quelque chose qui va rouler à tous les jours et qui va jamais failer puis que ton modèle parce que quand tu veux mettre quelque chose en production il faut que tu fasses attention à plein d'autres choses parce qu'il peut commencer à faire des mauvaises prédictions il faut que tu sois capable de le monitorer dans ce cas-là, ça prend une équipe au complét si je pars en vacances il faut que quelqu'un d'autre soit capable de faire donc on a des pipelines de machine learning en production mais c'est vraiment plus de travail que ça ici c'est plus comme comment arriver à un prototype c'est plaisir le but de ce que je fais c'est de le mettre en production aussi mais en fait j'ai des collègues qui ont mis un truc j'ai fait un prototype et on l'a mis en production ça l'a pris quand même un bouton il fallait construire tout le monitoring autour c'est super important tu ne veux pas que ton algoritme commence à faire des niaiseries en production mais maintenant que ça c'est fait, après ça toutes les autres qu'on va faire vont être plus simples s'il n'y a pas d'autres questions merci beaucoup