 All right So property bar inside from cleaning and labeling a cocktail data set so I've been working on a personal project and I thought that in terms of being able to learn and and be able to do data science properly we would want to to have the the data Sorry, sorry about it So I'll just start again. I have a personal project and it's it's basically to be able to take take a cocktail and Determine what season of the year is the best season for it to be to be drunken so I'm supposed to do this with with some screens Reading I'll just move on the idea the idea as I said is to be able to Clean the data and label the data But the focus of what I want to do in this presentation is Highlight the use of the various areas of data science. So domain expertise math computer science so for anybody who has drunk a cocktail has seen a cocktail list has wanted to prepare a cocktail There are some basic things that it's important to know about them and so the first thing is the ingredient list and You know it tells you these are the ingredients and and and this is how much of what you need Most recipes give you the the method It will you know tell you what glass may be And the size of it as well as potentially with the data sets themselves give you a category So something like shots or classics or coolers for example So here we have a representation of what a data set might look like in terms of the the way that is represented and As we can see it's very It's in its raw form here So it's very much like that list of ingredients that we saw you know that we would see in in a recipe have any measurement and the and the ingredients associated but with with data science and being able to to use this in in In tables. This isn't a very useful format. And so the ingredients themselves you would want to be able to say, you know, give me a column and Have the associated ingredient under that column rather than have two separate columns representing the same the same thing So another thing to take note of when looking at at this Is that the ingredients themselves aren't standardized. So you have some ingredients that are for example a brand Hennessy Jagger Meister Other other ingredients here Just generic generic things. So, you know coffee flavored brandy Apple brandy here. So it's we have to find a way to take those very different representations and Convute it into into that standardized column. So you want your columns for example to have things like rum and have the volume of liquid associated with the cocktail that you that you're making as As with any data set everyone has their own it is idiosyncrasies the one that I worked with here It in this in a single column for example, the you know under the ingredient It would contain multiple multiple ingredients. So for example, it would say oh, you know We want a measurement of one ounce and we want The ingredient will be light rum and dark rum But it's in the same and you need to be able to handle that properly when mapping to to the to the column with with with with the with the right amount of volume and Well in this particular data set as well Some of the data was placed incorrectly So they actually put the measurement in the ingredient column and and vice versa, which was just annoying So in terms of being able to take the data in the raw form that it was and Convute it into a more amicable format Computer science, of course, is a skill that you need to have so the the the actual You know coding up stuff to be able to get the data from the format that we started with to the intended format is Is is a given but it's also important To have another type of expertise and as the domain expertise Because you you could know how to convert from point A to point B But you don't even know what point B is without having that that domain expertise So let me give you some examples of what that might entail so as we saw with with the brands and the types of Spirits that that we saw to be basically standardized that we wanted to normalize and categorize those Categorization in particular was important for things like liqueurs And so if anybody knows knows like here's they Slightly flavored alcohols so things like a Contro or this or no these these have both flavor and alcohol and one of the things that I Absolutely needed to do So you would see here. There's citrus juice and citrus spirit To be able to distinguish among the two because citrus has a flavor profile within cocktails If they play a specific role and so if we want to know What's a good time of year to to drink a specific cocktail how citrus plays a role in that would be would be a factor and so Splitting it up into into juice and spirit is as important there and domain expertise was was be critical But something like that another aspect with domain expertise that you need is for data validation some of the the ingredients within this The the the data set contained We wanted like one ounce of anger of bitters So if if anybody knows but bitters and spirits you tend to put just a dash of it or a shake of it And so there's no way you put a whole ounce Of bitters and in a cocktail So knowing that was was important to to make the appropriate adjustment so that volumes etc made sense so in addition to Taking the raw data and moving it from the columns that we saw of the specific spirits our alcohol or not It's it's important to take the feature vectors or to make feature vectors Now what are feature vectors feature vectors are? just plainly some Representation or some properties about the data that you that you've extracted in intercolon my by itself so in in this example here we have The percentage of alcohol within the volume of the cocktail we have the whether bubbles are present in it So things like champagne sparkling wine soda would have would have give give some concentration to that Citrus the other flavors wine herbs so it's it's it was important to be able to identify which attributes of the drink would would would play into being able to to label the data and as factors When we when we extract our features the labeling is Is is is what what you what you're targeting so you are identifying? Initial features in the data to say what that you think would be relevant With the associated output or labels and of course we we could use our computer science knowledge to meld and mesh these Columns from the primary data that we have with the with the with the Ingrediences essentially so once we take our time and and establish what aspects of the of the cocktails are useful We could create some heuristics and say, you know, if I have a tall drink with You know very low alcohol it might be something that That might be useful in summer. You're useful to drink in summer and similarly something short Bit more alcoholic probably a bit sweet or or heavy you may want to have in the in winter and so You could create some rules around What to do to give to get your label? so That's that's what I did here and the data that we see here has that so to the far right would be the labels based on the heuristics alone and That's important because you as the as the domain expert would come out and say Well, this this is the thing that I make sense. You know your data. You understand what you want to relate and so When you when you take your time to build that validation around the heuristics You are actually Establishing ground truth and it's so important to establish that ground truth because anything after that is is You know based on what the data tells you and that's important, but you you need to make sense of what is there So as we can see with the heuristics You get some of the data level you probably and most likely won't get all of the data level And so may want to employ some techniques to help kind of automate and label in accordance with the heuristics that you that you've defined and so Well, here's where we turn to machine learning We could cluster So, you know grouping data or similar data points together So, you know similar cocktails Collins and coolers have similar properties. So you may you may put put those in in In in the same group or at least you would hope that your clustering algorithms would put them in the same group And so once once you do that you could establish Well, these are the ones the cluster puts in this group and that has a label. Okay, maybe the labels themselves or the The cocktails themselves would have similar properties and so Could also fall into Into the appropriate label groups so you could kind of fill in the blanks essentially or employ techniques to fill in the blanks to label Label the rest of them. So things like Bermuda rose what does power in this case would have benefited from from that technique all right and one of the things that I Wanted to Have you all get out of this talk is? in terms of This simple aspect of labeling the data and and cleaning the data The importance of the domain expertise is is absolutely key you can't make appropriate or useful decisions for for Validity data validity and even the that process of of extracting The the truth of value from the data it's important to check it against the domain expertise Of course the machine knowledge machine learning knowledge is useful But again, it's it's it's somewhat secondary to that domain expertise. So if you aren't the domain expert yourself get one Validated the end often Refiners required so things like your features if you if you think you have a feature and you think it works You know, you may need to you may need to change it you may need to find some other features that that are more appropriate to the to the To the data set or even to the models that you're employing This would go similar to the labels as well. So one of the things that I strongly come to contemplated with the with the project is Whether the the labels are right. I mean you could have a cocktail for Breakfast lunch dinner. You could have cocktails with different courses in a meal, right? So you there there are different places where or different ways to break down how you would how you might decide to label or associate that input data of the cocktails to to the output and finally Just to say that the tools that you have Could only take you so far It's important to be able to use that that Data the insights from the data In a in a sensible way so while my project is Beased to You know be able to predict what time of year a particular cocktail Is is useful in? Good it's nice to be drunken because I like to make me cocktails myself So, you know, I can I can add add to that list another factor that that was that was interesting for me is is the To be able to see I plan my my ingredient list, right? So, of course, I'll have the cocktails and I'll know the ingredients and therefore I can have some sort of inventory planner and so With that as the goal in mind You could plan around how you put the different the different tooling in place To take advantage of the inside so while it may not even be a model I could get value from it and that's that that's an important aspect for your data projects as well so I Hope that this talk gave you some insights into What you might need to do when starting your own data projects, you know cleaning the data Labeling the data, but of course, that's the boring part. You want to start to make models and stuff And it may be a question a selection Thank you very much