 Good morning everyone. I'm Pooja Tripathi. I'm the senior PMT within Amazon. Alexa to be more specific and I'm here to talk about ML data labeling as a product too. So before we jump on what is ML data labeling as a product, I would first like to discuss about the impact of ML data labeling for ML modern development. So as we know that data labeling means that a human is looking at some data which would be fed to a model to train the model and the more accurately the human predicts that data or annotates or labels that data, the better will be the model accuracy because the model is actually getting trained on that label data. Therefore the quality of the label data is like one of the prime factors which determine how the quality of your ML development to an ML model would be and therefore it becomes imperative to improve the quality of data labeling when it comes to improving the accuracy of ML models. Now who when I think of data labeling or a product manager in the data labeling space, the first question which might come is who is the customer and the customer in this case definitely are the scientists or the modelers who are building models and who are sending you data to label it and your customers might also be finance partners who are actually looking at you to lower the cost of data labeling while you are maintaining the same quality for the scientists. So in my view those two become the primary customers of ML data labeling while there might be a lot of other stakeholders in this pipeline as well. Now the second question which comes is what is data labeling as a product and how to think about data labeling? So as any product manager would start off the first point is who is a customer and as we discussed the first customer is the scientist who are sending us data for labeling. When we think about the KPIs for the label data, cost of the label data is one of the many important KPIs which are PM or a product manager in the data labeling space controls. We cannot increase the cost of label data beyond a certain point because then it starts giving us limited ROI by the model as well as the data becomes more and more we actually want to train the model and more data and if the cost of label data is increasing then we cannot have more label data. The second thing is the quality of label data which is extremely important as I have discussed before as well. So the challenge here is maintaining a very high quality of label data while reducing the cost and that's the tension which any product manager in this space goes through. I being one of them. So now how do I actually think about loading the cost of label data? So when it comes to loading the cost of label data the way to think about it is who labels the data? It's humans. So the easiest thing is that sending this data to some countries where the cost of label might be cheaper and you can get that data annotated for a cheaper cost. Sounds easy but very difficult to implement because as you will start on it there will be a lot of issues. First there will be excellent issues if you're sending the data to the same speaking country. So example sending someone sending data from US for annotation in India that humans in India might understand English but there might be negativity issues. There might be excellent issues and therefore you will feel that the quality of the data label is actually lower. Second thing is the regulations themselves. There are a lot of regulations. GDPR being one of them which don't allow us to transfer data across regions. So both of these somehow will impede you from take sending the data to a lower cost or lower human capital country where the cost of labor is lower. So if that is the limiting factor then how do we still reduce the cost of data labeling? So there's another way and the other way is that we use some active learning models actually to find out what data actually needs to go to the model to improve its performance. Maybe previously we were using the model with 100 units of data but actually of those 100 units 80 are just repetitive units. We have already trained the model on that data before and we have seen that the model performs well and why to send those 80 units again for labeling and again train the model on it. Why not just send the 20 units to label the data to label it and that's where the active learning models come very handy and it's an industry practice now where modelers or scientists across geographies are actually first using active learning models to identify which subset of the data if the new model is trained on will give the maximum ROI and they only send that data for data labeling. Now there's a third approach as well here and the third approach is you can use existing ML models and even if they cannot do the entire data labeling that model can actually partially label the data that would help a lot in saving the time of humans to label the data. As an example if there is a model for which we need to train the data on two sets so first is was this utterance completed successfully and the second might be what was the user objective meant. If the model can actually predict the first part was this utterance completed successfully that actually saves a lot of time for the human because in the human only has to answer the second point and since it's ML data its quality will ML model its quality will continue to evolve better and better. So the best solution to lower the cost in the industry wide has been to use active learning or use existing ML models for partial labeling. Both of these solutions actually help you to reduce the cost of data labeling either by reducing the volume of data to be labeled or by partially labeling the data through ML. Now the third the second point is how to measure quality of label data and then if once you measure it how to improve upon it so measuring quality of label data is not straightforward. The best way to do that is you have humans who are annotating the data or labeling that data sending the same data or sending a subset of that data to SMEs or subject matter experts to annotate it as well and then see how many errors you identified in that subset between humans and the SMEs and based on that taking a call on how does the quality look like. So as an example if the human annotated 100 units we can take like three units of the data and send it to subject matter experts and then compare how does the quality of human who label the data versus SME look like. The more agreements are between the humans and the SMEs in this case if three is the if all three units labeled by the human match with SMEs then it's actually 100 percent quality. If only two match then 67 percent two by three and if one matches it's 33 percent and if no match happens it's actually 0 percent. This way we can definitely we can predict the quality of the of the subset and extrapolate it for the entire sample. It has one issue though. What if the SME is not an expert? There might be chances when the SME ended up labeling the data incorrectly. For such situations what we have to think is continuously auditing the data actually done by SME to ensure that there is no gap between SME label data so that we accurately predict the quality of humans. Once we are able to predict the quality of the data then the second effort is to improve that quality. Now to improve that quality the first part is it's human who is actually labeling the data for you so train the human more and more to label the data and how would you do it? Ideally the human is labeling the data based on some rules so simplifying those rules for the human making them as intuitive and as simple so that the mistakes done by the humans are less illness. Second auditing the quality of the data by humans on a daily basis and finding some trend errors. So suppose if I pooja I'm labeling a data and every day if somebody is auditing my quality and seeing that I'm making mistakes on similar type of data or similar type of mistakes I'm making then just hyper training me on those set of utterances or those set of data to improve the quality. That's another way a very well proven strategy to improve the quality of label data. And the third one is use ML models again. In this case the ML models actually help you product if the quality of label data is if the quality of the data labeled by one by the first human is good or bad. If the ML model says that the quality is not correct based on multiple parameters then read out that data and send it to another human or an SME for labeling. That way you can still control the quality although you're not controlling quality at the root of it because you are just correcting a defect in the system but still to the end user you're giving a quality data. So these are the three main levers used in the industry to actually improve the data labeling requirements or the data quality. That brings me to the end of this presentation. I'm happy to answer more questions if you have, feel free to reach out and with this I'm going to stop the presentation. Thank you.