 Hello everyone, welcome to Metrics for Innovative AI Products webinar. My name is Reza Madad and as an AI product lead, I have had the opportunity to apply various metrics in the lifecycle of several AI products. In this webinar, I will share my experiences of metrics in creating, assessing and enhancing AI products. We will start with an overview of the importance of metrics and how they serve as a quantitative measure of the performance of AI and its impact on business outcomes. We will navigate through the world of AI products, classifying them based on their underlying technology and use cases. We will explore key AI related metrics for each category, discuss how to select and define these metrics. Throughout this webinar, we will discuss the balance of technical and business metrics, explore the role of human in the loop in AI products, and see how metrics inform crucial decisions, particularly in the early stage of AI product development. So without further ado, let's dive in and explore how we can leverage metrics to drive our AI products to our success. Why metrics are important? So metrics allow us to objectively measure the performance of AI models, letting us know whether they are functioning as intended and helping us identify areas for improvement. The metrics connect the AI models performance to business objectives. They help us understand how improvements in the model's accuracy or speed can lead to enhance customer satisfaction or increased revenue. Based on the metrics we as product leads decide to launch or sometimes sunset the product. Metrics provide a way to track the progress of AI products over time. This is very crucial in monitoring improvements or recognizing when the model's performance might be declining. The metrics provide a standard way to compare different models or algorithms, helping us decide which one of these models is best suited for a particular application or task. So as a product manager, we are working with size team, engineering team, and there would be always priority tasks in terms of which model, which algorithm that we need to pick. So metrics would be really, really helpful for us in those cases. Now, we learned about the importance of metrics, but we do have a lot of different way of categorizing AI products. So AI products can be categorized in any way or many ways. For the purpose of this webinar, specifically on topic of metrics, I categorize AI products to text, audio, and visual-based AI products. The approach of this categorization of text, audio, and visual is also helpful when you are focusing on the type of data that the AI model usually needs to be trained or deal with. So let's go through the first one, text-based AI products. These products primarily interact with textual data. So we use technologies such as natural language processing to understand, generate, and extract meaning from human language. For example, some of these products could be chatbots for customer support, sentiment analysis tools for social media monitoring, text generation software for creating written content, and automated translators for translating written text. The audio-based AI products are those ones that they are designed to understand and interact with audio data, especially human speech. They utilize technologies like automatic speech recognition, ASR, text-to-speech, or TTS. Some of the audio-based products can be named as voice assistants like Alexa, hey, Alexa, or Google Assistant, or Siri from Apple. And the next one would be transcription services that convert speech to text. And also audio analytics tool for analyzing call center data. The visual-based AI products usually work with images and videos. It could be aesthetic or dynamic metadata of images and videos. They leverage computer vision technology, which includes image recognition, object detection, and segmentation. The examples of these products are facial recognition systems for security purposes, autonomous vehicles that use real-time video data for navigation, image and video editing tools in media and entertainment, and also medical diagnostic tools that analyze medical images or videos. Again, we can categorize the AI products in many, many ways, but I found it personally myself. This is a great way of categorizing AI products in terms of metrics, training data, and their applications. So let's go through metrics for each of these AI products. Let's start with the typical metrics, accuracy, precision, recall, and a final score. So there is no way that you say, I have been AI product lead and I have been working with science team, but never used these metrics. So every single of you, I believe you have worked with these metrics and you're familiar with these metrics. So let's say these are fundamental metrics used for assessing the performance of AI products. And these metrics help us to evaluate the model's ability to correctly classify or predict outcomes. These metrics are important specifically for tasks like sentiment analysis, classification, or information extraction. But in addition to these typical metrics, I believe we should use more use case related or use case specific metrics for our product. That's why we need to go through these three categories one by one. Let's start with text-based AI products. These are the key metrics that I think we can use. Word error rate. WER is used in text transcription or text generation. It can be used to measure the rate of errors made in the output text compared to the original one. Blue, bilingual evaluation under study score is used for tasks like machine translation or text generation, where we want to compare the AI generated text to human generated reference text. Another popular translation metrics is HDR, human targeted translation error rate. This metric is as a proxy of human translator's efforts to post-edit the machine translation. Another metric would be ROCHE, recall-oriented under study for gisting evaluation. An evaluation metric for text summarization and machine translation. Also we have METAOR, the metric for evaluation of translation with explicit ordering. This metric is for evaluating machine translation that consists of precision and recall. One of my favorites on machine translation would be BERT score. This is focused on the similarity between reference and generated text by using contextual embeddings, which is used for generative AI models that try to create meaningful and similar text to corporate data. And the last one on this list would be PROZIDIC ALLIMENT. PA is the accuracy of placements of phrase boundaries considering the PROZIDIC structure of the original input. Now let's move to AUDIO BASE key metrics. We have also word error rate for AUDIO BASE product. We would use it for voice recognition systems that measure the rate of errors made in transcribing a speech to text. We have speaker identification and labeling error that measures the system's ability to correctly verify their speaker's identity. MASHRA is a methodology for conducting a test to evaluate the perceived quality of the machine-generated audio or voice. Mean opinion score or MOS, this metric is used for any system generating or modifying audio, which is subjective quality metric often used in telecommunication to assess audio quality. Word timing error. This is important. This metric measures the deviation of the predicted start and end times for words or phonemes in ASR system from original audio file. And the last one for AUDIO BASE products would be VERBOSITY and a speaking rate. VERBOSITY measures the difference of duration of the synthetic audio and the original audio, which I would say is a very useful metric in autodobbing while containing appropriate speaking rate. How about key metrics for visual-based products? We have active speaker detection that measures the accuracy of detecting speakers on the scene for image or video manipulation. Intersection overunion, IOU, used for object detection task, which measures the overlap between the predicted and Grand Truth bounding box. Mean average precision or MAP, this is also used in object detection and segmentation. This metric provides a combined measure of the model's precision and recall across different thresholds. A structural similarity index, SSIM is a perception-based model that is often used in a task such as image quality, compression, and restoration assessments. And the last but not the least one here would be peak signal-to-noise ratio or PSNR. This metric is widely used for measuring the quality of reconstruction in image compression. Okay, now we got several metrics for our AI product, but what is the process of metrics selection? Before we can select the right metrics, it is crucial to understand the specific business problem and objectives that the AI product is supposed to address. Different AI products serve different purposes as we know, so their relevant metrics can be very different. The metrics must be closely tied to the specific task the AI model is performing. For example, precision and recall are widely used in classification tasks while mean absolute error might be more appropriate for a regression task. We should also ensure that the chosen metrics align with a stakeholder's expectation. For example, in a customer-facing AI product, the user-experienced metrics like speed and accuracy might be considered as the key. In many cases, improving one metric can lead to a decline in another, like precision versus recall. Understanding these trade-offs is critical in choosing which metrics to prioritize. But how about the existing metrics don't fit very well with our innovative AI products? I believe most of you have been in that situation that you are working on innovative disruptive product, especially in AI machine learning field, and you realize that there is no such a great metric that fit with your product. So for these type of AI product, in most cases, you should define new metrics rather than selecting from existing metrics that are not really fit with your products. I believe we as product managers should be creative in defining the right metrics that fit really well with our product and its specific use cases. Let's go through human in the loop in the workflow of some of these AI products. So there are some AI products that are not end-to-end automated and need some human iteration that we call it human in the loop. For these products, you are working with AI-generated metadata and have a human in the loop to review and correct in each step of the AI-generated metadata in your workflow. Let's go through one example together. Our AI product here that they brought it for you as an example generates translated script of an audio file. And let's say we receive an audio in original language, English, for example, as an input for our AI product. And the output of our product is translated a script in target language like French. If you go through the workflow from left to right with ASR, we transcribe the original audio. But as we all know, the AI script may not be 100% accurate. So there will be some issues with the AI-generated script. So we need to make it to be post-edited by transcribers. And we call transcribers here as human in the loop. Then we need to go through machine translation to translate the post-edited English script to French. Before sending to end-users, the translation must be post-edited by human translators as human in the loop. As you see in this example, there are at least two rounds of human in the loop in our workflow. The reason that I'm saying at least because there would be some QA involved, maybe not, sometimes in a linear way workflow. So there would be a lot of back and forth in many industries. But let's say just two rounds of human in the loop in our workflow to achieve 100% accuracy of transcribing English audio to French script. So considering this example, for these type of products, we should consider the speed and cost of the whole workflow end to end and ask ourselves as product managers this critical question. Are we increasing the total timeline and cost of workflow by leveraging AI or we are improving the cost and timeline? If you only measure the performance of your models in ASR and machine translation, you may say it's a successful product. But when you add post-editing time and cost associated with your human in the loop for each round, you may not see that you're saving time or cost in overall workflow, comparing to the traditional workflow or status quo. So that's very important to measure the time and cost, for example, for the whole workflow, including your AI machine letting performance and post-editing efforts. Another important factor in these products are the adoption rate of workflow and any tools that you're using within your workflow by human in the loop. Those tools would be referred to post-editing tools that you are offering to your human in the loop within your workflow. We should consider that our human in the loop workforces are not our end users. They are working on our AI model outputs. So we should always work closely with them and understand the pain points and preferences in terms of post-editing, journey, tooling that they are preferring the features that they would like to see on their tooling. We also need to make sure that human in the loop workforces, especially on global AI products that are going to be used by millions of users around the world, that these HITL workforces in different territories and industries are adopting our workflow, including our editing tools. For instance, your product and workflow and tooling would be adopted really well by French translators, but they may not be adopted really well by Spanish or German translators. So each community, each country or territory have its own pain point, journey, preferences and culture. So that needs to be considered throughout adaptation. Okay, so as a great product manager, all of you are, we should consider both AI specific and business metrics. While AI specific metrics give you insight into the model performance, business metrics tell you how the model is impacting the business. The AI performance metrics provide an understanding of how well the AI model is performing on a technical level. Business metrics like user engagement, chair rate, and conversion rate. On the other hand, our key performance indicators or KPIs that show how well the product is meeting business objectives. AI metrics should align with business goals. For example, if a business goal is to improve customer satisfaction, then AI metric could be the accuracy of customer sentiment analysis. Improving AI performance doesn't always lead to better business outcomes. Sometimes pushing for a higher AI metric could negatively affect user experience and consequently the business metrics. User experience is key in its successful AI products. The metrics related to speed, accuracy, and utility of the AI product are all important for ensuring user satisfaction which in turn drives positive business outcomes. The AI product development should be an iterative process. As the product is being used over time, data collected can provide insight into the relationship between AI performance and business metrics which can inform refinement of both the AI model and metric selection. Regular monitoring and evaluation of both AI and business metrics over time is crucial to understand their trend and relationship. How we should make decisions when our AI product is in its earliest stage? So let's say you are a product manager joining a startup or the organization that they're working on new initiatives and the AI initiative that you're leading is on an early, earlier stage. So how you can make decisions with metrics? The first step is selecting appropriate metrics that align with our business objectives and product goals. At the start of an AI product's life cycle, it is important to establish baseline metrics. These baselines provide a reference point to measure progress and improvements over time. I should say not all metrics are created equal. Some will be more important than others depending on the specific goals of your product. So you should prioritize your metrics and decide which ones are the key drivers for your decisions. We should use the metrics to drive the iterative development process, continually test and measure the product against these metrics and make adjustments based on the results. As you gather data and learn more about how your product is used, you should validate whether your chosen metrics are indeed the right ones. Do they accurately reflect user needs and product performance? If the answer to this question is no, so you need to go ahead and adjust them. In the earliest stages, the product might perform exceptionally well on certain metrics, but that could be due to overfitting. So you should make sure to choose and use techniques like cross-validation to ensure the model generalizes well. As a product leader, there will be times when you have to make a very difficult but necessary decision based on the metric and data. This could include pivoting the product, direction, or in some cases, sunsetting a product. I know these decisions might not always be easy or popular, and for some of you that you just started your journey in product management, you don't want to sunset the product that you have been working on for a couple of years. But believe me, they are crucial for the overall success and viability of the business. The key is to be transparent, super communicative within your org with your potential stakeholders and also customers. Being transparent about why these decisions are being made, and you need to make sure that these decisions are rooted in solid metric-driven reasoning. So metrics are your friend to support you and defend you in these situations. One thing before we wrap it up, so remember, metrics are not a one-time setup. The metrics need to be reassessed regularly, especially in the earliest stages of product. Cool, let's wrap it up this webinar with our key takeaways. So in this webinar, we have walked through some key insights into metrics for AI products. We have explored various metrics associated with AI products, each with its own implications for product performance and business outcomes. From text to audio to the visual-based AI products, the diversity of metrics highlights the wide range of ways we can assess and improve these innovative products. Aligning metrics with a specific product and business objectives are so crucial. We have seen how a thoughtful, strategic approach to metric selection can provide valuable insights into product performance and drive key improvements. Lastly, we discussed the critical role of metrics in guiding decision-making, particularly in the earliest stages of an AI product's life cycle. We explored how metric-driven decisions can lead to better product outcomes and highlighted the need for product leaders to be prepared to make challenging decisions based on these metrics. Thank you all for joining me in this webinar on exploration of metrics for innovative AI products. I hope you have found this webinar helpful in guiding you in your own journey with AI product management. Please feel free to reach out on LinkedIn if there is any question, and I wish you all the best.