 So today we will be going to discuss about how we are covering the search relevancy testing in machine learning models. Yeah, so I am Vinayaka. I'm working as a QA and QA in ThoughtWorks. You can reach me over in this channels. So this is for information. So in today's topic, we will be seeing what is search relevancy. This is actually widely used functionality across search engines in retail applications. And we will see what is ML, how it is, how we solve the search relevancy problem using ML. And we will see what are the challenges in ML testing and what are all the testing techniques I have used in my earlier project and what are all the other techniques which are available. And one of the most asked questions nowadays, right, if you want to test any ML application, you want to know how to become a ML tester, that is actually like a fun, interesting thing will go by that. So first we'll understand what is search relevancy, right? So if you go to any search engines or any e-commerce application, you find this more for it, right? You can go, you can search for your input query. So expect the search results which are distributed, it should be accurate, and those results should come on the top, not in the second page or the third page, it should be on the top. This is about search relevancy, right? So let's understand with an example. Here I have an e-commerce application with two users searching for different input queries. They're like similar, they're searching for similar products, but with a different query. One of the user is searching for formal shoes, and you can see that only the formal shoes are displayed, and this is in a particular correct order, right? So this looks good. But if you see another search query, the user is searching for sneakers. It has combination of results, right? So it has sports shoes, it has one of the formal shoes has come up in the last row. In the last row, you have sneakers, and you have sports shoes also. It's not like a good relevance results, right? As a user, you expect it to be like from, this is like the right left side one which you see, which is actually like a correct one, right? Which is, this is one user want to see, not the right side one. How the sneakers results are displayed. So if you want to understand how we test it, right? First, we need to understand how the search engines are or your services understanding these queries. So let's see how search relevancy works actually in the back. Let's consider, for example, if user is trying to search for one of the query like Nike blue color running shoes for men, this is a pretty big one, right? So user has a specific items which he wants to buy. So he's searching particularly for blue color shoes from the brand Nike. So your service or your system should understand this. So how we go about this, right? So this search query provides multiple information for us. For example, the brand, it has a color and here one of the query term is actually misspelled, it needs auto correction and it has article type. He specifically wants running shoes and he has, he's searching for a man. So these are like different informations, right? So you consider example of, if you have gone to a footwear showroom and you ask the salesperson, show me the shoes with this search query, right? Like blue color Nike running shoes. But thing is like he has only two or three pairs of blue color shoes, which he can show you. But what salesperson does is like, he improvises that, right? For blue color running shoes, instead of Nike, he may say that, see, I have Adidas, I have Kuma shoes or I have some other brand shoes, which you may be interested in. He shows relevant products. So the search engine as a digital application. So our system also should do that. So how we go about this, right? First, we understand the query and then we do substitution for that query. Like, okay, the user is searching for Nike. He may be interested in Kuma, Adidas also. He's searching for running shoes. He may use casual shoes also. So he's searching for blue color shoes, maybe white also, maybe interested to him. This is how we substitute the query, right? So you can show multiple products to the user. User has more options to select from it. And then we provide this expanded search query to the engine and it takes care of so many other things and it gives the products in a displayed order, like best results should come in the top. It's not somewhere in the 30th page or 40th page or something like that. So user has the visibility of a good products in the starting only. So the chances of his buying will be more. So this is how search relevance works, right? So what is the machine learning part in this, right? So if you see the recommendation service, right? How do you recommend a product to a user? You should know his browser history, what he is interested in. You may have millions of products in your catalog, but it should have, you should know which products are bought more, which products user are interested in. So you can boost those products to the top and you should have this understanding, that query understanding, query substitution. These are like not typical programming things. You can solve it, typical program. You need some more intelligence in your services so you can predict better results. For example, this is one of the architecture, like for the search relevancy, right? When user searches for a query, it goes to that query intent engine. There is some semantic knowledge. We are performing some actions. We have very classifiers. We have ranking models, which is used for solving these problems. We use the product link stream data. We segregate them, we classify it, and we do multiple extractions to that. And then we use some of the learning to rank algorithms. And we take the user reviews, which products has good user reviews. So we can show that product into the top. So this is like a different thing than normally building a API or a login, but login application, something, right? This has more intelligence with it. So this is the part where we apply the machine learning. As we saw, it has to understand the query better. So this is where we are introducing the machine learning concept. So we'll go and see how we have solved, how we can test this application. So for example, if you consider learning to rank algorithm, if I'm using product link stream data to train my model, how do we test it? How do we go about it? How do we test it? So let's begin with the machine learning. So what is machine learning like? So the program, it should have the logical thinking, natural thinking to with, because there is no set of input queries, right? The input queries may change from user to user. Some user may search for, with different access, different queries, right? Like for example, if some users are just like running shoes, some user may search like sports shoes, some user may search like marathon shoes. This is different, right? But it is related to the common solution. So we have a program which can understand this. This is about the logical thinking. And in machine learning models, we go about learning in different ways, like supervised learning, unsupervised, to supervised takes care of like regression or grouping, unsupervised takes care of clustering, where I don't know what is the output. So but I group the data. And many people has this assumption, right? Like what is ML model? ML model means just an algorithm. It's not right. Algorithm, ML model contains data, features, so many things. And when we say ML model testing, people go about that testing the model. Developer will take care of when he's training his model. First he will train the model, then he go about, then he will go ahead with the testing with different data set. And then we can say, yeah, the model is correct. No, it is not actually. That testing is part of the mission, ML model development itself. Then comes the verification. How as a end user are a different user than developer, how can I go and test whether my model is working correctly? So let's understand basic how ML model is getting built. So I have a training data. There is a user who is training my model. I get the input data. I have the training data. First, after building the model, I use some of the training data, which I feed it to my model. I use it as a training data. I will train the model. And then I use some other data, which is called test data. I will check again with my model how it is performing. If it needs any tuning or anything, then I go ahead and change my model according to the results. After I meet some of the basic parameters or successful parameters, then I deploy to production. Then it will be there in the production where user can go ahead and see the predictions. If the model is, let's say if the predictions are wrong, then I have to go ahead and again, I have to train the model. This is called cross validation, where I use both my training data and testing data to verify. So if you are asked to do QF for this, how do you go about this? This is not like a normal program, where I code, where I deploy, where I test it, and then push it to productions. It's not. I have to use my training data. I have to test whether it is working properly or not. With the training data, if it is working properly, then I use my test data. I check my model, whether it is working fine with new data. If it is working fine, then only I push it to production. This is a different process altogether, right, than the normal SDLC. So if I need to test this, what are all the things which I need to consider for testing this, right? So for example, I should, the ML model should predict correct results and it should give relevant results. It should be robust and secured. Robust is like, if your model is under stress, it has so many huge input queries. And in different conditions, how it, in difficult conditions, how it is going to work. What is the data which you are using for testing? What is the prediction it is giving? In our case, it is just only the products, catalog, user products, which you can see. But if it is an image class player or something, what is, what type of data it is providing? How efficient is it? How do you say that your machine learning is not biased? Because one of the trustworthiness issue in the ML model is that it is not always true. It is always biased. So how do you say your model is fair to all the input queries? It is not just biased to one thing. And for example, the one more thing we consider is interpretability, how do we measure this? It is like, if for a given input query, if your model shows some output, how do you say that? How do you manually, how can you say that the result, what it is shown is correct? You should be able to interpret it. So if I have to achieve all of this, what is the challenges in my testing? So in this machine learning, in the machine learning model thing, how, what are the challenges we've seen normally is right? It is like, we call it as non-testable. Why do we say it is non-testable? Machine learning models are used to find solution for which there is no answer. If for a, for example, for a given input query, for like what we saw in the example, Nike running shoes for men, what is the output for it? We can just predict it. We don't know the exact output, right? If you take any other machine learning model examples, image search, how do you search an image in your whole catalog? You can't say like exact output, right? Which is called it as non-testable. We don't know what is the predicted output for it. So it comes under like testing oracle. There is no testing oracle defined for my test case. I don't know what is the input. I just give the input. I don't know what is my output. There is, we don't know what is the algorithm implementation. We just do class validation for it, but we don't know what is the future data is coming for, coming to use. And for normal softwares, we go for code coverage, right? We can say each of my, each line of my code or each class is tested. So it, but in machine learning, it is about the data, right? How do you say for the input data you have verified? Because we don't know the future data. So, and fault tolerance. How do you say for which condition your machine learning model is going to break? So you have to verify for unbreakable data also. How do you say model fairness? This is the again, like by ask, and how do you deploy this to production? So these are the challenges, right? For when you are testing a model, ML model. So let's go and see how do we, how we, how can we solve these challenges? If we can solve this challenge, we can go ahead and verify that the ML model, which is under test is actually correct, which is worth to deploy in production. First we'll understand what is ML life cycle, right? So for, this is, let's go through the life cycle of a ML model, right? How do we build an ML model? We first go with the business requirements and then we collect the data. The data is the very important thing in the ML, building an ML model. I gather all the data from my data lake. I extract the data and then I segregate them into different tables or TVs or anything. And then I, from those, I pick up what are all the features set, which is required for me. And then I go ahead and select, which is the algorithm, which I need to use for training for my application. And then I use, I train my model and I will see whether it is with the training data, how it is working. And then I go ahead and deploy. This is the life cycle, right? So what are the things which we are mainly dealing with this in this? So the first three things, what we see is like the data, right? So as a tester, I know how to test the data. So in most of the other applications also, not in ML, we have used data in so many places. So we know how to control the data. The another important thing is like the features, right? Features play a very important role in ML models. And then how do we go about testing the algorithms? So how do you verify training has done properly? And then the backend services. So these are like the different components which we are interacting with. So it is easy for me to know to how to apply QA concepts into this, right? So if you, as a normal QA, the data part or the backend services part are very easy for me. I daily work on that backend services like APIs or your model may write some data to some ES index or solar. So it is easy. The only unknown parts for me is like the features and algorithms, right? So my model majorly interacts with the data, features and algorithms. So we'll see how we can go ahead and test this. So the different components as we saw is like the data, the features and the algorithms. Let's consider the example of the data, what I used for my model training. So I gather some of the browsing history data, what user has seen in the, what user navigates in my retail application. This is the data which we collect actually. Like this is a sample data, which has so many data in this. This is a big data set. It has like user ID, search term, page URL and what product user has clicked and what is the product list user has seen. We have shown to the user. After clicking on that, where he has taken, whether user has, did he add the product to the cart or not and what is the date and time? It provides so many log data for me. This is really useful, but the problem with this set of data is like, it has so many information blank. For example, in search term, I don't know what are all the search terms. In the post referral, I don't know where user is going to. And then in search term, user is searching for charges, but the page URL which is shown is like main watchers. So this is not a sanitized data or a filtered data, right? So the first thing for me is to like verify the data which is coming to me is correct or not. So I have to do some testing on that, right? So after filter verification and all, we gather data like this. This is, this looks now like a simple, like a correct data, which has like search term, click product, whether user has, does whether user added this to cart or not, and what is the date time which he has clicked. This is easy and this is meaningful for me. So how can I say the data which I got from here and here is correct? There is transformation which has involved, right? From this big data lake, we have gathered only the data which I got. So I have to do some quality checks here. Why do I need quality checks here, right? In the data which you are getting sourced from, it should have all the data. It should not be like very only sample of data which is getting added to you. Whether user has added to cart or not, still I get the data because user has clicked on that there has some impressions. So this is very important because the bias may get starting applied from here itself. If you don't have the whole data, your model will not understand the whole user history, user searching only. So that's why the data part is important. So what are all the other tests which I do on top of this data, right? First, as I shown, I will go with the, I will see whether data is sanitized. I don't have any adversary data set. Like in this example, there may be some ads we are showing but that should not come into the data log. I should not take that for that training purpose. And the big data which we use, it is like it comes to us like in a batch size, right? Every two hours once the data log gets pushed into our system or whether it is complete, whether while pushing the data, is there any specific, whether that complete data got pushed, that data will be in GB soft size. So that process should complete successfully. If there is any, if the data transfer, that data transfer is not complete, I won't, I don't, I will not get the full data for training purpose. And whether is it the correct data which we are mapping against the data lake? Is there any other irrelevant data which is poisoning for me? And the one more part with this, right? What is the data which I am using for training my model? There is so many rows here. There is so many columns here, which is you, which I have got from the logs. But I typically don't need all this data for training purpose. So how do we select this data for my training the model? So there we come with the features. So in machine learning, features are like a very important thing, which is like the label, the variable, we call it as like variables or a measurable variable data or to make it easily understandable when you're using training data. So the each column which you see in your training data, we can call it as a features. So for this, my model training, I have used the search term, I have used which are all the product IDs, what are all the impressions I have got, what are all the clicks, and what are all the, how many times it has been added to cart? So this is, this helps in analyzing like what is the data which I am feeding into the, my model for training without product ID, it may, it will be like unknown, right? Which for which search term, which product ID we are boosting it for. For example, if we can see we have multiple search terms, but which product ID has more good clicks, right? Is the product ID in the first row? There is one more product ID, but yeah, it has more impressions and it has more clicks on that. So it is, so this is why the features for a model training becomes very important. How do you select this model, right? How do you say whether the selected features for your model training is correct? So to verify this, it is not only about your application knowledge, you should have the domain knowledge also. In place of add to cart, if I say price, it doesn't make sense, right? The price may vary day by day, but it is not a good feature for training my model. It may be a variable component across products. So apart from feature, we saw that algorithms is one more part, which is needed for QA. What is the challenge with testing the algorithms, right? We have come across very simple algorithms like bubble sort or some other simple algorithms for testing, but machine learning algorithms are like complex than actually the simpler algorithms. For example, we have for relevancy purpose, we use multiple algorithms like Lambda Mart, Mart, PageRank. So if you're asked to test that type of algorithms, how do you go about that? The implementation details, sometimes it may be unknown, because we may be using it from some other source, which is like a library where I just include it and I just go ahead and test it. This is not actually the easy part while testing it, right? Algorithms are the difficult. It is like a mathematical models. I don't know the implementation details. How to test this? One thing is like, if I have implemented on our own, our team has written this. I know the implementation details. I can check each and every functions, whether it is correctly. I can use TDD model. First I write the test data and then I write my function and I check whether all my test cases are passing. I can use unit test against them, whether it is working correctly or not. There is one more part to it, right? After you write algorithms, there are some metrics which you can compare or measure with that. For example, we use learning to rank algorithms for our model training. When you use learning to rank algorithms, right? It has some other metrics which we can measure it to say whether this algorithm is or the model which I've built using this algorithm is correct or not. For example, NDCG, distributed cumulative gain, mean reciprocal rank or the precision or the accuracy, like using this algorithm, the model which I built, how many true results I have provided? How many false negatives are there? How many positives are there? So it is not to worry about that. I don't know the algorithm implementation details, but I can use some other techniques or the components metrics to say whether this algorithm suits for me or not. So we've seen this like data, the feature part, the algorithms. Let's see what are all the other testing. This is like testing component level, right? So if I have to test the system as a whole, what are all the testing techniques which I can use? There are different testing techniques for ML model. Like it depends on like what type of ML model which I have built. For example, if I have used deep learning, DNN or the neural networks, it differs like based on the model, but let's go through the sum of the common techniques which can be applied across different type of models. One of the majorly used testing technique for testing ML model is like the metamorphic testing. So what metamorphic testing means like it is about the relations. How can you relate one function to the another? Why we use this metamorphic testing is to solve the test oracle problem where I know the input which I am giving, but I don't know the output of my program. So this creates the test oracle problem. We can use metamorphic testing to solve that. For example, let's say if I change the variable of one component, the another should also get changed. For example, we know this basic function, right? Sine of x is always equal to sine of pi minus x, which is called if there is any change in x, the sine of x function will change and the sine of pi minus x also should change and the value of both should match. So we do this transformations instead of x, I pass something else. So I pass some other value. So this is like a transformation which I do. There are like multiple transformation which I can use for it. And then one of the most common, widely used example, right? If you're calculating the life of a person, how do you calculate life of a person using a machine learning model? So you can consider what is his weight? What is his height? Does he has any diseases? Does he has any smoking habits? Or does he has any other habits? Or is he physically fit? So in all this information, if you can see if user has smoking habit, this is like model is like it reduces the age of a person by a x percent. So this is like the relations which we form. And there are different types of transformations which we do like cross-grained data transformation, fine-grained data transformation. In it is like using your model, change the data which you're using for transformation, add some constant to it, see how it is going and changing it. So there are different transformations in this. The another technique which we use, right? Is like the dual coding. How do we, in most of the ML models where we build, we pick some of the algorithms or the models based on our experience. If I have worked for some like two, three projects, if I know the problem correctly, what we are trying to solve, I can say like, hey, this suits correct. We can go ahead with this. But if the problem is little different, right? If I pick, for example, Lambda Mart for my test as my algorithm. So I try and I deploy the model, it says good results. But how can I say any, maybe PageRank can give better results for me? So I will develop and I will deploy, I will try my model using the PageRank algorithm. I see what's the, how it is behaving, right? After deploying, after developing these models, we compare across different results. For examples here, we have used multiple models, algorithms, and we have trained them. And we check the output of this using like KB testing, or I can use some other ways, like I can divert the traffic to different models. I can see how it is trained in the training, how, how it is performing in that while it is testing, how it is performing. I calculate different metrics and I can choose which is the best suits for me, right? So this is one of the concept, like using multiple models or testing, going ahead with multiple models, which I can compare. There is one more technique which we use, like called a precision and recall. In the predicted results, out of the total, out of the total results, how many of my results are correct actually? It is true positive. How many of them are like negative? Or how many of them are like false positives? How many are false negative, right? When we take accuracy, we have to take on the basis of how many true positive, how many false positives and by the actuals. So it gives the percentage of like, what is the accuracy of my model? This is one more concept which we are usually known, right? Like we use this technique in our normal programming also, like which is called mutation testing. Where we develop in the normal testing, in the normal softwares, what do we do, right? We use, we have a program, we introduce some faults to it and then we execute our test cases on that fault program. We check what happens with the test cases. It should fail actually, right? The same concept can be applied to ML model also. Instead of the program, we can use the data here. So I have the original data. I introduce some false positive or some fault data. I make it like a mutant data. Now I feed this to my model for predicting purpose. I see what happens now. So for the false data or the not so correct data, my model should give me how my model reacts to it, right? So this is one of the concept which we use for mutation in testing techniques. There are so many other techniques. For example, while generating the input, I can use like first and such based the test input generation, that input data, which I'm using it for testing purpose, I use different methodologies against it. I have a complex algorithm which is predicting the output. I can use linear regression or some other easy methods to say like, always my results should be better than this. So I can create that benchmarking using linear regressions or any other method. The one more problem with this, this is like test adequacy, right? So I have a huge model. I test against like how much is possible with the code, but data for me is like different. So instead of testing with the sample data, I take the real time user data, I feed it with it. I will check how my model is exploring, how my model is behaving. There is one more technique which is called deep explore, which is like a white box testing, which is used in testing the deep learning networks, like the autonomous vehicles. If you think of if a thing like you are given, yeah. So think of it like you have a deep explore, you have given the Tesla autonomous, the car, which if you are the response, if you have the responsibility to test that, the autopilot function in the Tesla car works fine. How do you go ahead and do it? So, and instead of this, you can use different testing with different data set, right? Instead of while training in the training phase, have multiple data set with different way categories and see how you can go ahead and with testing with that. So after coming with all this, right? If I have to differentiate between what is my traditional testing looks like and what is my ML testing looks like, right? So in traditional testing, like what I normally go ahead and test is only that code, but in ML, I have to verify for the data and for the code also. So the one in normal programs, if I deploy the code product, the code ones, it is like fixed, it doesn't change. But in ML, the data, every day data may change which is coming to my system and the behavior may keep on changing. So I have to verify it continuously. Where the one more important thing is like testing Oracle, the normal programs which we wrote, we write, we know what is the expected output. In ML, I don't know what is the output. We don't have the data coverage also, the test coverage for it. And false positives in traditional testing, which is like very rare. So if it comes also, we rise it as a bug and we fix it. But in ML, you will see false positive most of the times. And normal programs get tested by developers, QS and all. Here, data scientists may get involved, developers also need to test it or QA should also test it. This is one of the questions which we generally think about it, right? So I am testing, I am a normal QA tester who is testing the software. But if I have to test a ML model, how do I get that proficiency? How do I gather that skills for testing the ML models, right? The first and foremost thing for QA, which I think is like, you need to understand the domain. You should know what is the problem. And the basic knowledge of programming may be Python. So it is a good start. And I have, I also should say like, go through the sum of the ML courses. It has so many mathematical terminology which you may be understand, but they are named differently and it helps. And how ML models are developed, it will help. And ML is not just only about the models or anything. You should understand the data. The data science also is good. You should have that knowledge about that. And do some sample projects in the Kaggle, right? We have a, we are very much well versed with GitHub. So for more ML, people use Kaggle. It provides a platform Monday data and you have so many, you can see so many people who has solved that problem, how they have solved. And one more very useful feature for me while working on this is like the scholarly articles, right? Which we feel like very theoretical and we feel like it is very bottom, but actually it provides very good insights on how each model, how they have come up with those experiences, right? So this should help in your journey of testing this. So these are all some of the references which I have used for while through my journey. So maybe it will be useful for you also. So I think I am done with my content. If you guys have any questions, yeah, we can go ahead with it. So thanks a lot, Vinayaka, for the session. It was so much learning and so nice to like see through like how you did and be able to learn and implement in our work. So that's great effort. So thank you so much for coming and sharing with us. We have very good feedback and questions coming in. So if you can see in the discuss tab in the Q&O. Yeah, I'm going through that. So how do you manage to provision data for training and testing the model? Is synthetically generated from production? So here it comes to things, right? Like for training and testing purpose means when you are building the model, so what developer uses it for training it training. So that is different. So when you're verifying it for QA, how do you do it? Suggest that you better use the production data which you get the real, which is like a real time data. So you can verify your model is working correctly or not. Is your search validation different than model validation? Is your automation results integrated and acts as a feedback to model to retrain it? No automation results are not used for model for retraining. And the reason is that your automation always gives the same thing, right? But when my model is getting trained, it should be trained with the real data. No automation results used here. And then is it possible to implement the ML where the data is integrating with timely? Initially it is zero, after one month it is 6,000. Yeah, it is integrated with timely fashion. Like my model gets the data every day. So every day the model, like we train the model like based on how it is performing. If the relevance score goes or like the NDCG score goes very less, we take the model, we train it with new relevant data. Is it only for e-commerce website? For just for the understanding purpose, I have used e-commerce website, the search relevancy for even for the Google search engine for even for the Yahoo, it matters, search relevancy. I will just give you one of the, may not be a bug, but relevancy bug. Go to Google, search for ML model. Just for ML model and see what are the results shown in the image, image section. Go to Yahoo, search for same ML model, see what is shown in the image. Google shows like ML model diagrams, but Yahoo shows as Mercedes Benz ML class car. So that is where actually it becomes interesting. So great session. What resource parts would you suggest beginners to? Yeah, I think I have given it in the reference. Maybe you can go ahead with, I will share the PPTs. Do you compare multiple algorithms to decide which algorithm is giving better? Yeah, we do the DL coding. We have used page rank. We have used Lambda mod for the learning to rank. Yeah, we used multiple. Is it supervised or unsupervised? How do you, how you decided on which one to use? It depends on your problem statement. We use supervised because this is about grouping or predicting it, unsupervised where you don't know what you don't know actually. This is about like this, right? For example, you have a given a basket of fruits. So let's say you have around some 20 fruits on it. If you know what are the models, what is the name of the fruit, you can tag it tag with that. If your model doesn't understand that, right? It just clusters it. What are all the fruits which it didn't understand what is the name of it, but looks similar. It categorizes it. So this is the, this is, we use supervised and what is the success rate that is seen on the when the application is deployed to production. This is more specific to client, but we have seen like how we display the, like the, the conversion rate, earlier we use this model was like two to 3% the success rate, like the conversion rate. But after deploying to this model, we have seen like nine to 10% of sales conversions. Yeah, this helps. The details shared about ML models are used by developers. Did you another model to test developers model? No, I haven't developed any other model. This is how it is data, how it is different from data wrangling. Sorry, I don't understand that what is data wrangling. I think we are a time up. So if there is any questions, feel free to message me.