 Hi everyone thanks for joining us today. Today we are going to talk about automation test failure classification using machine learning. This is Anuradha Kondori. I have been working as software engineer in test with Expedia Group for more than two years now and I'm joined here today by my colleague Kirtika who is a senior ops and traffic analyst with Expedia Group. So today we are going to talk about how machine learning could help us categorize the automation errors into multiple categories. So let's get started with the introduction section. So first let's talk about the problem statement. So when we look at the automation failures we are not always quite confident if they can be directly reported as actual bugs or not and I can say this from my experience being in testing for almost five years now. So this doesn't essentially mean that the automation test cases are flaky but it could be due to several other reasons like the environment stability issues or unavailability of inventory or it could be the setup issues, runtime issues or in fact even automation issues like locators and so on. So because the automation failures could have other reasons as well teams tend to ignore the results altogether thus defeating the whole purpose of having automation in place. Also engineers spend a lot of time in analyzing the failures to determine if they are actual bugs or not and with the number of failures being very high it becomes quite a difficult to analyze the root cause of each and every failure and these days with microservices architecture and with CI CD pipelines in place it is quite difficult to analyze the failures for each and every build that happens mostly for every commit and to determine if they are actual bugs or not. So this is the problem statement that we are targeting we want to reduce the time that an engineer has to spend to determine if the automation failure is actual bugs or not by analyzing and looking into each and every failure. So now let's talk about the complexity of the problem that we have in hand. So different teams could use multiple frameworks which are implemented in different programming languages depending on their particular requirement and depending upon the framework that was used the automation failure messages could be quite different and another thing is the test specification methodology that is the way the test scenarios are defined and one of the prominent methodologies which we have been used in recent times it's the BDD way of defining the scenarios and this would also determine how the final automation failure would look like and another thing is customized failure messages at multiple levels of the automation system test causing the failure message to look different. So now given an automation failure it could arise from multiple layers so let's talk about them. So the first layer is the automation test code itself and a layer below that we have the libraries using which the automation was built and in a layer below that we have the framework that was used to build the automation and in a layer below that only in case of UI automation test cases we have the selenium web driver and in the bottom most layer we have the programming language that was used to build the automation test cases. So now depending upon the layer from which the failure could arise the automation failure message could look quite different and also there is some level of customization that is possible at each of these layers for the failure message. So now you could see that it is practically very difficult to find a common pattern among the failure messages that could help us in their analysis or gaining insights. So now let's talk about the desired capabilities for the solution that we would want to solve the problem that we have at hand. So obviously the solution should be able to automatically detect patterns among the failure messages to provide us certain insights or the categorization and like we discussed earlier different automation frameworks could be used and hence the solution should remain agnostic and be able to provide the categorization for different frameworks. Also the solution should be extensible to any future frameworks that the teams could use depending on their particular requirement. And the next thing is when we look at automation failure message it could contain two parts like the automation description or the summary and the stack trace. Now we could argue that we could use only the automation failure as input to the solution and ignore the stack trace but from our experience with the automation failures which we have seen within Expedia sometimes the failure description could not be available and sometimes we have seen that the stack trace could have certain useful information that could be used for providing the insights regarding the failure. So because of these reasons we have decided to go ahead with the entire failure message including the stack trace as input to the solution and thus now our solution should have the ability to work with this entire message which is now an unstructured message and next the solution should be able to adapt itself to any new failures that could be arising out of the frameworks for which the solution is already in use and last but not the least the solution should have high reliability because this would determine the confidence with which the teams would use the solutions outputs of analysis for the automation failures. Now keeping in mind all these reasons we have decided to use machine learning as our solution for the problem that we have in hand and to provide us predictions. So now my colleague Kirtika would be talking about the solution more in detail. Everyone this is Kirtika. I've been with Expedia since past two years and working as senior analyst in Ayurveda. So as my colleague Anuradha talked about the problem that we had and the outline that we have created to understand our problem. So with the problem statement and all the analysis we identified like how can we solve our problem. So we created a solution outline and we could able to with the deep analysis we could able to identify that this problem is a good fit for the machine learning because our data is like that our data is basically promoting it to the classification category of machine learning because we already knew our data we are very much familiar with our data and we have like we know that what we want from the data. So we widely has the four classes in all the test error messages that we get. So one is the inventory. So whenever there is some item unavailable then an inventory error shoots up. So and then there is an automation error as Anuradha previously talked about like whenever there is a non-existing element on the page or something like that. Then there comes the environment. There are many downstream and the backing services that are running to power up the UI and everything. So that is actually if something goes wrong with those downstream services then environment error shoots up. Then there is a product. So it is actually the error that a customer faces on the final UI of our Expedia site. So before moving ahead with the ML we have to plan it accordingly because as you can understand that there is multiple steps to you know solve a problem via ML we can't just throw the problem to ML. So first and the very important thing of this solution outline is we have to collect errors and classify them manually to provide a source of truth to our model. Then we have to identify the correct algorithms. So there is multiple algorithm available for the problems like there are text classification algorithm, there are categorization algorithm, there are multiple algorithm I would say. Then we have to do provide the proof of concept to finally you know upgrade our solution outline into a product. So after that we create a machine learning end to end flow and then automate everything because right now it is a very manual process where a tester has to go there and has to select one of the category on which that particular error message belongs. Then we have to keep adding more pre classified data and keep fine tuning. This step is very important for any machine learning technique that we are using or any machine learning environment if we are setting up because if data gets stale then it is more prone to you know more prone towards biasing and overfitting. Then there is then we at the very end consume the model. So we are planning to you know give it wherever in our Expedia. Expedia is a very big organization and there are there is and we are producing a lot of data. So we are basically will be able to make it robust and scalable to consume the ML model. So in the very first step the data collection step so as you could see on your screen this is how one of the error message that looks like. So this so one of the test framework pulls this message and put it into the dashboard and and tester has to go there and select the categories manually by understanding the message like from in which category it falls into. So this is how and from this this MongoDB this is basically the MongoDB back end. So we are basically collecting our data from MongoDB and we are classifying them manually for this for to provide the source of growth or to create our training data. So this is our data preparation. This is how our data looks like. So this is an error message I could see as you could see we don't have much dimensions in our data our data is like two-dimensional data with multi-class with multi-class problem. So this is an error message. One column is an error message. Another dimension is like prediction where the predicting category or the where the this error message falls into. In next step we have to do some of the data cleaning like there are multiple punctuation multiple html tags that comes in the message but we are to identify like exactly where that message belongs to like for example the inventory if something is out of if something is out of stock if it is not able to you know it is not able to call the inventory for a particular item then there must be a particular error message that is thrown by the app but the thing is that in the in the tests that are running we are getting the whole stack trace so we'll be dealing with the stack trace in when we will mature our model but as for the proof of concept we have removed all the all the extra punctuation and the tags and only kept the important message or the plain text message that was relevant. Next we have evaluated multiple algorithms for on our cleaned data so as you could see on your screen like there are multiple algorithms that we have created that we have tested on like native ways and support vector and so on so we finally was able to stop at the support vector machine with the term frequency feature so it was it was giving us an accuracy of 72 percent so it was good for the proof of concept but as you know it is not very good for the production and it is not reliable for an engineer because a human can do the error human can you know human human's error rate is around 95 or 96 percent of he is working as a job doing a job and and and doing it regularly so why he would use our he or she would use our product so we have to identify like why the accuracy is not up to the mark why it is not reliable and so we have to so now so far we have evaluated the model now we have to evaluate our data like what is wrong with data so as we have evaluated our data was like not in very balanced shape like few of the categories was having very little values while a few of the categories having huge value for example product does not have much does not have much errors instead the I would say automation or some internal error will also shoot up like have the multiple errors so in the end so we could identify that there is an data challenge and there is an issue of biasing and overfitting of the model so next we have to fine-tune our ML model and also you know clean the data so to so for cleaning up for the cleaning of the data we have like used multiple techniques like consensus cost sensitive general techniques, synthetic samples, tree-based algorithms, semi-supervised learning and all so I will go by one by one so cost sensitive learning is like assigning the cost and everything though the model assigned the cost itself on the whenever whenever it classifies that will that comes under fine-tuning but to improve our data to we have assigned we have assigned constants to our to the most favored word in most favored word in an error message so there is usually we have already seen multiple errors and it's a very it's a very matured Expedia is a very matured platform so we know that what what what are the patterns already so next the next step is the synthetic samples so generating synthetic sample is basically whenever the bars are not meeting so we are basically filling up extra space of commons so that we'll be able to you know uh uh level level up the all the categories now there are tree-based and model and sampling it is basically the machine learning technique uh machine learning modeling technique but we have used it to uh to balance out the data by you know uh adding the feedback loop and all so semi-supervised learning is basically so so far we were going through our supervised learning but in the end we identified that few of the patterns to actually tackle complete message we have to go by and understanding the patterns of the message like uh like to and to find out the similarity factor between the two uh message or the two records or the two data points for that we had to go by the semi-supervised learning furthermore the model tuning parts now we have like balanced out our data somehow and we were like uh we are like uh moving forward with the model tuning so uh we used uh the model and sampling at we were maturing our uh technique so we uh ended up with the model and sampling techniques where we combine the model break the model and get the result out of it and then we again you know keep on keep on keep on there is multiple iterations that we do to get our results so uh we have used like random forest voting classifier and so on but in the end we have uh we stopped at XGBoost where uh our accuracy was like very much promising and we were able to uh you know and it was very good for the production so currently we are using like semi-supervised learning for data cleaning and for imbalance part and the XGBoost classifier that could classify our errors and which will actually give us the better recall and the precession values so this is the final uh fine tuning uh that uh that gave us that is a accuracy score of 97 percent and precession of around 69 so this is 69 percent precession is for the product and and so the further we have divided the category wise uh accuracy and all so uh as you could see the hyper the most important thing is the hyper parameter values that we have assigned to the fine tuning and we have identified uh during the process for example the maximum depth depth of the tree is equal to 13 and then uh learning rate what should be the learning rate what should be the feature selection percent etc so there are multiple things that are under that comes uh under the fine tuning part so it is a continuous process yes because we are uh feeding more and more data and the new patterns are also coming new errors are also coming as the new applications are being developed uh continuously so yes we are uh uh to be a keep on so these uh parameters are prone to change and also uh for the improvement also prone to change for the improvement as well so our final uh metrics with the 30 percent split and the two different seed like there are 30 percent will be the training data and the testing data and so it is like the we have got the accuracy of 97 percent which is this is very very very promising so uh so so and it it was good and it was uh it was good to put into the production so that is why our next step is to deploy our model and set up an end point so that everybody couldn't use it so that part will be taken up by my colleague Anuradha Anuradha over to you thanks Kirtika um everyone so uh now we are going to talk about uh regarding the model deployment and end point setup so uh we want our trained model to be exposed as a service to an api end point where the error message uh could be given as input to the api end point and the api end point could talk to the trained and deployed model and get back the inference or the prediction and send it back as a api response we want to expose the model as a service to for its simplicity of usage because the model could be invoked to get the prediction as an api endpoint from any piece of code so now let's talk about in detail how we have achieved this so the primary uh technology that we have used to uh to expose the uh model and to uh train and deploy it is uh sage maker so sage maker uh allows us to uh put our model code into uh uh into a custom container which could be trained and uh run within the sage maker environment so the first step would be to uh create the docker image with our model code so uh sage maker provides a default container structure which already has a flask uh service that serves as the web app and an engines uh server and all we need to do is put in our model training and inference code and uh build the docker image so once the docker image is built there's an invocation end point uh that gets exposed within the uh docker image so once the docker image is built uh sage maker uh provides us uh with training and uh and inference uh scripts using which we could test on local and these scripts uh basically uh mimic the way how sage maker would call the training and inference process within the actual sage maker environment so okay now once we have uh built our image and tested the training and inference part the next step would be to uh push the model uh image to amazon ecr which is the elastic container registry and in the next step uh we want to make our training data uh available to sage maker and we have chosen amazon s3 for this purpose so we put our training data into s3 and uh in the next step we can go ahead and create the sage maker uh training job so the uh the inputs that the sage maker would need is the full path to the uh model docker image that we have pushed earlier to ecr along with any uh tag names which is useful to uniquely identify the particular model image and the second input is the training data path in s3 and the third input is the desired location where the trained model artifacts should be stored so uh now with these inputs once the sage maker training job is created sage maker run state training and the trained model artifacts are pushed to uh s3 so uh these trained model artifacts are nothing but it's a it'll be in a pickle file format which is like a serialized form of the trained model now once we have the trained model artifacts available in s3 in the next step we can go ahead and create this sage maker environment the prerequisite for invoking this sage maker end end point is it can be invoked only within a sage maker runtime environment so as the next step we can go ahead and test this sage maker endpoint within the sage maker runtime environment to see if you are able to fetch the predictions from the deployed model so now as a next step uh we want to make the model available as an endpoint so we have achieved this using aws api gateway and aws lambda so the lambda would have the code to create this sage maker runtime environment uh invoke the sage maker endpoint with the error message input that it gets from the api gateway and the sage maker endpoint would in turn talk to the trained and uh trained model and get back the uh inference uh which is nothing but the and send it back to aws lambda so lambda would append this prediction to the error message that was received as input and send it back to api gateway and api gateway uh sends this back as the api response so here you could see the example api response where uh this is the input error message that was received and uh the environment testing prediction that was populated using the model endpoint so now that we have the uh model uh created let's talk about uh the model endpoint uh usage uh within xpedia so uh within xpedia the automation test cases are are run as part of uh ci cd pipeline which is set up using Jenkins and for every test case run there's an uh there's a test record that that gets created with all the uh information regarding that particular test case run which gets pushed to uh elastic search and can be visualized using kibana so now let's uh look at a sample snapshot of uh the test run data from uh kibana uh in this particular uh data i have queried uh only for the scenarios which have failed and here you can see the field scenario name and the uh status which obviously is failed here and the line of business uh for which this particular scenario was executed and the next one is the error message or the failure message and uh we have here a field called an uh page failure message we will talk about this in a bit and uh we have the framework uh field uh using which the the particular test case was built and finally we have the prediction field and this is the field that gets populated for each test case run from the model endpoint that was created so now let's talk about in detail how these predictions are getting populated so fetching the predictions from the model uh endpoint so for every uh test case run depending on the framework using which the automation was built there's a result file that gets generated uh either in jason or uh our html format and for every uh test case run we have a processing module that gets uh invoked and this processing module is uh responsible for forgetting all the details uh regarding the particular test case run and creating a json record with all the details which is finally pushed to uh elastic search so now let's talk about what happens when the uh scenario fails so when the scenario fails this processing module would get the failure message or the error message from the uh from the result file and now there's another optional uh input called page failure message so in case of xperia site uh pages uh uh again this is applicable only in case of uh ui automation test cases so whenever there's a back end process failure or a back end service failure sometimes there's a custom a page failure message that gets displayed uh to the user and which we call as the page failure message so in case of uh ui automation test cases whenever this page failure message is available and when it was included in the training data uh we have seen uh that the model performed really well as it it could gain certain insights from this custom message uh which is taken from the ui itself so uh because of this reason uh even in the uh inference or the prediction part as well the processing module uh fetches this page failure uh message from the browser for the particular test case run whenever available and append set to the beginning of the error message and sends this uh whole message as input to the modal api and uh and gets back the uh prediction from the modal api and appends it to the adjacent record that was already uh created and this whole record is now stored in uh elastic search so now uh within uh xperia these are the current frameworks that are being in use and for each of the uh frameworks we have processing modules implemented which are able to uh fetch predictions from the modal endpoint and populate the predictions for the uh test case failures so the first framework is scala test which is implemented using scala programming language and the second one is night word js which is implemented using javascript programming language and the third one is kukumbar whose step definitions uh implementation is done using uh ruby okay now that we have talked about the uh modal endpoint setup and usage let's look at some of the uh prediction category uh examples from within xperia test case failures so the first category we have is automation so like we talked about earlier automation uh represents any locator issues or setup issues or runtime issues that are caused so and also please note that in these examples for simplicity purposes i have included only the failure description or the summary and not the stack trace so okay now let's go through these examples from left to right and top to bottom in that particular order so the first example here says that a null pointer exception was thrown which is a runtime issue and hence classified as automation and the second example says that uh element is not clickable uh could be locator issue and uh categorized as automation here and here it says that the automation was trying to access a method for a null uh element again uh automation issue in the fourth example it says that a particular uh value is uh not found from the select list in the ui again an automation issue and in the fifth example it says that it's unable to uh locate a particular uh element or the object was unknown uh again an automation uh issue and in the last example we can say see that it says that the session was terminated uh because of credit exception and the selenium uh web driver so this is an example of setup issues again and uh automation categorization so now let's look at the next category which is environment so uh like we discussed earlier uh automation is usually uh run on pre-production environments or the test uh environments and these could have certain stability issues like capacity issues or back-end service uh performing stability issues uh which might cause automation failures and this is what we expect to be categorized as the environment so let's look at the first example here it says that the automation is not on the desired page which is the uh hotel search page calls it a runtime error but it's an environment stability issue and the second example here uh says that uh even post multiple retries and element was not formed and it's categorized as environment here so like you could see here there could be certain amount of overlap uh within the uh automation failure description like it might look very similar but might get categorized into different categories again this is completely dependent on the training data that was fed to the model uh it could be that the model got certain uh insights or keywords from the particular failure message to to indicate that it belongs to a particular category okay so let's continue with the uh example so here it says that navigation to the trips url failed again calls it a runtime error but it could be an environment stability issue and the next example says that the failure message was malformed again an environment issue and the fifth example uh says that there's no uh room type uh i mean required room type that is uh that is expected by the automation from the hotel uh infocyte page or details page again an environment issue and the last example uh again says that the automation is not on the uh expected page now let's look at the next category uh which is inventory so like we discussed earlier uh automation is run on pre-production environments and the test environments inventory within xpedia is quite different uh than the production uh inventory that is available and in the test environment uh inventory sometimes uh the uh the items could not be available as required by the automation while it is running which could be causing failures so this is expected to be categorized as inventory failures so let's look at some of the examples the first example here says that the uh search result section itself is not visible definitely an inventory issue and a second example says that the uh hotel search section element is not available again an inventory issue and here uh in these two examples what you see as uh prefixed with uh alert error uh these are actually the page failure messages that we talked about earlier which are coming from the uh ui so now let's look at these examples here it says that we have a whole lot of rooms for the particular criteria that the automation is looking for clearly an uh inventory issue and the next example uh says that there are no results on the page although it calls it calls it a runtime error but it's on the flight page basically the inventory is not available and in the next example it uh it says that uh it's not able to find a particular trip that is required again an inventory issue and the final example is that the uh flight search is not working again an inventory issue now let's look at the final category which is product which is expected to represent the actual bugs that are to be reported so the first example says that certain expected sections are not uh uh as expected on the hsr page which is the hotel search results page and hence categorize this product or a bug and here uh in this example and some of the following examples these are custom error messages which are coming from the automation layer or from any libraries that the automation could be using so here it says that a particular section is not working as expected categorize this product and in the next example it says that the particular condition is false along with a message to say uh why the verification has failed and the fourth example it says that an expected message is not displayed on the page and in the fifth example it is termed as an assertion failed error and says that a particular option is not in the expected condition and in the final example it says that the search dates on the ui are not as expected by the automation again categorized as product so now that we have seen the examples based on the current data that we have in elastic search this is how the various predictions or distribution looks like for the automation test failures so the so most of the predictions are our environment category which accounts for around 65 to 70 percent of the failures and the next comes the automation category which accounts for around 45 to 55 percent of the failures and next comes the product category which denotes the actual bugs and it accounts for around 35 to 40 percent of the failures and the last category is inventory which accounts for around 15 to 20 percent of the failures so now my colleague Kirtika would talk about the uh future work that we could further do in this particular space Kirtika over to you so as per the future work uh we have like uh like we have only tested out with the test results that were uh you know focused on the product but there are multiple uh log mechanisms that are being used in our in our uh you know in our platform for example the js error javascript error that are produced and uh and you know fetched by selenium web driver uh there so those errors can be used to you know further train the model and further identify the issue and further classify like where the where the platform is failing and where the and what is actually the error and then there is an exception in a splunk so all the event logs are uh all the transaction and event logs are being logged under the splunk so we are thinking of you know fetching data from a splunk as well and working on that so then there is a haystack trace so haystack is basically the Expedia inbuilt tool it is now open source so it is trending tool basically and the trend analyzing tool so there are multiple failures success rate uh and uh and uh you know the anomalies that are being uh surfaced by this tool and logged under it we are also gonna fetch these anomalies and you know failures and everything and we'll put into our data and see like how the model and the automation will work so basically our idea is to you know use all the data points for the uh and and and uh from the across the platform wherever the error and wherever the failure messages uh are you know producing wherever the environment is failing and everything so we are basically fetching everything and we are making it more scalable more robust more uh more global I would say so uh that is all for the future work and uh uh and we are now open to your questions thank you for the thank you for the thank you for listening and everything uh this is all that we needed to present thank you for keeping calm we have I have seen many Q&As I know you guys are interesting this is an interesting topic I totally understand and uh it's uh it's like whenever we worked upon so uh it was like for us uh to you know uh interesting to implement as well so now let's handle your questions Manoj sure uh there was a pretty uh interesting talk as as we have been seeing on the uh panel yeah I was like only one right cool um the most upvoted question that I see here is how much data you use to classify and you know and train your model initially okay uh Kirtika can I take the question yeah sure yeah okay so uh uh initially we have actually started off with very uh less data we had because uh the like Kirtika talked about the pre-classification is a manual process and uh hence uh uh you know we had to rely on the teams to provide the classification so the rows were almost like 1k to 2k uh and the last thing that we have were around 3k rows but then what we did is like Kirtika talked about we have involved semi-supervised learning where we have put the uh I mean taken all the unlabeled rows which is all the failure messages without the categorization because we have lots of data so uh we have taken that and uh took a promising uh model to categorize it and fed it back as training data thus creating uh more data for the model to work with and this has definitely given us uh given us significant improvement in the performance uh plus I would like to add on it like after uh all the automation that we have did uh we are now getting around I would say with more than 60,000 data per minute for training very interesting very interesting um the next question that I see is from Abraham he's asking hadn't such a tool do you still use some manual analysis if so then when and how do you know when use manual analysis against fully trusting such tool do you miss bugs and what is the ratio that you see uh Kirtika shall I take it yeah sure okay uh so uh like we saw I mean although the uh precession and recall are the scores that we're currently relying on and it's doing quite good and uh so this like we said we want to uh or uh we want to highlight that we are using this to reduce the time uh that an engineer has to spend not completely replacing it because in some cases it could be that although it gives its initial analysis or the prediction in some cases it could be that the engineer might have to still spend some time to see if the category was different uh or even when it is categorized as product which is the actual bug it could not be a bug okay so uh this would still need and we are taking all such again manually set rows where they say it is the wrong value and set it to the right value and giving it back as feedback to the model so that the metrics and the performance could could further improve so uh coming to the other part of the question where you ask like did it miss any bugs so so far we are seeing that the categorization for product is good it is around 35 to 45 40 percent of the data obviously like maybe few intermittent issues could go and be categorized as environment or automation or something like that but apart from that I think it's not missing most of the bugs great I will take one last question as we are nearly there so other questions are mostly in some of the internal details and one question that I see most common is is this shareable uh I have plans to open source or is it already open sourced okay it's not currently open sourced where we were actually planning to do it before the presentation but we were on a time crunch so we would be we are planning to open source it soon with all the documentation required for the setup and stuff as well so probably uh yeah it will get open source soon that is a really great news to hear that um all right thank you very much Anuradha and Ketika for your time and such an interesting talk