 Welcome everyone to this session which is what to do when tests fail by Tarun Narula and Sandeep Yadav and we are glad they can join us today. I am Tarun. I am working as a manager at Naupri.com. Naupri.com at least from India everyone will know about. So for those who don't it is India's number one job site and is a part of InfoH that is a parent company which holds on to other sites as well like 99 acres and Jeevan Sati. So on a personal front I have worked on Celerium since 2012 and have also worked on API and apps automation. So hi everyone. I am Sandeep Yadav. I am working as a lead testing analyst in Naupri. So I have overall experience of around six years in manual and automation fee. So I worked on both manual, sorry, on web-based projects and mobile-based projects. Okay. So again welcome you all to our session on where we will be talking about what to do when tests fail. We have given our intro already so I am Tarun and with me is Sandeep. So we will look at some common causes of test failures and how to prevent them. We will also go over some of the best practices for functional automation. In the last we will cover some techniques using which we can get to know about failing tests at the earliest and discuss what we can do to fix them faster. Okay. So let's get started. Over to you Sandeep. Thanks Tarun. So let's discuss in details about the problem statement of test failures and its impacts. So here we will discuss about what is a test failure and what a flaky test is. So let's first discuss about test failure. So if any test that does not perform as per the requirement due to the occurrence of any you can say defect then it will result in a test case failure. Similarly a flaky test is a test that both passes and fails periodically without any changes in your code. So there could be many reasons due to which tests can be flaky. Some of which are concurrency, dynamic content, infrastructure issues. So we will look into these in depth in further slides. So now let's talk about the impact of test failures. So test failures analysis is a time taking process. So if the number of test failures are high then there is a probability that test failure analysis is not done every single time. So this ultimately results in reduced the stability of the product and also has a negative impact over the trust on automation. So these tests can also prove to be quite costly since they often require engineers to re-trigger entire build on CI and often waste a lot of time waiting for a new build to complete successfully. And these test failures are not only impact the feature being developed but also it impacts the time, money, trust related to it. So it increases the test creation cost related to it, test execution cost and it's impact the business. And as there are frequent delays in the releases due to these failed test cases. And this test flakiness not only impact small companies but it also impact the big one. Now let's consider an example of Google. So here in this image you can see it. So there was a publication which was released by Google in which they collected a large sample of internal test results over a month and uncovered some interesting insights like around 84% of their transitions from past were failed due to flaky test. And only 1.23% of tests ever found a breakage. Almost 16% of their 4.2 million tests have some level of flakiness. And they spent around 2 to 16% of their compute resource to re-run these flaky tests. Now let's discuss about some real issues which can get ignored due to flaky test. So first is the application not behaving properly which actually leads to the test case failure. Now let's suppose there is some deployment related issue in one out of two app servers hosting any page. So if the request gets directed to first server the page is loaded properly. But if the request goes to second server user sees a 500 internal server error page. So this might you can say get caught in our automation script. But there is a high chances that the issue will be ignored if we assume that the test to be flaky. Now same in the moving further. So sometimes issue related to APIs can also get ignored due to flaky test. For example, due to unexpected database change sometimes a pair returns a different response than the expected one. So which results in fall asset conditions and make your test case fail. And another thing is that if your back end server is very slow or it's taking a long time to process the API request. Then it may lead to request timeout which is 4-0 it. So which will also results in your test case failure. Now because of last time we sometimes ignore test run failures. Now as you can see in this image because of limited time we start ignoring test run failures. And this automation test or also may not have enough time for coding which results in code being unstable. And when they execute that code it ends up with lots of test case failures. So what will happen this will you can say leads to rise in dissatisfaction among them and impact their productivity. So as a result of so many automation test failing. So what we start doing we start testing the application manually and start and stop making use of automation. So basically this defeats the entire purpose of doing automation. Right. So thanks and deep. Okay, so now let's see some causes behind why do tests fail and discuss how to prevent them from failing. First in the list is the you can say the most obvious one which is not keeping our automation updated. So there are new features those are being developed all the time right. And they need to be added into automation. Apart from that the existing cases do need to be updated and according to the change flow that is. So in this example that you see in front we have a new widget which gets added on the Nocree PWA homepage. So apart from adding this floor in our Selenium test suit. We will also need to run the existing cases to see if they need to be modified or not extra failures due to browser upgrades. Suppose you're running a Chrome version say 81 83 whatever and it runs fine. All of your cases are getting executed and no issues faced. But one fine day your browser gets upgraded to a newer version. This needs to be taken care of in our automation otherwise it might start failing. So in this case a common exception that you see is a session not created exception. What you can do over here is you need to update your driver version. For example the version of your Chrome driver or the Gecko driver. So these driver versions are tied to specific browser versions. Like we see here for Chrome where each Chrome version has a corresponding Chrome driver. And similarly for Firefox also there's a minimum recommended Firefox version that is mentioned over there. Next in line are failures due to poorly written locators and we are talking about not using IDs. So this might happen due to a framework level restriction which mandates to use only X paths or a particular locator to keep the framework generic. So finding the elements using ID is both fast and stable. So if you have one you should always use that build your framework according to that. Like this Amazon page that we see here it has an ID for the email input box. So we should use that but practically not all elements have an ID correct. So if possible you can ask your dev team to add them. It's easy to add and a good practice to assign IDs to elements. So by making use of name or CSS or X path locators only in case the getting the IDs added is maybe not feasible or practical in that scenario. Next we are talking again about poorly written locators but this time about absolute X paths. So in my opinion it is one of the worst things that you can do to your test using absolute X paths. Let's take an example for that. Sorry to interrupt. Yeah, I did it. So let's take an example. The image you see here is of Naukri's job search result page where I have searched for some Selenium web driver jobs. So here each job listing is represented by an article tag in the HTML DOM. Let's try finding an X path for the first job that appears on the page. So this absolute X path that you see on the screen starts on the root HTML and goes down till the article node. So without doubt it can easily break in case any element is inserted anywhere in between. So this is very fragile. We should always prefer relative X paths over absolute ones. So they are more robust and let's have a look on some examples. So this relative X path is one that was suggested by a common browser add-on that we can use to find X paths. It is good but you can see that it will break as it has references to intermediate elements between root and our article node. So let's try to improve it ourselves. The second relative X path makes use of only the article node and will point to the first job on the page even if the search happens on a different keyword than Selenium web driver. So the learning that we can say over here is not to use absolute X paths at all. And also cross check if the X path generated by any tool is good enough or not. Okay, so now let's talk about some common mistakes that we do while automation. First is not covering all the application states. In our example here we see two snapshots of Nocree home page. The page on the right looks identical to the left. Apart from this Selenium web driver job search that I just did. So it is appearing on the home page as a recent search. In case I don't consider this scenario while coding my tests, some of them might pass but others might fail. So the reason behind that can be X paths changing for the sections like jobs by top brands or jobs by domain. So some cases might pass for which the X path is correct but for others it might start failing. So you should always have a good knowledge about all the flows and possible conditions of your app and you should write your scripts accordingly. Next common mistake you do which tests fail or you can say test appear flaky is writing dependent tests. So it is just like you can say playing with some domino blocks that you have lined them and one failure is bound to impact them all. So instead try writing independent tests to make your script more robust and lesser prone to failures. Third is not using weights properly. So weights are necessary to prevent failures especially when network is slow or you can say server is gone slowly or maybe due to difference in browsers. So there is a famous meme on browsers that fits our topic pretty well. On a side note even try out edge browser from Microsoft. It is definitely better than IE and based on Chromium. So let's move ahead. A major reason for tests being flaky is using sleep instead of weight. So you should avoid sleep but because you can never be sure basically when the page is going to finish loading and all of that when it is in the expected state. So this will result in your test passing sometimes and failing the other times leading your test to again appear flaky. Instead you should use implicit weight to set a timeout at the driver level or make use of explicit weights to check for a particular condition to become true. So using these weights correctly will help you achieve stable test results. Okay so tests can also fail due to difference in execution environments. So you should always keep in mind the environment you are going to run your tests on. For example your test or staging environments they can be slow due to different hardware configurations and they can even be faster so you should make use of weights over here definitely. Similarly your automation execution intro where you run your automation on can have slower machines than your laptops or network conditions also might be different. So issues due to network latency can especially be true if you are executing cases on a cloud testing provider like BrowseStack, Source Labs and so on. So BrowseStack has a good article hosted on the blog for the same. You can go through that for more understanding on how this impacts on cloud testing. Moving to failures due to parallel execution. So it is no doubt a great feature to have everyone will agree right. But you need to remember some of the points. The hardware configuration of machines as we discussed they might be slower and there's always a limit that you will be able to run your tests concurrently. Post which your test might start actually failing more. So you have to consider that hardware limit. Software configurations also can impact your test. Such as different browser versions, different OS or maybe different Java versions from your local system or even across the grid machines. They can impact your tests. So try having same hardware and software configurations across all your machines. Another very important point is you need to take care of data sharing between tests running in parallel. For example, your test might be using same user credentials across tests. This is fine when tests run sequentially. But once you're running tests concurrently, you need to check whether your application supports multiple sessions at the same time or not. Another case can be a phrase conditions where multiple tests are making changes on the same page and validating the entered info at the same time. So these tests will start failing and again appear flaky. Okay, so we have looked at some common causes of test failures and how we can prevent them. Now we will be talking about some of the best practices that will help us reduce the test failures even further. First best practice is knowing how Selenium works in and out. There are some great sites available on the internet. You can always go through those. But I do highly recommend going through the documentation that is available. That is selenium.dev. Also, you can have a look at the selenium code on GitHub for a more detailed understanding on any selenium class or functions that you are using. And in case you found or you think that you are facing a selenium level bug, you can look for any existing issues under GitHub issues. This is a pretty common one. You should always use source control. It will help you manage better, especially when multiple people are working on the same code base. And for that, you can make use of Git or SVM. Next, we are talking about knowing what to automate and where to automate. So let's see first what to automate. Generally speaking, the more repetitive the test, the higher is the probability that you can make it automate and go with that. So look at some business critical parts or tests requiring runs on different variety of data or even flows that are tedious to do manually and that too repetitively. So you can automate those. Also, you should remember that the tests are not the only candidates. So you can always think of creating test data for manual testing so that it helps out in manual testing by generating data quickly. So these are also great candidates for your automation. Coming to where to automate. So here we see on the screen the famous testing pyramid that shows us that the unit tests are fast and UI tests are slow. Similarly, unit tests have lesser cost attached to them in terms of early detection and lesser effort required for fixing when compared to UI tests. So our aim should be to have a good chunk off of a test at unit or service or API levels. I will recommend learning more about this topic on martinfowler.com or elesterbscott.com. Those are great learning resources for this or in general too. So another important practice is to not jumping to the coding part directly. Understand more about your app functionality and think about what will be the best way to design your scripts. As the image says over here, you only get out what you put in. Don't expect more until you do more and that is very true in case of automation. Next, we are talking about recrying our failed tests. So I'm not sure that this is a best practice as such but it can still be useful in some scenarios such as if your job fails due to an environment setup or any condition that you can think of. So first option is retrying failed tests immediately as and when they fail. For this you can make use of iRetry Analyzer if you are using TestNG or you can rerun all your failed tests at once by using TestNG failed.examl For cucumber we have a similar option that is available over here. Or you can go ahead and implement a custom solution based on your needs. So one that we have tried out is as shown over here. We have our Jenkins Job X. It is made up of Selenium tests, Selenium suites. So once this Job X finishes, it will trigger a child job. We call it Rebuild Checker. So the purpose of Rebuild Checker Job is to check if Job X's failure was more than the defined threshold percent. Say 95 or 100 percent in case the environment config went wrong or something like that happened. So if the failure is below the threshold, we do nothing. Otherwise, we can open the URL for Job X again using Selenium and click on Rebuild. So here we have used the threshold percent for this example to discuss over here. But basically you can play around this and build a custom solution for running your tests as per your requirement. Great. So we have reached our next section. Over to Sandeep. Thank you so much Selenium. Thank you for sharing the causes and preventions related to test failures things. So guys, in this session we will discuss about those things or tools which immediately reports at its interface if any condition that is likely to indicate a failure. So before moving further, let's first discuss about what is a static code analysis. So it is basically a method of debugging by examining source code before a program is done. And it's done by you can say analyzing a set of code against a set or you can say multiple set of coding rules. So we are discussing here some static code analysis tools which can be used based on your requirement. Out of which the first one is PMB. So what PMB does, it basically identifies potential problems which mainly like unused and duplicate codes, unused variables, empty cache blocks, unnecessary object creations and so on. So as you can see in this image, the below one it's showing duplicate code with 40 warnings from one analysis. So this comes through this PMB tool. The next tool is check style. So it basically analyzes source code and looks to improve the coding standard by transversing over you can say simple AST generated by check style. So it verifies the source code for coding conventions like headers, imports, wide spaces, formatting, etc. And in the same image you can see check style providing you 158 warnings out of which 9 are new one and 45 are the fixed one. And some other static code analysis tool that you can use based on your requirement like Rexxas, Web technologies, previous studio Q1. So static code analysis tools helps and fails fast things by converting your unsecure code into the one which is good to use. Now moving further, we will discuss about running test on underdevelopment builds. So automation you can say should be started as early as possible and ran as often as needed. So it will be good for you the earlier you start your automation in the lifecycle. Your project will be better and it will help you to catch the issues early. And you can say additionally you will get to know any impact on your existing automation script. So bugs detected early are a lot cheaper as you know this thing are a lot cheaper to fix than those discovered earlier than those discovered later in the development lifecycle. Now let me share an example with you. So in our organization we have a set of nightly builds which gets executed every night. So the entire automation suit is run on the new build and it finds out the failure at an early stage. So we can also decide easily whether you want to start automating the build or waiting for a stable one. So now moving further, we are discussing here about triggering Slack notification for the failure details. So I hope everyone knows about Slack. Still let me tell you that Slack is a communication platform which is used in many organizations. Where what you can do you and your team can ask questions, share updates and stay in the loop. So you can share notification of test case failures on Slack channel by following these steps which I'm telling you. So first you have to add Jenkins CI service in Slack. So this the first image is the icon of Jenkins CI service which you have to add in Slack. Then you have to add Slack notification plugin in Jenkins. And then you have to configure the global Slack notifier settings in your Jenkins job according to your need. Now you can trigger a notification on Slack whether it's depend on you whether you want to trigger it. It's depend whether you want to trigger the notification on job pass or fail or skip case conditions. Now let's see the image. In this image what is happening is admin runs a job through Jenkins which consists of high priority test or you can say premium test cases. And then after the completion of the jobs Jenkins find out the status and send it to the Slack channel. So using this it will be easier to the stakeholders to know the updated status of these test cases whether it's pass or fail. And it's also for you to figure out the status of test cases whether they are pass or fail. And in the last image you can see this is the notification which is received on Slack with the status of failure and success. Similarly, so if you don't want to trigger Slack notification using Jenkins then what you can do you can simply integrate incoming webbooks in your test class. So what is this incoming webbooks? So it's a simple way to post messages from external source into your Slack. So as you can see in this first image it's an image of incoming webbooks. So here is a code snippet which you can use in your test class to get a notification on Slack in case of any test case failure or pass. It's depend on you whether you want to trigger the notification in case of test case failure or pass. So after adding incoming webbooks in your Slack it will create a webbook URL which you can use in your portal. So as you can see in this image the second one it is showing rest assure.base.url and is equal to the URL which is my webbook URL. So you can easily create it by just going to the Slack channel, your Slack portal and you can easily create it by adding your incoming webbook. Now the last image is basically showing the notification which I receive in case of failure with the exception in it. So you can customize this kind of notification as per your need or whatever the text you want. Now like Slack you can also trigger a mail notification for high priority test cases in case of their fail. For this what you can do, you can just use java mail jar in your project or instead of it you can use dependency in your com.xml file. So here is a code snippet which I pasted here. It's used for sending the mail where you have to pass the subject, message and the recipient email IDs. And instead of this code snippet you also have to add SMTP authentication which I have not mentioned here. So what it will do, so whenever a task is failed, it will notify the stakeholders or whichever you want to send the email. So it will be easy for them to figure out which task is a fail at an early stage. Now similar to as discussed earlier about the email and Slack notification, you can also implement SMS notification service in case of task is failures. So for SMS notification, you must have a SMS gateway service provider because it is very essential for you to send the SMS. So there are many SMS service providers out of which few are listed here in the slide like Value First, Wayto SMS, Test Locals. So in our organization, we are using Value First SMS service provider for sharing the status of user profiles or applied status of the jobs to our users. So we can use the same service in our script easily. Here is our code snippet which I am using. Hello, I am an Audible. Yeah, how are you? Sorry, there was an issue in my life. So can you move that on the back side for me please? Yeah. So here I was saying that here is a code snippet which you can use for sending the SMS. So what you can do, you have to pass the phone numbers and a customizing message where you want to send your SMS. So till now we discussed about fail faster thing. Now let's discuss about how fast can we fix these failed test cases. So in this category, the first one is understanding some common error messages. So it will be good to have some knowledge of serenium exception in advance. So it will be easy for you to find out the root cause in case of task is failure and how to fix that easily. So in this slide we will focus on some of the most common exceptions in serenium. So out of which the first one is tail element reference exception. So what it does, so it basically occurs when the reference element is no longer present on the DOM page. So for fixing this, what we can do, we can recreate object instance in case of any page changes. Now the second exception is session node found exception. So it occurs when web driver is performing the action immediately after quitting the browser. So for this, you have to make sure that no driver action code should be there after quitting it. Now the third exception which I am discussing here is time out exception. So it occurs when the command did not complete in enough time. That is the element did not display in the specified time. So it can be fixed by increasing the implicit weight time. So there are some other exception like no such element exception, no such frame exception, no alert present exception, no window present exception. So these exception may be due to slow network. So for this what you have to do, you can make use of explicit weight. Now moving further visualization of test execution over time. So it is beneficial for you to record details of all your test execution and build a dashboard over it for analyzing the execution data over a period of time. This will provide you insights such as which cases tends to fail more or are flaky or you can say what are the common exceptions due to which your script fails or you can say how long your test case took for the execution. So now let's take an example. There is a test which took only a few seconds to finish in one run and nearly a minute longer in the next run. So this does not necessarily mean both had passed with green colors. Technically they both passed but the duration of long running test can be an indicator that something was not correct. So you can make use of the dashboard, this kind of dashboard for figure out such kind of issues. So you can build this visualization of your test execution by using main tools like Kibana and Elasticsearch. So there are few links which I mentioned in the slide which where you can go and check for more details. Like the first one link is based on elastic block and the second one is the Selenium conference 2018 where there was a talk based on visualization. Follow the coding standard and can be easily maintained. Anyone can understand if I make any point of time. Why this need occurs? Multiple people might not understand this in parallel or you can say so that's why. Now next is for automation testing. Basically not only makes you aware of the status of their automation runs but also helps you in finding out the root cause of your bugs. So there are many reporting frameworks which you can use like test and report. So here is a simple image which I posted here. So it's a test and report. So what it basically showing us is the number of test cases pass, fails with their execution time. Now next one is extend report. So as compared to test and report, extend report has more features and capabilities like it is more easy to read the exceptions and details about the execution. So it's a customized HTML report and the UI is better than you can say with the test and report. So it can be easily integrated with test and the J unit at cucumber framework. So similar to extend report, there is one more report you can use which is earlier report. So it's also open source framework which easily integrate with test and the entire area. So it provides some additional annotations which you can use like at the rate severity, at the rate step, at the rate attachment and at the rate link. So next thing is better logging. So by using logging, you will be able to save a lot of debugging time and it will also help in maintaining the consistency of the code. So as shown in this image, there is an extend test method which is used to create logs in the class. So in this report, the second image, you can see the logs which can be helpful for debugging the test cases as soon as you want. So the next thing is using screenshots and video recording. So in this, if you take the screenshot, so it will be helpful for you to figure out the failed test cases where it is failed. So it will be helpful for you to easily figure out the things. So here's a course snippet. So in this, it is showing how to create a reference of take screenshot method. And then you can call the method to capture a screenshot and pop it to the location where you want to save it and call this method in case of test case failure. The next thing we are discussing here about recording. So using screenshot for your slain test when they fail in school, but there are some limitations. For example, it is hard to know exactly what happened before a failure occurred during a snapshot. So capturing video of your test execution helps you to track the defect easily in case of any failure occurred. So it determines what went wrong in your test execution. So for recording, you can use two tools out of which the first one is FMPG and the second one is ATU test recorder. So there are the links providing the slides so you can go through these things to know more about these video recording tools. Okay, so we are almost done and the time is also up. So let's quickly recap what we talked about. We discussed about test failures, their impact. The causes of test failures, how to prevent them along with some of the best practices. We also discussed how to know failing tests at the earliest and what we can do to fix them faster. So hoping that this session was useful and has given you enough overview on what to do when tests fail. So that is it from us. We can have some questions and answers if there is still some time. Yeah, so thank you Tarun and Sandeep. Since we have only two questions, I think it would be clever to take them. Sure. Okay, so just give me one second. Yeah, so the first question is how can we fetch the browser console logs in case of any exception? Okay, so there is a utility that is available. You can integrate DevTools also or you can wait for Selenium 4. I guess there is a way to fetch all the DevTools info from there so you can make use of that. Exactly. Thank you. Okay, and one more question is let's say while running the test suite, if couple of tests get failed, then when should we retry immediately or just after completion of entire suite and what should be the retry count? Okay, so this is a very basic question that many people will have. So retrying is again, as I said, it might not be the best practices to go after. It might be useful in some scenarios like if you have some config failures or the environment failures or in some cases where you feel that something went wrong for the entire job. So running the suite level can be a good option. Running at test level you can think of only in case you're sure that the tests are not flicky as such because otherwise you will ignore some of the real app issues if you retry. So for example, if like Sandeep discussed it earlier, if your app gives a 500 the first time and next time when you retry the test, it works fine. So the test will work but your app is still not okay. Yeah, that makes sense. Just one more thing, so the question was also what should be the retry count? So basically it depends upon the execution. Like if you mention the retry for the one count so it works correctly, then maybe it's due to the application problem and it's totally dependent on you for how many times you want to use the retry. Right, it is purely practical. For our cases what we do is we retry only once. So for the test that we are sure that they are not flaky. So if it is okay in one time then it is good, otherwise we don't. So it depends on you. Okay, so we had only those two questions. So thank you Sandeep and Tarun for sharing your experience today. Thank you, sorry for the technical disturbances that we had in the call. Thank you everyone for attending.