 So welcome everyone to the Joy of Green Bills, running as Marti by Rajdeep Verma. Thank you, thank you Sejal and thanks everyone for joining me here today. So before starting let's set some expectations right. So the topic today is the Joy of Green Bills and I expect that, I assume that everyone who has joined here 44 people now at some point in your life you have written some automated test using let's say Selenium or APM or whatever tool you are using. And I also assume that you have run those tests in your CI server and also I assume that when you have run them in your CI server you have seen them failing. And by failing you mean I mean you have seen the red bills and then you have felt the pain of the red bills. So before jumping onto the slide I just don't want to jump in because if I see the slides I won't be able to see you people in the chat window. So can you tell me the reasons why a bill goes red or why a test fails? Anyone of you? Hello? Yeah, I think I lost you people for a while. But yeah, can you tell me the reason why you see red bills or why the test fails? What could be top reasons? I'm watching the discuss window. So bug, flakiness, due to app changes, uninterrupted blocks, test data, dependency, loading issue, flaky test, locators change, data, network issue, load time, environment issue. Yeah, that's a good one. Timeout issue, flaky. So I heard this word flakiness a lot here a lot. And then the network and environment issues as well. And then I somewhere found the locators. All right, that gives a good data to me. And I think by your answers I can assume you all are well versed with what a flaky test and what are the various reasons for it. So obviously there are many reasons for red bills. Major one is flaky test due to change in things like due to things like timeout, poor locator choices. I think this is freezing a bit. We are able to hear and see you properly. So no worries about that. Okay, sorry, my window actually froze. That's okay. We are able to not see you. Okay. Yeah, so let's continue. Things like timeout, poor locator choices, et cetera. And these are valid points. So in this talk I'm not going to talk about them at all because there are various other talks happening in other tracks which specifically targets what to do when the test fails for the reasons like poor locators and other flaky reasons. And they will suggest you how to build robust locators and how to use implicit weights. And basically all the things that are in our boundary and that can be fixed, I will not talk about them. So what I'll talk about in this talk is mostly about what test fails for the reasons which are out of our control. And some of the reasons mentioned by you people is environment issues, infrastructure issues. So what will we do if such things happen? So for example, what will happen if you are running a test and let's say the browser is crashed or the Android or iOS device you are running has lost connection. And how to over-dread builds in such cases. And that will be main part of this talk. So let's jump into slides. Hope you can see my slides. Maybe Sejal, can you quickly say yes or no? Yes, Rajdeep. All right, perfect. So before jumping in, let's refresh some mats. So if I have a coin here. Rajdeep, please turn on your camera as well. Oh, sorry. You can place it in right side. Oh, okay. Go down and turn it. Yeah, I just completely got the easy side. Good to go. Yeah. Okay, so if I have a coin and if I toss it. Yeah. What is the probability that I'll get tails on the top? Can someone tell me? It's very easy. Yeah, 50%. Someone said 50%. That's perfect. So the way we reach to this is a coin has two sides, a tail and a head. And out of two probability of getting any one occurrence is two out of one, which is 0.5 or in other words, 50%. Now let's take a slightly difficult one. What if I toss two coins together? What is the probability of me getting tails on both the faces of both coins? Now this can be, again, the answer can be very similar. If I toss two coins together, there are four possibilities. Two tails, two heads, one head, one tail and one tail, one head. So the probability of getting two tails, which is out of four occurrence. It is one occurrence. So it is four out of one, which is 0.25. Yeah. In other words, it can be, so I asked you two coins. I can ask you what happens if I throw 10 coins? What will happen? So probability theory has a very good rule for that. So probability, it says probability of two independent events happening together is equal to multiplication of their individual probabilities. So if I toss one coin, individually getting tails is 0.5 and same is for another coin. So if I toss them together, the product rule says the probability will be 0.25. Yeah. So now let's apply it to our tests. Imagine I got a test which is 99% stable. That means it passes 99 times out of 100 runs. And I've got few more such tests. In all, let's say I've got total five tests. And if I run all these five tests together, what's the chances that I will get a green bin as in all the tests will pass? And using the product rule, as I mentioned, it will be 0.99 raised to five, which will be 0.950. In other words, 95% times you will get a green bin, right? Which is amazing. However, you will never live with five tests. Over the time, your tests will grow from 5 to 50 to 500 to thousands. Yeah. And what will happen if your tests grow? Let's see what happens. Five tests, which you will get 95% green bin. For 50, it reduces to 60%. For 100, it reduces to 36%. What if you have got tests in the magnitude of, let's say, 1,000 or 2,000? So, if you have got 1,400 tests, you will get almost approximately zero green bin. And percentage-wise, it's sickness. And this person is me. My name is Rajdeep. And I've been doing testing for more than 10 years now, actually 12 years. I am a contributor to some of the open source tools like Kella Bash and APM. If you happen to use espresso driver in APM, you probably are using a lot of my code there. And I work at a company called Bumble in the UK. So currently, I'm sitting at London. It's 7 o'clock in the morning here. At Bumble, we have got two main products, Badoo and Bumble applications. Our mission is to offer most equitable, inclusive and empowering way to connect people. And we have users all around the world. And my responsibility at Bumble is to keep our tests fast and robust. And in this talk, I will share you the approach we follow to deliver this responsibility. So this is how the agenda of today will look like. So first, we will talk about UI tests at Bumble and main challenges with UI tests, which are speed and reliability. And then I'll talk about parallel testing approaches. And then we'll jump into Hughes worker and jobs and the test failure segregation and taking the right decision. And at the end, I will show you a demonstration and then we'll have some time for questions and answer. So at first, let's talk about UI test and also the speed and reliability. So while the topic is joy of green bills, I'll actually talk about joy of quick, quick green bills. And I said how to get quick green bills and the answer is delete all your tests. And that's it. That's the best way. And we all should do it. And with this advice, I conclude my talk. Thank you. Obviously, I was kidding there. Don't don't do it. You might have to speak for 45 minutes here. And by the way, don't take my advice seriously. You might you get carried away and delete all your tests and you you make it fired. So and I also don't take any responsibility of any such. So anyway. So again, talking about the challenges of speed and reliability. We have our test pyramid and it tells us what to do. And so this is the ideal test pyramid and it says what to do. And the expectation of our UI test is that they should be as less as possible. And the reality actually in many cases is different. And instead we get such pyramid, which is a garbage pyramid. And I call it garbage pyramid most of the time because it's a full heap of. It's a heap full of UI test usually in products and our situation in Bumble is not very different. We have got 1400 tests as of now. And as I showed you in the first slide 1400 tests means almost zero probability of getting a green bill. Even with the 99% stable tests. So while I'm not particularly proud of our 1400 tests, there are various reasons for everything. There is legacy code, which was not in right design to write unit test. We just have started introducing integration tests and there's still a lot of work to be done there. So 1400 tests are a bit too much for UI layer, but they help us prevent bugs. And so for that very reason we are not going to delete them anytime soon. We actually once thought of it, so we didn't run for them for some time and we were like completely helpless. So we have to use them and they are very useful until we grow another layer other layers of pyramid. But if we keep all these tests, there are two major challenge which we deal with and they are speed and reliability. And over the years we have applied various approaches and various tools and techniques to our test automation. And ultimately we have been able to manage something, some very comfortable matrices. So here are they. So our get push to test finished is approximately 20 minutes and our bills are frequently green. I'm not claiming that they are always green because they are red. But frequently green means you can assume from probability of zero to still getting green bills multiple times is a big achievement for us. And when the bills are green, there's a sense of joy and there's a sense of confidence among our developers as well. And they actually go and look at the test logs to see if their code hasn't broken anything. So green bill is very important for us in that sense. So here are some tool sets which we use for we do automation for web and mobile both for web. We are using Selenium and for mobile. We are using Calabash, APM and we use Cucumber for BDD. Actually we use it as a testing tool to be honest. We write scenarios and it's a mechanism for our student tests. And then we use Ruby as programming language and then on top of it we have our own test runner which is called Parallel Cucumber, a cue based test distribution system which is a bit more smarter and a bit more configurable than traditional test runners. A little bit about comparison between web and mobile. Selenium tests are a bit faster and stable for us. And the reason is Selenium is more mature than APM. Selenium has probably less bugs than APM and Selenium has less variables in the chain as compared to APM. So in case of Selenium you have got let's say your client code and then Chrome driver and that's it. In case of APM you have got client code which communicates to your Node.js server which communicates to JavaScript driver which communicates to HTTP server running inside your mobile device. So there are a lot of variables in the chain. So there is a lot of extra chance of something going wrong in APM. This chain also makes things a bit slower there. So that's why we have got more test failures in the mobile platform specifically let's say Android or iOS than web. So in the examples I will be showing you I will talk mostly about the mobile devices. So let's get back to the point of speed and stability. And they both can be achieved using parallel testing. You might think that parallelization helps us achieve speed. But in fact it also can be used to achieve a stability and we'll see how. So four years ago we started building our Android form with real devices and currently we are using our emulators form and in one build which is a 1400 test they run on almost 200 emulators in parallel. So we have got a lot of heavy Linux machines and each machine has 30 emulators and they are at our disposal. So there are a lot of emulators available to us for one build. So devices are not a problem there. So with this in mind I will go continue with the further examples. So what happened four years ago was we started simple division. So for example let's say you got 15 tests and you got three devices. Alpha, Brow, Charlie you can divide these tests equally among these devices. So five tests each. Now assume this that Alpha got five long running tests and Brow got five short test running tests. What might happen is Brow finishes first after a few minutes Charlie finishes and then after a few minutes Alpha finishes. And with this what might happen is there is a waste of Brow and Charlie's time here. This mechanism was achieved using a gem called parallel Kalabash which is now deprecated for good because there are better options which we have developed now. So the advantages was there was improvement in runtime. If one device becomes faulty maximum five tests are lost. So it's better than losing all the tests on one beginner single thread. So there is a stability component there a little bit I would say not something very exciting. However the downside is the waste of resources. Two devices are idle and if someone has finished very quickly that device will remain idle. And that's very sad. And no green bills because even if the five tests are failing there won't the green bill will not be green. So even if one test fails it's called bread. So how do we get green bills then? And the answer is very simple. I think most of you have got lots of tests have pulled this out of your tool chain and have some point in your left side. Let's rerun and it magically solves the problem of flaky test. But is it something as helpful as it seems? Well sometimes you get green bills because of it. But is it not problematic let's see. So rerun helps us getting green bills. However if you see from the performance speed and the speed side of it genuine failures are run twice. For example if you have got a bug in application for which the test is failing. And if you are blindly just rerun in that failure it will fail again definitely. And what might have what will happen is you have run a test which had to fail anyway two times. Which is not good. And if infrastructure is bad there is no point in rerunning. So some people say network issue. For example let's say there is a network issue going on today some services down. And you are running all the tests. In our case let's say we run 1400 tests all of them will fail because of network issue. And then and we don't know when the network will be bad. Let's say it could be flaky network right. So let's say out of 1400 200s are failing. And then we rerun then out of 200 100 more are failing. So there is no point in running because 100 failures are a lot. No one will go and check out those 100 failures. If they are of the magnitude of 5 or 6 failures people are still okay to check it. So there is no point in running. Rerunning sorry if the infrastructure is bad. But that's are these only two problems with rerun. It seems like not. Rerun has a bigger problem. And the rerun slallows bugs. How? So for example imagine you have got a bug in your application which is an intermittent bug. For example in android you got a crash which is intermittent not always reproducible. And only in certain circumstances it happens. You run the test in the first run the application crashes. And let's say you just rerun it again. And this time it passes because it's intermittent thing. What happens is you have ended up swallowing a bug which is left unrequited. Which is even more dangerous than red base. So how to solve this problem? The answer is iteration 2 which is workers and queues. So just to explain what workers and queues are I'll give you an example here. So imagine you have got six tests and you can put them in the queue. So you can call it as a queue. It's very similar to you go to a supermarket and then when you have to buy product you stay in a queue to check out. And then there are people who are doing billing on the counters. And these people can be called as workers. So as soon as a worker is free you call the next person for the check out. So this next person is like a test. So imagine this is a queue of six tests. And then there are three workers which is Alpha, Brow, Charlie. In our case they can be real devices or browsers or threads which run the test. So Alpha, Brow, Charlie are workers. So what happens in a queue based system is each worker gets to pop a test from the queue. So they get one test each. And after some time let's say Brow finished the test and there's outcome. The outcome is passed. It gets new test and then that is also passed. It gets new test. Let's say Alpha finished the test. Alpha gets another one. Charlie in the meantime finished its test. Alpha has outcome of the test as passing. So it's like it's not a simple division. The tests are popped from the queue on the need basis. And there are some popular test distribution mechanisms. So if you are using Java there could be JUnit. For JUnit there's Maven SurePy plug-in which distributes the test in queue based mechanism. For TestNG there is out-of-the-box support for parallel testing using VBG internally world's informer queues. And if you are using Java and APM there is a tool called APM test distribution which is specifically for mobile developed by Saini and Sriniv. Who actually took the workshop yesterday. They might be available here in this area. You might catch them if you are interested in this. So even though these are here there is one we still don't use these tools because of a fundamental problem with the way these distribute the test. And what happens in the queue based system is it doesn't care what happens in the infrastructure. Queue and workers are independent of your test run. So let's see the issue. So you've got six tests, three devices and one test each. One each device got a test. Let's say Bravo test is finished. It asks for a new test. Let's say test number four is executing and Bravo got sick in the meantime. Let's say it has some memory full. So test four will fail. If test four fails it pops up a new test. Test five which also fails because Bravo has full memory. And the test has outcome which is failed. As soon as there's an outcome, queue doesn't know about the outcome. It says I gave a test, there's an outcome. It failed or passed. So it will give a new test. And again, that will also definitely fail. So what's happening is Bravo is just swallowing all the tests and making them fail. And that's a very big problem with traditional queue based systems which I dropped earlier. So instead what we expect is, let's say we have got six tests here. By the way, the problem which I mentioned earlier, this is not just a problem with queue. It's a problem in any case. Let's say even if you do simple division, that's the problem in any case. The thing is queues actually can help us solve this problem. How? Let's say we've got six tests and Alpha, Bravo, Tali has three devices here. So what happens is each of them got a test. Then Bravo test is finished. And then Bravo got a new test. And then Bravo got sick. And if Bravo gets sick, test four fails. What we expect is test four should get back to the queue. And that's where we need a little bit smartness. There should be some kind of communication between our queue based system and our tests. And I'll talk about that later in my demo as well. What should happen later on is if we detect that Bravo is sick, Bravo should be taken out of the equation. It should be killed permanently so that it's not able to take more tests. And then let's say Alpha finished the test, it gets a new test. And test on Alpha is green. Test on Tali is also green. Tali gets a new test. Alpha gets a new test. And let's say both of them passed. Now, if you observe the picture here currently, all of the tests are green. What we lost is a device and which we killed. Even though that happened, a test had failed. But because of some smart mechanism which we can apply this thinking, all of our tests are green. Basically, the picture is green. And I'll show you a quick demo of this in a real time. So I'll skip, exit my slides. So what I'm going to do here is I have a framework which has got three scenarios. I won't call them tests because they are actually not testing anything. They are just scenarios for demonstration purpose. So these are one, two, and three scenarios. What they do is simply launch the application and type the name of a superhero character and then do a count from one to ten. By the way, can someone from the organizers tell me if they can see my screen or not? I just wanted to make sure. So we got three scenarios here and we got three devices here. And what I will do is I will run all these tests on these three devices initially. So these are Ruby tests and I run them using a rig. If you don't know, rig is basically a build tool like Maven. But in Ruby, the counterpart is rig. And the parallel is a task, which is like, you can say, Maven target. So don't worry about what is this. The code will be shared. It's already on the GitHub. So when I run this bundle exec rig parallel, so we start a Redis server, which is basically a queue. Redis is a queue-based database. And then we have got three workers which are pre-checked. And the three workers you can see are assigned tests. So the test should start shortly. There's something very slow about them. Give me a moment. Okay, there you go. It shouldn't be so slow. Maybe because machine is overloaded right now. Yeah, indeed. It's just taking a lot of time to do things here. Yeah. So you see here that they are saying, must pass, must pass. You see Shaktimaan here. I kill Shaktimaan. So the tests related to Shaktimaan is kind of, I expect that it will fail. And the Batman, which was another one, probably Spider-Man. Batman and Hulk has passed. And I actually killed Shaktimaan. Shaktimaan is a superhero, which was doing his job and I killed him. And what I expect here is because I simulated a condition wherein the test was running and I actually killed the emulator. And you see the test has actually restarted and got re-cued and restarted on another emulator. So now let's do one more naughty thing here. I killed this one also. Let's say our infrastructure is failing, flaky. And what I'm simulating is let's say we lost the connection to emulator while the test was running before we had an outcome. Or maybe let's say this situation can be your device memory was full. Or let's say device lost the Wi-Fi connection. There can be many, many such reasons. Or let's say device cable was flaky. It happens quite a lot in real device. And what you see Shaktimaan has reincarnated. It's a superhero, which cannot be killed. So that's why it's my favorite character. Sorry to interrupt you Rajni, but we have only 10 minutes left. Yeah, that's fine. It's almost done. So even after killing the device twice, you will see all my tests has passed. So you see past tests three, right? In reality, if you see the test has run five, three tests has actually, there were five runs. On worker zero, the test failed one time. And on worker two, it failed one time again. So it was failed, got re-cued. Again, failed, got re-cued. So what it shows is we can avoid this bill could have gone red, but because of our smart Q-based system, it became green. So the way it happens is, sorry, yeah. So the way it happens is all the devices which we have here actually go to a health check, which is a custom script, which checks whether the device has enough memory, has got internet connection, has got reachability and a lot of things. It can be written by the person who is using it. It's provided in form of a hook. And if the worker is healthy, it de-cues the test and runs the test. And test can have two outcomes, pass or fail. If the test is passed, the past thing is reported and we go back for the health check. If health check passes, we take another test and if it passes, we follow this loop. However, if the test fails, we check if it is failing because of some infrastructure issue or some third party flaky issue. So we have to do reasons aggregation there. If it, for example, doesn't fail with any such reason, then we go back to the health check, take another test. If it fails, we take another test as usual. But if it fails because of a faulty device, then our test framework has a mechanism to communicate to the health check script or health check mechanism some way that, hey, I failed and I doubt it's because of the device being not reachable. In that case, what happens is next time the health check will make the worker unhealthy. And if there's an unhealthy worker, it will kill the worker, basically kill that, take that device out of equation. So this flow will be something like this and this. Yeah, let's move on. So all these things are then implemented in parallel cucumber, which we actively maintain here at Bumble. So after we will switch to such a system, there's basically massive improvement in runtime. And if a worker becomes faulty, it's scaled and test is returned back to queue. No queue, no re-queue of genuine failures. So detect that the test is failing because of a valid reason, we don't re-queue it. And no waste of healthy resources because all tests almost finish at the same time. And we get lots of green bills. How this happens is based on, as I told you, there's a communication between test framework and between our queue-based system. And that happens using some hooks. And these hooks are like before workers and I'll talk one by one. So before workers hook, we can have a logic related to sorting of the test. So we can sort longest running tests to shortest running in order. And if we then run the test, we target all the worker to finish at around the same time. And also if we want to repeat one test multiple times, we can put that multiple times in the queue and that will basically mean that same test runs on all the workers. And we can analyze flaky test using such mechanism. Then worker health check, which I told is a custom script, can be provided by the user or by the test framework to parallel to the queue mechanism. It can be anything. We can write the custom code there. And then the job gets executed, test runs, and then after job we decide further actions what to do with the test, whether we want to do re-queue or not, or what else you want to do. And on the failure action, we can do lots of things. For example, as a flaky test, we can possibly re-queue it. Or if it's a business failure, we should not re-queue it. That's why possible. If the flaky app is there, we should not re-queue it ever. And if there's a flaky infrastructure, we should re-queue the test perhaps multiple times. As I showed you in my example, the test was re-queued two times. And to do this, we need to know what was the reason for the flakiness. And for that, we need to have a reason segregation mechanism. And that reason need to be communicated back to our smart test runner. And that can be done in this way. Usually what we do is if, let's say we have a method called install app, which does ADV install on the device. I'm talking about Android for iOS. It can be, say, iOS sync at all. So, if the fails, if this step fails, it usually raises runtime errors and the error, which are the part of code libraries. However, what we should do is, for example, let's say install app and if the result includes something like install failed insufficient storage, we should raise a custom exception, say device storage error. And this custom exception can be something like child lock in prior error, which is also a custom error. And then, which can be added from runtime error, which is in built exception. And the advantage of doing this is exception can be a good way of communication. How? In the after job, we can check what was the exception for which the scenario failed. So, if it's a business reason, we can say, okay, it's a final failure. We won't review it. And also, we can report the result to QA devs or in some dashboard. Let's say, if it is a third party service error, we can detect it. Let's say, if third party is down, it can give us a specific reason of failure and we can raise third party exception. And then, based on that exception, we can decide how this third party is down. We can report the results to Slack. Also, can report in the Kibana to check how much flaky is it over the time so that we can present this data to those third party developers to tell them how buggy their services. And then we can review the scenario. And this is the device storage error which we spoke earlier. If we see such an error, we have to mark devices as sick so that it's taken out of our ecosystem and then we can review the test. So, in summarize, there are some characteristics of smart Cubase test distribution and they are re-running infra-failure only and if you want to run business failure, only conditionally run them. And then it should have worker health check. So, this is if you are going to design something on your own in language which you are using then these are the things which you need to take care in mind. And for worker health check is very important before each job. And the setup to your down worker hook, it's needed for things like if you want to set up APM or start APM in before running your test then the setup is needed. And also there should be a facility to terminate build if failures exceed threshold. So in our case, when we run 1400 tests, if there are more than let's say 40 or 50 failures we terminate the build right then and there because that's a significant number of failures to handle so that we don't waste time of running all the tests. And then sorting test by weight is something very important because if you start running the longest running test first and then short test then you aim to finish all the workers almost at the same time so that this is more optimization to resource utilization. Repeat a single test multiple times obviously to find if the test is clear or not. And then the backup workers what might happen is let's say you have less test only three tests and you got 100 workers. What might happen is these three tests might just run and let's say all the three workers on which they are running are faulty and these three tests may just not pass. So there should be some mechanism that they may go to another workers so there should be some backup workers from the workers which you have started. That's another optimization. So yeah, that's pretty much it from my side. Thank you very much. And these are some links which I have. These are the links to the repository and the parallel cucumber repository and my demo repository. You can clone them if you want. And yeah, if you have any questions you can ask me. Thank you very much. Can you hear me? Yes Rajdeep. So we have few questions and can you see it in the Q&A time? Q&A. We have to click on the discuss button. Yes, yes, I saw it. So let's take it uninterrupted blocks data change. Okay, what parameters define a sick device? A sick device can be defined on the parameters as I said. Let's say you detect that the package manager of a device is not responding for Android. So what will you say is even though device is visible in ADV devices but the package manager is not responding, we can call it sick. It's up to you how you want to call a device sick. Let's say if a memory is full or the battery percentage is down then you can define it as a sick device. It's a custom thing you can check. So another example could be, for example, it's not a device. It's a machine running browser in a web driver. That happens quite often too. What happens is that machine in which we have our web driver, sorry, browser is running sometimes the machine gets down while running the test. What happens to the test which we are running during that time on that machine is we come back to a scaling and the reason usually the communicators are we could not communicate to a browser. And then we detect, okay, if some test was not able to communicate to the browser we can call it as a sick device or sick browser. And then those tests get taken. I hope that answers the question. Thank you. Thank you everyone for joining me today.