 Hi everyone. I am Sayed. I work for Eureka. Today we are going to talk about few operational challenges that are unmanageable when we use large scale grid and node combination. And we also did lot of executions and node and grid combination. And over the years, we came across many problems and we keep on finding solutions and we are improving ourselves. Today we are going to talk about few of our challenges, major challenges with logging and operational challenges with node and grid setup. Let me ask few questions. How many of you know Selenium Grid? How many of you use Selenium Grid in order to run your test? Half of it. How many of you extensively use Selenium Grid? You totally rely on Selenium Grid. How many test runs happening per day in your runs? What is the max count? Approximately. 300,000? Over the grid? Nice. Okay. For few who doesn't know about Selenium Grid, we will just touch up on what is Selenium Grid? As most of you already said that you are using Selenium Grid. I just want to have over you on that. It's a load balancer actually which concurrently distributes the test across various platforms. The main disadvantage of having Selenium Grid is that you mainly reduce your time in order to test your test cases. This is the major advantage and even the Selenium Grid fastens your tests by doing load balancing. As he said, 1,000 runs happening a day. All balancing done at the Selenium Grid side. Over the few years, we are executing thousands of runs every day to suffice all our customers and LVBs and developers' needs. We do execute parallel runs and cross browser testing. We do a lot of activity over the years. And operationally, there are many practical challenges while debugging. There are developers who actually schedule runs and LVBs schedule runs over the night through CI tools. And the practical issue is the debugging. What if the run fails? And every developer wants to see their logs. Operationally, the nodes are sitting ideally in the controls of operations team. And log distributing is highly challenging. The very first challenge we are encountering is the clients, the developers, the users who are actually executing a test run. They never know on which node their run went to. And on which node the run been executed. Maybe while running, you can see on the grid console, but after execution, there is no way to track that. The only tracking point is capturing Selenium session ID into the developer logs and passing the session ID to the operations team to get the respective logs. That is the only way. If you don't want, developers don't want to capture Selenium session ID in their logging. The other alternative is creating a unique run ID, a test ID from each developer and sharing that ID with the operations team. And also we need to make a little code change in the Selenium code to capture the run ID and print it in the logs so that they can grab it. They can search for it and share with the developers. That's another alternative. That will be more challenging if at all you have to support thousands of developers, thousands of test cases. We may get 100 requests a day. And even that will be more difficult and practically very difficult to handle if you have log rotation policy because it generates a huge logging. If at all you have a log rotation policy and log purging and archival policy, if a developer comes back after a week and asking for their run status or run log, it is very difficult to search for it and pass for it. It's very time consuming and resource everything. It's a practical challenge which we have faced. If they can print the run ID in the logs, that is the only way to get it. But again, that's challenging. And in order to have this test ID, the unique ID, you need to do a little code change. As we know we have different capabilities. Before we launch any test, we set the capabilities. In the same way we need to have a capital T by name test ID where you need to set the unique test ID for every execution. Then anyhow we are displaying the log. In the log you can uniquely identify the test ID. You can see the highlighted one. And the second challenge what we were facing is multiple sessions, the same log. Is it what you are logging as your own log script or is it what Selenium node claims? This is our own logging system actually. Server log. We have created the code. This is not an actual same number. This is not what the client is logging. Yes, this is a node. Node logs. What are the log files you are saying? The node logs also come back to where your server is running from. It's all depends how you started your node. If you started your node and if you needed it, the logging is in a different file. The file will be reshared in a shape or something. You will get a log, say a stack price to your program. Actually there is a different case of log. Even if you have a driver's test you log in. Let's say a program. So this is a normal user node log. And the second challenge. Multiple sessions, same log. Let's say the client is executing multiple test runs. And all executions will go into the same log file. You can see in the highlighted one where the client is trying to execute two test runs. Where you can see the overlap log. You cannot differentiate what is the specific request for this execution. You cannot differentiate. It's more of a time taking process in order to debug that log. For this we have a ideal two solutions. The first one would be if you can effort, you can dedicate a fresh VM for every execution. You can run only node with a single capability on it. And once the test execution completes, have a process to rename the log file. And with the unique session ID, anyhow it generates a session ID, unique session ID. Have the same session ID as the file name. And save it to your location. And move the log file into a folder that is accessible by developers. The reason being, moving this log file to the accessible to developers is, in our route, let's say tomorrow I have an issue with a certain test suite. I want to debug it. Using the test ID, I'll take all the log file and anyhow it's getting saved using the session ID, right? I'll check for the specific timestamp and specific command. The next event, let's say if you cannot afford having a fresh VM allocation for each execution, you can have a new VM for each test execution and want to host multiple test execution. We can even have a single session running on a node, restricting that node should have a only single session. In this, the only difference between these two solutions is in the first solution, you are having your log file with the session ID, the unique session ID. In the second, you are having with the test ID. It all depends on how you, I mean, you log your execution log file. And the same thing, the store the log file in a folder that is accessible by developers. In order to track the issues, we have a common folder system where the developer can access the log files. And this is the third challenge. Selenium executor logs by default only print time but not date. Agree? Yeah, you can see at the left hand side. You can see only the timestamp. And you can see the request response coming to you from the Selenium node. Right? So far this, what we have, we enhanced the logging system. We included the data and time. You can see over here the date, the timestamp anyhow the Selenium has provided. And you can see the request response saying the Selenium node request executing and you are seeing the response coming back from the node. And this is very, it's the main advantage of having this enhanced logging is it's more readability in order to track the issues, to trace the issues. Again, as the operations lead, we usually get every day a request from developers or LOPs that they are observing basically they test completed but, sorry, they test completed but there is no screenshots. But test is, tests are taking too long than unusual. And sometimes tests are abruptly closed. That's called aborted suddenly. They didn't get any stat trace. And sometimes they have some other issues. It's performing low. It's moving very, very low than usual. If it is escalated to operations team, operations team has high challenges to understand what actually exactly happened or happening during the execution of the particular test run. Until unless they have a strong metrics or system level metrics or node level metrics, it is very difficult for them to debug it. They need to know what is a system memory, a system CPU and other JVM memory or how IDE driver is performing, how Chrome driver is performing, how many IEs are launched. Even though you execute, even though you control, one session there might be a session which already went in abruptly closed and left the session or left the IDE driver open in the process in the task manager. So until these guys, operations team have this information, it will be very difficult for them to debug and come back. So we have to have a mechanism to capture real time and frequent polls. What is a system CPU, system memory and JVM memory? How much JVM is taking? How much IE drivers and browsers are taking the memory and system and CPU percentages? Similarly from the node end, we can capture a command and target values and response and request and capture time and what is a node name, what is a node OS and what is other attributes of system and XMLs and persist everything in the database. Now everything we have in the database. So even though we don't have logs tomorrow and OS team can correlate the data between what exactly happened versus during your execution and they can see graphs or pictorial representation of the data into reports. I just want to add to Ragu, generally you will see in specific to IE driver, even though you use driver.quit, sometimes you see the windows pop up coming up saying that do you really want to close this program. Have we ever, I mean face this issue? Right, right, usually, right. In that situations it leaves the thread open in the task manager. If you schedule another IE session, the session will invoke, the session ID will be created but nothing will progress there. It's not only model, it's crash message. IE crashed. You can see the message that the executable has, you can see the message and the bottom you see, you will see the closer program of, right? So I'm not wrong. If you run heavy. Do you really want to close the program? It's not a kind of program. Close the program, I mean I said it's a closer program. The process. The IE driver process. And similarly, there are any number of other parameters in the test cases, right? While executing, like if Java memory is already reached its peak and it is not responsive, that has to be tried. If Chrome is already hung and if you schedule another IE, that may influence other hands. So if at all you have all this information through some metrics collected and put this in the database. And now we have, we can run analytics to see what exactly happened during that time. And we can give this data to the developers for their own debugging. And operation team can relax for some time. Open this. If at all I have run something like this and I scheduled it like setting for a book, a CD, an Amazon and executed run. Run this happened. It happened one week back. Now a developer came back from his vacation and asking operation team, what happened to my run? It took longer time or I didn't get the screenshot or it failed. Then they can immediately open the report and see what is the command actually executed and what is the duration of the command executed when it was executed. And if they click on that, any of the command, it will give you other system metrics like how much CPU and memory and other driver information or JVM information during that time. It will be very helpful if you do extensive executions. This information is very helpful to know the difference between AUT failure versus system failures. There are many situations maybe AUT issue. AUT is crashed. AUT is throwed a 500 error. But system these are the issues to be easy to debug if you know the issue with AUT or the system. And these are the places where. These are the list of files where I said if we have enhanced the logging system, right? These are the list of files. And the last file is for the, include the date format. We have our own specific date format which happens to the timestamp. Just further analysis like if at all we know which browser is available. If at all I'm scheduling one run, if we can throw a message to the user saying that okay, there are IE 8s 10 sitting. Why don't you pick Chrome? If you can pass on like if at all you pick something like which is unavailable now. It may take 36 minutes for you to pick if you schedule this run. Why don't you pick another available browser? This will be calculated if at all you know the control on the execution side. How many nodes, how many browsers or how many capabilities are free at this point? Yes, please. That's good. But if at all you want to build a generic framework, any developer, you don't need to bother about the execution in there. Like how the commands are transformed. If at all you want to build a keyword driven where Excel. Let me raise my question. My question is what is law for the law? Yes. So basically said, if you don't use me, a client standard doesn't give me the date. Yes. So I'm guessing that the guy who actually plays the role in this thing, whatever is available, is on the case in the smart matter. That's correct. So this means, so basically being able to find it in more than one jar is only the last part. And then it's been all alone. Yes, yes, we agree. That would be another turn, but we have tried with this option. That's true. That's correct. If at all you're asking, allows to do that. That's correct. But is it only runtime or is it post-all? No. Front time, right? My operations concern is post-execution after one week. We launch at the first. Large rotated. What do we need to do? That's my main concern. That's the main frustration from my operations point of view. Developers schedule their runs through CI. And every day it runs. They come back after a week and ask for where is my log. They query it. They do whatever they want to do. But there's no log. Because log is already purged. Because it's a huge log. Because thousands and thousands of runs a day, it goes to GBs. We have to purge it every day. That's where we came with this other solution to get only required information from the entire log of GB. Maybe you need at least 120 lines. We'll take the information as command, value, request, response, and persist in the database. Database, maybe I have four lines. That will give us, right? That's all you need as a developer to see where it exactly failed. You don't need to look for all the 10,000 lines of code. This is more practical. In my case, developers won't sit before systems and execute runs and debug it. Everyone, right? So this all gives you a permanent solution for the post debugging. Okay. So the third question I was basically having was, so you guys collect a lot of information with respect to the JVM and the memory being used and all of that stuff, right? So earlier today there was another session which was basically talking about self-healing capabilities of the grid. I mean, within my company... Can you a little bit louder? I was much able to... So earlier today there was another session which was basically on something called as Selenium Grid Extras, right? Which basically talks about building some sort of auto-healing abilities within the grid. So, I mean, so I come from PayPal. So I have also worked on building some self-healing capabilities within the grid and all of that stuff. So just trying to understand, you know, I mean, like, if a person is basically going through a mechanism where after a N number of runs, they basically recycle a node or recycle a VM itself. Do you think all of these monitoring systems that you have basically... This is the second point, right? I was talking. Right. That's what we want to go. No, I mean, so you were showing me some statistics with respect to the JVM usage and all of that, right? So eventually you probably get to this point. Do you think all of those information would still be required? Still be required, yes. If at all you do session recycling maybe after 100 sessions, right? Maybe if you run a heavy browser load application, 100 sessions within this one minute, that's enough to crash IE browsers and that enough to reach a JVM peak. It depends upon how frequently, how much load you're going to put on the node. It's not irrespective of how frequently recycle it. It all matters like how much load at the end a node or JVM is carrying. That matters. And this again, it's a debugging purpose. Is that really JVM matters? Basically it gives the developers to think in two directions. Should I bother about my AUT changes? Should I bother about my script changes? Should I bother about environmental issue? If it is environmental issue, let me open the system. It's their issue. That gives them. Thank you. Any other questions? Okay. Should we wrap up? Thank you. Thank you. Thank you everyone.