 And yes, so let's get started with the session. The session is about WebDriver Connector for Botium. And she's going to talk about, is it a tool for testing conversational UI, or how about it? So over to Shamash. State is yours. Thank you. Let me share my screen. How many of you have heard about Teh? Teh was a Microsoft bot on Twitter released in March 2016. It was designed to have conversations with Twitter users and learn how to mimic a human by copying the speech patterns. It was supposed to engage with the people of age between 18 to 24. So let's see what she learned and how did she perform. So she started with a very sweet hello world message. And in a few minutes, she became Hitler's fan. And some more time after that, she also wanted to teach Mexico a lesson by building a wall. And soon after that, we saw her become a racist. So what actually happened, right? Enough damage made. She was shut down within 16 hours of her launch. And trust me, I've just handpicked a few of the nicer quotes or nicer tweets from her. You can go on Google and Google more what happened. So next, we'll see a CNN Newsbot designed to give you updates, news updates, right? So it does not seem to be unsubscribing the user. The user requested to unsubscribe. It said yes, I did. But the next day continued to send in the alerts. So what did fail? What did not understand? No, it did understand, right? It said I unsubscribed you. But maybe some integration failures there in the back end. Another example, also. So this was designed to provide an info on weather for giving weather information for a particular location. So does it perform well? And it does not. In the first half, it did not understand what umbrella means. And in the second half, when I asked how does the weather look like in Brooklyn, New York, this weekend, it failed to understand what does a weekend mean? Straightforward test cases, isn't it? So now, let us understand how are these different from other apps, right? How are these different and why are they behaving differently? They're self-learning systems. So most of the bots are built using natural language processing, machine learning, deep learning algorithms, and all the systems are under constant training and improvement. So the tests that you're going to write today with expected outcome may change in next term. So having a non-deterministic component in the system under test will mean the software testing really useless. As soon as you're not able to tell, the failure, the reason for the failure is a defect or as an improvement, right? And when using a bot, either it's a voice bot or it is a text interface, a chat bot, there is no interaction barrier for users as compared to any other traditional web or mobile app. So there is a UI which will have a predefined interaction defined for you, how the flow should be, the links, the buttons, the text, the forms are actually helping you to have that conversation and get your job done. But with bots, it is really unexpected and the user is free to give inputs and start from anywhere, right? So the bots has to handle this in a very decent way. Plus, also the non-deterministic user interactions like human language, unique way of texting, emojis, you know, even today we have apps where you can have your personalized emojis. You can create a personalized emojis, right? Think about a scenario where I communicate or with a bot with my personalized emoji. How does it understand what I'm trying to say? What my mobile is, right? And also there are tons of possibilities around phrases, jargons, typos, local languages. So the test coverage is really challenging here, right? 100% test coverage you're trying to make? I don't think so. Next, this is very specific to voice bots. So like Alexa, like there are around seven and a half billion humans out there, at least seven and a half billion. And when I say that, that means there are seven and a half billion different voices, voice textures and for a UI, if you compare, it doesn't matter who is filling in the form, who is giving the input or who is calling that API, you know, who is clicking on that button. But for a voice bot, it does matter what voice is an action. So before I jump, you know, to how to strategize this testing, I want to just see how does it work in a nutshell so that I could understand what are the different components of the system and maybe that will help me to, you know, strategize even better. So let's have a look. So this is in a nutshell, a very high level of, you know, abstract architecture that any bot would have. Okay, so there would be a user input. It may come through any of the platform, either it is voice or text. In case of voice, speech to text libraries are used to convert the voice to text and we are really heavily dependent on them. And once the input is received, it is, you know, given to the NLP layer and NLP layer processes the input and may enter and talk to the databases, the internal APIs, the external APIs, and so on, right? Now, what does NLP layer do? With NLP, what is just finding a way to convert user's speech or text to a structured data? Okay, which is then utilized to choose an relevant answer. How should I respond? And what does the user's intent? In terms of this, it tokenizes, that is it is breaking all the inputs into series of words and each piece of this is representing a different value in the application. The next it does is it does the sentiment analysis. So it will study and try to learn the user's experience and it will transfer the inquiry to the human whenever necessary. Let us say the user is very much upset with the service or very angry. It should not be, you know, continuing the discussions here and there, the way usually it does, but it should direct it to the sales rep or the human so that it could be handled properly, right? I might want to annoy or irritate my user even more in that case. Then comes normalization where it will try to find the typos, the spelling mistakes, it corrects them and it also tries to alter the intent of what the user is trying to do. And then the entity recognition, it is trying to find out for those words, the category of words similar to a name of a particular product or a place, the address, the phone, zip code, whichever information that the bot is designed to look for, okay, and process. Then it also searches for subjects, verbs, objects, common phrases, nouns so that it could find the related phrase and understand more what the user is intending to do. So let's also understand this with a quick example and learn a little few terminologies so that we are familiar because I'm going to use all of these terminologies in the later part, okay? So this is an example, what I'm trying to do here is let us say I'm interacting with a bot which is helping me with the airline booking and I want to book tickets from Mumbai to London, okay? And this is my input. So this entire sentence is called as an utterance, okay? And then the bot is trying to figure out what is the intent? So what is the intent here? Booking tickets is the intent of the user and it also extracts the entities, Mumbai and London are the entities, okay? So I have everything that I want to have now, okay? And if I have to have an analogy to understand it better, what is trying to understand what is the method do I need to call and what are the parameters that method is looking for? Do I get all of these parameters from this sentence or not? Okay, so think of it that if you compare it to a method, a book ticket might be the method taking two parameters, source and destination. This is what it is trying to extract and yes, it gets what it is needed. So what do you think? How should we be testing this? How should we strategize testing around this, right? So now I welcome to all you to the session where I'm going to discuss exactly this where I'm going to talk about how do I strategize using Botium, how Botium is going to help me in testing all of these conversational AIs, how we can automate and then also how we can determine whether this a specific piece, the build can be taken to production or not. So I'll take a minute to introduce myself. My name is Sharma and I'm working with ThoughtWorks as a senior QA consultant and I'm passionate about testing and working on the solutions and strategizing things around testing. And as part of my QA journey, I got an opportunity to work and strategize testing on a chatbot app and that is what I am here to share my experience and learning with you all today and I hope it is going to help you to get started. I'll keep it pretty simple, okay? So let's straight away begin. So what is Botium? So Botium is open source tools and it is a set of tool which helps us to train bots as well as NLPs and we'll see that in a minute. Here, this is a nutshell where we see the components of Botium. There is Botium Core. Botium Core specifically helps you with connecting to platforms. Like where is your bot available? Whether it is Skype, Slack, Telegram, Web, WhatsApp. Okay, these are all platforms it supports. It has connectors to help with various NLPs like Dialogflow, IBM Watson, Microsoft Fruits and so on, there are tons of them. And then it also has a CLI which is, we call it as Botium CLI. It is a command line tool to work with all the components that I just mentioned and to run tests and generate reports. It also has various test runners like Mocha, Jasmine, Jest and so on. It also has a UI component where you can exactly do the same thing that we are doing with writing scripts and running it from the terminal and CLI. You can do the same thing on UI component. It is called as Botium Box. We will see that in a minute. And here is where I'm going to do a demo. Yes, so this is my directory where I have all my test cases. Okay, and I'm quickly going to use Botium CLI and run. Mocha is just a test runner which is going to help me to run and generate the reports. Let's start running. And in this case, I'm running it on Chrome. And let me tell you, this is not going to pop up any browser. It is going to run on the headless mode. If you see the first test case has already started, what was the first test case? It was about the account balance check. Okay, we will see that in a minute. What I'm doing, I'm sure everybody is confused. I have no clue what exactly I'm doing, but we'll have a look in a minute, okay? So while this is running a couple of tests for me, I'll quickly show you which I already ran and I have certain reports, okay? So these are the tons of test cases that I ran and I see two test cases are failing. If I look at the first, it says, transform money to savings account regression failed and bot response, the user says $500 to be transferred to my saving account and expected to match was all right. So your transferring amount, that is $500, it will be replaced from your account, the account number was replaced from an account too. Is that right? But it failed. It says that was what it was expected, but the bot says, sure, transfer from which account. So it does not have the account information. So something went wrong. We'll have to see why did that happen, right? The second test case fails where the user is trying to get a picture and the bot did not understand it and it responded something else and the user was expecting something else. All the other test cases seem to be passing. Okay, so what happened? Okay, confused, not sure what I did, no problem. We just deep dive into everything that I did and let's see what I ran, how did it work and how we can achieve this with any of the bot, right? So how does this setup look like? It is a node application, so you have to have mode installed on your machine and then once you have mode, you have to install bot.com CLI and this is the command you can see on the screen. Then you do bot.com CLI init. This is going to initialize the project and it is going to add two files. One is bot.com. And then the second one is the convert file. And then the third line is going to run your test cases that you have written and it is going to display beautiful reports for you which you can use it to analyze further. But then let's have a look at what are the configuration files doing? What is this bot.com.json file and what is its importance? So if I have to compare it, it is like if I have to run my test cases on a particular browser using Selenium or APM if I have to run into a particular device, I need to give capabilities and based on the capabilities is what it understands, how, where should it connect to and where the text needs to be executed on, right? The similar way bot.com.json file, I am going to describe in what context a bot runs, where should it run, what bot it should connect and how it should connect, what is a platform that I want to run it on, okay? And the next set of files or convo files, these are your test cases. So let's then see how do I write these test cases, okay? I can write test cases in simple text, notepad, okay? I don't need any idea or anything to that. I can, if I love programming, I can write it in JavaScript. I can, if I am very much habituated to using Excel sheets and I'm comfortable using Excel sheets, you can use Excel sheet or a CSV file. So let's see how a text, I mean, how does a test case look like? Before this example, I'll also show you quickly, these are my, I mean, just an example of a bottom file. So if you see, this is capabilities, these are just your desired capabilities, sorry, desired capabilities, project name, container mode, dialogue for project ID, client email, private key, user use intent, right? So where did I get this information from? You can, you know, you can check with your developers, they can help you with this project ID, client email and private keys. If not, I am going to add a link to my references where you can refer to that link and see how exactly you can capture these details for your dialogue for agent, okay? And I will also share how do I write a test case? So this is a sample test case where I'm just greeting my part, okay? The first line is the name of the file, you see, this is just a text file, on notepad I've written this. And the second line is describing what is this test case name? TC01 greeting, this is just a test case name, okay? And this will help you in reporting. Line number three, hash me, and then line number four is high. So everything that is followed after hash me is an input given to the bot, okay? So this is the user's input. And then the, all the sentences or all the utterances, now that we know the terminologies followed by hash bot is going to be what you're expecting out of the bot, okay? So in this test case, the user says hi and expected to hear back from bot, hello, good morning, how can I help you? Okay? So just give it a thought. If I run this test case on my CI CD pipeline for every build, will this test case pass every time? Hopefully not, because if I run the build in the morning, it is going to say good morning. And in the afternoon, I might want to hear good afternoon, evening, and so on, right? So this text is going to fail and my test case would be marked as fail, but that is not the failure, right? How can I fix this? So in this case, what I'll do, I'll use something called as an utterance file and I can put all my utterances, okay, in that file and use that utterance file as an input from the user as well as for the bot, okay? What this does is every single utterance that is placed in the file will be replaced, okay? And same goes with the bot. In this case, there are three inputs in my hello utterance file, hello, hi, and hey to me. And the expectation is still hello, good morning, how can I help you, right? What is this doing is if I have to compare and give the analogy with Selenium, these are my data sets. And this is how I can parameterize a test case. I have three different test inputs and I want to run the same test case with three different inputs. So these are three test cases. And expectation, you can think this line from bot as an assertion. So this is completely a hard assertion, okay? How can I even fix this more? I'll show you an example. I have an example of a payment due date input. This is an utterance file and the bot also has an utterance file. What input shows, okay? These are different utterances I want to use. Payment due date input is the name of the file. The line number two says, when is my payment due on my card? When is the payment due? Check due date for the payment on my card and all of these utterances. I can place all of these utterances and in the expected output is what I can list from all the expected output. I can just place input another, I can say it is, okay? So what my bot does is it says, okay, fine. For line number two, this is my input. Run this test case, user gives this input. When is the payment due on my card? And it looks for any of the matching responses from this file. So that way I can give all the various kind of different responses my bot can give, okay? Clear? So just to summarize, my utterance file in me section is going to be present data sets and it is similar to our parameterization and the utterance file, if I place in my bot section, it is going to look for the responses which may match for any of the inputs that I've given in that utterance file, okay? So let's move further. The next is how do I interact with the user interface elements? Like if you see some of the bots will give you certain links, I ask you to click on those links or buttons so that you could choose between options and also certain media files, pictures, emojis, any of that. So how do you test this? Again, in me section, I will say, send me a two small piece or anything. The bot returns me with buttons and the name of the buttons would be kids normal family, okay? And the media, it will also give me the picture of those pieces. So how do I choose a particular button or how do I click on that media or link or image? I may use the keyword like link button, okay? And give the name of the button. So this is going to help me to interact with any of the user elements that my bot is giving. In the next example, I have tested for particular scenarios, but now I want to run into end conversations. Like, let us say that the intent of the user is to transfer the money from account A to account B or check the balance or check the due date of the payments. I want to have that conversation end to end and I'll have those conversation files, okay? The next step is I want to run them or different setup. Like I want to run it across Slack, Skype, WhatsApp, web, and I want to maybe if I'm supporting multiple browser engines, I want to run it for Dialogflow, I want to run it for IBM Watson, just an example, right? So how do I do that? I write it again, a bottom files and I can run it. Either I can set it up local, I can set up a local grid and in the previous example that we saw, it was a local grid that I ran on or I can also use cloud infrastructure like browser stack, source lab and I can give my credentials there. I'm just going to share one of the sample file for local. This is my local, okay? Where I give what connector I want to run. In this case, it is WebDriver IO because I want to run my test case against a bot which is running as part of my application, web application, or maybe I want to run it across desktop. I want to run it across mobile devices. How do I do that? There is another example I can show if I want to give particular device combinations. I want, in this case, I'm running it on Samsung Galaxy S6 and I can go on putting a number of capabilities here so it is going to run on all of those devices. So this is how I can just write my test cases in text file and then write my configuration file in bot.json file and that's it. I don't need to really write any code, okay? Switching back. So what's next, right? So I have my tests in place. My CD-ICD pipeline is in place. I run my tests from every build. But how do I make a decision that should I take this build and ship it to production or not? Okay? I have to, you know, I'll have to also consider the models are self-learning, okay? Because of which the test cases that I'm running may fail for various reasons. So I should have a good check in place and I should have a wave where I can still analyze my test case and determine whether this is an improvement or this is an actual failure, right? And how do I validate these failures? So let's take an analogy again. Let us say, okay? I am training my three-year-old boy on to learn two English alphabets, A for Apple and B for Bot. So I use these images and teach him, okay? I train him for a week, okay? And he learned. He says I'm okay. Now, time to test. I'm going to test. I'll ask, Hey, son. What is this? So he says, mama, this is an apple, right? I said, what? Apple? No, this is not an apple. Okay, let me check if he recognizes at least apple correctly or not. What is this? He says, sorry, I don't understand. What is this? I don't know. So probably what is the problem? Where is the problem? I taught him for a week. He learned, he said, yes, I learned. I know A for Apple, B for Bot. But what happened when I asked him, I gave a picture of a ball. I asked him, what is this? A picture of an apple. He's not able to understand. So where is the problem? Is it about his learning abilities or memorization? Or it is, I really have to improve on my teaching skills. The problem is with my teaching skills, nothing to do with him. So something needs to be improved for sure, right? So what is that? Of course, it is my mistake in this case. I did not train him enough with different pictures and he learned just red circle as an apple. And nothing else. So I'll have to fix this. So what I'll do to fix this is I'm going to collect varieties, different variety of data and more and more samples. So he understands how to identify an apple and how to identify a ball, correct? So your bot is just intelligent, as only as the underlying DB. So it is not going to do something out of the box for you. So data matters a lot, right? So let's see how I can define such matrices which can actually direct me to the right path. Should I improve on my training or should I look back at the modern and see it is something, I'm doing something wrong. It has learned something wrong. How do I do that? Confusion matrix, right? It looks very confusing. So let's see what it is. So the rows here are going to tell you the predicted values by the bot, whether if I say I want to book a ticket, whether it predicted it correctly or not. And the columns are what is the actual expected output. Book tickets should say the intent is booking tickets, not hello, right? So these if you see it as something like true positives, false positives, true negatives quickly I've run through. It might be confusing in the moment, but it is just going to classify, correct and incorrect predictions and also tell me what are the error types. Like it will tell me if the true positives is going to tell me whether it was predicted correctly and true negatives is going to tell me it predicted correctly as not booking it. And then first positive is going to tell me it incorrectly identified something as a booking ticket intent and false negatives is predicting it incorrectly as not booking. So we'll see an example. It is pretty confusing. I know because even I took a lot of time when I was learning these things and these are a couple of more matrices that we actually calculate based on the confusion matrix. So the first is I'll get the confusion matrix. I'll look at the confusion matrix. How does it look like? And then the next is we are going to look at these matrices. Accuracy is just going to tell me out of all the classes how many were predicted correctly, precision out of all the positive classes how many were correctly identified. Recall is just going to tell me out of all positive classes how much did we predict correctly and it should be as much as for high as possible. There's a slight difference in these all of these if you see the formulae here and F major is going to help me with recall and the precision at the same time giving the harmonic mean instead of the arithmetic means. But don't worry you do not have to calculate all of this. We'll see how it can be this time and we'll also see a small example. I've collected certain runs and I have collected a good nicely looking or you know, results and also ugly looking results. Let's see. And this is going to be continuous versus for you every build just the way you do the build sign off and look at various matrices even today. Okay, so let's quickly jump and see something. Okay, here I go. Okay, so this is what you box that I was talking about the UI component. I just wanted to make sure that we have a UI where we can visualize things. So I'm for this demo I'm using bottom box. You could see everything that we did the bottom.json file that we wrote. I don't need to do that. I can simply generate those files and drop in here. If we can connect through chatbots any of the bot technologies it has a lot of bot technologies. Let me show you quickly. You might be wondering if it only has samples. No, it is supporting your Alexa, dial up flow, you know, bot press, Rasa, I give what's and so on. So you can choose between them. You can set all your device life, what all devices you want to run it against. You can go to settings and also connect to your providers. Okay, and you can write all the test cases and see what test cases you're in text format. You can also connect to get and pull your test cases and you can view all the results here. So yesterday I was running a couple of samples and let us first see a very good sample looking. How does actually confusion matrix look like? Let us see this example. Okay, I'll explain you. So if you remember the rows are going to tell me what predicted and the columns what it was supposed to predict. Okay, so these are the intents like ask time. Okay, ask time had 43 intents in my test and it correctly identified all of the 43 against ask time only. Okay, so if you see any confusion matrix as this diagonal thing and everything green that means the model has predicted all your intents accurately. There is no false, okay? And if you look at these banking account balance check there were 23 utterances and all the 23 utterances were found and identified predicted correctly by the model. So this is all good. Okay, I can also see the numbers that we were talking about here. See the F1 score, precision recall, accuracy, everything is fine. Okay, it's all one, right? So nothing looks bad here. My confusion matrix look good. My, you know, matrices look good. And I want to also see are there any risks? So yes, this is something in red. Shama, we... Shams, yeah, Shaukuja. And if I want to, you know, quickly click on one and go and see what are these. So these are my intents and these are my prediction confidence. And I, you know, I recommend that you should have at least 15 to 20 utterances for each intent you're trying to test so that it would give you better results in terms of training. So you can have a look on all of these and take a call, do you want to improve upon training or not? And now let us see our really ugly results. I want to show that as well. How does it look like? Because that is what is more important to understand. Here is the confusion matrix. You see these outliers? So what does it mean? How do I read this? So banking account, earning check, okay? So they were 15 intents, 13 were identified correctly. But then there was one intent which was incorrectly identified as part of banking account balance check. And there was another, you know, utterance identified incorrectly as asked, what's possible? So something went wrong here. We might want to look at the training data. That means I'll have to work on my data. And also look at my accuracy and predictions. How does it look like? I go to this section and I look at these. These are pretty low. F1 score is pretty low. Recall accuracy is pretty, pretty, pretty low. So that means I really have to work on my model. Okay? So this was just an example. And after this, I just want to summarize what all we've learned. We'll have to do manner testing for sure. When I say manner testing, test case writing is what is important. And how do we write test cases to make sure I cover all the parts? We'll have to write test cases around different seven categories, personality. Does it have a name? Or does it introduce itself and tells the user about its capabilities? And so that the user knows what hand, how can be interacted with the part? Intelligence, how does it handle multi-level or multi-step conversations? Because if you remember, I do not have a UI. I can give my source first, the dates later, the number of tickets, multi-level, multi-step conversations. Also the addresses can be pin code I can give first and then I can tell my area, then I'll tell my building name and so on. How does the bot understand this? Write the test cases around. Error management, if the bot is not designed to do something, if I go and ask Poncho about give me the movie ratings for a particular movie, it does not understand because it is not designed to do that. How does it handle it gracefully? And understanding different utterances, emojis, medias, so you'll have to have a bunch of test cases on that side. Also you'll have to test speed and accuracy, accuracy matters a lot. Remember that selenium has two different meanings in based on the context, right? Selenium for a chemist means completely different and for an IT person means completely different. So in context, it has to understand what I'm talking about. Speed, on UI, I get that loader, but in bot, it has to make sure I understand it is processing and it is going to get back to me rather than keeping silent. And then navigation, in UI, I can go back to the previous menu and make some changes and come back. In bot, you do not have that. So if I have to go back to the previous conversation, I say I want to book tickets, then I say, hey, hold on, I want to also check on train tickets, not air tickets. So does it understand that? It goes back and switches between train tickets and airline tickets, or does it give you the later one, right? So these are all kind of test cases that you have to come up with. You can also have a lot of data sets readily available so that you could use those data sets for banking, retail, healthcare and so on and utilize that in your testing strategies and write more specific tests across your application and use cases that you have that is specific to your application as well. And also the next thing that we also have to do is automation testing. We saw a bottom. We can have all of these test cases that you have written in text. You can use the similar same test cases and put it on bottom. It runs seamlessly, we saw that. And the next step is doing the analysis of the matrices. We saw that as well. What are the actual matrices that I need to capture and how can I do the decision making? And then the last thing is crowd testing. Our crowd testing is important to actually help you with hand picking people with different age, gender, profession, just to make sure that you collect a variety of ways so that you could see how your bot is performing and you could get real time data as well and more references for your training and that you can use for your bot. And also it will cover different markets and locations. So that is an added advantage. And do not forget your routes. API testing, performance testing, security testing, database testing and all of the testing that we are doing today is still applies for all the bots. It is just about, there is an NLP sitting in between but it is still interacting with various end points, services, databases and whatnot. So make sure you also have these test cases if you remember the scene in bot, right? It did its job, it understood, it's all good but somewhere in the backend, the integration may must have failed to process that and that's the reason it is still giving you the alerts. So I'll just, you know, stop here and I'm done. I hope it was helpful for you people. You can get started with testing any of the NLP's if you have any projects.