 Hi, my name is George Stark, and I am a Distinguished Engineer and Level 3 Certified Data Scientist by the Open Group at IBM Corporation. Maureen Neeraj and I put this workshop together for the San Antonio event, and we've edited it based on that experiment for this webinar, so we hope you all enjoy. Maureen. Thank you, George. My name is Maureen Norton, and I am the Global Data Scientist Profession Leader at IBM. And as George mentioned, we have been working with the Open Group to create the Data Scientist certification that is currently available. And this is a experiential workshop that we are trying for the first time in virtual mode and are very excited to have you here. So we look forward to this event. Neeraj. Hi, good morning. Good evening. My name is Neeraj, and I'm working with IBM for over 15 years, and currently I'm looking after the data science practice of IBM Cloud client experience. And like Maureen and George mentioned, yeah, this is a hands-on workshop. And first off, it's signed, I mean, that we have created and developed like after San Antonio event, so really looking forward to it today. Thank you. Great. So I'd like to get us started. And first of all, welcome. I hope this digital first event finds you all safe wherever you are in the world. The intent of today's session is to provide you with insight into how data science and AI can really be a game changer and most certainly inspire your work. So I'll take you through an example of how one company's data scientists were able to predict how happy or not their customers are. Imagine that, not just a survey and report out on customer satisfaction after the fact, but the ability to actually predict it and do something about it to change the outcome. So this is a detailed example that Neeraj will lead us through a little further on. But the intent of it really is to get you thinking about the types of challenges that your work, you have at work, and your company is facing and how these kinds of insights and different things we'll share with you today could be applied to challenges that you're currently facing. We go to the next chart. You'll note that the workshop had some set up instructions for completion prior to the event. So if for some reason we're not able to get your machine set up. You're certainly welcome to stay and listen. However, the experience won't be quite what we hope for you on getting your hands into the code and building a notebook. Running algorithms and really working with the data is more the experiential part. So let's take a look at the agenda for today to go to the next chart. So we are going to get started with doing some introductions in terms of some key concepts that you're going to need going through. And actually do some predictive analytics and machine learning solutions to an introduction to that and create a project. We're going to, because this workshop is lengthy, we're going to build in some breaks for everybody. And then we're going to go into the hands on experiential journey. And as I mentioned, this is about a net promoter score, a measure of satisfaction. And we're going to go through each of the steps you see outlined there. So just as a data scientist would do, you're going to go through the business understanding, data understanding, data preparation, actually doing some modeling, evaluating the model and and some discussion and exercise on deployment. So in this coming up session, we're hoping to take you through all of that. So when you come out of it, you really have a good feel for the possibilities for what you could do in your own workplace. We go to the next chart. So I want to start this off and say, let's talk about data. And let's use the chat for this. But if you think about, is there a source of data that you can think of or know about that would really have information about any topic you can imagine anywhere in the world at any given time. So if you have some thoughts on what that kind of data source might be, of course, data is pivotal to everything we're going to be doing today. So I wanted to get everybody's mindset into thinking about that. And if we go to the next chart, I'll give you one that I certainly had in mind. I see some good things coming into the chat there. And that's just certainly right news articles. The one I'm going to talk about today is Twitter. Because if you think about it, this is a very powerful platform people all over the planet tweet about everything. When Twitter was starting out, they sometimes came under scrutiny for their business model because their primary for revenue was from advertising alone. And as they started thinking about the data that they had that wealth of data, they wanted to figure out a way to broaden their business model and perhaps monetize some of that because it's really a gold mine of data. And they thought they could help businesses with that. In fact, I was at a think conference IBM had a few years back and the CEO of Twitter was there. I was right after there was an agreement with IBM launched, and he was on stage talking about how he was so very confident that no matter what your business where in the world you are, he had data that could help. And then he described one day when he was sitting in his office in front of his large window, pretty much feeling on top of the world with this newfound sense of the powerful data that he had. When a CEO came in for a meeting with him, asking to make a deal to have some of Twitter's data. He leaned back as chair very confidently saying, of course I can help you know tell me a little bit about your business. So, he proudly told him he owned a commercial fryer business, you know those large fires that are in fast food restaurants that make the French fries or Chris or different recipes like that that so many people across the world love. And he, the CEO of Twitter kind of sat back in for a minute was thinking and really perplexed he thought he might have finally met up with a company that he couldn't imagine how Twitter data could help. People just don't tweet about commercial fires that they happened to walk by in a restaurant or see. So he was about, you know ready to just say look, I, I don't know how I can help you. People don't talk about commercial fires. However, the CEO said, perhaps not, but they do talk about really glimpse that next chart. They do talk about soggy fries. So, next chart. And who would have ever thought if you've ever Twitter know someone who has complained about soggy fries after going out to eat that is actually very valuable data. So the CEO of the commercial fire company explained. If somebody's complaining about soggy fries and we have that geospatial data to know where it is. And that tells me that there is something wrong with the commercial fryer. It needs some kind of maintenance. And he would want to be able to send out people to fix that proactively, or sometimes even better yet. If he could determine that it wasn't even this company's commercial fires. It was a competitor. They don't have information and data to go with a very strong lead to try to sell one of their commercial fires or, or provide some kind of maintenance. So, I always love that example, because it just seems like if you really think about creative ways of using data, you're going to see that there's a wealth of ways you can attack different problems and how you could go about it. That was one of the creative sources of it. But let's, let's think about some other sources and things that could go to the next chart. Be also very interesting types of data. I think some of the people in the chat have really come up with some, some good ones here Google search and news websites, seeing seeing a lot of those. And I want to mention one other one, which is again very common and pervasive data. And if you go to the next chart. That is weather data. So, whether data is extremely valuable. If you think about it to transportation companies who are managing routes to event planners seasonal businesses. Retail, you know, we've all heard the stories about when it's raining, how much the price of umbrellas go up and they're right in the front of store. There's all kinds of ways that people are using the weather app in, in a very creative way. And in this COVID-19 pandemic that we are all living through right now. The weather app is also providing a way to provide a level of insight into COVID-19 at very local levels. So if you have that app, not in every geography yet it is planned to be rolled out but if you're in the US and Canada, I believe you can look at that app and in the lower right hand corner of your weather app see a red COVID-19 button. And that will give you information at a pretty detailed lever level about what is going on in your current area. So, if I'd like to right now, turn it over to George Sark to take it from here, George. Thank you, Maureen. Maureen mentioned a couple of source data sources and folks in the chat and also in the Q&A also provided some interesting ideas. I'd like to just introduce you to four very common data science types of models. The first one is generally considered a risk assessment model. And essentially what you want to do here is you want to screen some set of actors for threats. Credit card fraud is very common, passenger screening for airlines or trains. The ability to purchase something, one of the folks in the chat had mentioned about the reliability of the news source. And certainly fake real news is a type of data science model that you can build and predict the probability that a news article is fake or real based on a training set of data that you have. Next one is quality or defect prediction. We use data science models to predict the number of defects in a product, whether that's code or whether that's casting of materials, how the compounds or how various compounds and chemical plants are built, how ATM machines are fails, how servers are fail. We have a lot of quality and defect prediction models in the data science space. Business value, customer satisfaction. And this is one we're going to talk about today, especially, and Niraj and you all are going to walk through an experiment about the how to do, how to build a data science model to predict customer satisfaction and which customers are likely to be happy with your product, which one will be promoters and be champions of your product. And anytime you need to predict a price, a cost or a value, certainly, you can build a data science model in order to do that. In fact, Zillow and Reddit and realtor.com all have data science pricing models for home. So if you're looking for a new home and you go out there, you'll actually see the asking price and then below that there will be an estimated price based on the machine learning algorithm of those various realty sites. We've actually built several models here to predict the number of issues. We're going to see when a product is released and to figure out how, what the, what the probability of a major incident or a downtime causing incident would be. So all of these things are very common uses of data science. And I'm sure there's, you know, millions more categories, but these, these are the big four that I've been familiar with and have built models of. Next slide. The key to being a good data scientist is matching the analytic approach to the problem that you're trying to solve. A lot of times with data science, people get very caught up in the notion of machine learning and machine learning just is the only way to do things. Well, that's not exactly true. It's certainly a great way to do a lot of things, but basically the decision tree on this slide can help you figure out additional approaches that can solve your problem. If you're trying to predict an amount, you maybe want to use a time series model or a linear regression model or a decision tree model. To predict, say, sales tax revenue for a state government. You may in fact, if you're trying to predict an event, such as a server failure or a customer satisfaction score, then you would want to look at logistic regressions, neural networks. If you're trying to put articles or websites into a class, you would use some sort of clustering algorithm to do that. So basically this chart helps you match the solution to the kind of problem that you're trying to solve. I personally believe that every good data science model has three components. It has a data creation component whereby you use natural language processing to classify text into categories that can be used as variables. It has a modeling component, which would be your regression equation or your decision tree or your neural network. And it has a simulation component. And the simulation component is what's used to test the changes in those variables and in order to make recommendations. So you have a data creation piece, you have a model piece, and you have a recommendation engine. Most really good data science models have all three of those components. This is the framework that we're going to use today. It comes from what's called CRISP-DM, which is a life cycle model for data science. It starts out with a business and understanding phase, then goes into a data preparation phase. And once you're in the data preparation phase, you may iterate back to business and data understanding to make sure that you really know how the variables are created and how the variables are what they actually mean into the business. From the data preparation phase, you're going to a modeling and evaluation phase. And in this case, you're going to test a bunch of different machine learning algorithms. In fact, in the piece that Niraj is walking you through today and the work that you're going to do yourselves in your Jupyter notebooks, there you will test four or five different models on the data, and you will select the best model based on very well-defined success criteria. And then the last phase is deployment, which is actually putting the model you've just built into production so that it runs on a regular basis. One of the most important things of deployment is to be able to monitor the model's accuracy. Models tend to drift over time and often need to be recalibrated. So this is something that you should plan for during the deployment phase. This roadmap that you see here is very useful for scheduling a project and planning a project and can build milestones real easy that can be shared with upper management. Next slide. So there are a lot of machine learning tools available in the market. We put six of them on this slide. It's important for you to know that we are going to use IBM Watson Studio today and it's not a comment on whether it's the best solution or how it compares to the other five. It's just the fact that that's the one we know how to use and so that's and it's free. So that's what we're going to use. You're certainly welcome and encouraged to try all of the solutions to find the one that you prefer when building your machine learning models. So we're going to do a quick poll of the audience. Simon, can we poll the audience? So you should see in the polling section on your screen, there's four options. Go ahead and select what you want to learn through this workshop. All right. So very nice. We see that a little over half of the folks on the call are interested in understanding the process to build a data science model. And that's good. That's certainly what we hope that you get out of the course. Raj, you're up. Great. Hi, everyone. And thank you. And the results look exciting. It says, okay, no one wants to be dangerous. Okay. Thank you. Okay. I think for the next step is as we are stepping in in the workshop is to see if we have the environment ready and be in a position to open the notebook. So you would see that I have actually posted two links. And I think there have been some questions around whether the slides would be available presentation. So they all are available on the GitHub in the first link. And the link that you're going to use here right now is load the notebook in the environment. So if you all want to kind of, you know, copy this notebook link, which is there in the chat. That's step one. Then if you are in the Watson studio, we have to add a project, you know, click on notebook and I'm going to do that along. So we can follow the steps here. Let me do that. I'll open a broad space here. Hopefully you have like an equal to Watson studio. You have an option here with says add a project. Then you click here on notebook. Like the instructions are there. And once and you can choose the runtime, which is 3.6 free. Then just name it. I mean, I've named quite many. So I'll say test it. You can just say maybe NPS prediction. I'll just putting one more because I may have existing ones. And once you have done this, then you go to from URL NPS prediction. Test eight from URL and face the URL link here. That's there in the chat. So you see the link is here. And then you click on create. And it'll take a few minutes and it will get created. It's setting up. They run time. And this is the notebook, which we are going to use in all the phases that you looked at in one of the road map that we are going to use step by step. So hopefully by the end of it, I mean, you see each step, you see the code, you run it and see how we go end to end in this use case. So we are actually, we are going to, we have a break planned here, George, right? So we can actually use that if the, if you have not been able to follow. So then we have like 10 minutes now as a break. Is it right, George? We take a 10 minutes now. Yes. Yeah. And we can help you and answer the questions in the chat and use the Cuban area for the questions. It's much easier to answer that way. Yeah. Yeah. So, so I'll repeat the steps. I think so, so there was, so if you see the GitHub link, let me show you that maybe I think that may help. So this is how we have organized the GitHub, if you can see here. So prior to the workshop, we actually send some pre-work. So these are the workshop setup instructions. So this is one presentation we shared. And if you see, I think to be able to move forward and have the best experience, I hope you have gone through the steps of setting up an environment, which is this presentation is there in the GitHub. And I think the precursor for us to be able to create the notebook is you all would have to go to the dashboard and see that you have the services being listed, which is Watson studio machine learning cloud object storage. So if all that is set up, then when you, the baby just created a notebook with the steps that we were looking at. Yes, that state. So it will open the notebook. Let me see some questions out there. Okay. Let's see what the questions what the URL for Watson. Okay, it's there in the instruction. See the FAQs also. Okay, follow the instructions. Loading the notebook. Okay, that's a good thing. Provided before the workshop, like creation of the notebook you've just done. Okay, so I mean I'll repeat the steps. I'm just so that if in case you have missed out. So once you're logged into Watson studio out here and recreate the steps. Okay. So once you are able to open the Watson studio, you will see an option over here to create a project. And once you create a project is you will see an option as a notebook. And then over here, it's going to be free here. Then the link. And then the name and you pretty much then once you type it, it's going to be. You can click on create out here. Oh, sorry. Okay. Yeah, so it's like, let me share it again. Okay. So this is the page. Okay, one, we've got a few dance. That's good. Okay, so this is the asset notebook. If you missed that step. Okay, then there's a free from your name. And then you can type the link here and then click on create. And that would open the notebook out here. So once it opens in some cases in case for any reasons, you get like a page stuck, you can refresh the page. And that could open the notebook like this. You can see on my screen. Yep. You're right. You're right. Yep. If in case it's stuck, then for, I mean, just for some reasons, I think you can just refresh the page. And yeah, you should have this green and the notebook. Like you see on my screen right now. And not to worry, I think in case you're getting certain errors for any reasons, I mean, I will run the code on the screen to show you step by step. What happens is the next step. So you can definitely resolve the material would be with you after the workshop to try. I mean, yeah. So we'll definitely try to do it together for some reasons. We are getting some issues. Yep. We will still have the material to be able to practice, but at least we can focus on looking at and following the key steps. We're going to, we're going to go through together. Yep. So let me know whenever you would want to start. And so, yeah. So this is what we just did. It looks like we're still a lot of folks are having an error. Let's go through it one more time very slow and just try refreshing your browser and see if that you may actually have to clear your cash. Yeah, yeah. So the key step here. We were here. Then B, then we went to the URL. Okay, I think we posted the URL in the chat. I can put it in the chat or in the Q&A. I'm not able to see the Q&A right now because maybe I'm screen sharing. Let me see where it is. Okay, I found it. I can send it to all one second. Okay, so I'm going to post it to all. Hopefully you get it. Second. Okay, so this is the link and someone is asking for the. Yeah, I send both the links to see if you're able to get in the Q&A. I also posted these links in the chat as well as in Q&A. George, are you able to see them? Yes. So Niraj, a question that's being asked is do I need to follow these steps even when I completed the pre-work? Would I be able to follow these steps? No, do I need to, even if I did the pre-work? Yeah, I mean, yeah, because because, yeah, in the pre-work, you just loaded a test notebook and now we'll have a fairly notebook that we're going to use for this workshop because that was just a test exercise. So you have to follow the same steps, but when it's asking for the URL, like you just have to use the new link over here so you have a new notebook that we are going to go through today. So yeah, so you will have to use the new link and create a new notebook to answer. Okay, great. Yeah, there was a little confusion about the different notebooks. Yeah, and I would suggest I think if I see some of them have been using the Bitly link and I would recommend maybe I think use the link that I am posting in the chat. So people who have been saying we're using Bitly, use the full link and that should work for you like magic, yeah. Yeah, this is the link you should be using. So Max says, yeah, you figure out the error, unlock the notebook from the dashboard and yeah, and that should get working. That's good. Thanks, Max. It seems like, Max, you're able to open the notebook, right? That's what I see here. Okay, working for Max, good. Looks like it's working for Martin as well. Awesome, okay, that's nice. Okay, I see. Okay, got it. Okay. M Young says, yeah, got it working. Awesome, Claudia. Yeah, working. Okay, Raju working. Awesome. Okay, let's wait for just another few minutes. George, what do you think? Just another few minutes, yeah. Well, Gunther still having an issue. It says cannot read property language info. Yeah, because I think you're using the Bitly link, you will have to use the link that the full link and that should be working for you. Yeah, that should help you out on that one. Okay, I think resources and available would be the error you would get. I think you will have to go back and follow the setup instructions because if you haven't like, if you haven't. Yeah, so you don't have the storage. So basically you have to go to the instructions on the dashboard and make sure you have these three services available. And as George mentioned, you know, possibly, I think you don't have the object storage set up. And that's one of the step for you to set up the system and infrastructure. So that's the reason you're getting Dan resources available issue out here. So just, this is the document there on the, the same way you added, I think, machine learning and studio, you'll have to just add the object storage. And you should be up and running. Okay, that's good. Pushkar is done. Good. That's a good thing. Claudia, like you mentioned, make sure you use 3.6 free version. Like, I think I've highlighted it here in step C. I mean, I've seen in the rush sometime we just click next to make sure you choose the Python 3.63. Thanks for the pointer on that Claudia. Balaji done Robert. Okay. Okay, there's okay. So Robert, you're not getting the space for the URL because there are three options on the top blank from file and URL. So you'll have to click from URL. And that's when the link will come up. So I think that's the reason you're not getting it out there. Okay, yeah. Siji Vivek help guys. That's good. Good collaboration and helping each other. Thank you. Okay, you can leave the old old notebook. I mean, Ravi, that shouldn't be a problem. I mean, yeah. But if even if you would want to you can pretty much just when you go there. So this is the reason I think one of the question was that why I'm not seeing the URL option because you are on. Is because you are on blank and you will have to change it to URL. And that's when you will see this. So you will have to flip through between these options. Go to URL, then put the link here, then make it free. And yeah, that's when it will be up and running for you. So that's the step. And I think how do I delete the old notebook? Yeah, I mean, if you go to the assets, there are three dots at the end. You can select the file and delete if you are done on this. Let me show you that part. Okay, Sanjeev is good setup. Awesome. Okay, congratulations. Awesome. Okay, that's good thing. Okay, that's fine. Okay. Morath is done. Okay, you don't see the notebook code any notebook. Okay, that's surprising. Maybe Manoj, are you using the old link of the notebook? You should use the one which I just posted on the chat. That should have a field notebook like the one you're seeing right now. So maybe you put the test link that we shared in the instructions. That could be the reason. Okay, working for me. Okay. Awesome. Okay. Yeah, we'll be working. Kind of Sammy booking Richard working. Okay. Yeah, and that's the fun of a technical workshop, you know, working together. That's right. Yeah, because everybody can knows that this is a challenge to do all these virtually we normally would be walking around the room trying to help people out. So, thanks everybody for your patience and going through it. Yeah, yeah. And if you really been asking for a real experience, I think this is a real experience because then the day we would have, I would have started or anyone would start as a data science. I mean, setting up the infrastructure is the first thing. And I mean, I mean, what we have is a pretty step by step guide, but in general, it would take everyone to learn test, do it, set it up. So it's a journey on its own. I mean, and that this is a pretty much an example of that. And it's about getting it or solving moving on. But yeah, this is also part of journey. And that's why it's like a hands on for all of us. Okay. So we got few, at least I think quite a few yeses now. And so we have, I think, quite a few. And I think, yeah, it's a good one to make your girlfriend and mother tested. Yeah, I made my wife test this for me. That's a good one. Okay. Okay, I think resources are available. I think it's the same thing. You have to set up an object storage, go to the instructions, set up the environment first. And then that's the reason you're getting the resources error for you, Shivani. Yep. You have to choose 3.6 and there is a free version out there as it's in the slides. 3.6 and that's a free. Awesome. So, so this is what we have to use free 3.6. Okay, cool. Look for Dan. Okay, Niraj, let's see if we can take the next step. I think we have lots of folks, lots of folks set up. So that's good. Awesome. Awesome. Okay, so, yeah, go ahead, Doug. Yeah, I'm shutting my screen. So, yeah, so hopefully you guys have figured out how to create and set up the data science environment on the IBM cloud. And you've got that roadmap that we talked about business understanding data understanding modeling and deployment in your head. So, next slide. So here's a little scenario that we've seen played out a few times in a lot of different data science projects. And basically, the CIO is having a meeting with his direct reports and he says, you know, we've been having issues with client satisfaction and they need some ideas on how to improve it. And the chief marketing officer says, you know, customers give us their money, but fans give us their hearts and we can sell more. And the CFO says, I agree we had 2% increase in customer retention has the same effect as decreasing our costs by 10%. So we need to get more fans. And the CIO says, you know, we've done some surveys and about 44% of their customer says our experience is very bland. We need to figure out how to do this. And the chief data officer says, you know, the net promoter scores probably a better measure of satisfaction than our current survey. And then the CIO says, yeah, you're right. How about we create a proof of concept to predict the net promoter score. And this is what our project is today. Our project is to, can we build a data science model to predict the net promoter score. And then go back to our fans from that net promoter score and see how we can make our experience less bland. So that's what we're going to try to do with the data science model that Neeraj is going to walk you through today is we're going to try to predict the net promoter score. Next slide. Just the next couple slides are sort of background on what net promoter score is and how it's calculated. And this is maybe you guys already know, but we want to make sure that the background is there. So in 2019, our ACME company here supported 500,000 cases created in multiple platforms. The net promoter survey response rate was at 15%. 60% of the cases were non promoters. That means they gave a score between zero and eight on the net promoter score. And 40% were promoters, which means they gave a score of between nine and 10, either nine or 10 on the customer satisfaction index. So next slide. So the net promoter score is becoming an industry standard on customer loyalty. Essentially what it is is a 10 point scale and this question is very simple. Would you recommend our product or our company or our service to a friend or a colleague? And basically a score between zero and six means no. A score of seven or eight is a maybe and a nine or a 10 is a promoter. And basically a nine or a 10 are the people that we really want to understand what makes our product tick. And we want to figure out how we can move the seven and eights to nine and 10s and how we can move some of the zero through sixes to seven and eights. So the way we calculate the net promoter score is we group the nines and 10s together and sum the number of responses. And we group the seven nates together and they just sort of sit there. They go no further in the calculation. And then you group the zero through sixes together and your net promoter score is the percent of promoters. So in this case 500 out of a thousand or 50 percent minus the percent of direct detractors. So in this case 400 over a thousand. So 40 percent. So your net promoter score in the example that we have here is 10. So net promoter score can range from minus 100 to plus 100. And anything around 70 anything around zero is considered normal and anything 70 and above is considered exceptional. And so really what we're trying to do is figure out how can we get our net promoter score to 70 or above. Next slide. So the other thing that we notice is the net promoter score varies a lot. It varies by industry and it varies by products mix. These slides here are from the US consumers 2019 net promoter benchmarks. And basically the industry you see the leaders are out in the 70s and 80s kind of scoring. And the laggards are down around in the teens. We need to want to we want to be able to predict the net promoter score and figure out the effect that it's going to have on our finances and on our product mix. Next slide. Our goal for this project and for this workshop that we're doing here is to improve our net promoter score by identifying the non promoters ahead of time and proactively addressing issues. So what are we going to do? We're going to pull in a bunch of historical net promoter score data. And we're going to build a machine learning model to identify the most important features that lead to net promoter scores. And then we're going to create a capability to share the top candidates for the non promoters and build action plans to address their issues. So does that make sense what we're trying to do? We're trying to build a machine learning model that allows us to predict the detractors of our service. Next slide. So here's an exercise for you guys for your business and for your environment. We have developed this template that says as a role, I would like to improve or reduce a KPI for a particular scope of work by an amount in a time frame. So for example, you might say as a CIO, I would like to increase the availability for my mid range server environment by 8% in the next six months. And it's this sort of sentence that will drive your machine learning and AI workshops. So basically take a few minutes here and try to write down for your business and your environment what might be a statement that you would want to see happen. And go ahead and put that in the chat once you write up a couple sentences. And we want to see how these use cases can then be moved into data science solutions. And another advantage of, you know, putting your ideas in the chat is you may find and connect with people there who are interested in doing a similar thing. So it may be an opportunity to collaborate, you know, with other folks who are participating here. There was one question in the chat somebody was having trouble locating. I mean question in the Q&A locating the chat. So everybody on the bottom of their screen if you are having trouble bringing up the chat panel should be able to see buttons on the bottom of their screen and the third one from the left is in red X and three dots and the one right next to that is a blue button with a comment in the middle. So if you click on that, it should bring up the chat panel on the on your screen. Yep. And I think the question one of the question here on direction. So if you see the direction can be like you want to improve you want to reduce you want to increase or decrease. And now the part is, I mean, what you're trying to accomplish with the direction is what it's referring to. I think that's the question that came up from, I think. Oh, yeah. Okay. So, should be in the wrote an excellent sentence. Sanjeev. Nice improve employee retention for his department by 10%. So, these kinds of sentences can lead us to data science models right. But really powerful template to be able to frame, which is always critical when you're trying to solve something to really concisely put it into that. Yeah. People are people are putting in some really good ones here. Yes. This is great. This is great. Yeah. Another experience I want to share like I think last week I was in another conference where we had like one on one export sessions, you know, there was a learning conference that happened. I think in that during the interviews that came up, you know, some of the folks asked me that. How do we know what problem we are going to solve and I think and some people had done tons of data science AI courses but they're like, okay, where do I apply it. And I think the statement that you're kind of putting in together. This is where I mean, whether you are from data science and non data science. These are the problem statement that needs to be translated as a problem that eventually will become a data science project on its own. So, I just wanted to share some of those kind of questions and we played on the importance of I mean, what you're writing right now. This is where the whole project begins with this one liner where somebody has a problem or somebody has a goal and then it gets translated into a data science project. Yeah. I certainly like that one. These are these are fantastic. It shows, you know, great insight by all these participants and breath of different problems that people want to tackle. Okay, these are really fantastic online shops. Reduce the tone. Interesting. Wow. Now reducing security. Incidents. This is an interesting one as an advisor to government. I'd like to increase the COVID precaution compliance more effective by 10%. Yeah, that would actually be a pretty interesting model to build because. Okay. Okay, Niraj. Okay, those are pretty good examples. Yeah. Yeah. Okay, so let's thank you everyone. I think yeah, these are some great insights and I'm sure I mean like we are putting in the chat. Everyone is seeing the variety of problems which are there. Then let's keep going with the flow and that's the next one, George. Yeah, so hopefully at this point we understand NPS and we understand what we're going to try to do in this workshop. And we have some tools now. As far as the roadmap for building a model, and we're able to under assess the situation with a clear business statement. And so, Niraj is going to walk you through now actually building our NPS model. And just so everybody can follow, you probably noticed at the top of the screen, kind of the breadcrumbs, right? So we just kind of completed that business understanding. So we'll be going into data understanding and you can kind of at a glance see where we are in the process. So just trying to make the roadmap easier. Yeah. Okay. And now that I think there was just one more point I think I wanted to kind of emphasize here like. Like by now we're done with like kind of with an example of business understanding. And the goal of this workshop is yeah, we look at NPS as an example. But by the end of the workshop, I mean you all have your own examples and the way you saw the first question. There'll be more follow up questions as we go along. So keep some of your examples ready. And I think we would want to kind of work together and see by the end of the workshop, how you have like a roadmap for your own project. And while we are following NPS as an example here. So, so yeah, that's so I think I'm excited we have like 200 plus a task force for the CIO who has asked us to build a proof of concept for this problem. And that's what we are up to right now. So now, as you see that we did the business understanding part. Now, I mean, and we got back to the CIO and said, okay, these are our findings, you know, we assess the situation. We looked at the benchmarks. We looked at some of the poster child is like looking at the benchmarks who we want to be like in the industry and which companies we admire like we are close to them and we can be there when they are already there in the NPS scale. And then we did all the work we finalized our objective we presented to CIO got a sign off on the project. Yeah. I mean, this is a good way you're approaching. And now let's get into more details. Okay. The problem is agreed. The stakeholder signed off your project got sponsored. Now what are the next steps? The next step is, you know, working on the data. Okay. So what data are we going to use for this problem statement. In the example of NPS, we looked at multiple dimensions and these are the data sets you will find on GitHub. We'll go through that also. But the first step was what variety of data are we going to consider in this scenario. So when we looked at that, we found a lot of time based variables or features. You call them features also in our context, like, you know, when the customer reaches out to us, what day of the week they call us at. On the working days, the weekends, then, you know, what time do they call us? Like it was it like the prime hours, non-prime hours, you know, in the global context. And then how long do we have that account, you know, and maybe like, and how long does it take for us to get back to them and provide a meaningful update? Like a regular, like, for example, if you send me a message as a customer and I am a provider tech support, how long it took me to respond to your message is a meaningful update time. And then as the conversation grows, you know, it's going to be average meaningful time so we can look at those features as one example of time features. I would say in this, in this example, some more examples are where are we getting these cases from, you know, in terms of all the issues the customers are reporting in terms of geography. So country, you know, overall what geography they belong to or region they belong to. And how much are customers are spending in this kind of a use case of the company? Like how much are they spending with us? Like is there any monthly recurring amount that they are spending with us? Is there any lifetime spend? Do we know about that? So another good features to kind of consider. And the fourth part here, the kind of data features we're going to be looking at are the sentiments and emotions. So this part is about, let's say when you have a case and you are typing in your problem statements, you know, in the case or through the email the way you're reaching out. But now we pretty much what we're doing here is we are using Watson API is the preconfigured some of those are preconfigured models. So we run, we kind of, you know, call the API is to calculate these sentiments and emotions course and kind of, you know, we would be able to absorb this for the company. So those are like some more kind of features and you can find sentiments is minus one to plus one, and you can find the values plus one being positive minus one close to that being negative. And then you can look for emotions, like when the customers type, I mean, are they showing anger, disgust, fear, joy or sadness? What are they really talking about? And then there are many other features that we generally attribute it. I mean, we will attribute in this use case to begin with. One is, I mean, given the case comes up, I mean, how many cues it goes to how many times it get transferred as an example? I mean, do they have any support plans with us? You know, or in this company, then do we have any, is there any specific account type? I mean, are they like, you know, when we talk about that they can be different categories of account type and tears, are there different tears? Then, is there a severity that was given to the case when they when they approached us, what technology they belong to the case they reported? I mean, who reported the case? Is it like, you know, we created a case for them, we identified proactively or they kind of, you know, customer logs in and they create the tickets themselves. So that could be who and I mean, and that is, that is the user and what source do they come from? There are 100 ways they can call us, they can, you know, chat with us, they can as a customer, for example, in this scenario could be maybe just have like, you know, an open form to submit a request. So they can be multiple sources of through what source are they reaching out to us and tribe and catalog is again more of an extension could be an extension of technology, how some of these areas could get grouped. So these were pretty much an exhaustive list of features that we started with. I mean, seriously, and it's like, if you look at that, I mean, it takes a while to be able to understand what all features can play an important role. And you will get into this list or like, kind of come up with this list after multiple interviews who are business experts. And that's where it's about first phases around business understanding and second is about data understanding. So because once you start getting the data, no, it's about revisiting that, seeing the values and that's going to be pretty much the next step. But so these are the kind of variables we are going to factor in for the proof of concept today, as you can see here. Then the point is, let's say, let's assume, I mean, you're going to get this data sources from multiple databases or multiple, you know, different places. Now, once you have that, the next immediate step in the data understanding phase is that after you have spoken with everyone, you have, you know, looked at the data, cleaned it up. You do a simple like an audit on the data and see if you're really convinced with the data that you would want to move forward. And you want to feed in the model. I mean, I think you see you have the comfort level with the data set that you're dealing with, and you have a hold, you can understand, you can explain what feature means what and what are the values. And, you know, you can look at the correlation, you can look at the missing values. So that is, that is pretty much the next step. So the goal here to summarize in the data understanding phase is, first, get an exhaustive list of features that you think can impact the customer experience in this case. And you can see these were the examples we found. And then the second thing is, once you have gathered the data, and this is, I would want to tell, I mean, this is like, like first three phrases are the like the most time consuming like everyone says. And the next thing is getting a hold on the data set. Now, I think it would be less of talking like we promised, I think it has to be more of hands on. So this is the step. I mean, the first lab for us is that if you all, I mean, few of you who have the notebooks open, so if you can run them, and I'll tell you which steps to run. And if not, I mean, for others, I mean, he is going to share my screen so you can follow along. So the first step is, I mean, if some of you we are assuming have never used a notebook, which is fine, we'll just talk about like, like a few seconds info on this. Then we will load the most important packages, verify the versions, and then we will load and read the files and explore the data and perform a quality audit. So that's what we are going to accomplish in the next lap. And if you have the notebook open, so like this is an example of notebook. And we have like a brief introduction if you want to read and we pretty much talked about it. But just an example, I think the key thing I want to mention about the notebooks is that there are three types of cells. And majorly, I think two are the ones like code cells and the markdown cells. Like if you see this cell is a markdown cells where we can write some explanations. So anyone who's reading the notebook now or later can kind of read the comments and understand what each cell is meant for. And the code cells are the one where you see like where you can run it like these ones. So these are the code cells where you can perform an operation. So these are the ones basically we are going to be using in this workshop. Can you increase the cursor? Can you please make your cursor bigger? I'll have to check that out. Maybe I'll check in the next break and try to change it. But I think so now the first thing you would want all of us to do is like select introduction to notebook. I mean there would be no output on this cell as you can see and click on run. So this moves on and you can also press shift enter if some of you are like the shortcut with people out here. And then you click on next. So you don't see any output because it's a markdown. Then the next step is you have to load all the packages. So there are like multiple packages and I'll brief. I mean some of those and I've kept a lot of readings for you if you want to take this to next level in terms of your learning. So we are using NumPy which is very well known package which is used for scientific computing. We are going to use Pandas which is a heavily used data manipulation and analysis package out here. CSV to deal with the CSV files and date and time. Whenever we have a date and time like you saw an example where I have some of the features like day of week. You know when which day they called up or which time of the day they called up. So this package plays a very important role in those steps. Pandas profiling we are going to do the quality auditing using Pandas profiling. And there are many more. I mean there are some out here I can say mention highlight that Watson one because we are going to be using Watson services. And then there are scikit-learn that's what we are going to be using. It's an open source. We are going to use this for machine learning. And now it's the time you can select this and say run. So once you click on run you will see that it's a kind of after sign that means loading all the packages in your in your notebook. And once you do that the next step is you can check the version. Like okay my packages got installed just for me to see what packages because sometime it will happen that you know some of you would take these notebooks and use in different environments and you may get some errors. So that is a good time you can validate I mean if the errors are due to packages version differences or there could be some other reasons also. So this step could potentially may help you to troubleshoot and run these in different environments. Now once we have done that then the next step is about so what we've done so far we loaded the packages we saw the version of it. Now we are actually going to read the file there are multiple ways to read the file but for today. I can read it directly from my GitHub. This is the file name. So now it's about you know reading it. And so the file got read it got stored in NPS. And then once it's done then the next step here is exploring the data quality. And this step may take some time and I think so pandas profiling in general as an open package has been little notorious is what I've seen. So for any reasons if you get some errors like I got too many I mean before and it's not like a studio thing it's about a lot of precursors on the prior packages dependency so it took me a while to troubleshoot it. But if you see any problems in any problems in running pandas profile by any chance. I have loaded a sample copy of the report in the GitHub. And I'll also show you an example of what you get out here like for example when you run this. Now so like if you remember the slide what we talked about is when you get the data. What are the key things we should be looking at so it's like you've got the data set. I mean how many features you have what features are those what type of those features are how many distinct values are there and that translates into also a technical term which is high cardinality sometime. Then the correlation missing values and the end goal is to be able to continue to revisit this data and improve upon it. So let's look at some examples here. So with this one line of code pretty much you can find the whole data quality report out here as you can see here. So some of things we are going to look at is like currently in the data set we have we have like 58 variables of 3900 observations. One are numeric and categorical are 17 and I think if some of you have not been able to run I think we'll take a short break and maybe I think kind of help you out on that. But I think let's follow the flow. So I hope you don't miss these steps then when you have to look for it. So I see that okay so now I think so you saw the variable observations what types of they are. And I think one interesting part about this report is that you can find it gives you pretty much upfront warnings about your data. Like these are all your variables that the way we looked at the data set and this tells you about what percentage are the missing values. And you can see in this one you can also find that some of them are getting rejected because they are pretty much redundant redundant aspects in the model. What it's saying is that feature description and discuss score that we calculated over here in our data set that we have right now. They're highly correlated so pretty much we don't need both of them. Even if we have one of this it should help us out and move forward. So that's another example. Now if you go further down it gives you very high level interesting report to look at. And then I mean the most interesting part is I mean now you can start auditing your data. Okay account type. I have three types of account. Does it look good? Maybe there are some blanks if there are. I mean for this workshop you pretty much fixed up the data a bit. I mean as much as as a demo data for you. But like you see that three types of account and this is the distribution. Does it sound fair? And maybe you would want to ask the business. I mean what does it mean? What are these three kind of accounts are about? I mean what do you think about that? Then you talk about like you know like a meaningful communication you know and these are the scores on the anger. And you see like you can pretty much start looking at you know like which variables are well populated. How many times the cases are you know kind of assigned. So you can start studying these parameters one by one and the questions that come up like these are missing values. Is there a way to populate these missing values? And once you have thoroughly audited your data in this scenario. I think one of the best part is like I would say if you look here in this report. And you look at the missing value. So it tells you exactly like okay you had around 58 features in this scenario. And once you have the 58 features I mean how well populated is the data set for you. So you can see that this is another way of looking at it. Now the goal of this is that okay given the reality of every data science problem the data is always like you know you have to start. And as you mature you continue to build upon it. So it's not bad I would say I mean it's definitely you can we can do a lot to continue to build upon this data set. But this is how you would start looking at it. And then I think once you know maybe which are as you move forward I mean I think some of these may not be as clearly be started so you can look at correlation. I know it's pretty micro right now but when you may be limited to you know reduce the features rerun it again again reduce the features rerun it again. Then you can start looking at the correlation matrix and see which are of more importance which are of less importance. And then make some good decisions on what feature makes sense to keep what features of variables you should just let go. So you can do a lot and I think if you realize this all would be done was done just with one line of code. So you looked at the variables you looked at the correlations you looked at the missing values. And then you can pretty much look at the sample data out here. In this scenario so that's the so that is what you know we pretty much accomplished or looked at right now. And so the goal was look at these points and end goal is to continue to revisit this and improve our data quality because that would be of essence. Like you've heard I'm sure a lot of time like it's about the modeling is about garbage and garbage out. So the better the data the more qualitative input we'll get so that's the. So that's the I would say the steps. Okay and if you're not I think that's fine I think that would like I said I mean profiling. If you read the GitHub of the partners profiling package that's a notorious like example I was like you know talking about. So if that is the case for you I think in the GitHub I have attached. I have attached a PDF. I mean you can look at that and that step doesn't have any dependencies on the following steps. You don't have to really worry about that but that's an that's an independent step for for us you know and it takes a while like you're saying to run that then. So once we're done like if you're able to look into this one. Then the. So I think so these are the lab instructions out here and maybe I think we'll look for for I mean few minutes and see hopefully you're able to try some of these steps and see if you have any questions. And Marina George, did I miss anything that you think we should we should have mentioned highlighted. No, I don't think so. I think we'll just kind of watch the chat and q amp a here to see if. Well, everybody has been able to follow. Yeah, I just think you went really fast. Okay. Please put your questions in the q amp a and and we'll try to to give. Yeah, sure. Ask Niraj to do things over again. So this is the one. Okay, let me see some of those. I have an error. Okay, let me see the GitHub. I'm going to see the GitHub quickly and see. Okay. So, okay, some of you are saying you don't find the quality default. Okay, so I'm going to let me see. Okay, I'm going to see if the report is there in GitHub just for your reference. Okay, maybe it's not. I will do it now. Okay, so I just put a new file in the GitHub and you should be able to see it like I added a file which is data quality report. This is the one data quality report. So this is the one when you open that. I can open the example of that just for your reference. This is how it looked like. Okay, cool. This is how it looks like. So the same thing what we were looking at. So you have 58 features. 3912 observations, some of the warnings. Yep. And you get some of these warnings and you can study each one of them within that. And yeah, that's that's how you can look at this one. Now let me see some chat questions. Well, I'd like to point out, Neeraj, one of the really important things here is that this is messy data. Yeah. And every data set that you get as a data scientist is going to be messy. So don't ever expect 100% of 0% missing values. Don't ever expect non correlation in your data. It just doesn't happen. So be be able to deal with the messiness of the data. Okay. So I have been using this, the annotation tool, like I think Wilfred, you shared an input. I hope that allows you to see my cursor better, right? Okay. Then I can't select. Okay. Let's see what other questions we have. Okay. Thank you. Steps again. Okay. So can you show the steps again? Yes. Why not? Let's look at that. So let me visit this. What we really did, if you're still seeing my screen, we talked about the features which we are considering for the problem and the CI will find us a problem to solve like a proof of concept. So these are the features we got in at our hand. Now, once we got the feature, we have to use the code to be able to run the quality check on the data set. So we are using pandas profiling as one of the package we loaded. And the key things we have been looking at are number of observations, features, and what type of features those are. Then we also look at, I mean, how distinct, what kind of distinct values it has. Then we look at the correlation. Then we look at the missing value. And I mean, then the end goal is like the way we just talked about. And then the example of that, like the way I was showing, like hopefully when you were able to run this part with sorting out the errors on pandas profiling, this is how the report looks like as you can see here. And you can see the part which is just with one line of code like I have this one and this one. You can see the overview tab. Here you can see the overview of your data set. This is where you can see the variables. This is where you can see the correlation aspect. You can see the missing value and a sample data set. So this is all about, you know, being getting comfortable and a handle on your data before you move on and start preparing that for the next step. So I hope that helps. I got a request to repeat some of these steps. Okay. Okay, that was from Rajeshri. Okay, that's great. We welcome Rajeshri. And let me see the Q&A also. Okay, any other questions? Okay. So can you post the GitHub report name? It's a data quality report.pdf. If you want to see, that's a question from Santosh. Okay. How do you find the report? If you refresh the GitHub, you should be able to find that out. Okay. And there are some errors in the pandas profiling. I mean, that's our life in data science, getting errors and reading stack overflow and finding out how you solve them and reach out to the github of the packages. That's our real life and welcome to the world. If you're getting these errors, I mean, you're in a good state, I would say. Not to worry. So what else? Okay. So Santosh, I hope I answered your question on that. Then let me revisit the chat quickly and see something else is there before we move on to next steps. Okay. One second. Okay. Chat. Okay. Got it. I'll ignore the error 3.2 and look for your data quality report. Okay. That's good, Dan. Thank you very much. Okay. So, okay. So I'll go back to the notebook, summarize what we did, and then we will move on to the next steps. So what we have done here, we already loaded the packages. Then we, okay. Okay. So then you have the exploration part. And I mean, yeah, and I think that's another session. I think if you're getting an error sometime, it's just worth kind of, you know, what you can do, just go on them. You can restart. And then what you can do, you can keep, if you're able to restart it, then once that happens, you can pretty much go to any cell and where you want to run and run all above, you know, just to be able to run all of them together and see if that helps out. But if not, I mean, I think, yeah, at least you're able to see how it's working. So, so that's this part. Okay. Let's see. Any other quick questions? Okay. So then hopefully I think we are okay. Can I get to yeses if you are able to follow the instructions at least or I mean, at least you are able to follow along. Can I get a few yeses on the chat? Okay. Thank you. Thank you very much for the validation. So I think, yeah, so with this step, let's continue to move on. And I think, yeah, so far we did the lab for the first section. And now cell is taking too much time. Yeah. Welcome to the world. And that's the reality, how it works and depends on the environment you have. So if you want to upgrade the pack, you know, environment that may help. So like I mentioned, I mean, I think, you know, so again, I mean, it's, it's a journey we are taking together in this workshop, like we started with, you know. So in the business understanding phase, you all wrote your statements on what kind of problems do you want to solve because this is just an example. Then what we would want is when you go back, you solve your own problems and collaborate with people around you, even if irrespective of whether you are a data scientist or data scientist. I mean, you should be in a position to have the vocabulary, have a plan and put it forward and say, I need these kind of resources to be able to get this going. So the next first step, what we did is we, we kind of put up a problem statement and very interesting problem statements that we looked at together. Now the next step here, you saw the kind of data set we looked at time based geography based sentiment emotion based and, you know, like kind of pretty was very broad variety of data set we factored in. So the next, next exercise for all of you is with, I mean, I would like to ask or request is would be that now, as you have a problem statement, how about you look at, consider what data set are you going to gather? Like, let's say now you have the problem statement, you have your CIO telling. Now, okay, go ahead. I like your proposal what you shared. Now, what is the data set? Now you're going to go back to your data team or the, you know, other teams you have and what kind of data sets would you would you factor in for the predictions you're going to make. So I think that's where I think we're looking forward to, I mean, hearing more and maybe sharing our knowledge. So, if you want to maybe put your thoughts on the chat. And, you know, yeah, let's, let's see, I mean, what you have what thoughts you have here. So the goal here is that what we are trying to look for, or we should work as a next step. First, we looked at the project and now we want to elaborate our project and say, okay, now, given my problem statement, what are the data set or the variables I'm going to factor in in this scenario. Okay, any interesting examples on data like this was our data set like you saw this is our data set. Now, what do you think would your data set would look like? I mean, for the problems you're trying to solve. Okay. That's interesting. Okay, so Aaron, looking to improve perception of honesty. Okay. Okay. Unsolicited comments from social media and press about the company. Okay. Okay. The tracks on IOT network. So prefer data set traffic data, having source IP, destination IP port numbers. Okay. Very interesting. Smitha threats also. Okay. Okay. Is it like a configuration management database or what does the MBB stands for? Just for everyone's benefit then. Okay. Got it. In application operations. Connors army. And past history of system logs across intra app server. That's past history major incidents. Okay. For subscriber. Say sage sage for subscriber turn rate improvement variables such as minute of use frequency of subscription life cycle. Okay. The awesome Smitha. That's a good thing you've got. You're saying you've got profile report after restarting. So that's good to know. See somebody commented about my company. I started in silos. And then they're not connected to each other. And that's certainly a very common challenge. What kind of preconditions activity would you expect the company to do before undertaking this kind of project? Can certainly relate in terms of, you know, many, many companies are having the siloed data kind of problems. And many have put in place chief data officers to really get governance around the data and be able to, you know, create that platform that enables data scientists to access data more easily. So there's certainly a lot of, I think, creative steps companies are taking to address that. Certainly, Maureen, one of the things that we run into that a lot and what we have to do is put in place a data dictionary document and then find ways to export the components of each one of the data sets into a common data lake that we can then model on. Yeah, exactly. In the Q&A, someone is waiting to, for a reply on the GitHub report name. Okay. I'm just looking for another final few, if you have, or I think we'll move on to the next steps from here. Yeah, I think we'll move on to the next steps from here. Yeah, I think we can go ahead and move on. Sure, sure. Okay. Found my notebook and project again by Watson Studio. Awesome, John. Okay, so let's get started again. And I think thank you for sharing all your cases and the kind of data you will consider. And that opens up the doors to this is just one problem you're looking at. There is a, there are tons of new problems to solve. And that's what it highlights. So hopefully you have a project. I mean, while you're following the steps, you have a project. You have been able to look through or think about at least the data set you would factor. And if not now, I mean, at least build the thought process and continue to think about that. Because the goal is to be able to learn and apply. I think that's, that's the focus of this workshop is, I mean, you take back and go back with the plan. And then if you see what we have accomplished so far, and I think I really want to congratulate you. I mean, people who have been able to follow through, if not, I mean, at least you have been able to follow in terms of understanding. So what you've been able to accomplish that you all should be very proud of one is you all been able to set up the data science environment. You all have learned about the roadmap to build a machine learning system, the one that George highlighted, which had all the steps of Chris DM as a methodology. I mean, we all pretty much looked at together as a, as I mean, how we assess the situation in NPS scenario, looked at the methodology, how NPS works, benchmark, define the objectives of the project, got a sign off from the CIO. And then we, we looked at the notebook, we loaded the packages, we verified the version, and we pretty much explored the data set, looked at the quality data set, and looked at what data set is, are we going to factor in in our problem area. So with this, we have a fair understanding on business and hopefully data. Now the time comes, that okay, you got the data set with you now. We have to prepare that for the modeling exercise. And the question here is that just like, I mean, when we talk to a child, I mean, there is a different language when they're born, I mean, you talk to them. The same way, when you look at the modeling, I mean, techniques, I mean, they will not be able to absorb the data as the way it is right now. So we have to get the data in a form which can be understood and be read by the model. And there are three most important concepts when it comes to data modeling. And the first is about feature extraction. So and the second is scaling and selection. So extraction is about, let's say we had two types of users. One is a customer and another is an employee or technician who created the ticket for us. That was one of the feature for us. So currently it may be one column, you know, as like, and that will have customer technician written in the text. But when you would hot encode it, it will be like, you know, 0, 1, 1, 0. And if, you know, that is how the matrix will become. And the machine would be able to understand that. And that is one hot encoding in this example is. Now, that's not the only or the, I would say the only technique. There are many others which are listed below that you should, I mean, if you're interested, I've given some very interesting, some of my favorite readings out on each of these topics. If you would want to see, there are hyperlinks I would definitely encourage all of you to go and check those out. That has pretty much the step by step reading on what these are, how they work, but for the workshop today, just to keep all of you interested and keep, you know, show you how end-to-end cycle works. So feature extraction, one hot coding is one of the technique. Then the scaling part. So let's say in the scaling part, the variables may have a wide variety of scale and the values that they contain. So for example, the revenue or the spend the customer is making, that may be in millions. But when you look at the severity of the ticket, that is on a scale of, you know, one to four. Now the scales are widespread, and if you just feed in the data as is, the model will not understand that. So what we would have to do is, and even the recurring revenue on monthly is different than the, or is different, I mean, on the annual. So what we have to do here, we have to get all the data on a similar scale. And there are multiple techniques for that, for feature scaling. So the one we are using here is the most famous one, like Minmax Scalar, which puts the data on a scale of 0 to 1. And then there are other techniques, like you can use standard Scalar, Robo Scalar, or Normalizer. And the next step is about selection. So like, if you remember when we were looking at the quality audit report, if there are some of the features, I mean, if there are 90% missing values, I mean 99% missing values, we somehow don't need those features in our model because as of today, they're not adding any value. But at the same time, if you look at some features, which just has one value, then pretty much there is no variation in the feature for us to be able to factor that in in the model. So these are two examples on how you may let go some of the features or make a decision on whether you keep it or let them go. And that you can pretty much find out when you start auditing your data quality report where we looked at all the different aspects. But there are definitely some more advanced or other techniques which I have listed down below that you can use and make a decision or make the data transformation part using Manhattan coding, then start the scaling part and then finally once you have done the scaling part, you start I mean also deciding upon which feature to keep, which let go and there are techniques that we will look into the lab as the next step on what it looks like. So with this introduction to data preparation, so these are the three steps of labs. Now we're going to attempt ourselves all of us is like first thing is given the data set you have already read, we all would want to do a feature extraction part, then the feature scale part and then the feature selection part. So let me go back to the notebook so we can run these steps together. So okay, I'm going to share my screen. So see this is the step if you all can see my screen. So what this currently highlighted cell is saying is that this is a feature extraction and currently I mean if you see there are 58 features out there and some features are continuous features like numerical values. So we just kind of put all of them together as numerical values out here. Some of the features are more of categorical. As you can see the support plan, the user type, the severity, those are categorical features and some of them are categorical but with I think if you remember one of the concepts we touched upon was a high cardinality, they have too many values and so we kind of separate them out and we are going to be hashing them. Hashing is another technique we are going to be using and once you do this part then this is the step that you are on the six one that you are taking. When I run that, it would kind of convert that and so I can run that maybe with you this part. So let's say I run it, so seven got done, then I did the eight. You can see this part. So the data got hashed with some of these commands so I separated all of them as continuous, categorical and categorical with high cardinality and this step I let the continuous variables as is, I did not do anything with them. Converted the categorical one into dummy or one-art encoded variables and then the ones which were with high dimensionality in terms of categorical, I applied hashing as another technique and you can read more about that in the links I have shared but this is what this step is doing and once we run these two we are done pretty much with the extraction part. Now once we have done that, the next step we are going to do is the scaling part. So for that what we are doing here, we are taking the target variable, we are separating that out because we don't want to touch that and we are assigning the entire table of values we have in X and we are letting go the target variable which is likelihood to recommend. So we have a table in X, everything excluding the target variable which is likelihood to recommend and once we do that we are going to make sure that everything in X is a numeric value so that's when we just turn everything into numeric. So I can run this with you one of you so you can see that it's run, then this part is run and then what you are going to do like you looked at the data quality report, there are still some blank values so how you deal with the blank, you know missing values or blank values, once you have realized that okay it has enough values and I still want to retain this variable or feature, that's when you could either turn them into mean or you could do median and then in this case I am using mean. You can then these are the pre-existing packages in scikit-learn like minmax scalar you know and like the way we when we loaded the packages in the first cell so these are pre-built over there and I am saying do the minmax for all these columns and do it I mean it's a 0 to between 0 to 1 and this step kind of converts the data, scales it and you see this is the output. So now pretty much the entire data set you are seeing here is scaled or scaled on 0 to 1 and that's what we accomplished in this step as a lab and I hope you are able to run these steps and we will take a short break after this so that we are able to just make sure you are able to run it then the next step is about feature selection now I mean given you have and there are like tons of techniques on feature selection but we are using the one just to get all of you started on this I mean there are more readings like I said so what we are doing here we had 58 columns so what we are saying is okay I mean given 58 we don't want all 58 just maybe for now we are saying just give us maybe top 30 to run the model so what we are doing here we are taking top 30 and then we are using Pearson correlation which is about talking about the correlation between the input variable and the target variable and what we are saying check for the correlation and based on the strength of the relationship whether it's positive or negative as it's closer to plus one or minus one just give me those top 30 which are the most important in my model so that's what we are doing in this step if I have to run this part I'll run it so this ran and then this let's run this so we got the top 30 out here we got this here and then once we get these top 30 we are going to assign it we are going to store all of these that because we are going to call them again so we are going to put that you know store them as top 30 and so that we know that these are the top feature in this scenario we are going to use to train our model now the point here to the point here to remember is that I mean this is just like one Pearson correlation is one approach but there are many others like in the last listed here so we can attempt many other and and then make a decision which one we should keep which one we should not keep and this is a very interesting block that I have I think I found the second one so I mean this blog really talks about how you can apply all of them at once and compare the results and he does scoring this is a very interesting blog that I found but but yeah so this is one way to do that and I think I hopefully I'll check the chat now and you are able to follow some of these steps let me see the chat and questions yeah George any pointers you wanted to make on this one the marine nothing for me Neeraj I think you did it well okay yeah nothing on that I think you're right we should take a break and maybe during the break if there's additional questions you know can handle it but yeah we should probably take a what do you think 10 minute break now yeah that sounds good okay all right so we will we convene at in 10 minutes okay the next step I mean yeah you will get this one and this document is there in the GitHub so once you have done that you will get started with studio start to create a empty notebook and pretty much I think the steps you follow here new project then the access tokens I mean yeah this document is there in the instructions in the GitHub so yeah you are pretty much out there I think there's another few steps left that you can then on board the journey okay so let's go back okay John okay cool awesome yeah that's okay now you should be able to pick it John as you're able to edit it that's awesome okay yep so this one is done okay ran the feature but no output yeah I mean no you should rather you shouldn't have received any output over there because there is no print command out there so it's just the processing is what you're doing so you shouldn't be expecting any output like you see I mean I think so whatever there's a print that's when you'll see or but I think in general if you are talking about this step like like this step you don't get any output nine you ten you go and don't get any output so yeah you proceed to scaling and you follow the steps then that's when I'm doing the print so that's when you would see the output yeah you can start the scaling part yeah sure and yeah that's great you can start the scaling okay and that's when you're going to move next okay awesome thank you for the confirmation good job okay Aaron Aaron issues clear if you're giving 3.2 okay okay yeah you can run these in the anaconda for sure and yeah it should work for you yeah that shouldn't be any problem so I use an anaconda also yeah so transformation is a it takes like few seconds in general because the data set is not big so maybe I think just interrupt the kernel restart and maybe skip the pandas profiling if that's not working and run the rest of it as like cell run all above so you can do that yeah yes we are going to be covering the deployment part and that's the next thing on the tab you see on the top we're going to be looking at the model part next evaluation and then the deployment part yeah we're going to do that today and yeah okay guys I think let's start and let's move on with our journey so yeah our CIO is very happy and with the team working together on this group of concept and now I mean as a next step we report our findings that you know we didn't business understanding we did the data understanding now we have the lab we looked at the we extracted the feature we scaled them selected them and I think and now the next step for us is I mean given I think the kind of things that you what you looked at just data preparation side I mean do you have any thoughts on what kind I mean you looked at the challenges I mean I think errors code but besides that I mean what do you think about I mean what are your thoughts based on the projects you have been doing or the new ones you're learning on what kind of what kind of steps would you use when you talk about extract scale and select when it comes to preparing your data is the next exercise and we just wanted to learn some of your thoughts so everyone can get exposed to some of the pointers or thoughts you have so maybe if you want to drop in your thoughts like this is like the third exercise for us for today where we want to hear about given you have a project plan you have thought about some data sets now the next thing is okay how are you going to prepare it for the model to consume it and what challenges do you foresee I mean if there are any right now and there are some instructions out here yeah thanks and young and yeah let's wait for another few minutes on this one and then we'll move on any reflections you got I mean on extractions scaling selecting or I mean given your data I mean if he was to revisit that part on what we looked at it's all setting there's a question on bias and I remember I think in one of our last conference you highlighted one of the example on the bias on the pictures there's a question on how do you deal with bias yeah I'm answering the ones in the Q&A section so I'll get up to the chat okay okay okay so okay we need to scale the variable okay you're actually I think we could keep going yeah interesting time okay sure let's do that so yeah so with this as you see so we've been able to look at that then this part okay cool so let's look at I think so now we have the data and now the next thing is that okay now we have to start the modeling and evaluation part and one of the first question like I think in the in the first part of the session like George highlighted about the overall framework on the choosing the right approach and in this context we are highlighting mainly I mean now you have the problem you decided you have the data but you have to decide what what machine learning algorithm are you going to select so this is a very simple example and I put one of the you know reading books out here one of my favorite when it comes to understanding things and making things simple so there are two broad approach supervised unsupervised when it's the data is pre-categorized or numerical in our case the customer data it is you know we have we have been making a we have a pre-categorized and then it's then it's we are talking about whether it's supervised unsupervised is like is there some target that we are trying to predict or not I mean if you took at it in this example are we trying to predict a category which is around the color of the socks or in our case the customer satisfaction experience in nbs or that could be a classification example or if you are trying to predict a number like the examples you all brought up like and plus the you know the price of the house or the zero example then that's the regression and then talking about unsupervised I mean if you are trying to cluster things like maybe similar clothing into a stack you know or the food and that's one and association would be pretty much like I would say an Amazon example like what item goes with what is the association and I'm sure you would have heard about the beer and diaper story about how they are always placed together so that's an association example and so with this I think so one of the next exercise I mean quickly I think if I want to touch upon is given these examples where do the where do you think all of you where do they fit in in which area any idea like where would spam filtering go when I'm post your responses in the chat where would spam filtering go what kind of technique would you use for spam filtering okay classification okay and fraud detection just think about it I mean yeah if you want to post that's great the fraud detection I mean segmentation so think about for a minute and see where would you place them what class of problem would you treat it as house price topic model for similar document search I mean to place the products on the shelves okay fraud filtering okay on and on house price regression okay okay okay association awesome awesome okay so that's great I think yeah thank you for these inputs and moving on I mean just to show you the answer sheet so this is what I mean you're pretty much right so you can see the examples and how they map and today the problem we are working on is a classification problem as you can see here so thank you for the inputs and let's keep going so now once you have this part I mean so you have been heading towards the model and it becomes more and more important to be able to understand that what metrics or the measurement measurement aspects you're going to look at once you have the model and decide which one is better you know one against another so the key the key thing we talk about here I mean the most common vocabulary that you see coming out here is a confusion matrix that describes that how you're going to measure the performance of a classification model where you have a testing data set so you divided into train and test and a half is what you train on test is what you're going to be you know testing on so you see these metrics and then that really helps you understand how well your model performed out here and then there are further other aspects and George would you want to share maybe some more insights on that this one with your some of the examples. Yeah so basically when you think about these things you can't just use one measure of goodness of fit of your model. Accuracy is a measure of overall how are your true how good are you at identifying the true positives and the true negatives and the problem is that it assumes that it's equal cost to make an error in one direction or another so in fact accuracy you can have 90% accuracy and still have an absolutely terrible model or you can have 70% accuracy and have a great model depending on what's more important whether the true negatives are most important or the true positives are most important and how much it costs when you make a mistake and have a false negative or a false positive precision is another measure that's important and basically what precision is telling us is you divide the total number of correctly classified positive examples by the total number of predicted positive examples plus the false positives so basically what precision is telling us is that if we identify something as positive it's probably positive as we have a small number of false positives so precision is measuring our positivity. Recall is telling us the total number of correctly classified positives divided by the total so high recall indicates that our class is correctly recognized in other words we have a small number of false negatives for recall. When you have a high recall and a low precision that means that most of your positive examples are correct. If you have a low recall and a high precision this says that you're missing a lot of the positive ones but those that you predict are positive are indeed positive and this is a one that we use a lot when we're trying to predict a major incident or a catastrophe or you know a something really bad right because you want to be sure that you're correct. The F measure on this slide is the ratio of the precision and recall and basically what it does is it's looking at your harmonic mean instead of your arithmetic mean so it punishing your extreme values more and so if you're looking at the F measure is a key one if you want to just have a good single general measure of how good your model is and we like to see an F measure above 0.3 when we're in that range then we think we have a pretty good model assuming our accuracy is in a reasonable way. Yep great and so with this I mean being thank you George and with this being like a baseline to be able to measure the model performance and seeing what metrics are we going to see so this is an example you're going to see when you kind of get onto the lab so we ran multiple models where we are trying to predict the customer experience and we are looking at the metrics accuracy precision recall and F1 score and then we looked at pretty much all of them we found like in our scenario one of the model performed the best looking at the scores you can see F1 score precision recall and then you use that model and so far if you see in the modeling and the evaluation phase what we're talking about we first decided on what model we're going to select then we talked about what metrics are we going to use to be able to measure the performance of the model and the third is we evaluate and we decide which model are we going to go ahead with save it and then deploy that I mean but also there is one important concept out here that I think want to highlight is whenever we would attempt to talk this language of data science I mean a stakeholder is going to be turned off because not every stakeholder that you will work for would be interested to see the back end part like you know how do you do the model what for the metrics what they want to see is a big picture and this is the one demonstration of this project I mean the one we are talking about as a use case that we are we are we are going to be taking it to our CIO that we are working together to explain him how our model is working so what it's really telling out here that we are taking the time based features we are taking the geography features or variables then we are taking a dollar based features of customer and then the real time insights in terms of the sentiments or the emotions like you saw and based on that we are predicting the customer satisfaction where it stands and given if you see the bottom part here this is a life journey of a case like the earlier we catch the case where we have a high prediction on customer not being happy there are better chances we can address the customer issues upfront rather than the temperament going at this level and the signals going at that level and they finally end up rating us down so it's all about while you are working on the open cases open email open requests from customer to be to be able to predict make a prediction as early as possible and get it on the right hand and help them out to be to take the right decision so this is the way could be one of the possible way how most statistically you can explain what's happening in the back end and tell the story to your stakeholders so that and that's what we plan to take we will take it to our CIO who gave us a proof of concept in the scenario. Now the next part on the lab is now we have the data we looked at a minute we know what metrics are evaluation is we are going to be splitting our data into training tests we are going to measure the model performance and evaluate and select the model so this is the next lab that we work on so if you open the notebook in this section there is a section which does the splitting part so what it's doing is we are taking all the top 30 features and then we are doing 70 30 split and then this part of the code is pretty much splitting it into two pieces out here like you can see I just ran it maybe I should have run the last step I missed this part okay so this is doing the splitting part so now your data got split here and now you are doing the model performance so you pretty much there are multiple algorithms we have listed down here so pretty much just run all of them you once you run you see the precision recall F1 score and you see the confusion matrix here and once you do that then pretty much I think once you do that then the next step is to be able to see all of them together so with these events been running you can see the performance and then finally when you run the evaluate and select model this would give you an overall comparison you know like you should be able to see something like a final table out here that will tell you how each model stands in the overall context and the way you are comparing and you can also see the each model's value on how their F1 how the precision recall and F1 is and make sense out of it now which one are you going to choose in this context so this part and then so this part is done like what this is talking about is so we did split the model we measured the performance we evaluated and selected the model here and just see if I see if there are any questions out here at this time and then the while you have questions I think then next part here is now taking your journey ahead from where you started I mean you had a business understanding data understanding you saw the data preparation challenges now the point here is the problem you decided upon you want to work on what modeling technique you think that could that could use you know because you looked at some examples like I shared so that is the next exercise for you to think through and feel free to share your inputs in the chat. You don't have your slides on a slide show you've got the whole PowerPoint up there. Sorry say again can you see my chat. The slides yeah but they're not showing on slideshow you're showing the whole PowerPoint app. Oh okay okay I can put it on slideshow because I'm switching okay yeah can you see it now on the slideshow more. Thank you yeah. It's easy to see that right thanks. Sure okay so let's see if there are any questions or inputs so the problems we started with like in this scenario like in the beginning when we said okay what would be the potential problem statements so now it's time to kind of you know take through that those problems are classification problems, regression problems, association problems or clustering problems I mean what do you see. Yeah I mean maybe we could make another version of this notebook but as of now the graph shows accuracy and we can maybe add in the next versions we could definitely consider that feedback of getting maybe a similar one for precision and recall also yeah we can do that. Yeah that's a good feedback yeah thanks Raju. Okay we'll wait for another two minutes if no questions I think or inputs I think then we'll kind of you know move on to the next step. Okay so I think yeah let's keep going hopefully I think you're making notes because it's the whole plan and the thought is that I mean yeah by the time you're done we have a plan of action and a proposal for your all CIOs that you're going to be working on with your leaders. So with this I think summarizing on what we have accomplished and again I think you all should congratulate yourself I mean you've come a long way and making best use of the time and situation we are in. And so if you see that like in this step we pretty much did our self spreading the data into training and testing you can also do validation but we haven't touched upon that concept here and then you selected the model like the one which is the top performing model and then you pretty much evaluated the metrics and then we looked at one way one potential way of how you could make things simple and demonstrate the whole working of the model to your video stakeholders so that is another part we talked about. And then with this moving on to our last phase which is the deployment phase and that's where I think the key is now and I think someone asked in the chat are we going to look at the deployment again and we are going to be doing that so how is the solution really working? I mean these are some examples like mock-ups you can see that here. Now you're getting data from multiple sources I mean right now you have one single file but honestly that's never a reality you pretty much have data sources from multiple databases coming together and then you have to create a pipeline of data and create a mechanism to how you absorb them but if you see what you're doing is in the solution that we are proposing to our CIO that we have gone through a journey but this is how when we deploy our solution looks like so we have the data coming from multiple databases that we are absorbing by directly linking through our notebook that's one. Second is we would be scoring our predictions on a batch or on the real-time scoring we can do we can define that what is the definition of real-time is it going to be daily, weekly, monthly or it's like every hour or whenever any event happened then you run it it's your definition of real-time or our definition of real-time and when that happens how we embed that solution in the system that people are going to be able to absorb so that's the solution part. Now an example of you know how the solution may look like and this is another mock-up that I wanted you to see that how it gets integrated in the system now if the team uses Tableau the team uses Cognos or maybe any other solutions on Power BI then the point is the things that they use most often that's where you want to you want to put your solution and once you're able to make the predictions this is what the final output that CIO would be and his organization would be able to see which is around talking about the global heat map on I mean where the predictions are the deepest where customer potentially may have challenges or the predictions are coming up. This can talk about I mean the spend and the exactly you know where the customers in what range or what quadrant they are falling in and the chart below you see can give an example of the for each transaction whether it's a case whether it's the you know kind of an email transaction it can tell you that each transaction what is the prediction or likelihood of this case to be unhappy so this case just got opened in a day or two old and it's still been worked upon and the algorithms that we just created now are going to detect the challenges in the cases based on historical trends and behaviors and going to highlight that this case given that pattern it has followed there is a high likelihood that when they get a survey they're going to rate us down so maybe we do something up front and ahead of time is the goal here and this is the way of maybe a CIO kind of you know what do you say a wireframe or something you should maybe we all should look at how we're going to integrate it as a solution. Then the next part here is saving the model deploying it and maybe running a prediction against that is the next lap for us that we're going to look at and the most exciting part I think the most most of the work has been done so that's the I mean the next step so if you see here I think if you're able to see my screen out here you know so this is you can't run it up front you have to fill up these information like API key and the instance ID so how would you get this part is going to be if you see the instructions out here it is on slide 20 so when you did the setup and you were in the so if you go to your service credentials and open the machine learning you can create a credential like in this step and there was a guidance that save these credentials that you're going to use so this is where the API and the instance key will come up and and you can auto generate add it and the save it kind of here and once you take the API and instance key from here and I'm hoping that you're able to kind of you know post it here then run it this I'm not running because I'll have to copy it but it's already a run notebook that did it before the workshop then once you do that this step you're storing the model and you can see it's already run then this is the step where you are deploying the model on what's the machine learning that's what you're doing because your infrastructure is already there to support you in this step and you're pretty much your model deployment is ready and you can call this model through the I mean pretty much Python and you know other languages you can think about you can just make a call to this model so this part is taken care of here and then the final part is the making the predictions so one thing you'll have to always remember is that whatever your input data source was and the data you're trying to run the prediction against has to be the same there has to be same same features then only you can make a prediction because you've recorded certain patterns and you're trying to run a new data through that to see if there are certain patterns going to repeat and how how likely that customer is going to be unhappy so in this step what I'm doing is taking the same field that we trained our model on and this function is pretty much created to give us the final predictions and if you see this part we have we are running it in batches and this part we're just calling out the top 10 cases and we're using the top 10 cases and running prediction against them and in this example if you see so these are the indexes of the observations and then you can see at the end this is the probability you know where the customers are likely to be the target is 0 means they are unhappy they are potentially going to be unhappy and we are getting the probability on how likely would they be unhappy customers out here and then because I mean if you're dealing with like thousands and hundreds of thousands of cases being open every day or any like request then the goal here is we can't look at all so that's where we put thresholds on the probability maybe we take 95 percent and above confidence where we really know things are going to go wrong and that reduces the number of observations you can focus on and that's and that takes us back to the step here which is going to be this is how the final output when you integrate this to your dashboard what you have power be a tab lower maybe create one new form from you know for you this is how potentially this will get integrated and now people can start using it I mean that's the whole goal here is and that sums up the lab you know and the and that part and I think I've added some snapshots here but I think so the point here is that we are able to generate the ideas I mean I think I think an exercise here I think I skipped so now the question here for all of you is that given you've gone through this journey like we showed you how we potentially would consider as an option to be able to consume the output of the model and and make it visible to the customer to be able to act upon it but the point is given how you would create your journey how would how would you see that your business would consume this information is an exercise here and and I think given the time I think I'll leave it I'll leave some time for the chat and happy to talk more and maybe I think answer questions and maybe take take a few and see once I can't say yeah okay so that's I think pretty much I see okay cool it worked for Finn I see that okay so that's the exercise part and I think so what we have been looking at here I mean in this type if you really would want to summarize what we did here I mean generate ideas on how to consume the predictions and integrate a solution in the business systems is what we maybe what we looked at as an example here as you can see yeah and that's and I'll be available on the chat to answer or we'll be kind of there to answer questions but but I mean that's pretty much a marine handing over to you I mean that's what I see as a summation of workshop I'll be around I mean to answer questions but that's how I see as a summing it up now yeah okay terrific and acknowledgment of the acknowledgements go ahead and your action no idea yeah we want to acknowledge all the people that were helping behind the scenes to answer questions and put this work together so thank you all for hanging in there and joining and interacting so much in the chat we know this is kind of you know not an easy thing to go through we wanted to keep it experiential and hopefully you got a good sense of projects you could work on and what it's really like and what things data scientists have to do with on an ongoing basis so we thank you very much for that