 So we're back and we're about to start the data science workshop. I'll go ahead and introduce the speakers. He'll be conducting the workshop. So, first is Niraj Madan, who's a data scientist with IBM. With over 15 years of experience in data science and strategy consulting. He's currently leading the data science practice for IBM cloud client experience. Next is Maureen Norton, who is chief analytics officer. And marketing intelligence professional at IBM. Maureen is the global data scientist profession lead helping to grow the skills and expertise of data scientists. And finally, we have up car litter, who's a data scientist with IBM. He's an IBM data science and AI developer advocate with 16 years of experience in. I team management development, including team management, functional and technical leadership roles with a deep experience in full stack technology. So, Niraj, Maureen and up car welcome and I'll turn it over to you. Thank you so much. Niraj, I believe you're going to share the screen. Yes, that's right. Okay, then we can get started. Wonderful. Welcome everybody. I hope you were able to join us for the earlier part of the data science day as part of the open group workshop of open group day. There were some wonderful presentations. And I think we're going to be building on some of those concepts here. But let me welcome you to the data science workshop. And this we're hoping that will give you an experiential journey with data. But the real point is to inspire your work. So we'll use an example. But what we really want to do is as we go through this session, have you think of different kinds of problems that would apply in your work environment? So that you can start thinking through some of the issues you might have, get some assistance from the folks here on the call, as well as colleagues who are in the workshop, who may have some expertise they can share in the chat. So let's go to the next chart. And we'll talk a little bit about how we're going to structure the workshop today. So we're getting started, of course, 9.15 Pacific time. And we had sent out pre-workshop setup instructions. So hopefully everybody has had a chance to go there and start to get set up for this. And we'll, again, have some support available for folks if they had any difficulty with that. Next chart, please. So here we're going to do in terms of this. I'm wondering if you could just speak a little louder. It sounds a little bit of trouble with the volume. Oh, okay. Sorry about that. Let me, I can switch headphones. Yeah, you already sound better. Thanks. Thanks, Maggie. Also, Maureen, while you're doing that, Nero, I'm not sure if there's an issue with your display there, but what we're seeing is a fairly thin sort of slide, and a lot of black space on the right and left of it. So, yeah, I'm going to do that once again. I assume your colleagues are seeing the same that I am. Yeah, I'm just going to leave here. Thank you. Is it better? Still voting. Here you go. Yes, that's much better. Thank you. Thank you. That's great. Okay. So what we're going to do today, we're going to have kind of a getting started session where we just kind of go through some expectations, a brief introduction in data science, talking about some predictive analytics, machine learning solutions. And as I mentioned, this is experiential, so we really want to be able to create a project. So then we're going to go into an example that has applicability across various industries, and that is a net promoter score example. So we'll dive into a little bit of that and take you through each of the steps. That are critical to creating your project, the business understanding. Always start with the business problem, data understanding, data prep, modeling, evaluation, and deployment. So we're going to go through that kind of version of the life cycle, so that you can then have a better sense of what it's going to take to implement a project in your own organization. Take the chair, please. All right. So the first thing is let's talk about data. If you were able to join us on the earlier sessions, there were some just excellent insights presented by Seth O'Brien, Beth Rudin, and Rauder Osterventsch, because they've each shared something that really was relevant to this workshop today. And that is the importance of data, of data governance, and really focusing on thinking about your sources of data. We want to make sure that as we build systems, we build them in such a way that they will be considered trustworthy, so that we always manage and try to make sure there isn't bias and that sort of things. All kinds of ways that we can do that. And so we're going to hit on that very lightly, but right now kind of let's think about data. So if you have a business problem, and again, always start with the business problem that you're trying to solve, you can get pretty innovative about the types of data and the sources that may be able to help you with that. So there are some sources of data that have information that could help you with pretty much any topic anywhere or anytime. And that is something that I wanted to use as one of our first examples, is there a source of data that has their finger on the pulse of what people think at any given moment in time, anywhere in the world on any given topic. And if you go to the next chart, most of you will probably not be too surprised to see that such a source of data is Twitter. There are, it is a wealth of information about any topic that you wanted to do some research on. So again, I'm using this as one example of a source of data that maybe you aren't considering. So the CEO of Twitter a few years back was speaking at an IBM conference and he told a compelling story that has stuck with me ever since and he said he was feeling very powerful about all of the data that they had at their fingertip that they could help any business that wanted to tap into it and it was a terrific resource. And then he described rather humbling experience one day where a CEO of a commercial fryer company came in and he explained that he wanted to access Twitter's data to help his business. And so as he explained what his business was, this CEO was very perplexed on how we could actually help him. He was a commercial fryer operator, I mean not operator, manufacturer. And so what that means is as you go into a fast food restaurant when maybe they're doing french fries or crisps depending on where you live in the world what they're called there will be these large commercial fryer's dip you know they dip in the food in the oil and it's coming out and so those are the things that he created and manufacturers. And the CEO of Twitter just kind of was leaning back in his chair thinking this is probably the first business that I've run into that I don't have anything that could help him. But the innovative manufacturer said no actually you do have something that could help me tremendously. But he said that nobody tweets about the commercial fryer as they go in and get some fast food or something else and nobody comes out and says gee what a great commercial fryer they had in that place. And he said no they don't do that but what they do is they'll tweet about something you're not happy about and particularly on the next chart soggy fries. So nobody likes to get soggy fries like they're supposed to be crisp and fresh and hot. But if the commercial fryer isn't working as it should you can end up with what you see right here and definitely people go and will tweet about it. They tweet about a lot of things including something as what you might think is insignificant as soggy fries. But looking at data through a very innovative lens what this manufacturer knew is that he could use geospatial data to figure out where the complaints were coming from and then map out whether those were his commercial friars in which case then he would know they need some maintenance and so he could dispatch maintenance to make sure that they're fixed or alternatively if they were a competitor's commercial fryer he could then use that information to make sales calls and try to basically sell his commercial friars. So he was using something I bet if you've ever tweeted about soggy fries you would never have imagined that that could actually be turned into and monetize as value that kind of data. So if we go to the next chart I wanted to also talk about other types of data and what other types of data can be used to drive those deeper insights and I'm sure that all of you here probably have some ideas feel free to engage in the chat with each other if you want about different types of data that you can add to your existing data sets to provide deeper insights. A lot of times innovation comes from putting two things together that just haven't been put together before. A classic example is suitcases and wheels. Right there was at one point somebody got a patent because they had the idea to take your basic suitcase and put wheels on and now it's probably hard to buy a suitcase without wheels but that was putting two things together in a new and innovative way that enabled a whole new solution. So likewise we can think about data in that way and how you can put different things together and another example is on the next chart which is weather data. So there is a wealth of information in the weather data. Retailers use this quite frequently to be able to determine different inventory levels. Obvious examples would be in the winter in a very cold snowy area. You're going to, as a retailer, want to stock up supplies and things that will be more in demand at that point. But weather data can be really mined and used in a lot of different ways to augment different things and to gain insights into additional data. So I wanted to really try to get you to think broadly and innovatively about data sources that you can use in light of the business problem you're trying to solve. So the real key point is always start with the business problem and then you can seek out the sources of information that will be most helpful to you and then try to get some innovation going in there in terms of putting those wheels on the luggage, if you will, in your problem so that you can drive those deeper insights. So I'm going to now turn it over to Nirash to take us into the next steps. Thank you, Maureen, and great examples on Twitter and the weather data. I think today's workshop, in a way, it's very interesting because it involves all of us. And the reason I say that is because by the end of this workshop, what our goal that we have kept for ourselves is to give you a set of templates or you can say a plug-and-play structure where you all can pick up a problem and fill in the blanks and that becomes your journey in data science and that's why we're calling it like a real journey and kind of a workshop that we are driving today. So throughout the workshop, like as you would see, we will have few exercises where we'll be asking you for inputs so and share templates that you can reuse after this workshop for any AI products that you are developing from scratch or if you are at any phase of the project you're already on, you'll find a lot of templates which would be reusable in this workshop and everything is posted on GitHub. So we'll go through that, but I think with that, thank you, Maureen. And so moving on from here, I think like we saw two examples and whenever we work with, there are like tons and tons of problems around us when we talk about which can be solved using data. And we talked about Twitter, whether, but I think broadly, like from the time we start interacting with our stakeholders or start working with the businesses. When the problems are kind of thrown at us, we generally need a framework to be able to bucket what class of problem is this right now? I mean that I'm dealing with so that we can go for the deep dive. So in broadly, in data science, there are four classes of problems, risk, quality, customer sat and anything around price, cost value as such. So if you talk about risk as an example out there, it can range from anywhere where you know if there is a, we're talking about at the airport where the passengers been screened for threats or talking about credit card transactions or the loan approval. I mean you talk about anywhere we are talking about any sort of risk is involved. All those sort of problems can be classified into the first bucket which is like a risk assessment out there. So that's the first bucket, I mean that you could classify and say, okay, I'm looking at a lot of pictures and I have to build a model which are the fake pictures versus the real one and same stands for news and the video photos or I mean any sort of such problems you can classify in the first bucket. Then the second one out there is around a quality defect, I mean definitely largely in the manufacturing setup and the weather data example is quite perfect, even around ATM machines like failure or also, I mean I think that brings a point I remember that one of my colleagues worked on a problem where they were working on how often the machines should be filled in and weather data played a very important role in that in servers when you talk about it, multiple variables can help with that in terms of what is it a hard drive or what component kind of leads to a server failure on and on. So and then castings as you can see the car or toy car out there or in the real scenario, there's a defect I mean and what variables cost it. So any of these sort of problems around quality and defect is like another bucket to classify such problems. And the third one is around business value or customers add and I think that's the bucket we will be focusing on today. And net promoter score is a very widely used term but we are not going to make any assumptions in today's session because that you know you've heard of it or not but we'll start from even explaining that what is NPS, how it's calculated, everything we are going to cover in this workshop from scratch. So but I mean to say that sat customer sat or value kind of the area is like another bucket all together and with the house pricing, being all-time high and stocks now picking up as you can see from last one or two days again. So I mean anything around price, cost, value would fall under the four buckets in the fourth bucket which is around I mean how much rentals estimates the value of the house I mean all those kind of things would fall under the fourth bucket. So they're broadly like four classes of problems around which majority of the projects can be defined. So that's so the reason I'm explaining is so whenever you're dealing with a set of problems, so first we should start thinking about that okay so what class of problem am I dealing with because and that's where you start to go to the next level. So starting with the with this one. So once you have identified what problem that I'm dealing with then it comes about okay what is the right approach or what analytics approach am I going to be applying in this problem? So we go very step by step we found which bucket it belongs to and then we are like okay get a little more technical and we say okay what is the analytic approach we are going to apply to solve this problem. And while it's a very exhaustive chart but I think going from the workshops go what we are covering today and how you read this chart. So first thing we ask the question that okay in the scenario that I'm trying to solve for my business or for my customer am I predicting an amount like is there a value that I'm predicting like a house price or anything as such you know if I'm predicting that then it goes into the class of problem which is forecasting linear regression or decision tree and if you are not predicting an amount we are predicting you know kind of you know we are not predicting I would say like an amount as such then it goes on are we predicting an event for event to happen is machine going to fail or not is server going to fail or not is customer going to be happy or unhappy you know so all those kind of events problems then these are the another set of algorithms we can apply and if it's not that then we and if it's a very exploratory analysis then we go into you know approaches like clustering in that scenario so I would the reason I'm sharing this is because it's always like a starting point and we need like a you know framework to start thinking through and navigate so while these are few of the examples given here there are many other algorithms but this is like us getting started to solving a data science problem from start to end and that's the goal of this workshop that by the end of it I mean you run the code with us you know what we have created and starting from you know reading the data to checking the quality to kind of going on the next steps we'll talk about it but this so far we talked about two steps the one looking at the bigger buckets then after that looking at the analytics approach which are available to us and then if you go for the deep dive I mean I would say large class of problems and today's time what we are dealing falls in the bucket on the top but I mean if there are some more advanced problems that you're looking at would be in the second bucket which is at the bottom but I have mentioned that for all of you for your reference now once we have got that I mean and this is one of my you know favorite chart I would say that actually helps you with the whole framework you know start to end on how a data science project goes and where do you start where it ends I mean what are the key deliverables and often I mean what I have seen is like majority of the organizations you know coming in following agile methodology it becomes more and more stand-ups oriented you know culture where you kind of go report what you're doing and in some time these data science projects could be very research driven where you say okay I'm going to spend reading some time and you know and it becomes very hard to quantify that what we are trying to accomplish in every stage of the project every week or every month and that is where these frameworks can play a very important role for us to communicate with the business or our stakeholders because people can't just wait for six months and a year and say some magic will happen and algorithm will be created and the predictions predictions will happen it has become more and more eventually important to tell the stakeholders around that what stage of the project you are on what value it is bringing on and be able to document that so what this chart is explaining is the key phases you know of the project and this chart comes from two sources basically so one of the one of the top data mining methodology in the industry is Chris BM that's one and the second is from the book that I had my hands on which is from Sebastian Rashka so the top part is from the Chris BM and the bottom part is from the book from Sebastian Rashka they both are pretty much communicating the same thing but just a different way of looking at the journey of a project so if you see how we start whenever we are we are kind of working on a project from scratch and we start I mean the first thing what we get with we start assessing the situation okay what is the situation I am in what problem I am solving and we will take an example on each of these areas we are going to explain through this case study but to get a broader understanding first part is always to assess the situation then we look at the methodology and then it's like what methods I am going to use like in the last example I was mentioning then it's about benchmarks because it's always important to see what I am trying to accomplish has anybody already done it not and seeing a reference or if you have been the first one to do it I mean that's also a good thing to know but if some benchmarks are available it's always good to have them and then once you have done the ground work then it's about objectives then we start looking at features features I mean can be a little technical term but I would say features is what data variables are we going to be looking at and once we have the data then it's all about checking the quality because as we all know I mean if the data is not reliable the output is not going to be reliable so then we so that's when the phase of business data understanding goes then we move on to data prep where we start because the machine would never ever understand the data we have it so we have to code it in the form the machine can understand and that is where the steps around extraction scaling and selection will come into play and we are going to do it hands on today in our notebook and once the data is in a form where the where the machine can read and understand that that is when the time comes where we start model we start selecting and creating a model and evaluate that and then the time is demo time we create a demonstration because nobody from the business would actually would be not interested on what's the backend goes and how you did the algorithm and all that what they all want is how you can explain them what your model is doing and this study can give you some examples on how to keep the information as simple as possible for the business to be able to consume it and then the next step is about how do you integrate it in the production so everybody can start using your work and the outputs from that and another way of another language I would say you know it's about you know like if you see the parallel references that came from the book is around creating a path raw data training testing data set let the algorithm learn there's a final model and then there is kind of a new data on which you try to make predictions and often time I mean it's been like you know one of my favorite quote you know that people say I mean you know okay so this model is useless you know so we generally said that essentially all models are wrong but some are useful because it's always an assistance and they they can never ever be 100 percent you know accurate but there is something is better than nothing you know so there are some models which are useful so it's always and we are always working on probability so this is a kind of a framework on the roadmap to kind of building any machine learning solution starting from a concept to a product and this I would say I mean if you keep this in mind and start filling the blank and we'll look at one kind of example is where we started with and then I think with this so there are like many many tools available out there they are like tons and tons starting from IBM Watson studio to I mean Google cloud Azure you know SageMaker many many sass also just because I mean we work on Watson studio and there is a free trial so we we just using that but you can pretty much if you later you want for now you can follow us along in Watson studio and later if you want you can use the same notebook in other platforms also with some minor tweaks and that should work for you and with this I think I don't know is the chat enabled for everyone to post I mean yes everybody can post into the chat channel so I have a question for all of you I mean just so that I mean you know we don't get the meetings by the end of the workshop so I wanted to ask all of you and if you want to post your responses on the chat which is and the question is you know this one I wish I could do the poll but I mean yeah if you want to post your responses on the chat that'll be very helpful for us to see here and you know stay on track on your expectations and so what's the option? I mean you know what's your expectations from this workshop? What an option 3 okay I can't okay thank you let's see wow okay just give it a yeah go ahead do you think so? I was going to ask well can you hear me yes I can hear you perfect I was just going to test my mic well I'll answer the question but good oh no that's good that's good I'm excited I mean yeah we're looking for you all to talk also so that's very kind of good that you could test it yeah good okay so thank you for sharing I think yeah and next two hours okay so so I think summarizing on these options the reason I kind of you know asked and I see I mean the majority of you have been I would say so who's seven is saying that you want to understand the journey of a data science project yeah we will be accomplishing that you know effectively communicate and lead the data science team more effectively learning the jargon on how the project goes I mean absolutely that's our goal but anybody who's feeling like the first and fourth option I think those are a little far away from what we could accomplish in the next one and a half two hours we have so I hope I mean I could communicate but that's that's I think one so thank you for your responses on that and then with this you know I'll hand it over to car I think we just wanted to maybe I know we shared the instructions on getting your system set up but I think we'll still kind of you know walk you through that quickly so if somebody is not been able to follow along and the recordings would be posted later so but our goal is that you all try the steps what we are following in this workshop and that's why it's hands on so with this over to you car okay let me share my screen by the way I was reading the responses and it's interesting that everybody assumed that the list is one based and not zero based because the programmer in me the first question I asked was this is zero one two three or one two three four but okay hopefully you can see my screen yes okay great so I think you ended here all right so first of all I I hope everybody is able to follow along and actually do the workshop with us so what Neeraj and I did we created another PDF actually let me go to that GitHub repo this is Neeraj's GitHub repo so if you go here first of all do if you like the workshop do star it do fork it and then there are a couple of different files here so the the first thing here is the notebook itself that we want to explore together and hopefully you get you get to import that into your Watson studio environment today as well but more importantly there is some set of instructions that we sent beforehand and if you weren't able to do the instructions before that's fine you can do them now actually I'll go through the steps real quick here in the next five to seven minutes and if you watch the screen it's it's pretty easy to follow along and then we'd also have the actual PDF of the presentation here for you to look at after okay so if I if you open the setup instructions we are going to do a couple of things here so first is create an IBM cloud account we sent or this PDF has a link on page number 12 this one here so if I if I open that in a new tab you'll see I'm already logged in so it's asking me to log out in order to create a new account but if you're not logged into your IBM cloud account you should be able to use this and the cool thing about this URL is that you it gives you a special account where you don't have to enter your credit card number if you were to go to the public cloud.ibum.com URL it would ask you for a credit card even though everything we use today is free or has a free tier or a light tier I should say and it would not charge you anything so use that URL to create your account and then we will create three services today let me see yes three so first is Watson Studio second is Watson Machine Learning and the third is cloud object storage okay so let's let me actually go ahead and start creating those and as I go through them I'll explain what you eat each one of those is for so I'm already logged into my IBM cloud account so cloud.ibum.com takes me to this dashboard there's a lot to talk about here so I'm you know we'll let's just start creating the services and as I go through the pages I'll explain what they are dashboard is where you see any sort of announcements your resources what you've used so far and if you have any tickets open things like that it's the overall high level view off your account on IBM and then the other important page here is the catalog so if you click on the catalog on the top right it'll show you all of the different services that are offered on IBM cloud and there's plenty of ways to see them by category on the left here you can see or you can filter them by different things here as well so you can look at all of the IBM provided free services for example all right so like I said we need to create three services today so the first one is Watson studio so I'm going to type that up on the bar here or you can also use to search the catalog as well it'll take you to the same thing so Watson studio first one click on that and you'll see in the plan it'll automatically pick uh-oh so let me actually go back and do this again the problem is I just had all these services created and I need so you can have only one free or light here at a time so I just want to make sure I deleted the ones I created before I did okay just to be safe I'm going to log out and log back in give me a second okay I do have a lot of different accounts with random emails so let me log into this one and we will start creating the services again so now all right so first thing I want to do is Watson studio either just the screen is the text okay or do you want me to enlarge it I think it large will work and also I think if you want to check I think we can just wanted to ask like how many of you have been able to log in so far so you can follow along with us if you can post yes on the chat if you're logged into this website okay a lot of I see a lot of yeses that's good thank you thank you yeah sure all right so a couple of things to notice here so first the light here is picked for you please you know that that's enough for our purposes today and then the second thing to notice the location here so whatever you pick whatever location you pick here that location will be used in two places one is when we create the other service we want to create it in the same location as Watson studio and the second thing is in the notebook itself there's a place where we ask you to put in the location of the service which is this for me it is us south for you it might look a little bit different but ensure make sure that you first take note of this region and secondly the next service we create make sure you create it in the same region as Watson studio okay so agree to license in terms and then click on create so this usually takes a couple of seconds so there you go it's saying Watson studio in cloud pack for data has been created great the second service I want to create is machine learning we'll type that up on top it's the same process so machine learning also has a light here and you also have to pick a location again I picked us south for Watson studio so I'll pick us south for machine learning as well and agree to the conditions and create the service so again this is the second service we created and third and the last service we want to create is the cloud object storage now if you work with S3 before this is an S3 compatible cloud object storage or object storage service on IBM cloud when you create this it will not ask you to pick a region because this is the cloud object storage is cross region when you create when you actually create a bucket inside the service then it might or then it asks you to pick a region but we don't have to do that for our workshop today so pick IBM cloud and light here is selected again by the way in all of these services I'm picking the default name you can rename your service if you like and then click on create so again just to reiterate three services Watson studio Watson machine learning and cloud object storage and you can let us know in chat if once you have created all three or if you run into problems all right this should come up pretty soon here I'm sticking a little bit longer to provision here so let me open another tab and go back to my cloud.ibm.com I want to show you something else while that comes up so if you ever get lost in you know one of the many pages on this cloud dashboard the one thing the navigation on many on the left hand side is pretty handy and there's something called resource list so if I click on that you'll see it shows me all of the resources that I've created on in my account so and then if I further drill down into software and services you'll see this is the machine learning service I just created this is the Watson studio service I just created and finally this is the cloud object storage service which has been provisioned now that I created okay so step the next step once you have created these services is to open Watson studio so let's click on Watson studio and let's launch it so launch in cloud pack for data once this finishes we'll create a project so everything in Watson studio is driven by projects so you can on the main page you can see I don't have any projects yet because I just created the service so let me click on new project all of these steps by the way have screenshots in the PDF so if you're not if you're if I'm going too fast you can always go back to the PDF and follow along so on this page I'll pick create an empty project we have to give it a name so let's call this I'm going to call it workshop you can call yours whatever you want and you can see it already picked the cloud object storage that I had created earlier in the previous step and so if you forget to create that or if you don't have one this screen will give you a way to create a new cloud object storage instance but it's easier to create the services beforehand and then create a new project in Watson studio all right so it's creating a new project and the next step after doing that would be to associate my Watson machine learning service with my project so all of the notebooks in my project can use that service to create and and deploy models okay within this project so now I've created this project within this project there's a couple of different tabs right overview is the overview of your project and then the assets is what we are interested in so you can add different kinds of assets to the project you can load different kinds of files for example CSV data files and other kinds of files all of those files would get stored in cloud object storage you can see on the on the right hand side here but there are other things you can add as well so this add to project will tell you all the things you can add into your project we will come back to this and add the notebook that Neeraj and I have written for you in this project before we do that let's add the machine learning service so to in order to do so let me enlarge this a little bit you go to settings and there is a section called associated services currently it's empty and you say add a service and Watson and it should list my machine learning service that was created earlier here if you don't see this ensure that you have the correct filter selected on top right so it might be that you don't have the area that you pegged before selected here so you won't see anything and this is where it's important that your machine learning service be in the same region as your Watson studio service so I'm going to click on associate service so that's done so now I have my machine learning service or my Watson studio knows about my machine learning service and that's important and you'll see once you're done you'll see it listed here as well alright so now we have we have done all the prep work we've got three services here and we have linked Watson studio with machine learning we have a project in Watson studio now we will add the notebook as an asset to our project so again if currently there are no assets here you can see I'm going to click on add to project and then add a notebook now there are different ways to add a notebook you can start off with a blank notebook of course you can get a notebook from file on local desk we'll use the third option to get the notebook from the GitHub repository that Neeraj has prepared for us so again if I go back to this chat actually yeah oh you added in the chat let me actually grab it from there as well okay this is not working let me get I'm just going to get it from from here but it's the same URL here we go so this one here I'm going to copy link address and I will paste it so the name of the notebook is NPS and you can name it whatever you want and then it's in this remote in this notebook URL I'm going to put in the whole URL that Neeraj posted in the chat you can leave the runtime as default of course there are other paid options as well if you need more juice and more power to running your notebooks and there's as this thing says at the bottom right there's GPU enabled notebooks available as well but for us the free runtime that they provide is good enough that's it let's say create and it should import the notebook from GitHub into Watson studio as an asset in my project so once that's done I think I'll pass it back to Neeraj once this finishes and we can get along with the agenda let me also go back here so yeah so we went through setting up our environment creating the services creating a project then we added a notebook a URL is also here by the way and yeah those are the steps I wanted to do let's go back and there we go so the notebook has been imported into my environment you can see it's using Python 3.8 runtime here if you've used Jupyter notebooks before this is a you know typical Jupyter notebook but if you haven't these things are called cells and there's a couple of different ways to run it and there's a couple of different things you can put in a cell so this the first cell contains markdown as you can see here and to run it you can either use this button so if I click on run it runs that cell there's nothing to run since it's a markdown or you can press shift enter to run it if you like to use the keyboard okay and one more thing to notice once you get to the code you'll see to to begin with well actually mine doesn't have mine all of mine has numbers already in them but these numbers once you run them or as the cells are running this thing changes to a star as you can see so if you see a star it just means there's something happening the cell is being processed and once it's done it'll change to a number and the numbers are sequential so you know if I run one, two, three three cells and then I go back to the first cell and run it again you'll see a four there so don't don't get confused with that if this is the first time you're using jupiter notebooks okay Neeraj is there anything else we want to cover at this point I think I think though I think the IAM credentials I think that's one more okay good thank you for reminding me so as I said before we're using Watson machine learning in order to train or not to train our model but in order to deploy our model we're using scikit learn to train our model we're using Watson machine learning to deploy our model online as a service so that somebody can use it with rest endpoints in order to use Watson machine learning somewhere down in this notebook I won't go there now but you need credentials you need an API key to grab that API key again the screenshots in the PDF document explain this as well but let me show you real quick how to do it I'm going to close this so to grab that on your IBM cloud screen or the console screen you go to manage and then you go to IAM access this used to be done differently by the way if you've attended this workshop before they've changed it to use IAM credentials for all of the services on in your account and what you do is once you manage IAM and then you click on API keys and you can see I generated this earlier this morning but you can generate a new one so you cannot use an existing key because once you generate a key it hides the value from you so that's the other thing so let me let me show you so once click on this button to generate a new key and I'll say workshop key and I'll say create so this is the key here I'm going to show it to you because I'm going to delete it later but do not share your private keys with anybody else but once I get out of the screen this key is gone so what we do is click on download and save that somewhere on your desktop so you can go back to it later but make note of this key so in the notebook once we are going through it you will see it'll ask you for this key later on in order to use use the machine learning service with the SDKs All right Niyarj back to you you okay so with this I think we have we have a short five or ten minutes break but I think we would stay over here on the bridge I think just to see if you all are set up or if you're facing any difficulty we can just help you with the steps further on but so yeah so the break starts I mean we'll be back in another I mean 10 minutes from now and get started on next steps but we are here if you have anybody have any questions so yeah is that okay subkar yeah perfect oh okay okay I'm getting some terms thank you and yeah if you all set up I think and if you have this notebook open in your computers if you're still around if you can give some yeses that you're ready that'll really help us to know that you're following along so Tim which step you got confused and if you want to elaborate that or you want to come on audio or either chat is okay whatever you prefer you want to talk more about that so we can I can try helping out on that okay associated services okay I'll show you that step so I'm going to share my screen and share this step yeah if you can see my screen and like you're already there in Watson's studio I think so this is the step like I think what we've shared is like then you are in the project so these are the projects and I'm assuming in the projects you have already created a project I mean I created one with the name NPS so once you have the project and then you look at these option overview assets environments jobs access and we click on settings and once you click on settings and scroll further down I've already done that but I mean what you will have to do here is click on add services then click on Watson and then you will find here Watson service just put a tick mark on that and then just say associate kind of there'll be an option here you can just associate that does that help are you able to follow along awesome thank you for the confirmation here cool yeah and if you have questions I feel free to I mean send to all because I'm sure if many may have questions similar questions so that may help others also but it's your preference that's all fine any other questions anyone has or need some help please call out so the way to verify you all are I mean good to go would be I mean if you go to settings then basically you will have Watson machine learning here and then if you go to resources then basically you will find out that you know basically I mean you have if you looked at the last one when we created if you looked at the resources you will find WML object storage and the Watson studio three services in the resources yes so once you have the API key just save it somewhere and once we will go through the notebook there is going to be a place where we have to paste that API key we just store it for now and we are going to use that thank you for the question after creating the services okay so okay so once you have created the services were you able to create the API keys okay so the next step after that would be creating an API key so let me just share my screen one second if you see my screen so I think first thing you'll have to see is when you go to the resources so if you go to this link I just I mean the way to verify that you have everything set up is that one you look at the let me look at that I'll post the link side by side so you go to resources and then you verify that you have three services one you have object storage if you don't have you can add it and second you have two more services which is machine learning and Watson studio we already have these three services right they're showing up in resources okay cool so once you have that then you go to manage on the top and then click on access IAM here and once you do that then on the left there is an option that says API keys and here you click on create an IBM cloud API key and just save that because we are going to use that later and once you have done that so it will show something like this but you can save that in your system done okay so once you have done that go back to the resources and then once you're on the resources where it says Watson studio open this open Watson studio in the resources so once you'll open the Watson studio then it says launch in IBM cloud pack for data and it's going to open up and are you with me so far and once you have that then you have to create a project it says create a project so you can click on create a project if you are on this screen with me so far okay cool so then click on create a project and this screen will open for you and you can click on create an empty project and once you have done that then you can put a name you know like any name like I would say NPS workshop and once you have done that you will see an object storage which will automatically you know get aligned here and then click on create Niraj you could also remind people if they have questions and need help they can raise their hand it's that down at the bottom up to the smiley face oh yeah yeah that's right that's right okay you all can use that also yeah there's a q&a you're raising in hand yeah so once after you're done thank you for that reminder I'm trying to add a Watson so this is not allowing me I can see the listed but I can't associate it to the project okay I think so the problem here would be Judy that I think let's finish this one I'll tell you what I think is going on with that I think I know that so once you're done creating the project and I think I'm going to start the session now I think because I think hopefully you're able to follow along until here so once you create a new project then basically you'll see and you'll see a project coming here and so I have a new project here like I just created and yeah that's right and then once you have a new project then you click on add a project and then click on notebook and then and when you do the notebook you click on from URL and then you know you can from URL just type the name of the you know notebook at NPS or something NPS workshop notebook and then you can just put the URL which I shared in the chat over here and then click on create and once you do that you will be on the page pretty much on this page which will open the notebook so awesome thank you are you able to follow along so far are you good okay great so let's get started I know we are pretty much on time and thank you for following along so I think let's get started back on the workshop so and where can I find the project where can I find the project I think so project is you have to create a blank project and then the notebook link is pasted on this chat so when you create a new notebook that's when you will find that so basically the project when you created you created a blank project like pretty much you can create an empty project and then within that you had to open a notebook like from there so let's get awesome thank you the pin okay so let's get started now and so I mean so what we have done so far I mean just kind of a recapping on that I mean if for now it's just been like a few clicks and you have your infrastructure ready for like any machine learning project and to end if you we all would have done it like standalone it would have taken ages so we so what we have accomplished so far and thank you for that so we have set up like the machine learning you know in our infrastructure the virtual infrastructure we have we have set up the storage object storage we have the Watson studio in which we have a notebook that we'll be using Python notebooks and we also have our credentials API which can be used centrally to authenticate that we already have set up so how so now again if you are moving on to the next step how does how does these projects go on I mean and how does it and there are a lot of times I mean I've been in some other conferences on some you know some speaker series and some mentoring a lot of people would ask you know how do I know I don't have a project I love data science I've heard a lot about it I don't have a problem statement to work on so I just wanted to share the journey that we took so in our scenario that you know one one of the days I think like this whole case study is inspired from some of the work I have been doing with the business and so we were with our CIO and you know I was sitting with them and like we called upon like a working session with our leadership and then CIO mentioned about you know kind of you know kind of in this example a great concerns with client satisfaction because we always want to be the best and then the CMO kind of brought up a point that hey I mean it's always you know customers give their money fan gives their hearts so we really need their hearts is what we should aim for and then CFO brought up another financial point there that you know hey I mean rather than you know customer retention is more important you know for us to decrease the cost so we that's very very important and then the then the you know operate you know CIO mentioned about that you know currently in this example or I mean if we would see like 44% of the customers for this you know for this kind of you know case study 44% of the customer experience has been planned and and and then I think kind of let's say I was also part of the conversation I brought up a point that when we talk about customer experience NPS is an industry standard to measure the customer experience and the point I brought up was that how about we use NPS as a measure for satisfaction and we see how we can further improve upon that and then possibly what we could do you know and then we could actually start predicting the customer experience ahead of time and bringing in the right intervention is the case study that we kind of you know coined in that conversation and the CIO was like okay right so you all are a team right now and let's assume this in this workshop we all are like the member of that the core team and we all are asked to go back and build a proof of concept to predict the NPS net promoter score which equates to customer experience so that's how this whole project or any of the case studies get started you know there is a sponsor there are some support management and then you start building your team and I have a like you know how many I mean 60 80 people out there as my extended team here to build the proof of concept together you know out here and then so basically I mean and and that's where if you remember we looked at the whole you know the structure of a project like a framework so when we anytime we start a project we assess the situation what is the situation currently so let's say this is the company and they support 500 cases or 500 interactions on multiple platforms with the customer and then their response rate I mean if you send them a survey the response rate was 15% and and 60% of the responses were non promoters and 40 were promoters and in this example you can see that you know the scale is 0 to 10 and we combined like the detractor and passive into one is not happy because if it's passive also we consider them not happy and you know and then the promoter is like a positive they're like very happy about what you know the services they're getting from us so this was like assessing the situation and any of the problems that we generally get into we have to see the scale of the problem how big problem are we solving and a page like this being able to you know fill in the blanks and these numbers is always very helpful in this situation then once we know what I mean NPS I mean some of for some of you NPS might be very new technology kind of terminology so it's always important to be able to put a documentation on how it is calculated and in this example we all have experienced it I'm sure about that because we all get a question at the end of our interactions with any chat board any of the customer service departments from any company like the question on a scale of 0 to 10 how likely are you to recommend the company to a friend or colleague and there's a scale of 0 to 10 and you can see how it's you know these ratings at a at an aggregate level how they are goes down further down so let's say there are thousand total responses you can see how many promoter how many passive how many detractors and the final formula of calculating NPS is detractor minus the promoter and this core can can be like minus 100 to plus 100 and like in this example the score is 10 that's how the NPS is calculated and I've also kind of incorporated or rather included the link of the calculator also for your reference so now once we assess the situation we know the methodology it's also very important like I mentioned in the earlier scenario that we need to look for some poster childs around like or industry standards like is there somebody who has done that I mean what is and if they can do it why can't we do that I mean it's it's I mean something we are missing out so you look at benchmarks and in this example you can see the left chart talks about the industry benchmarks on which industry has what NPS score like on a scale of minus 100 to plus 100 then on the right side you can see the top I mean some of the poster childs on NPS like Southwest you know one of my favorite and then Apple and all of them are like premium on the customer service out there and you can see the other poster childs out there so they this possible so if you have to set a benchmark or a target for us this can be referred as one of the baseline you know in the context now once we have done that then the next step comes be be able to articulate kind of you know your goal approach and desired results so in our scenario we say we want to improve the NPS by identifying the potential non promoters non promoters ahead of time and be proactive in addressing their customer their issues as such and then approaches we consume the historical data we use ML and AI set up an algorithm and we use that to predict the cases which are worked upon so that we know ahead of time where they're gonna rate us down and address their issues more proactively because we're talking about large volumes and it's not possible to look at every one of them as you know proactively if they are really not happy with the service now and then finally you know bring the UI or the tool which can be made available to the customer to the to the team so they can they can consume these predictions as such so this was a business objective now I think with this I think like I said it's a hands-on and I think by the end of this workshop what we aim for is that you go with a problem statement where you use this template and start filling it up and maybe build a project start making more partnerships so one of the exercise I think I would like to request here if you all can put your inputs on the chat so this is the template I mean you can say that as a role I mean your role could be an end user could be a developer or a CIO, CTO or the CEO or I mean any full stack developer I mean anything any work you're doing that's your role I would like to I mean improve reduce increase decrease I mean like in this case we are increasing the NPS the target variable for me the target variable is NPS like customer satisfaction for you it could be fraud, risk kind of effort, price anything I mean out of the list productivity or revenue and then for department X or a scope of area you focus on buy an amount which could be a value or a percentage in the time frame within three months two weeks a year next one year I mean you can mention that think about it and then kind of you know list down your thoughts on the chat so that people can see some variety on what are the possibilities and how the problems can vary based on all of your experiences together so it'll take like two minutes and if you want to think through it this template and share some of the examples in the chat that would greatly help because I'm sure when you signed up for the workshop you all have a problem kind of something that you are looking around and you're saying hey this is the problem I can't solve it so this is the time you know you can anonymize or simplify the problem but if you are comfortable please share in the chat so we all can look at that and Maureen and Upkar is this something you wanted to add up to this one not just yet watch the chat okay I think while you're thinking through more examples I think the goal is that you all have I mean your own problem statement that you use this workshop as a medium and apply that so we'll just give another minute and then I think we'll move on to the hands-on part anyone else still thinking wearing your creative hats on what problems you want to solve in your business the NPS for customers add by 10% thank you great great thank you I think thank you for these inputs and I mean I keep posting I mean so that everybody can see some of what different kind of problems can come in like if you are doing some testing you want to reduce the you know test bugs or if you're working in the manufacturing unit they could be I mean tons of examples you know yeah yeah that's right so I think so so keep thinking about the problems and keep posting them so everybody can see that so I think so moving on from here I mean this is like a very good reference template for all of us to start documenting the opportunity to begin with and once we have done that so I mean and so far if you see like you know what we've been able to accomplish I mean we've been able to have a structure where we can assess the situation understand methodology benchmarks and hopefully you have a structure on being able to define a business objective on the problem you're trying to solve now the next part here is that you know we start looking at the data so once we are done with the first phase of the project which is a business understanding then the next part comes as so where is the data I mean what are we going to use to train an algorithm and and how do we get started and that's where features plays a very important role so in our work that we are doing right now in this workshop we looked at multiple features like time-based features location-based features the money-based or the sentiment and emotions time-based could be you know what day of the week they call us they call on weekend or the working what time window do they call us in prime hours or non-prime hours like off hours something really must be troubling them that's why they're calling off hours based on the time zone how what's the age of an account I mean how many meaningful updates are they kind of having and then what's the location I mean which location are they calling us from what's the spend like a life plan spend recurring spend and I mean and each conversation they are having with us how are their sentiments emotions like a Twitter example Marine took so in this is like real time then they're in their chat when they are putting up the updates in the cases we can calculate the sentiments emotions and detect whether what direction this whole case is going to and then there are many many other features like you know how many times this case has been assigned from one team to another if the assignment has been five six ten that's a problem I mean nobody's taking an ownership and what support plan what account type severity I mean many many other options could be there so this is the kind of data that we looked at in our scenario and then you know we looked at and then the point here is as soon as you have the data the next very important thing is start looking at the quality of the data is the data really reliable and the reliable as a definition could be taken from multiple multiple ways like for example you know you look at how many numbers of observations are there is there enough data for me to look at you know are there too many values in a column if there are then I mean because at the end of the day we are trying to train an algorithm with some patterns and if every row has a different value this is very hard to build a machine learning model so we need to bucket them or find some creative ways to look at that if there are heavy correlations you know kind of if we look at that which file which of the which of the feeds I have a correlation which one we can keep which one we should drop if they are representing the same information if there is a missing value I mean too many missing values do we impute them do we just let go those records and the end goal is that we have a data and as best as we can get out of you know of the information we have received and then basically with this step we if we all have the notebooks open so we will open the you know the notebook out here so this is the notebook and now I think the work goes on more hands on with the context so we have a business problem we have the data and now we want to understand the data better so if you look at that Okkar already talked about like you know basics of notebook as such here you know so if you want to run this we just click on the first cell and run this is a markdown so you will not see anything as such but it's just an information to document and so you all can follow along this is an index of the notebook I mean generally that's a practice we follow to structure the notebook in a better way and then this talks about the notebook and then loading the packages and verifying the version I mean in Python loading the packages more like you know if you look at the home scenario we have appliances you know we have microwave washer dryer so in the same example of Python ecosystem we have libraries that we can call upon and everything does their own job so there's a visualization library there's a machine learning library just like we want to heat the food we go to microwave and say heat the food the same way Python ecosystem there are libraries that we need to initiate and that way we can avoid writing thousands of lines of code and that's a kind of an analogy and we are doing in the step two where we are loading the libraries that we are going to use in this project so just select the cell and then click on run and then you can just click on check versions over here in this step and then we have to load the file because we have to read the file the data that's been made available to us or we created so this step we are just reading the data directly from the GitHub and we are reading that and displaying that so you have you have to select the cell and then click on run and once you have done that there is a visualization package there are two of them one is a data prep another is a pandas profile so you can run this one and this one and it takes a little while yeah because it's installing so once you have that then basically how it will look like I can show you in this example you know let me show you this one so once you run that this is what will come up basically you will have you know this is the output how it will come up so you will have the statistics just with one line of command over here you will be able to kind of inspect your entire data set so you can see that you know this is the data set it has 58 variables these many rows missing cells how many missing cells out there how many are duplicate rows and all the information how many are categorical numerical and you know geography related data and you can see lot more information out there and you know further you can start inspecting the data now you can also see the histograms or kind of you know distributions of the data you know overall for all of them for example you know what's what is the support tier representation in this like you can see a support plan majority of them belong to support tier and then it goes further down to support three so maybe it's support three must be a premium for the user and there are fewer out there that's the one way to learn it then the severity of the case you can see that most of the cases are in severity four and very few in three, two and one so that means this is a common severity and it goes in the order of one being the most urgent issues and all that so you can just inspect your data just looking at just for the one line of command you can pretty much see your entire data and wherever you find some problems you know like okay you can see the prime time non-prime time lot of information lot of information and then you know whether the severity was is been changed you know during the course of when we are helping the customer and then lot of information I mean this is an endless and you can see sentiments like in this sentiment overall you can see like it's on a scale of minus one to plus one so you can see these all customer comments have been classified as sentiments out here so you can see that I mean definitely so there are multiple variables like I showed you in the last slide and that is how we would look at or inspect the data and if something looks wrong that's the time we should start questioning the data and going back to the you know the data engineering team or the provider and then this also talks about the correlation and our you know the the NPS is the one we are looking at as a final variable we are making we are trying to make a prediction and across all the other data points we have captured what is the correlation of that with the target or outcome we are looking at so that's that's the one we are looking at that and then there's also another one which is pandas it generally takes a little while if you want you can run it but generally takes a little while so I generally leave it but the outputs are very interesting from this also but you should try this out later I would say like for example so when you go to the like pandas this is the EDA prep the one we looked at just now and when you look at panda this is pretty similar but I mean the views are again different like it also tells you similar information I put the snapshot on the presentation then it tells you some warnings about cardinality very distinct values and you know all that and this is this is one of my favorite chart it also helps you to visualize all the missing values across all the columns and that tells you how much data is populated so this chart is very interesting in one snapshot you can see which columns needs to be filled up or they are partial and you can then be focused on working on the information so that's how it is and then I think then I think that hopefully I think you are able to try these steps can I get a few yeses if you are able to run these two on your computer anyone been able to run it so far awesome thank you thank you very much and then I think that brings to the next part of the workshop where I think like as you been able to follow along put your problem statement and then the next part here is that given I got my data set for the project and the workshop and the project we are working on for this stakeholder but if you are working with another stakeholder and the problem statement you have selected is time that we should start thinking about what data set would you gather for this problem statement and make your own notes so that by the end of the workshop hopefully you have a charter that you can start working on you know with your business and start having a similar project if that helps your business with the outcomes so with this I think and I think so far I think what we accomplished as in we been able to have a you know some introduction to notebook we were able to load the packages verify the version explore the data set and perform a quick quality check on our data and with this Upkar I'll hand it over to you that's okay perfect thank you let me give me one second running all the cells that you I already ran so we can pick it up from where you left off oh sure yeah so almost there yeah keep asking yeah all right let me share my screen okay see my screen yes all right so what I did here was so as Neeraj took us through I think section three and one way to quickly run all the cells is you can go to cell menu and run all above so if you if you you know if you haven't been running them one by one this is not the way to do it so you can see it's running it's ran number five and there's a star which means it's processing that cell right now so let me go back to the PowerPoint and hopefully by the time we come back to this it would have finished running all right so the notebook there's a lot of sales in the notebook and you know you can see it's a pretty long notebook but there are a couple of things that I would like to point out that we had to do in order to clean up the data and make sure it's ready for or ready to create the model and then also ready to be consumed by the model so Neeraj mentioned mentioned the different kinds of data that that we have in this data set and there there's a concept called feature extraction part of which is looking at categorical values so if I go back to you know Neeraj showed you this diagram but let's look at severity right so severity in this case that's just one of the categorical columns we have and you can see it's the values are tier one, tier two, tier three and tier four so it's a versus a continuous value this is a categorical value now the problem is the models you create do not understand categorical value values like this right they only understand numerical values so there are different techniques in order to convert these categorical values into something that the model can use to build or or the framework can use to build a better model what we use is something called one hot encoding and essentially what that is is you take the actual values off the column and convert the values into additional columns so for example in this case support here one, two, three and four would become columns and for every row you would have a zero or a one indicating if these categories are absent or present hopefully that makes sense that's the one we used and scikit-learn has a nice function or a method to help us do that there are other techniques as well and Neeraj has put some links at the bottom of the slide for you to go read up on these I think each one of these can probably take a lecture to explain and go through in detail so we'll leave that for now but for example hashing encoder is the other one as you can imagine right we have just four columns here and or sorry four categories of here so we get when one hot encoding we get four additional columns if you can imagine if you have a lot of categories in your data or in your values then you would get hundreds and thousands of columns and that's you know that'll make for a very big data set and cause more problems so there's something called hashing encoder that deals with that so we use that for one of the columns as well and I'll show you in a second then there's label encoder which essentially would label these as zero one two three four or zero one two three in this case and then instead of creating columns each row would have zero one two three indicating what kind of tier the value has or the data has the problem with that of course is you know when you say zero one two three the model might give more value to something which is labeled four versus something which is labeled one right so there's again you have to kind of go through this understand your data and then pick the feature extraction for the categorical type that makes most sense for your data same thing with feature scaling like an example I give with this is let's say you're predicting taking a different example you're predicting price of a house based on a number of different features or attributes like number of bathrooms in the house or may and the square footage of the house as you can imagine both of those things might affect the price of the house that you're predicting now in that case the ranges are quite different so a number of bathrooms in the house might range from maybe one to four let's say but the square footage might go from you know 400 square feet to maybe 2000 square feet and the model might learn different things or might give different weightage to the different columns based on how big they are or how small they are just the numerical value so to get around that we do something called feature scaling and again there are different techniques to do that we use something called min max scalar where we normalize this data so the model doesn't prefer one column over another and the third thing we demonstrated in the notebook is called feature selection or also call dimensionality reduction so as you can imagine after this feature extraction a bit there might be you know you might end up with hundreds of thousands of columns or input features or attributes to your model and you know that might be too big for the model or it might take too much time for the model to train on or more importantly a number of features are different features might not might contribute in the same way to the prediction and that is it's sort of duplication right so so what I mean or sorry what we do in this notebook is we use correlation as a correlation as a way to see how my input features are correlated with my target variable and then if they are highly highly correlated that means they're not telling the input features not telling me as much about the target variable as other input features and we take them out that's just just an example of how we use correlation to reduce the number of features that are used to train the model now again there's there's a lot here to discuss and you can look at the links at the bottom to get into details into each one of these things but just wanted to point out there are cells in the notebook that perform each of these functions and actually let me go back to the notebook and see if I can run these so the first thing we do in section four is identify the columns that are numerical which are numerical columns good to go and then we have columns that are categorical on which we'll do the one hot encoding and then there are columns that are highly categorical in which we'll do the hash encoding so around that cell let me I added this later on so let me delete that okay so in here is where we actually do the one hot encoding using this method called get dummies and then we also do hash encoding in here by applying this hash so if at this point if I were to look at nps select and just look at the columns you'll see now it's got all these different columns like this the tribe level underscore two all of these are the columns created due by the one hot encoding method that we used here similarly then we start doing feature scaling as I said we are using the min max scalar so that's what's happening here and now you can see all of the values in the different columns range from zero to one so all the data has been normalized for better predictions or creating better models and lastly here we use the correlation method as I explained before and we pick the top 30 features based on the correlation coefficient that comes back from this utility method that we created so again feature extraction which is dealing with categorical variables feature scaling which is normalizing our data and then using correlation in this case to do dimensionality reduction and pick the top 30 features for to create our model all right let me go back I think Neeraj has a quiz for all of you Neeraj do you want to explain what's going on here it's no longer a quiz oh okay never mind I get the answer on the screen no that's fine that's I mean I don't very often on Y, A, B, C, X, Y, Z I mean I always never thought about it but I was like I mean one day I was thinking I mean who coined A, B, C and why not you know E, F, G but I think this just comes from 1637 in La geometry you know so that's where it came from on the on why we use these alphabets for the purpose they are so that's pretty much yeah thanks Uqar yeah my bad it's not okay and there is one more I think right yeah and that'll also have the answer so I'm a big Linux fan and I use Linux and the slides the slides didn't show up nicely in the open source version of PowerPoint on Linux so I'm using the PDF version yeah it's okay so I think another one that comes up very often that I think not many people would know we use it every day you know when you are actually you know you are so when you are looking at the input variable you use an upper X and the target variable as you know small case Y I mean Y it's the case it's again kind of you know there's a there's a whole background on that I mean it's like being a vector versus a matrix and so that's the kind of you know explanation on that it comes from linear algebra and I mean if somebody asks you why we use capital X small Y it's just the Y how the denote you know how we denote the terms in linear algebra and you know response vector being you know being a Y and it's a vector and the other one is a matrix with multiple columns so that's the reason we would have capital X and small Y when we train the model test plate I mean I was always like why we're using small Y why can't it be common like in plain English but there is a reason behind that so yeah over to you Kaur maybe X just has a huge ego and Y is very humble yeah all right so with that I think the sections four five and six you should be able to run the cells in the notebook and the whole idea behind this is instead of having you write the code to do this I think with notebooks it's a lot easier to you run that existing code and explore as you go along right so for example when I added that one line to see the different columns right so feel free to add cells as you go along and run them to see what happens so I think the whole this whole process is supposed to be really iterative and experimental and exploratory all right Neeraj back to you yeah okay sure so with this one okay cool I'll share my screen so I think so what we have been accomplishing I mean we've come a long way so I mean a great job all of you I think it's what we've been able to now extract scale select the features from a data model and we've been able to make you know make move one more step forward in this one now I think this also comes as we are heading towards the you know the next step of training and model training a model so there's a way I mean I think this is this comes from a very interesting book I've actually posted the link below it's how do you look at I mean it's kind of a similar chart we looked at looking at the analytics approach but maybe a simpler version of that so when you are having a you're looking at classical machine learning it's supervised you know or unsupervised sorry why are you saying something Hi Niraj I don't see your screen it might just be me though who is it one second you saw your screen okay Niraj okay that was just me then who is it okay let me share it again okay can you see it now yes it's showing up now yeah okay so I mean yeah this is the last slide so I was just emphasizing the point on things we have accomplished so far and additional things we learned and now the next one is about the algorithm selection and like I was saying it's pretty similar to the analytics approach we were talking about and and how the starting point look at the problem that I'm solving is a supervised unsupervised that's a starting point now when I say supervised and I ask a question am I predicting a category or am I predicting a number if I'm predicting a category like customer is happy unhappy there's a fraud not a fraud or a threat not a threat risk yes no kind of those kind of classification then it's a classification problem you know in this scenario it is this is the case right now and if it is predicting a number like a house pricing or any of the numbers I think many examples we looked at you know out there anytime we are predicting a revenue I mean probably it would fall on our regression run then they're unsupervised where we don't want anything specific as an answer but we want to see the patterns out there and that's where it falls under clustering and then there is a you know there are times we have to reduce the dimension that's another technique or the association this is very common I think I'm sure you all have seen Amazon or any website when you buy a product it tells you you may buy this one also because other other folks bought it so I mean that's another example of that association one even you are choosing the right techniques and you want to move forward then there are many many examples out there I mean just for you to relate to it like you know supervised learning classification could be spam filtering like spam versus ham fraud detection regression stock price house you know house price clustering customer segmentation dimensionality reduction could be when we are doing topic modeling or a similar document search in the legal context to look for the laws and the association to place products on the shell what goes with what and you know and also give suggestions on you may buy this also so that's another part of it so with this I think Okar, over to you hopefully you can hear me and see my screen yes yes all right perfect so thanks Neeraj so the next thing we want to look at is how do you evaluate the different models that you come up with right you'll see in the notebook as as when we get back to it in these projects are in these state cases you never sort of work with just one model you always want to benchmark against something and then and that might be flipping a coin right so whatever your benchmark is and then come up with a couple of different models tweaking the different hyper parameters to see which one comes out the best right and there's no one right answer there are different ways to evaluate your model for a classification model which is what we are doing you want to look at something called confusion matrix which results in a couple of different metrics confusion matrix I say as the name suggests it's actually is really confusing and I need to go back and look at the this this chart for example to see to remember what's going on here but the way to read this is you look at the row on top as what your model predicted and the columns on the left hand side so these two are what actually happened and what you're trying to figure out or putting numbers against this is when the model predicted no for your class was it actually no or was it actually yes and the same thing with when the model predicted yes for the class was it actually no or was it actually yes and based on these and these things called true negative which means you know model predicted no it was actually no true positive means model predicted yes or class B and it was actually class B and then false negative means it was actually class A but you predicted class B and the false positive is the other way around. As you can tell, it does get really confusing. But essentially, you're looking at these metrics that come out of the confusion matrix, accuracy, precision, recall, and then the F1 score is a combination of accuracy and recall. So accuracy is, let's see, accuracy is, as it says here, it's a total or a true positive plus true negative divided by the total. And intuitively what it's saying is how many of the cases did your model predict correctly out of the total cases? So that's the accuracy. Precision, on the other hand, you can say it's of all the classes that we predicted as positive, how many are actually positive. And then finally, recall is from all the positive classes, how many we predicted correctly. So as you can see, depending on the case you're looking at, it may be more important to you that you don't have any false positives. For example, if it's a medical case where you're predicting if somebody has cancer or not, in that case, maybe it's better to not have somebody slip through the cracks, so not say this person is negative, whereas they are positive. So again, depending on the case, you look at the confusion matrix and you pick the metric here, which is more important to you or which makes more sense to you for that use case. And again, you just provided some links at the bottom if you want to read more about these things. Let's see. Okay, so in our case, we went through and created a whole bunch of different models that the methods to create these models are provided in scikit-learn. As you'll see in the notebook, to get to this point, to create the models, you have to go clean up your data or first of all, gather your data, clean up your data. We did feature extraction, normalization, scaling, and then we did dimensionality reduction. And now we are at a point where you can actually create the model. You'll see this step is actually not that bad to actually create the model once you have that data that can be used to create the model. So if I go back to our notebook here, let's run through a couple of different steps here. So you ran through the feature selection. So we have the top 30 attributes or top 30 features that we'll use to create our model. The next step is to split your test or split your data set into train and test. And the reason to do that is you use the training data to train your model or models and then you to evaluate, you use the test data. And it is crucial and very important that your model never sees the test data in training. That would be cheating. And there are other concepts here too, like overfitting, underfitting, all of that. We don't go into detail here. But let's run this cell here and it's just putting the data into training and testing. And we have created a utility function here, calculate metrics that given the true training and testing, or sorry, the true y, which is the target variable and the predictions. So this is the actual and prediction. It'll give me that confusion matrix that you saw in the slide. So let's run through that. So now we have this method. And the first thing you see here, I'm importing logistic regression. That was one of the models you saw here. Am I not sharing? I'm sharing, right? I'm sharing. Okay. So that was one of the logistic regression is the first model you see here. So we create an instance of that and then we pass in our train data or our training features and the target variables to create this model. And you'll see this pattern used a lot in scikit-learn. So you create an object off the model. And then you call this fit method. And fit method essentially is training the model. And then you call this predict method to actually use the model to get a result back. So let's run through all this logistic regression, then support vector machine, K-neighbors, classifier, all these different models that are given to us by scikit-learn or provided by scikit-learn. Again, decision trees, random forest, which is an ensemble of decision trees, gradient boost, classifier. And let's do this last one, last ensemble as well. All right. So you can see once you run through this, all of the utility method we created gives us back our precision, our recall, our f1 score. With the f1 score, you want that to be as close to one as possible. And then we get our confusion matrix here. And again, to read this, you would go back to this chart. And you can see it's a two by two matrix. And so is this. Right. So, and you want this, you want this column to be as high as possible. The true negatives and two positives, you want those to be as high as possible. So you get some information from this, from these cells. But then the next one here kind of puts it all together. If I run this cell here, you can see there's a chart here, which just has the accuracy. So in this case, we have decided that accuracy is the most important criteria for us to evaluate this model. And you can see the Gaussian model comes up at the top, followed by logistic regression with, I mean, they're, they're so, like the first few are so close here that you, in this case, I might go back to the confusion matrix and look at some of the other metrics and see how else I can compare them. All right, let me go back to the charts. I think that ends, yeah, that ends this section. Neeraj, you want to talk to the slide a little bit? Neeraj, can you hear me? I think maybe we'll cover this after the model is saved. I think this would be pretty much the last step. Okay, gotcha. All right, so let's, let's keep going. So, all right, let's go back to the notebook and see what's going on next. So now we have a model, right? We have picked a model. I think in this case, let's go down. We just picked the first model, let's see. Well, let me, let's, let's read through it instead of doing that. All right, so we pick a model that based on or based on the evaluation, evaluation criteria, in our case, it was accuracy. The next step is to now deploy this model on IBM cloud using the Watson machine learning service. If you remember we had you copy or store that API key somewhere, we need that now. So I actually have it here. I stored it in a file if you remember. So I'm going to copy that key and paste it here. Okay, so that's easy. And the second thing is we want your, we want you to add your location here. Now if you remember I was in Dallas, and so my location is US South. If you were in London, it would be EU dash GB, et cetera. So you have to go, go back to your service, see what location you were in and put that here. All right, so now I'm going to run the cell. All this is doing is creating this JSON object or a dictionary with the API key and the URL. And if I actually output the URL, oops, let me start, I can't use that notation. This is where I have to see how good my Python skills are. So here, this is the URL that it created based on the US South that I provided. All right, so the next cell here installs the Watson machine learning service. That takes a couple of seconds. And once that's done, I use this API client method and pass it in my credentials to see if I can authenticate successfully. So I don't get any errors, which is a good sign meaning I did authenticate successfully. If you get an error here, go back and make sure you have the right API key. You have the right location in this cell. So I've authenticated. There's one more step that you have to go back to your Watson Studio to perform. Now, I've closed my Watson Studio, which is fine. So let's go back to cloud.ibm.com. And I will go to resources. And this actually might be helpful for those of you who also closed it by mistake. So go to resources, services software, and I'm going to launch Watson Studio. So I get to this page and I'll say launch, again, launch an IBM cloud. Now you might be in your project, right? So let me open my project. If you remember, this is where the whole thing started. We were in our project and we added the notebook. Now what we want to do next is create a deployment space. So if I go back to the notebook, you'll see here, create a new deployment space. There's a link here. I could have gone directly to this link to create it. I want to show you how to get to it from the project. So if you're already in your project, click that menu and you'll see deployment, view all spaces. So earlier today, I already created a space called production. So let me create another one. So let's create a space and this is my workshop deployment. And it's asking you to use a machine learning service. So this is the one I created before. I picked that and click on create. So this takes a couple of seconds again. Alternatively, I could have clicked on this link and it would have taken me to the same page. So feel free to use this. Once the space is created, I need to get the unique identifier for that space and add it here. So let's wait for this to finish. Let's see if are you able to follow in the chat. Let us know if you're not and I can repeat some of these steps. But we are almost there. So it created a new space for me called workshop deployment and I need to get the UID which if I go to manage tab at the end, here's my space grid. That's what I need. You also see it's available in the URL on top here. That's another way you can get it. So let me copy this and let's go back to the notebook and paste it here. Again, all I've done so far is put in my API key, my location, authenticated with Watson machine learning service, created a new space and this is where you have to go back to the UI to finish a couple of steps. Create a new space, get the grid, let me run this cell. So let's run the next one. This will list all my spaces and you can see this is the one I just created, workshop deployment. This is the one that was already there from before. So then I set the default space ID to be this new space ID that I just obtained. Okay, so let's say success, that's good. The next step here is creating some metadata around my model that I want Watson machine learning to save as I save the model. So you can see here I'm telling it I'm using scikit-learn, what version of scikit-learn I'm using Python 3.8. If you picked a different one time when you started the notebook, you would have to change this but hopefully you stuck with the default. And this is the method repository.store model. This is what stores my model and you can see I'm storing this model called CLEF10 and I can go back up to see what that model was. So if we go all the way up here you can see CLEF10 is gradient boosting classifier. That's the one I picked to deploy to IBM cloud. So going back down here I'm saying hey client, store this model, CLEF10, use this metadata and I also provided the training and the sorry the all of the training data, the input features and the target variable, right? And that's to create a lineage between the model and the training data. So in the future I can always go back and see for any given model what data was used to create it and then what if any metadata was associated with that model. So I'm going to click enter or shift enter to run the cell and it's actually storing my model on IBM cloud. All right so that cell ran successfully and in the next one this is just a cell that brings back what was just stored just to make sure it was successful. So you can see here it stored my model, this is the metadata associated with the model and then these are the input fields or input features that were used by the model, right? So you can see all of that information in this cell. It also tells you when it was created. Okay so we have now got a model stored on IBM cloud. The next thing is to deploy that model and we use that by using this method on the client. So client.deployments.create. Okay so I'm going to run this and it creates a deployment. You'll see initializing and if it runs successfully you'll see a successful message at the bottom. All right here we go. So successfully finished deployment. Now if I were to go back to my project let's go back. So let's go back here. Let's go to project workshop project which is what I'm working with today and let it come up. So you can see here if I go to let's see it's not here I thought it would show up under assets. All right so going back to let's go back to the deployment space I think it'll show up there. So go back to deployment space and then the workshop deployment space this is the one I just created and if you look at deployments now I have this deployment right that we just created 11 11 a.m. now it's 11 12 a.m. So going back to the notebook we stored the model first then we created a deployment. Now we can I mean again the next few cells are just getting the details of the deployment just to to confirm it was stored. Now this is important so once you have a deployment how do you actually use it? So Watson machine learning gives you this URL this endpoint and this is what you use to run predictions against the model right so you make you make HTTP post requests against this URL and you get back predictions and that's what section number 12 does here so I get my scoring endpoint which is again this URL here and then I prep my payload and my payload has the input fields and the values okay and then finally I use the score method so client.deployments.score and you can see you know there's a pattern here we said clients.repository.store clients.deployments.deploy and then clients.deployments.score to actually score it. If I run that you can see I sent it one data point and it came back with with a prediction and a probability you can see the prediction is it's it's things or it thinks the it's a class of zero and then the probability is 0.67 and the probability for the other class then is 0.32 and the last two cells here what we are doing is we created again a utility method to do batch processing because initially or here we are sending in one row at a time with this method you can send in a data frame which is helpful so you can see at the end here we are sending X train data and this is kind of cheating because we're sending the training data to but you know we don't we didn't have any other data to test or to show this prediction bit so we're re-sending the X train data and we are sending the first 16 0 to 15 so first 16 rows and then outputting the results so you can see here once this thing runs all right you can see here all of the input variables that we sent and at the end of it you will see what the target class is so that would be a 0 or a 1 the promoter or diffractor or detractor and then you have the probability score for each one of those classes and again in your business problem you would then decide a threshold for a probability score that works for you right so if the probability or the confidence score is 50% well maybe that's not as good right and so you need to figure out what that probability score is for you and there are different ways to do that as well in classification model problems so I know that was a lot Neeraj I think maybe I'll pass it back to you but just to just to recap real quick here so we created a whole bunch of models we created 10 models we looked at the confusion matrix we looked at looked at the accuracy precision to figure out which model we want to now deploy once we picked that model we then used Watson machine learning we first authenticated with it using the api credentials you had stored earlier we then created a deployment space where our model will live we got the grid of the deployment space and then we used this method to create the deployment and then another method to then score against that deployment using the score method and then in the end we it's just a utility function so we can do multiple inputs at once instead of just giving it one row at a time so again lot to go through but hopefully you know once you go through this by yourselves later on or if you're doing it now it'll definitely make more sense when you once you're running the cells yourselves so Neeraj I'll stop sharing and hand it back to you at this point yep thank you Claire and the way you we are now I think so we have taken one more step where we've been able to like I think we have been able to split the data into train and test we selected the model based on the metrics we looked at then we evaluated the performance metrics and we finally had like you know a way to kind of you know and demonstration part is one I think we should look at anything that's the one you have not covered so now I think here at a point where we have the model we have algorithm we have our code but to be able to stand in front of the business I mean honestly they're not going to look at our new notebooks and we have to understand in a very simple way what we are doing and this is what we kind of you know in our scenario we built it for our stakeholder CIO and we told him that okay hey we looked at a bunch of time based variables a geography based variable or dollar based and the real time sentiments and emotions on the customer inputs that they're providing us and based on that we are predicting a case you know whether what direction is it heading towards then on top of that we have another layer because a journey of the case when they interact with us can start from day one and you know it can go on up to day n and we continue to run this model or algorithm on on those interactions every every second based on our real time definition and we continue to predict and this is one chart that kind of you know you know so basically you know we that allows the stakeholder to understand how the overall algorithm is working because they we can we would never ever get an opportunity to open our notebooks algorithms with them so this could be like a kind of a demonstration slide for them now having said that when we have done that then the next one to look at is I mean there were times when you would have to maybe you know put up like a proposed solution or like you know what are you doing end to end so what we have we generally have to put more like an architecture you know basically or say that not really an architecture but like how it's working solution end to end so we have been collecting data from like multiple data sources then we are scoring them on the real time basis and what using Watson and multiple other components we talked about and then we need to make this available to somewhere so that people can start using it and it can be in any ticketing tool like service now sales force or based on you can have your own UI or design built up so that will be a proposed solution and finally this could be one of the one of the maybe proposed UI I mean I'm just sharing as an example like you can see some filters on the top you know and then then you know showing which this is a geography where most of the detractors on happy customer you know interactions are lying right now and then based on the revenue I mean who are sometimes you know it's like a heavy paid customers and me again everybody is important because sometimes the startups are also very important based on everybody's priority which team is looking at them is it a growth team is it like a mature team you know there are different teams aligned to different scope of the work so some teams can focus on the startup organizations who are here and they are unhappy because they may not have a great revenue but still they're very important from the growth per se so this could be one of the view and earlier the way you were looking at the data or like all coded data but then we reverse engineer the data and get back the original values and tell them these are the key fields and these are this is the likelihood of the customer being unhappy and like upcar was saying saying that not I mean nobody given we have 500,000 cases a year nobody would have time to look at you know hundreds of cases so we can put a threshold like point whenever our accuracy is with the confidence of 95% and above where that only gives us like a 10, 20, 50 or 100 cases based on the capacity or the time that people are giving to a project like this to study the you know selected selected cases and then work with the customers to address the problem so we can use a threshold but this would be kind of a UI that will come up and like I said you know ideas are easy execution is everything and it comes from John Doier one of the one of the favorite quote I have here another one here and that pretty much I think sums up and so we talked through this so far already and with this you know another exercise for all of you how do you plan to consume the output of your models what metrics would you use as a reference and with this I think you know we are at a point where we have also talked about how do you consume the predictions and in a way how we kind of integrated the solution and for the as for the instructions from CIO in their ticketing system then they can start consuming the information and that pretty much summarizes the workshop and so with this I would like to hand over the session to Maureen. Okay thank you very much I hope all of you really enjoyed thinking through how you could apply this to a business problem that you have I want to thank Niraj and Upcar for taking us through this in in such a experiential way so we would love your feedback if you want to you know let us know if this is something you would like to see different versions of in future conferences just give us some feedback in the chat and we look forward to having you join the data science community of interest and the LinkedIn group so with that I'll turn it back over to the open group um let's see John or Maggie here or Jim here we go thank you to all of you Maureen Upcar and Niraj for your presentation in the workshop I'm sure the audience enjoyed it for those who are in attendance still thank you for attending we hope you enjoyed the data science day at our international virtual event we do make videos of these presentations available so look for those maybe it'll probably be week after next before those get posted to the open group YouTube channel but you can find that information there you'll probably see a follow-up email from the open group as well some time between now and then with links to to those things and if you have interest in joining the open group you can reach out to us at member services at opengroup.org and we're happy to answer any questions you might have or get you involved in whatever part of the open group you want to engage in so thank you all hope you all have a great day and we'll see you at our next quarterly event