 So welcome everyone! Today we're going to go through this 90 minutes workshop. This workshop was already done at another Open Data Science conference in Boston. We also did in different metaps traveling throughout, mostly I say, Europe and US. So it's nice to be in India. And yeah, basically we're going to go through those following those sessions. So how many of you used NIME before? How many of you heard of NIME before? Okay, I'll be better. Alright, before we talk about analytics and automated machine learning and all of these things I did with NIME, maybe we better understand the tool first, right? So then I would like to do a demo about this tool. And we're going to have a brief talk, 20 minutes, about automated machine learning, guided automation, guided analytics. And then we're going to go through those exercises. You can do them on your laptop. And then I will just go through them and add node by node, because as you will see it's all about nodes with NIME. And we are going to build our workflow to create a user interface to automate machine learning. And then in the end I will show a solution deployed on an Amazon web service, easy to instance. Okay, so let's start by having a quick introduction. So what is the best way to show NIME, as I was saying, is to give you a live demo. So let's switch to NIME and this is how NIME looks like. So the point is that you can see that you have its eclipse base, and at the middle you can see here this workflow. Now I zoomed out a bit. But the idea is that you have this set of nodes that will define an operation, an analytics process in NIME, okay? So let's see a bit more of how this works. So the idea is that you have a certain amount of nodes. As you can see we can start by a file reader node and then we attach to it a color manager and so on. And then you go on with the analysis. So if I want to execute the first node I can right click and execute it. So as you can see now the node went from yellow to green. That means it's executed. What does it mean that it's executed? That you can open the output of this node and see the actual data that you see. Now we're working with census data. We have all those different subjects with age and work class and so on. So the idea is then that when you do data analysis with NIME, you can see how you can execute all those nodes. In this case we are training a decision tree. We partition the data between train and test. The train set is used to train a decision tree. The test set goes here, takes the train model and it predicts it. And at each given point we can always see the data how it looks like. For example this is color now but we have most importantly the prediction here. They are being attached to the table. So this makes it more transparent. And at each step you can use those annotations to actually see to describe what's happening. And many likes this because in the end all those complex analysis has to be always understandable by everyone. And it's nice to make data science more open this way. But let's build a new workflow because this was already done. We have many examples like this. You can download them, play with them. But what if you want to start from scratch? So this means that we need to go in our NIME Explorer. The NIME Explorer is where you have all your workflows, all your actual data and so on. So let's create a new workflow. So we right click here and we select new NIME workflow. And I'm going to call it Bangalore. And now the workflow as you can see it's empty. There is nothing in it. And we want to add some data. So how do we add some data? We go again here. There is a CSV here in the data folder on my PC. And I can just add the CSV by drag and dropping it. And you can see that this is the configuration panel that is automatically open and it's automatically recognizing the CSV. So now I can click OK. And then you see that you know this yellow because it's ready to be executed. So I can right click and execute. And we can see right before the data that is being just imported in NIME. It's the same data set of before. But now we want to do something else. Let's do some manipulation. Let's do some aggregation. So I want to group rows. How do I group rows? Do you guys know SQL? How many of you know SQL? Can you raise your hand? Many know SQL here, right? They grew by operation. So we want to do a group by operation. And then what do we do? We can go here in the node repository where all the nodes that you are available are listed. And we can type in group by, right? And we can see a node that is group by. OK, for me it was easy. I knew what I was looking for. Yeah, yeah, file reader node. We have a CSV reader node. But the file reader node is the most flexible one. If it doesn't work, then you can use another node yet, right? I think it would take me something to list them all. Like definitely most of the tabular format, CSV, like whatever. We have a data file with the comma separated, tab separated, whatever, right? Excel files. Yeah, yeah. Yeah, yeah, yeah, yeah. We manage parquet file as well. You know what? I'm going to show you something else. So when I go on here, we are going to see this after, but you can type in your app.nime.com. And we can, for example, going to type in your parquet, right? And the point of this is that there is the parquet reader node, right? And this is something that I can import in my workflow, right? Because this is a website listing all nodes, and not just that. We can also list all the workflows, like these workflows by Tobias, a co-worker of mine in Berlin. He's made a recommending resume using association rules, and there is probably a parquet reader node in there. So it's all there, like you can search. Because honestly, like if we start with the question of do we have that? You see, like I'm kind of tight here with all the things I want to show you today. So let's keep going. So how do we aggregate data here? So I can type in group by, and then I can go in nodes. Here it's my Google Chrome. Could be Safari, Firefox, whatever. And let's add this node to the thing. So there is the page of the node here that I opened. And you can see there is all the node description, whatever. Let's add this node. So a new functionality that we have since June that you can drag and drop nodes directly from the web browser. So it's even easier to browse through all the functionalities of nine and just pull them in your analytics platform. So now we want to do some aggregation. For example, we are going to do the, by working class, we want to do the average age. As you can see, I'm just clicking around things, and I'm aggregating data right now. So this is the configuration panel of the node. And I'm pressing OK, and now the node is yellow because it's ready to execute again. So if I right-click and then I execute, now I can open this table here. And let me zoom a bit. You can see by each working class the average age. Okay, this is great. We do some aggregation, but there is a better way to understand these kind of things, right? It's called data visualization. So what if we add a bar chart to this? Well, same thing for visualization. Today is a lot about visualization. We all need to do is to type in bar chart here, and just you can also double-click, it will be connected, and you can select that we want to compute by working class the average age, right? So as you can see, we are going to have here a bar chart, and right away we can see that people without a pay are average young. That's not surprising, right? People younger usually are not working yet. And this is just a simple example. I'm showing you how to visualize data. But I want to show you something that we're going to understand later on. It's how to combine different visualization together. And this is about also the idea to all this interactivity with JavaScript, right? So I don't know if you heard of Plotly, but we can use, for example, violin plot node that is using the Plotly library. And we can connect this to the prior source of data, right? This way. So now, same thing. We need to right-click configure or double-click it. We have a node dialog coming up, and we need to select the data incoming again. I don't know why I'm interested in age. And we want to do a group by work class, right? So in this case, we are going to have a different kind of visualization that is all gray because we still didn't have the color. But we have, for each working class, a violin plot. They're showing us the distribution of the partition, right? And the idea that is interesting is that let me just color them real fast. We can just add nodes to a connection like this. And you can see that it's just added in between. And so on, we can color the working class by this color scale. And it will be then updated here. That's how it works now, right? In particular, it works to explore things. But we're going to talk a lot about components. And the idea is then that you can select a set of nodes. We can right-click on them and then, say, create component. OK? So if we do so, and the nodes are going to be reset, sure. And we are going to call these age bs. I don't know. We're going to have the nodes are gone now. And we have these light gray nodes. What does it mean? The nodes are still here just that they are inside the component. OK? And inside this component, we have those four nodes. And we can decide the layout of those nodes by basically using this panel here. OK? And the idea is then that we can, for example, select them to have them next to each other. And something interesting in the end is the interactivity part, which can be enabled by having this enable highlighting on the aggregation nodes. And this means that now, in just few drag and drops and operations, I created a visual interface that is interactive in the sense that I'm going to, for example, say, show select a row only in the violent plot. And I want to see the people there are self, I don't know, private sector. And you can see the visualization. And you can also add more distributions like this. And this was done really quickly. But again, this is all free. You don't know the software. You install it. And you can do those kind of explorations on your data. The point is that, OK, maybe not this kind of dashboard exactly, but sometimes you want to send this same dashboard to your boss. And he doesn't want to install NIME or anything. He just wants to play with this interactive thing here. Right? How do we do that? Well, that's a different story than about NIME, because the idea is the following. We need to take the workflow and then deploy the online server so that it's remotely accessible. And then it could be accessed via our web portal. So how do we do that? That is the idea that we can go here. You see NIME server learn at them. And I'm going to log in. And once this is logged in, I need to save this workflow. I'm going to reset the node. And then I'm going to save it. I'm going to close it. And then I'm going to deploy it to server here. And then I'm going to deploy it. So now we are deploying this workflow to the server. It appears here. As you can see, we are in the NIME server learn at them. And so how do I access this? Now it's just a link. I can use this shortcut here open in web portal. This is my web browser. We are on learn at them.downline.com. And you are able now to access it and to just go and start it. And we have basically the same thing. Now everyone with credentials, of course, can access this dashboard. All right, so that's it. I'm done demoing you this tool. Let's talk now about the full topic. That is, if you have any question about the tool, maybe now, but I'm going to now talk more about the actual automated machine learning. So any questions? Yeah. Yeah. If you are a Java developer, please feel welcome to add more nodes. We have community extensions, so do that, please. Yeah. Yeah. Yeah. I mean to give it a line to deploy it on a web portal now that whenever you move something to name server, name server needs a license. Right? So you would be able to ask your boss to install nine, import the workflow, and open the dashboard. And that would be still free. But of course it's now as ideal as just sending out a link. Right? OK. So let's talk more about the vision. Yeah. No. The code is not fetchable. You can insert Python part, piece of codes, and R piece of codes. But a code that is running, it's not easily accessible. I'm pretty sure that it is accessible somehow, but it's not part of the user experience that we were looking at. Yeah. All right. So the analytics learn it, and so it's about building those applications to be able to automate machine learning. How many of you have seen this already? Can you raise your hand? No one? OK. So this is the Chris BM, right? The cross industry standardized process for data mining. Now it's called also data science, right? It's kind of all these things. It's a Wikipedia page. But the idea is that you go through the steps when building your own machine learning model or data mining model if you want. The idea is that you go through the steps, and whenever something goes wrong, you go back. Right? And so it's an iterative process. This can be done with Python and R, but we're always there, right? You prepare, you fix something, you go back, you evaluate it, it doesn't work, try again, and so on. The idea is that when you do it with NINE, which is really time consuming as well, because the data science problem is still there. We don't make that part easier, right? The idea is that anyway, our workflow is created, right? So in the case that the data scientist is using NINE, we're closing our workflow to its business analyst, right? And this workflow is, let's say it's great, right? It works, it creates, in this case we load some data, we clean and transform it, we create the model, we visualize it, and we deploy it. Done. Well, that is not always the case, right? There's always something that can't go wrong, right? So what is then happening, that the idea that the business analyst, maybe it doesn't build its own workflow or its own Jupyter notebook, but is always asking for something else. What if you can add this kind of feature? What if you can add this kind of Excel file? Even like really important things or really stupid things, but it's okay. He is the one that really knows the business problem, so it's okay that he's going to ask to the data scientist to change something. Just that this is time consuming. Every time the data scientist needs to go back to NINE and to change certain pieces of the workflows. All right, so the way we thought to tackle this, it's similar to what we showed you before with one more step. The idea that we use components and those components are all interactive. So this means that the user can put his own Excel file, decide the number of trees for the random force to be trained. So like really custom process to build on the fly this data analytics process. The data scientist is still there. Just you also make sure that this workflow is accessible and interactive to the business user, all right? So how do you build those workflows? You basically, NINE, so it's about nodes. You can build an interactive component whenever you have either widget nodes or those JavaScript nodes that is the plotly node and the bar chart that we're showing you before an example. Okay? And those are the main nodes that we are going to use in today's workshop. So the idea is that, for example, you have some data that is getting bland and the user wants to always change his Excel file. So we are going to add a file upload widget on top of the Excel reader node. And this will empower the user to decide what Excel file he wants to use. Yeah, we version the workflow, which is kind of the model, right? Because it does a secret. You can schedule the retraining of the model and update the workflow, right? Like you can model all of these situations using the versioning and so on. Yes, but I mean, you will need to model that by building a workflow able to detect such things and trigger certain executions and update certain piece of the workflow. It's a bit like coding just that you are using those nodes to build all those sequence and trigger events, right? Here we are just talking about the fact that the front end is way more easily exposable to the user. That is the focus of the talk today, that we have a way that, okay, we have a certain sequence of operations and we are able to make a front end really easily. Then the back end can be however complex you want, even, for example, calling a piece of Python node that is given the day, that is to get fetched somewhere else through a RESTful API, then I'm going to do this other branch of the workflow that retrains the model. It's all like this, right? So the point is that we are going to focus today on this possibility of automating the creation of models and to see how we can deploy them, okay? Okay, so the idea is then that we can create a component by creating all those different nodes together and they automatically will use the interactivity of the JavaScript nodes to make really complex view. Now we can see once the component is created we can open the interactive view in time by right-clicking on it and seeing this browser that pops up with all the layout we decided and then we can always save the setting, the interaction, the rows you selected by selecting Apply and Close here in the bottom corner. The idea is then that you can, for example, decide the layout that Disha showed you during the demo to make really your own kind of visualization. But once you're on the server and you need to add to the server, you can already do that. If you have nine open, you can follow those steps to actually connect to a server that we are going to use today, okay? So the idea is that when you go on Dynamics Explorer there is this button at the top, this white icon and when you select, you can add a new server, okay? So when you select it, you go and press New and when you select New, you're going to type in HTTPS doublepoint slash slash learnathon.9.com and then you can put the credentials. If you need credentials, you can raise your hand and Catherine will take for you the credentials, okay? So yeah, you want the credentials? Yeah? Where is Catherine? Okay. We can also do it later, but the idea is that if you have this ready, it's maybe easier, right? All right. So that way we are ready for the section afterwards because many people get lost last year. Anyway, it's pretty easy. You can get, like, the credentials here are going to connect you to an Amazon Web Service System where you can deploy workflows and you will be able to then select the folder of your username, okay? Because we do not want you to deploy your workflow in the username of another user, all right? And, okay, while still people connect, again, htps doublepoint slash slash learnathon.9.com, then we can put the credentials, then you can test the connection, then you press okay, then you close, and then you're done. You need to go on the NIME Explorer. There is on the top left corner of NIME. And then you can select this icon at the top. Oh, yeah, yeah, once, okay. So again, go on the NIME Explorer, the white button, then you add new, and then you can add those credentials here. You need one? Yeah? Okay, so, Catherine. No, you're good? Okay, cool. All right, so once you connect, then you can find a workflow that you want to deploy, and we are going to do that later, but the idea is that you are going to select a workflow, right-click, deploy to server, and then you will see this window with all those different user folders, and you select the one with your username, and then you're able to move there your workflow that you want to deploy, okay? All right, so once the workflow is up there, then you can right-click on it again, and then open in Web Portal, and then we have this next back button, so we are going through all the views ignoring the backend, okay? So, there is some things I would like to show you that you can enable all these interactions in the panels of those JavaScript nodes, so you have enabled selection, enabled publish selection, and the idea is that this view, the parallel coordinates plot right now is subscribing from the selection and the filter event of the bar chart and the range slider, and you can also use the parallel coordinates plot to publish a selection to the other two charts. Okay, now you know all the things you need to know to understand what is this guided analytics. So, guide analytics is about deploying those interactive systems that the business analyst is able to use to make decisions, all right, to easily get information out of data to decide something that is important for the business or whatever other situation he needs to look in, okay? The idea of guided automation is a special instance of guide analytics, so the idea to use this framework to intelligently automate machine learning. What do we mean by intelligently? We mean that we want to not just full automation, that you just please click go and the automation goes, that actually if you have some important question that makes sense, you can always add it, like do you want to use this strategy of the decision or this other strategy, right? We give you a way to really customize your AutoML process, okay? So, can we automate machine learning cycle? We know, yes, we can do grid search, biasing search, random search for hyperparameters, we can do feature selection, feature engineering, all of these things, and is it useful to automate machine learning cycle? Is it useful? That depends, right? There are some situations where you deploy a workflow that is able to easily automate a really easy problem of machine learning, right? A simple classification task, that is possible. Is it always possible? No. Sometimes you really need a data scientist to look that you're now overfitting because the problem is actually really hard, right? So, it depends, all right? But if we want to make a customized way to automate machine learning, that's more or less the idea. We can do that from different sources. There is a list of data files, a database, a cloud. We take all this data, we blend it in in NIME, and then we can ask human questions, right? Ask the user question, what do you want to predict? Do you want to do supervised and supervised? Do you want to do some parameter optimization? Or do you want to do this extra step of feature engineering? And the user can decide, go back, change his mind, and so on. All those parameters are ready. Then we can perform the machine learning automation. What does it mean? Then we are going to automatically train all those models using some computational power that is maybe better than just our laptop. I mean, if your laptop is super strong, you can use it, right? But sometimes we want to use things like Spark or some distributed environment, right? And once that is done, then we can have a dashboard and we can interpret the models using machine learning interpretability techniques. Okay, so let's go a bit in details to what is actually happening here. That is the true point, right? So we need to take the data. We need to prepare it, given what the model needs. So if we need normalization, we need to do it, right? If we need some kind of encoding of the categorical feature into the numerical, we need to do it because we know that the user selected the model. Then we can do also some feature engineering that is creating new features from the existing ones, like dimensionality reduction, like, for example, simple exponential or logarithmic function over some numerical features, or combining them together. Take a feature divided by this other one, but then you have so many features and so many hyperparameters to try. And that's really the optimization part, right? Where you try those different set of features and those different set of hyperparameters, and you try and until you get the best one. And here, it depends what optimization strategy you want to use. Is it grid search, or is it Bayesian search, or is it random search, and so on. There are different strategies that you can implement here. Once you've found, at least with the time you add in the completion power, you add the best set of features and of hyperparameters, we're going to use them to retrain the models completely, given the settings that we have defined. And then we're going to evaluate it. We're going to compute if it's a classification task, the accuracy, otherwise there to square error, and so on. And then we finally compute a dashboard with all the results. Okay, this can look scary, but we're going to do it tonight. At least I'm going to do it if you do not have a laptop, but that's actually the exercise of today. So we're going to focus. Usually this is taking some time and we ask people to divide into groups, but we can go through group one first and group two. Group one is about building the initial interface that makes all the questions to the user, like what models you want to train, what column you want to predict, and what data you want to use, and so on. And then you have a second dashboard at the end. After all the models are trained, that is instead showing the performance of the results. So the accuracy, the UC, the time that was required to train a model, the time that was required to score it, and so on. Okay, so just to make some a bit more clear the use case, we have this data set, this is the census data set. We have certain columns here, like the age, the working class, and so on, and we want to predict whether those users, those persons are making more than $50,000 here or not. Okay, so whether if they're rich or poor, sort of. Okay, so group one challenge then is about this part, right? The part that becomes before the whole automation. And this part is then about creating an interface that is able to read the data, to select the target variable, to select the machine learning models that we want to train, and to exclude the certain columns from model training. So if you want to take this group, it's basically really helpful if you're not really confident on what comes out of a machine learning process and you want to just be friendly with creating such a user interface, you can look into it. How does it work? Well, if you went to the URL, or if you got one of the USB sticks, you can then import this workflow in IME. So in this case, you can see here on the side, we have this guide analytics learnathon, and this is group one, right? So when you open group one, you will see something like this, and if you want to start working on group one, you will need to right click, click on component and then open, and you will see here all the different steps that you need to follow to add to these workflows, node by node, to build a user interface, okay? And let's see a second also, group two before we get into this. So group two is instead about the dashboard that comes at the end. So we're going to visualize the performance of those models, the time it took to score them, create some download buttons where you can actually download the predictions and the model that was trained, and also optionally something to inspect the predictions, but most interestingly, something that we can use to interpret the models. How many of you heard of partial dependence, Shapley values? Okay, so real quickly, it is the idea to use visualization. They are quite interactive in a way that you can understand treating the model as a black box, what it's actually doing using a single feature or some other one, okay? All right, so the idea is then that we are going to go on this link, find the Galenithics Learnathon important and start working on it. So since many of you didn't know, I have a laptop, I'm going to go step by step as low as I can, given the time I have. Who needs the USB sticks to access those workflows because it doesn't have access to this URL? If you can please give them back for the next workshop you have, it would be super. Just saying. Okay, so once you import this file, how many of you imported the exercise? Can you raise your hand if you imported the exercise? All right, so I guess everyone is answering emails right now. So the idea is that you file import those nine workflows, you browse, you select the file you need, that is the Galenithics Learning, you'll find all the USB sticks on the URL on Dropbox, and once you're imported, you have those exercises available. So we need to install some file extensions because probably you do not have all of them. So you can go to file, install nine extensions, and then type in those extensions to install them. So we have the data generation, the H2O machine learning integration, and so on. Those are all extensions for the nodes we need for the workshop of today. If you want to just install them all at once, you can just select the solution with all the nodes that are required, nine will seed what extension you need, and you just hit Next, Next, Next, and you will install them automatically from the update site. Okay, so before we actually go through the exercise, and I would like to just invite you to our nine-fold summit in Austin if you're able to attend. So it's about seeing other people using NIME for actual use cases, like in their business, right? And also trainings where we teach you more about how to use this tool. If you are able to come, maybe you should also check this coupon at a 10% off promo code. So maybe you want to take a picture of this slide. Okay, so something else that you might want to keep is this coupon for a free download of a beginner's luck book. So by going through this book, you can learn how to use the basics of nine. We usually chart for this, so you just add it to the basket, and it's good, and it becomes free. Okay, awesome. So I will now start the exercises. So we're going to start now with group one, and we're going to build the first user interface. I will go through some of the tasks, because there are many of them, and once I'm done, I will show the final solution directly on the web portal. If you already want to play with a guided analytic solution, you can just go on your web browser. Many of you, even if you don't have nine installed, it doesn't matter. You type in learnathon.nine.com slash nine on your web browser. On your phone, it will be maybe weird, but it should still work, and once you go in here, you should get these credentials here, and then you might want to log in with this piece of papers we gave you before. So I don't know. User 33 is in here. You can go in here, and you can access and see, actually already the workflows in there, and execute the solution play with it. All right. So let's start by adding some nodes here. So the first step that we see here, it says inspect the output of the file reader and use the .explorer node. So how do we do that? We execute the file reader node, and we're going to read the output and add the .explorer node. This node is now ready to be executed. We can right-click and execute again. Once it becomes green, then we can also already open the interactive view. And when we do so, we can see the distribution of each different column in our data set. But what's interesting is that if we press this button here at the top, you can see the layout panel here at the top. You can also add the .explorer to the current interface, like here at the bottom, for example. So now, if you go back to this workflow here, you can right-click and select execute an open view. When you do this, you will see the title that I set up for you and the same view that is now part of the user interface. So we added the first piece of the user interface. Then we can go back in the component and, for example, the second part says add the column selection widget node. How do we do that? We go to the node explorer here. If you have the full interface, it would be down here. We type in your column and selection and then we are going to add the column selection widget node. And we add it to the file reader node. But it says here also add a column filter node on top because we want to train only classification models. What are we doing now? We are adding a widget node that we can configure, but the idea here we want to type in here select target because this will be the target of our machine learning model. So how does this work? Now, if I open it this way, so I right-click the column selection node and select execute an open view, I see select target. But the problem is that if I select, for example, age or education num, then we are not training any more classification model. We are training a regression model. So how do we get rid of those columns that are numerical? We need a column filter node. So we go back in the node repository, we type in here column filter, we take it, we drag it in and we can mouse over the connection that we want to add it to. We drag it in, we press OK, and then we can say filter only double string columns, right? So we go and type selection, we select string, and you can see that we added now only string columns to be selected and all the ones that are not string will be left out. This means that now we can execute this node by right-click and then execute, and then we select the output of the column filter and we see that no numerical features are here anymore. So this means that when we select here column selection, we are going to have a drop-down with only the string columns that can be used for a type of classification model, right? Okay, so this was the first part. Of course, we can also add in here the column selection to the user interface, and now it's parked next to the file upload node. Okay, so what else can we add? We go back here in the annotation. Maybe I can make it a bit bigger. Let's make it 150%, and we can go here and see... Maybe it is too big. Okay, so we can see let the user select the machine learning models to train with recommended setting. Okay, so multiple selection widget. It's always the same. We go here, we select multiple selection widget. We add it to the workflow, and here we need to decide which models we want to train. So we are going to have a list of checkboxes, vertical, and we type in here the decision tree, then random forest, right? Then again, random forest, then we can have generalized linear model. I'm reading those down here, if you can see. And then deep learning. And by default, we are going to train all of them. So by default, I select all of them. And here at the top, I can set up select model. Okay? So how does it look now this multiple selection widget? You have a list of checkboxes, right? And if I now do not select only decision tree, because I want to train only the decision tree, and I go here and I say close and apply, then the output of this node will be... Whoa, sorry, I didn't save the settings again. And then I go here and select close and apply. Then you have the output here that is only decision tree. Okay, and we can take the information on which target variable, and we can just add it here this way. Okay? So this is the first part. Then we want to output all of this. How do we output all of this? We take away this domino that is just empty. And then we can decide that the models to be trained go here, and the data that we want to use go here, right? So we can add the multiple selection here as well. And so if I go outside of this and open the view of the component, I'm going to have all the forms, right? The upload data I will show you on the example server, the target that we want to select, and which models we want to train. Okay, so how does it work on the server? So if I go directly right now on the server and I execute the final solution, so now it's loading. It's the same morpho that I just deployed. It's the full solution that already has all the components there, can execute it to if you want. You can see that we have the possibility to change the data that is being uploaded. We are going to select income as the prediction class, and we want to now train all of them. And here we can expect the columns we want to use. There is a problem when training models in this fashion and give it to anyone to train them. Maybe they think they're doing the whole thing. They just want to say all features are good. Just put them in there, right? Everything you have. Sometimes you have some columns that contains the ground truth. So you're leaking ground truth in one of the input features of the columns. So for example, if we go here, like if you want to be careful, you should find out that gender with income is one of the columns that will be used by the model. This means that you basically use income to predict income, which doesn't make sense. So we want to exclude this class and use the checkbox here to exclude it. And you can exclude also many other models. Okay, so let's exclude all those things for now so that the computation is faster. All right. And here is this data exploration part where we combine together all those visualization nodes. All right? So you can see it's interactive. For example, I can select the capital gain here at the top, and we can select this. How do we build this interface? So if you go back here, you should see group1.exploration optional. You can right-click, and then you can go in component and then open. You go inside, and again, here is to fill. So the first part is to fill the Instagram node. So we can also go on app.nime.com. Here, and type in here Instagram. We find the Instagram. We don't even need to open the page. I can just drag and drop Instagram. And we have added the Instagram to our workflow. So we are going to... the visualization with all the data points is taking not just any more time, but for the parallel coordinates plot, we can just have 2,500 lines do not work. So we can... Yeah, for the Instagram, we can also use all the data set. So we go here, and we select that we want to have the Instagram for age, and the number of bins should be 20. So we select 20 bins, and now we execute, and we can open the view. So that's the single Instagram right for age. And the idea is that maybe we don't need to do the entire interface. We can add a parallel coordinates plot. I drag it in. I connect it. And for this one, we can maybe just keep in some of the columns, right? Like only the numerical ones. But the problem is that like before, this thing is all black now, right? So it's not as informative. So we want to add the color manager node. So we go here, and we add color. We just write color. We find the color manager node. And we're going to add it to the work. Okay. So for the color manager, we can decide to color by age, for example, with a gradient. So we can select age. And we can say that the minimum value for age is, for example, let's say something really pale like this. And then a really, I don't know, dark color for the people with 90 years old. So now each person will be aligned. The color resembles the age. And for example, now we can combine them in the sense that we can have the histogram here and then below the parallel coordinates plot. So something I would like to show you today that is a bit advanced is the idea that if you go back here, you can have a single page that is going to work that you can select, for example, a bin and see the selection actually on the parallel coordinates plot, right? But this is what if we want to nest this view in the other view. So we can just copy this node and paste it. You see I'm making a copy, right? Then I can control X just like with text. Go inside the other node, the one we were doing before, and paste it. And now I can just give him the data I need to use, right? And this component is now nested in the other. We have group one, the other view. It's nested below. So now when I go back and I open this view, I do not have just the settings of before. I'm also nesting the data exploration part below just with a copy paste, right? So this is also interesting to make really complex views made of many different pieces. Okay, so I think that for group one is enough. What if we already have trained models and we want to visualize results? That's group two, right? It doesn't mean that if I go back here at this step, one before I was... We already did this part. Now we trained some models and we want to build this other dashboard, the results one, right? How do I do that? Then I open NIME again. I go back in my workflow. Are there any questions about group one before we keep going? Oh, yeah. So when a workflow has a black connection, that is data, always data. When you have some colors, that is a flow variable. And a flow variable, it is something that you can use to change the settings of another node. So in that case, let's go and see it. In that case, we can go to group one and you can see the column selection node. So the column selection is selecting the target and this target is the name of the column. If I go here in the output, I don't see data, I just see column selection or class because this is the setting that is being selected to predict or class, okay? So the idea is then you can go in here and now it's part of the flow, but later on this information will be used to train the machine learning model given this setting because it's the output of this user interface. Okay, so if we go now in the second part which is group two, we have already the models trained. So here we trained a set of models that we have the random forest, the decision tree, the deep learning, the generalized linear model. Each of them has a different accuracy. We store the actual model object in each row with this port object domain and we also stare some other information like the time it took to train them, to score them, and they are already under the curve. Okay? So the idea is that here instead we have the prediction. So we have a table with the data just that attach together with the data we have all the prediction of all those four different models. And now we want to decide which model is best, right? So we import all this information in here and we can already see and we need to fill it again. So the first part is to actually add, it says here a bar chart to compare the different accuracy, right? So I'm going to type in your bar chart. I can drag it in. I can connect it to the first input part and then I can open the bar chart settings and say that I want for each model to visualize the accuracy and the area under the curve. Right? So when you visualize this, then it shows you that we have more or less like the best model it looks to be the random forest which has the higher area under the curve compared to all the others. But where is this area under the curve coming from? The ROC curve. Can we create another ROC curve? And how do we do that? We go on the node repository and we type in ROC. Right? So we have here this ROC curve node. We take it, we drag it in and now we need the predictions of the validation set. So we take this port with all the prediction and we attach it to ROC curve node. Right? Which is red because it needs to be configured. So we right click on the node and we open the first thing and it says okay, what is the ground truth class? What is the actual data that we need to measure the performance? And I have here a column that says ground truth. So I compare what is the positive class that will show you all those rates. And we can use the above 50k. The probability to be above 50k so that's the problem. And then we need the prediction columns. Right? And we have them here stored with the name of the models. So I can just select them. And now we have a way to see the actual ROC curves of each model. Right? And you can see that the green one is the random forest and it's the best one. Okay. So this is the first part but there is another part that is about interpreting the models. And the random forest is the best. And we want to interpret this random forest model. Right? So I'm going to show you that even since June we have a new extension that is the machine learning interpretability extension. And this is the idea also of shared component. So here in another repository I show here we have workflows that's correct but we also have templates. And those templates are the components just like this one we are creating that has been saved. So if we want to explain prediction here you can drag and drop this that is not in the repository but if we drag and drop it it's actually looking like a node, right? It's just been saved already. We can go in it. We cannot change anything but it's basically a set of nodes combined together. And what does this set of nodes do? It asks the user some settings imputes the partial dependence a violin plug, all kind of visualization that I created for you guys to interpret this model. Okay, so let's connect it and see how it works. So we can take the first input and connect it to this with all the models here all the predictions. Sorry. Now this is a component but acts like a node. What does this mean? If I open and configure this also this component has a configuration panel and ask us how many predictions you want to explain. Now we're going to compute Shapley values and Shapley values are a bit time-consuming to create all those coalitions, right? So just I'm going to show you later an example with more of them but for now we just compute one explanation at random, right? So that is fast. And the class that we want to explain is about 50K, the positive class. All right? So then we can execute this and now we can see that it's actually executing this part and it is actually executing here taking the random forest that is an H2O models and using H2O productivity in a Shap loop system with all those different coalitions and then computing and then we are visualizing it. Wait a second. Let me close this. Okay. While this thing is ready I can really show you the model that we were training before. Okay. So this is actually on the server and let's start already to compute all the predictions, right? So and then we can also see this single explanation being explained. So we are going to upload the data, select the target. I have the four models here and then we can go and I think I did this already before with you. We excluded all those columns and now we can go at the button and select next. If you go and learn it on .9.com you can do the same. Okay. So in this case what is actually happening? It's executing. If you do not care about workflows you just go and on your link and drink a coffee and wait because we are training four different models and optimizing them. But since we are in the process of learning workflows in .9. We can go and connect to the server and oh actually this is done. Let's see the single explanation. Okay. So this is open, a single explanation the value of the explanation made of five sharply values and you can actually see the partial dependence curve and you can also see the partial dependence in this case of age. You can change the feature in the partial dependence plot for example education number and also have a surrogate run decision free below. Okay. But this is only for one. It is executing. So we can go here in the Analytics Learnathon. We can open this one. Oh, many of you have started executing this. Look how many jobs. Okay. This one is actually executing. I think it's mine. Okay. So now we see that we are actually, we are already computed all this part of training the model. So how does this look? Let me put it bigger. Whoa. So we have all the settings. What are the settings? Let's see the settings. So we have the settings here of the models that are being trained and here the predictions. Then we preprocess them and then here is where all the actual training of the models takes place. We can open it and we see that the first part is to divide between the test. Then we take the train and we are going to normalize it and remove outliers. If requested, in this case was a default, but you can expose this parameter as well. And then you can use the test to actually apply the normalization and whatever strategy you have on the train data on the test data. And then we are going to compute the model hyperparameter optimization. What does this mean? If we go in here we have an actual workflow that is being called each time training a different models with a different hyperparameter combination. Then we compute a final accuracy for each hyperparameter combination and we take the one that gives the best accuracy. Then we can go back here and go on the feature selection part where we create many features and we are going to compute all of them and then we go at the end here and we can retrain the optimized models with all the settings that are being saved. So if you go in here you can see that we are actually retraining many models. We have just four options but you can see here there are much more and I'm just reusing a piece of another workflow. If I open now here it's still executing but the idea is that here we can see a final dashboard with all the workflow that is being who executed this final solution? Anyone on the web portal? We are because I've seen many jobs okay so this is executing and the idea is that it's taking some time because we are explaining a bit more than just one single instance we shall play values and this takes some time given that we are now using a huge instance on Amazon web services. Okay so let's go back and see where is the model actually at oh you see it's still in this group two it's probably executing the part where we explain the predictions so we can go and we can see that this part is still executing where we actually compute the explanations so if we open we can see that we are iterating through the different iteration of the shop iteration so how does it work shop we take all the features that the model is using and for each of them we take different set of features of collisions yeah yeah you can put in the different things what shop loop you mean this shop thing exactly okay you can have a Python snippet in there and you can compute predictions and you need to tell to the shop loop which are the predictions that are made with all those actual like coalitions right and then he's going to recognize all these different probability difference with all those coalitions and so on well yes because you are opening with so there is a settings here in preferences right and if you go there you can actually open a debug option that every time you open a view on our browser is going to save the HTML in your temp folder somewhere then you can fetch it so it's a bit of an act but you can do it Python libraries you can code in Python in a snippet so you have this Python snippet and you can code in there and then it depends on your Python that you installed that communicates with 9 what Python packages you have installed there usually people use a condi environment where they install stuff there okay this should be almost done and while it's still running let's try to refresh this okay so before this computes I would like to show you some finance slides so time is set almost 95 alright so okay this was a really really simple application that we created with really simple user interface and because we have a short workshop we cannot do like the full application in just a few minutes right but I did it with a co-working I would like to share it with you in a little bit of context to actually something that a scientist can give to the business users and they can use it without its help alright and the way it works that you have way more than just two pages you have more than one you have actually one, two, three, four like you can see also the top this flow chart and you go through these different settings and I would like to show them to you so this is the workflow as you can see it's a bit bigger and the first part is of course the uploading I mean this is how it will work right we upload the data sequence of interactions training of the models and then the dashboard so let's start with the first part that is uploading the data set when you open the view the first view you will see this page where you actually are able to select the data that you want to upload then we select the target is fairly similar to what we did so far and then there is a part that is way more interactive because you have a range slider with which you can filter the columns given columns that are too much correlated with the target that would be weird columns that are too many missing values 90% missing values that's not a column you want to use or maybe a column that is always constant right so in this part you have this really interactive framework to take away stinky columns right away and then we can have a way to select the models and to also decide whether we want to remove outliers or not now there is an important step because at this point there is a branching something that is a bit more complex that we didn't talk today so the idea is that you cannot force the user to go through all the interactions you can also make some of those interactions optional so you can ask the user do you want to define exactly how the item parameter optimization will take place and the feature engineering or do you want to just keep this part because it's too hard for you and if they say no I want to skip it then default setting will be used in the lower branch right because those will be automatically set and in the top branch instead we are going to ask questions like ok from what number of three to what number of three you want to do your search of hyper parameter for the random forest or if it's a deep learning situation how many layers you want to test from one hidden layer to ten hidden layers and so on right so you can ask this kind of question in this part and then the feature engineering or question like do you want to compute dimensionality reduction to be done on top on the features that you create right and the user can try settings and then keep them in memory that they are working and so on ok so ok so this is the range slider where you can select how many trees you want to or how many hyper parameters you want to test and here is instead this check box where you can select what kind of feature transformation you want to apply ok this case you can see like simple transformation like exponential function right or logarithmic function right you just apply this and then the execution settings is something that we didn't talk really about it's really important do you want to execute this locally or like in this case on the local name server or do we want to ship this execution on Spark for example right because sometimes you have really a massive training and this will take some time so you can also you should ask this question to the user ok are you ready to use your really powerful potential Spark cluster or whatever in this case some options are grayed out because they are now available ok ok and then we have here a part where we compute all the training and validation and surprise surprise this same exact part is what I showed you before like this core part of the model where the hyper parameter optimization the feature selection, the feature engineering the retraining of the models all of these I just copied from this workflow and this is cool about 9 because you can download things copy and paste them in your workflow and reuse them right whenever someone shows uses a workflow that someone else already used that you can include it in your workflow and then we have a really fancy dashboard with all the results of all those different models that have been trained and you can select the models that you want to download so this workflow if you want to try this complex workflow with this nice guide on the side with instructions this flowchart on top this is publicly available on our example server and also on the app you can go on tinyworld.com just search guide automation on the search bar and you can find it and download it, use it and this is something that you can really start working with and maybe change it in order to work. This is Simon I did this workflow with him and it's also if you want to publish an updated version of this that is working best for you feel free to do so you can add also your workflow on the app. Okay so let's see the solution that should be well yeah yeah we are working on a way to publish workflows internally within the company but for now when you publish something on the app it's publicly available sort of like when you publish on github right yeah we do now have private repositories yeah I think they're working on yeah of course that is totally optional totally optional and it's like for people that really like to share and get like feedback and people say thank you to you and so on. AWS definitely we have S3 nodes and so on. Databricks no we don't I don't like you know sometimes it works out to make a partnership like this sometimes you don't. Alright so this should be done at any second now what sorry yeah so like this server is private so you have your own server on your own cloud service that is Amazon web service, Azure, Google cloud compute doesn't matter or on Prime right and there you have your own nine server where you deploy your own workflows and you can even set permissions for different teams within the company and so on. The idea of the app is instead the fact that we have our own public place where people can save their workflows and it's a of course for now it's all public but we're thinking to make this way like for example there is someone now on the forum that is working to take the same exact workflows I showed you before and to change it to make it work for regression so every time we're working like where there is a classification learning node it's gonna place a regression learning node and it's a lot of work but if it doesn't and publish online then everyone can use it as well right so that's the idea of the community of open source of that usually is connected to code but we can have also the same idea with workflows right so yeah so oh let's see if it's almost done because I think many of us execute this workflow and it's slowed down because in the end Shapley is something that we added in June and the idea to compute Shapley values can be expensive and this instance if everyone is computing Shapley values then it's slowing down a bit but it should be done soon we have Tableau extension with Tableau nodes so at some point you want to send your data to Tableau server to in order to do your own dashboard interactive dashboard then you're able to do it yeah there are some popular workflows that on the app regarding like getting stuff just HTML you know scraping HTML or like with tweets right like you get fetched tweets and process them yeah that could be possible so you take the ETL flows you put them on a NIME server then you set up Resco and then you have some fancy setup to start the Resco yeah yeah yeah yeah yeah yeah you can do that yeah yeah you can do this you can input in a text box SQL then you can take this string and then input it in a SQL node and then you can perform SQL query and that would be cool do that publish them on the app and then you get many kudos from the entire NIME community so not sure exactly what's going on but basically the idea between this let's run it locally real fast so I can show you so this was the example of before and now let's set two instances right or even two and then I'm re-executing this so the idea of Shapley values is then that you can compute for two predictions in this case but you can have 10, 100 depending on how much time we have to wait right but the idea is that you can compute Shapley values and in this Shapley value we also regularize to have at maximum 5 Shapley values so for each prediction we compute the difference in prediction between the average prediction and then we are going to explain this difference using Shapley values so we're going to say for example the prediction to be below above 50k is 80% but usually it's 50% so we need to explain this added of 30% and to create this explanation on this 30% we're going to say that it's for age that contributed to 5% towards 80% and then so on right but then you have many you can explain a single prediction like this and then we want to explain more of them and to explain more of them we can use other explanations to for example compute the average plot so now this is executed let's open the view and the idea is then that you have here two explanation if you make the average but here we can have also 100 it doesn't matter and we can see that for age this is the value of minus 0.1 so if I select this particular explanation I can see that this is the partial dependence that is what is partial dependence is that you take a feature five minutes so you have a feature and with this feature we're going to change age this is the feature we're using we change age and we're going to see how the prediction changes when we have all the other features fixed right so this person is predicted to be below 50k because it's red and I encoded red below 50k the age of the person is around 20 because we see this red dot here and we see that if we increase the age the probability is going to be up and down but more or less is increasing and if I see the value of actually the Shapley value for age it's actually showing you that it's actually not contributing towards being a rich person but it's contributing towards being a negative value but what about something else for example here we have hours per week so we can go here and select hours per week and go here also to select the explanation also hours per week and the idea is that we see that now this explanation has a value still negative but not as negative as before because it's about minus 0.115 and also the hours per week is doing not so many hours per week because this young is doing 40 hours per week but the point is that you can go feature by feature now we are just seeing one and when we select two instead we can instead see an average of those two and we can see the curves in the partial dependence on the left I wanted to show you this with at least 10 so it would be more interactive what do we have here at the bottom so let me go here a bit at the bottom so here we have a decision tree and it's a surrogate decision tree so you take all the prediction of the random forest and then on those predictions you overfit a decision tree so you have a decision tree that is doing everything that the random forest does with a single tree and sometimes this helps to decide the prediction of the random forest was above 50k and the decision tree can tell you oh, because let me zoom here a bit so because education number is greater than 12 so we scroll down and then no, it's actually less than 12.5 and then we have this final weight I don't know, we need to look into data but the idea that we can partition in a decision tree that is doing more or less than the model doing in the random forest and again like, of course it's a lot of going back and forth the partial dependence, the shapely value, the decision tree but for example if we want to select all the instances here you get them selected and so you can see in the other ones how it works and I created this in a day by just dropping things together and creating this component that you can reuse so it's really up to you how to create this interactivity to make it to make your interpretability on the model of the word train okay, I'm still hoping that this is, okay well, when this is done come visit me at the booth and I will show it to you, right? thank you guys one last thing we have our socials, sure but if you want to create a nine meetup group please go ahead, create it we can help you, we have all over the world and if you want to check whether it's in your city or not you go on tinyrule.com you go there and you can find the meetup you want to join based on where you live and yeah, thank you, that's it