 I'm Vincenzo Tursi and I work for NIME. I'm really excited for being here. So it's my first time in India, my first ODIC conference, so really great. So today I would like to introduce you about a project I've been working on with my team this year, the beginning of this year, and it's about a teacher boat. So what we thought was that... So we came up with this idea. We wanted to give to the new NIME users the right resource material related to their questions. So let's suppose you are a new NIME user and I can imagine you have tons of questions, right? So I don't know how to access this database. I don't know how to process this data and so on and so forth. So we said, what about to create an application that in an automatic manner gives the right links to the training materials that we provide on our website? And that is why we decided to develop Emile, the teacher boat. So I mentioned also NIME here. Just raise your hand. Does any one of you know already about NIME? A few? So just to give you a little introduction about this... about NIME Analytics Platform. It's an open-source solution with a graphical user interface that basically allows you to cover the overall data science life cycle starting from reading data from any kind of data sources and then do the ETL processing, apply machine learning models. You can use it also to create reports and also to productionalize the model in a production environment. So at the moment, there are more than 2,000 native embedded nodes. The nodes are those block here that you see in this slide. And each node performs a specific task. So when you connect this node together, you are actually creating or developing a workflow. So from the educational perspective, I work also in the Evangelist team at NIME. We need to educate users about the tool, but we have also another objective, right? Which is to help the users to solve data science-related problems. And users use the NIME forum also for this, to ask questions related to how to use a specific thing on the tool, but also how to solve a specific data science-related problem. So a bot, what is a bot, is a software designed to automate all the kinds of stuff that you would usually do on your own, or you would ask someone else to do. And just focusing on the positive bots that are out there. We have search bots, we have teaching bots, we have communication bots, personal assistant bots, we have data and developer bots, we have team bots. And Emil, I mean, the idea behind Emil was exactly this one, basically help people, new to NIME, analytics preference to find answers to their initial questions from a large amount of technical tutorials that we have online. So this is more or less how we would build a bot, right? So first we start with a user's question, sorry, we first start with a user interface, of course. We have to let the user ask the question about these user interface. Then the bot has to understand the question. So there will be text processing NLP functions that will help to do all the text processing procedure. And then we need a machine learning model to associate the right group to the tutorials, to the question that has been asked. And then we need a user interface, actually, again, to return the answer. And eventually, if this is not mandatory, but we might need also a feedback mechanism to state whether the answer was or not of any help. So this is the UI that we built for the teacher bot, for Emil. We used a web application. In this case, it's a future of the enterprise solution that we have in NIME, but you can build this application with whatever tool. And this is the first page, the first web page. So we started with a UI that has a NIME logo, the Emil picture, some greetings, and then two fields that allows the user to write the summary of the question and then the question. The aim, in the end, is to provide the right resources to the user, right? So we provided a second user page, web page, where we actually link all the possible tutorials that might be helpful for the user. So the user then can click on these links and then see whether the link is used for or not. For example, in this case, we have a blog post that shows how to blend data from six different databases. But of course, to reach this point, we needed another step. So we added another page where we asked the user to give a feedback about the category that was predicted by the algorithm that actually was predicting the class. And this page helped us then to link the resources related to this category. So this was everything behind or on the web portal, so the application. And this is the workflow. I know that probably most of you have never seen the workflow, and I'm workflow, but I don't want to focus on the details. I just want to point you to these blocks here with light gray color. Those, we call it red methanol, and they contain a subflow, some nodes that allows to actually visualize a web page on the web portal. So for example, the first red methanol allows to visualize the web page with a question that allows the user to write the questions. Then actually we have a part that is related to the understanding and the brain. Then we have a second red methanol where actually we ask the user to select the category that we use then to match with the resources related to that category. And this is another part of the brain of Emil. So we have then the suggest resources red methanol that allows to show on the web portal the links to the technical resources or technical material, and then we have a thank you page, a goodbye. Okay, when all this is done, we update the data sets with two different tables. In the middle we have some if-switch conditions. So for example, if the user in the select category selects the option, something else, or bug, this is a bug, or this is an announcement, Emil will actually redirect to a thank you page and the question will be sent to our support team. We have another switch here in the middle of the workflow where actually if the category doesn't match any of the resources available online, this means that we can't actually provide any link to the user. So even in this case, we send an email to our support team and we thank you the user. And the last if-switch is when actually we provide the resources to the user, but actually it doesn't find those links useful. So you can click a button, send email, and this email we send to our support team. Okay, so we have seen that there are two phases where the teacher both has to understand and make some predictions, right? And this is the understanding part where actually we collect the question and we have to understand it. So we do that with text processing or text mining, right? So we transform the question in our document. We do all the text processing phase with punctuation erasure and case converter, so all the cleaning stuff. We tag the terms that are available in our text. And then we do some filtering and we also do a lemmatization. So the last step is to extract the most meaningful words from the questions, right? And here we can use a keyword extraction model algorithm. And in this case, we decided to use the chi-square keyword extraction algorithm. So after this point, I mean, we have the list of the keywords for the question that has been asked. So the next step would be to, let's say, transform these keywords in a document that has auto encoding with numbers, right? So we have to have a table that has numbers in there in order to apply our model. So in the end, the goals for the MS brain were those. So we started with a question and then the main idea was to provide, as I said, specific resources, technical resources, right? Available online. So the first thing here was to identify the areas of expertise, and this was done via the machine learning approach. This means to identify the category. And the second objective was to identify the list of the most relevant articles. And this was done with a similarity search approach. I will back on this in a minute. So there were some parts in this project that were not that easy, let's say. So first of all, defining the training problem. I will come back on this. Second of all, also building a class ontology because we had a list of technical materials, but we didn't have a class system for the model that we wanted to train. And also creating a labeled dataset. So in a perfect word, what we would have? We would have a dataset that we can somehow preprocess, and then we can split in a training set and a test set. The training set will be used to train our model, and then the test set will be used to test our model. Then we can score our model, et cetera. But actually in the real world, what do we have? We have a dataset with tons of missing values, and sometimes we don't have classes, and we don't have labels for each row in the dataset. So how can we do... I mean, we can't actually train any algorithm on that unless we, for example, we train... Or like we use a clustering algorithm, right? Or a similarity search approach. And this is something that we can do in the first attempt of our analysis. At some point, we have to find a class system, right? And so this was the dataset. We were using data from our forum, from the 9th forum, data from 2013 to 2017, and those were just questions. So everyone can go on the 9th forum and start a new thread, ask a question from there. And we wanted to match these questions with the online resources that we have available online. So the only problem is that we had 5,000 questions available on our dataset, and the resources were more than 400 resources available online. So any kind of machine learning model actually can't actually manage this kind of problem, right? There are many resources compared to the number of data rows that we have available or questions. So the first thing that we did was to use a similarity search approach to match the keywords extracted from the questions available on the dataset, on the 9th forum dataset and the online resources. But the results here were suboptimal. So we had to come up with another idea. So we decided, so we thought, okay, maybe an active learning cycle might help to basically find out a better class system, right? But in the end, so even in this case, we had to rethink about the problem. So there was no solution because there were too many resources and we couldn't match all of those with the questions that were available. So the issues in this case were multiple. So first of all, the selection and the training of a machine learning model, the definition of the areas of expertise, in this case the class system, how to set up the active learning framework and also the implementation of a similarity search procedure. So about ontology, there are some... so there is a big research area that I mean, related to medicine, the easiest example is the anatomy order classes and we have another one, for example, for biology, which is, for example, the interrelated animal classes. And about data science, I mean, the ontology corresponds to the class system, right? So we thought, okay, maybe we can somehow reduce or create an high-level class system from the resources that are available online. And okay, we said, okay, we have any learning course on the NINE website. Let's start by using these classes. And we came up with seven different classes here, right? So installation, data access, ETL, mining, control, deployment, and data visualization. Those are all the things that are covered in our eLearning course. Then we said, okay, wait, because if you look on our website, we have different use cases. We have also something that talks about the text processing extension. We have big data-related topics. We have server-related topics, image-processing-related topics, and also reporting. So we came up in the end with, we have seven plus six here, 13 classes. But yeah, we said, okay, probably this is not enough, because then looking at the question and at the questions on the dataset, we saw that actually there were questions related to development, integrations, the different integrations that NINE provides, how to optimize NINE workflows, how problems or use cases related to life science, or new announcements, or bugs, or legal stuff. Okay, so in the end, we decided to create a class system with these 20 different classes. So we reduced our class system from 400 resources to 20 classes. So as I said, we started with a training set with the forum questions, and here the first attempt was to use a similarity search criterion to match the classes with the questions. We said that this was suboptimal as a result. So then we said, okay, we can use active learning for this. So we train our model, and then we extract the most uncertain predictions. We let a teammate or myself to label some of these uncertain predictions. Then as I said, we do the relabeling, and then we extend the class label to the closest prediction. So we can actually iterate on this cycle, and we are talking about active learning. So we can train again the model, extract the most uncertain predictions, relabel the most uncertain predictions, and then extend the class label to the closest predictions. And do that until we reach the point that the dataset doesn't change that much, right? So that we have labels that are more or less fixed for the rows that we are using on our dataset. Okay, so this is the active learning cycle, or in general, how AMIs have been built. So we have the user interface part, and then we have the nymontology model. We train our model, and then we have our active learning cycle in there. If we look in the details, this active learning cycle, we see that, I mean, as we have seen, we have an initial labeling phase, a training set, and a model training. Then we choose the subset to be labeled, and here we start with the active learning cycle. With the initial labeling, we used what was based on distance, and the model training was used, we used a random forest model. Going into details, here the subset that was chosen with the most predicted classes was the 10% of the uncertain classes. This was the difference between the three top probabilities for each predicted classes. So we first labeled the predicted class, the classes, or the something else, and then the something else option, or category that was available in the choose category in the second web page, we labeled manually with a specific category. So then we extend this category to the closest prediction with a key nearest neighbor with k equal to 1. So we basically keep going on this until we reach a fixed version of the dataset. And this is what, so like the evolution of the dataset, so the changes in the training set. In the first iteration, so we used only the similarity approach. And in this case, what we did, we used our classes, we used the classes at the ontology that we defined, and then we used the questions on the forum. And we ended up that with this similarity approach, most of the classes were related to installation, which is the most prominent topic related to get started with NIME, right? In the second iteration, when we started using active learning and also the machine learning approach, we saw that the most prominent topic was ETL. And this makes sense. I mean, again, we are talking about data science, life cycle, and ETL is the most prominent topic, right? And so we saw that in the third iteration, the second iteration, sorry, the third iteration, the one that is here depicted as AL iteration two, the dataset didn't change that much. So we saw that there were some minor changes in the smallest topics, but in the ETL and some others were already fixed, right? So we decided to don't continue with the active learning cycle. And here we have a bar chart that shows in the different phases what were the topics or like the labels that the classes that were most prominent. So about the evolution of the answers. So we saw that actually there was no improvement in the accuracy. And at the beginning this was kind of weird, but in the end, we saw an improvement in the way, in the links that were provided to the users based on the question that was asked. So this means that accuracy is not always meaningful, right? For the problem that you are working on. Also because we were working on 20 different classes, so the accuracy can be that high. And just to... I mean, we have built a bunch of workflows to make all this application work. And everything worked out pretty well, but so we decided also to reuse some of the components that were available in different workflows. And we did it with microservices. So one of the main advantage of microservices is being able to reuse smaller parts of the workflows, right? Or the components of the project that we are working on. And in this way we can make it more efficient. In KNIME there are... I mean, we have one way to do this, to create microservices. So at the beginning what we did, we used methanol templates. So again, those are like subflows that we can reuse it. So we create templates, and then we can just drag and drop directly on the canvas and then reuse it. And to change this in a microservice, we just replaced the template, the methanol template, with the so-called co-remote workflow node. So what it does, this node, basically it calls the execution of another workflow that is stored in our explorer, okay? So in the end, we ended up with a series of a number of workflows and a number of microservices that were called throughout this co-remote workflow node. And as you can see here in this slide, here we have an example of microservice. So this is just to show you that everything works with a... So when you call an external workflow, or like a microservice, you always communicate with JSON. So you always pass a JSON file. So the workflow takes as input a JSON. Then you transform this JSON in a table, a nine table in this case, something that can be handled by NIME. And then you do all the processing or analysis. In this case, it was a prediction. And then you transform back everything as a JSON file. So the co-remote workflow gets back the results. So what I tried to show here in this presentation was how to create a basic bot. Also the text processing that involves the understanding of the question that was asked by the user, as well as the keyword extraction, how to build an ontology, which is, let's say, it's a really important task to do when you don't have a dataset that has a class system, how to assign labels with the active learning cycle, and also how to convert reusable subfloor into microservices. What did we learn during this project? Well, so we learned that the NIME forum is often used as an educational tool. So often there are people that ask, okay, how can I do this, or can I connect to this, or that, et cetera. That keyword extraction is a plus with respect to keyword search. So there are two different... That should be a separation between those two different topics. And to readjust the class system from time to time, I mean, in the end, we found out that we had 20 classes, right? But if you look into more details at the dataset, maybe you find out that there are more classes to take into account, right, better predictions. So it's better also when you have new data coming in to readjust your class system. And of course, we learned that accuracy is not all. So it's not the most important thing to look at. And also from the questions that we found out on the forum, we said, okay, probably we have to provide, or like, we have to create new educational material for data visualization. And also maybe to create a new blog post for related to how to optimize the nine workflows. Okay, so there are several ways to extend this project. So first of all, with word embedding, so in our case, we used auto encoding transformation. But the thing is that the document vector node leads to a big and sparse future space. So we can improve this by training a vector representation with word2vec. Another implementation that we could add to our project is using the Keras integration for LSTM. Nine integrates also Keras in this case, so this might be also another implementation. And also investigate the role of the parameters that we have used. So in this case, the 10% of a certainty, the K equal to one for the key nearest neighbor, or maybe if we forgot some other functions that might be helpful to have a better prediction. Another thing that we could do is also to add speech recognition. And maybe add the YouTube videos that we have on our Nine TV channel. So all this stuff basically might be helpful to extend our project. Okay, so what we have done, so all of this stuff has been published with a series of blog posts. So if you are interested in this topic, you can find out more on the Nine website. We also published the workflows that we have been working on. And this is all available on a public server that you can access to from Nine Manities platform directly. And yeah, we have also a white paper. So if you go on the section related to the white papers, you can also find out the white paper related to Emil the teacher both. Okay, thank you very much. Thank you.