 Hello. This presentation is titled How to Simplify Machine Learning Workflow Specifications. This is a USR 2020 conference lightning talk. My name is Anton Antonov. I'm a senior research scientist at the Sendoo Data LLC. Everything presented in this movie can be found in my GitHub account. So what is this about? This presentation is about rapid specification of machine learning workflows using natural language commands. These are things to automate with machine learning, machine learning workflows. And in this presentation we demonstrate that with natural language interfaces to machine learning algorithms. As a motivation, consider the following scenario. We want to create a conversational agent or a bunch of conversational agents that help data scientists and machine learning practitioners to quickly create the first initial versions of different data science and machine learning workflows for different programming languages and related packages. We expect those initial versions of programming code to be tweaked further in order to produce desired outcomes in the application area of interest. So the workflows considered quantal regression, related semantic analysis and recommendations. The workflows considered in this presentation. My first example is going to be based on this data. So this is a data frame, the temperature data. You see we have two columns, time and temperature. And this is being plotted here. And I'm going to show the first quantal regression workflow. This is my first example. Based on the data frame I showed earlier. So there are three commands, natural language commands here. I'm going to recite them. Create from dftemperature data. Compute quantal regression with 12 nodes and probabilities 0.25, 0.5 and 0.75. Show date list plot with date origin 1,900 January 1st. You can see that the quantal regression did find those curves and other requested probabilities. So how is this done? For which given machine learning domain like quantal regression or related semantic analysis, we create two types of domain-specific languages. The first one is a software monod. Think about programming language that provides pipelines. The second one is a domain-specific language where it's a subset of a spoken language like English, Chinese, Spanish or something like that. These two domain-specific languages are combined. The natural language commands domain-specific language is translated into the programming pipeline language. By executing those translations we interpret commands of the spoken domain-specific language into machine learning computational results. So we assume that there is a separate system that converts speech audio into text. This diagram rephrases what I just explained. We have the problem domain. We have developed software code. This is the yellow rectangles on the left. We have developed the grammar sparses of the natural language, domain-specific language. This is the green rectangles on the right. At some point we combine them. We make single line interpreters. We can make more complicated interpreters. At some point we hook up some device like Alexa or Google Home and we can make a full-blown conversational agent. Grammar sparses. So for each type of workflow is developed a specialized domain-specific translation module in the programming language Raku. Raku is also known as PURL6. Each Raku module has grammars for parsing sequences of natural commands into a certain domain-specific language, spoken domain-specific language. And also that module has the ability to translate parsing results into corresponding software-monet code. Different programming languages and packages can be a target of that translation. At this point I have implemented DSL translators to Python are involved from language. This Quantal Regression workflow translation step shown here is based on a sequence of free natural language commands. I'm going to recite them. First of all, create from DF temperature data, compute Quantal Regression with knots 12 and probabilities 0.05, 0.95, find outliers. You see that from these three commands we produced these three lines of code. The first one, QAMON unit. We create the monet object. The second, which is with the Quantal Regression, but we have the degrees of freedom 12, which corresponds to the knots being specified and the probabilities which correspond to probabilities in the command, natural language command. Then we find outliers and we plot, do an outliers plot. This generated command lines are pasted together and given to the function parse. Parse produces an expression. This expression is being evaluated. Then we get the desired result. The points in blue are the top outliers, the points in red are the bottom outliers, which was requested by that sequence of commands. My second example here is a later semantic analysis workflow. It is based on the Shakespeare's play Hamlet. In that play, every play part is a document. I have five commands. I'm going to recite them. Create from text Hamlet. Make document term matrix with automatic stop words and without stemming. Apply LSI functions, global weight function IDF, local term weight function none. Normalize function cosine. Extract 12 topics using method SVD, max steps 120 and min number of documents per term 2. Show Tizora's table for ghost and grave. You can see about these five commands. Generate again monad object. We have made the document term matrix. We apply the term weight functions IDF, none and cosine are specified. Extract 12 topics with the method SVD, max steps 120 and then we actually want to echo the statistical Tizoras for the words ghost and grave. In this slide, again, we paste the programming clients. We give them to parse and the result of parse is given to evolve. We can see we get this table on the left most column. It's the search term. This is the statistical Tizora's table, which was requested in the last command. You have the words ghost and grave. If we know what the play Hamlet is about, we can see what the related statistical Tizoras word like ghost is very close to father and stage and Christ and grave is close to tongue. On a sleep, they have apparently a lot to do with Hamlet. This recommender workflow demonstration. I'm going to use the Titanic data. In this Titanic data, you can see a sample here. I have five columns. The first column is the passenger ID. Passenger class is the second column. The third column is passenger age. Passenger sex is the fourth column. Passenger survival is the fifth column. Here's the summary. I want to make a recommender and make a profile recommendation with that recommender. Here's the sequence of natural language commands. I'm going to recite them. Create from DF Titanic. Apply the LSI functions in verse document frequency, term frequency and cosine. Compute the top six recommendations for profile. Female equals one, 30 equals one. Extend recommendations with DF Titanic. Show pipeline value. You can see the result here. Just directly evaluated the result. We can see the result here. We have six recommendations. All of the recommendations adhere to the specified profile. They're about females who have age 30. Two interesting questions. One is handling misspellings. This is the same series of recommendations, workflow commands I had in the previous slide. But I misspelled some of the words. Apply recommendations and profile are misspelled. You can see the RACU module actually communicated this. Handled this. Although these are misspellings, they are recognized, but they belong to a certain stencils of commands. The interpretation proceed anyway. And so we can see we have the results which we expect. Another interesting question is, can we just be applied to other computational workflows? Of course it can be. It follows exactly the same pattern. Since USAD 2020 was not held in person because of the COVID-19, so it seems fitting to choose this example here. So this is a specification of epidemiology modeling workflow. You can see I'm not going to recite the commands, but we're basically using a certain type of well-known model, compartmental epidemiological model. We assign different initial conditions and different rates. This model implies the usage of certain differential equations. So we assign initial conditions and some parameters for these differential equations. We simulate for 240 days and you plot the population results. You can see the generated commands here that correspond to the generated code which corresponds to the commands given above. And the plot here after evaluation has these typical patterns which we expect from epidemiology compartmental simulations. This last slide has references to all the packages which are used in this presentation. Thank you for your attention.