 The title of this presentation is Data Seller Package enables intuitive mobile data manipulation. Ah, hello. I'm Toshi. I'm developing a new package called Data Seller, which improves data cleaning experience in all. It's developed on GitHub. Before I'm going to talk about the detail of this package, I'm going to talk about why I started to develop this package. Usually, data analysis workflow is composed of some steps, such as making analysis plan, cleaning data, statistical analysis, and interpreting the results. I think most of you agree with these steps. In all, the boundary between data cleaning and the statistical analysis can be fake, because all is general purpose programming language and has much flexibility. So when you clean data, you can also do t-test at the same time. And even if you are doing regression analysis, you can manipulate the data at the same time. So it's really flexible. However, I would like to focus on data cleaning when I do data cleaning. Then I started to develop a new tool focusing on data manipulation, which is called DatSeller. What is DatSeller like? DatSeller does row by row data processing using Seller script. So Seller script is designed for this package. DatSeller does not replace R, of course. DatSeller just manipulate data using script called Seller script. How to install? As usual from C-LAN, install packages DatSeller, this installs tree system. Actually, I'm still submitting this package to C-LAN. So when you watch this video and if it's not still available, please visit this website. I'm hosting this website. Packages are available on the website. How to use? DatSeller has cell function. We can pass DatFrame and Seller script to this function. DF can accept data compatible with DatFrame. Code accepts Seller script in our string. So passing this DatFrame and Seller script and this function returns DatFrame processed. So this is an image passing DatFrame, passing Seller script, DatSeller calculate each row for each row and output result DatFrame. Then let's try DatSeller. In this example, I'm going to use R with DatSet. And only these two columns, petal.lengths and separate.lengths will be used. And I'm going to call this Iris-based DatFrame IrisX. So let's try to divide this column by this column. How can we do it? How can I do it? Yes, passing IrisX DatSet and passing Seller script. Maybe you can imagine what this Seller script does. Petal.lengths divided by separate.lengths is assigned to petal-separate ratio, which actually results in a new column, petal-separate ratio. So this is a variable. This is called variable, but variable names corresponds to column name and new column name. When we use a variable with new column name, the new column is created. So this example shows basic concepts of DatSeller. Variable names correspond to DatSet column names. This reduces the amount of typing, like non-standard evaluation in-out. And this idea is familiar to statisticians because they call column names as variables. Second, by Seller script, DatSet just instructs how each row or each record is processed. So we can just focus on how each row is processed and DatSet applies it to each row. I'm going to talk more about DatSet functionality. From here, I'm using another DatSet called MT-CarsX, and it's based on MT-Cars dataset and extracting only one column of HP. So this MT-CarsX has only one column, and the MT-Cars originally has row names, so MT-CarsX also keeps row names. So this functionality I'm going to talk variable, variable assignment types, operator splitting functions, affairs control flow, missing values, regular expressions. I'm going to talk about these things. First, variable and variable assignment. As shown in previous slides, each variable corresponds to column name. So HP, for example, HP variable corresponds to HP column. And assigning to a new column name. So yeah, so variable name corresponds to column name. So I will use variable and column name interchangeably. Assigning to a new column name. So assigning to h.power, this does not originally exist on MT-CarsX dataset. So assigning to a new column name creates a new column. So in this case, h.power is new column name and this column is new column is created. And this line does assigning HP times 2, assigning the value of HP times 2 to HP. So assigning to a new existing column name updates the column. So in this case, HP is updated with the original values of HP times 2. So let's look at the result. Yes, h.power holds original HP values and HP is updated with HP times 2. Next, cell types. In cellar script, there are only three types, integer, double, string, and compatible types are converted between R and cellar as follows. So if the column of data frame is integer vector, it's dealt as int from cellar. And if the column of the vector is double on data frame, it's dealt as double from cellar. And if the column vector of data frame is a character vector, it's dealt as string. When Boolean vector is dealt as 0 or 1, if factor is dealt as string. So yeah, you can see you can use integer, double, and string in cellar script. Yeah, you can see this result. About operators, assignment operator and arithmetic operators can be used. Assignment operator addition, subtraction, multiplication, division, power can be used. And also comparison operators equal to Rajasang, Rajasang y equal to Rajasang y equal to. And also regular expression operator can be used. So you can see HP times 2. And this variable is halfed. And this is power. And this is a squared root. You can see these results, the result of these operators. There are also built-in functions. Strings are usually manipulated using built-in functions. And these functions start with str underscore. And also there are other kind of functions. I'm not showing here. You can refer to documents. So in this example, this function, str underscore subset string. So I didn't talk about this variable, underscore onem underscore. But this special variable, this represents the column name. And so this function, what this function does is subtract from index 1 to index 3 of this variable, this string. So in this case, extracting index 1 to index 3. So first three letters, extracting first three letters of row name. And this is assigned to this first three column. Number 2 str converts number 2 string. And in this table, left aligned means its string. So this column string is converted successfully. You can also use affairs control flow. As usual, if condition statements, else if condition statements, else statements, this can be used. This structure can be used. And else clauses are optional. So in this example, if HP is larger than 145, power is assigned high. And if HP is between 0 to 145, power is assigned low. So this value is 175 is larger than 145. So power column has high. And in other cases, power column has low. You can also use missing values. That represents numeric missing values. And empty strings represent string missing values. So yeah, that means missing value in say that. So it can, when you compare with NA noun in R, it should return true. And yes, when you compare empty string with NA in character vector of R, it should return true. So this is example. I'm creating a new data frame containing missing value. And if x equals to that numeric missing value, x underscore NA is assigned 1. And else x underscore NA is assigned 0. So in this example, when x has NA, the x underscore NA has 1. And in other cases, this comes as 0. So it's working. And lastly, regular expression noun, this is a really powerful feature. Regular expression can be written like this RE, slash, pattern, and slash. And this is a regular expression literal in Sailor script. And also equal to that is regular expression matching operator, which is a value to be Boolean and should be used for if it's condition. So in this case, I'm creating a new regular expression pattern, and it's assigned to Germany variable. And Germany variable representing this regular expression is matched against the roll name. And if it's true, so it's matched, country, come has Germany. And also this function is executed. What is this function? This function is back reference function. This pattern includes groupings like this parentheses part. This is called grouping in regular expression. And the RSV underscore matched function can extract the matched strings. So in this case, when the roll name has Mercedes, milk is matched and milk part is extracted from string. So like this. So yes, if the roll name has milk, milk part is extracted. If it's Porsche, Porsche part is extracted. You can see. You can also see this is working. I also talk a little bit about that Sailor internal. Sailor script is passed to virtual machine code. DataFrame is extracted each row. And each row is processed like this on virtual machine using virtual machine code. And the result is redone. And these purple parts, this is that Sailor purpose itself. And these green and orange parts, this is implemented in this Sailor C or C++ library, which is created especially for this package. Thank you. Thank you for watching this video. This series is still in development. So feedbacks are welcome. This is really difficult time. So I hope everyone is safe and see you in the future.