 Okay, welcome to this session on using R at scale on database data as part of the R consortium's R database webinar series. And many thanks to the R consortium for hosting this event and for Eliza Trevino for supporting us today. In this session, we look at enabling R to work on database data at scale using available open source packages, and then how Oracle Machine Learning for R makes it easy to do this with data in Oracle database while simplifying solution deployment and applications. If you have questions during the session, please include them in the Q&A and we'll address them either with a text response in real time or at the end of the session. So let's first introduce ourselves. I'm Mark Hornick, Senior Director for Oracle Machine Learning Product Management. I've been involved with the R consortium since its founding and with R more generally for nearly 15 years. After focusing on in database machine learning since Oracle's acquisition of the company thinking machines in 1999. So happy to be here and talk to you about Oracle technologies and Sherry, please introduce yourself. Hi, everyone. I'm Sherry Lomonica. I work with Mark in Oracle Machine Learning Product Management at Oracle, and I have been supporting customers using R for almost 20 years now. Happy to be here. Great. Thank you Sherry. So for those of you who may not be familiar with Oracle as a company, a few brief comments. Oracle is a provider of enterprise cloud and on-premises applications and infrastructure as well as enterprise software. And you probably know us best from our flagship Oracle database which supports what we call a converged database management system that integrates technologies like machine learning graph and spatial within the same database. On Oracle cloud infrastructure, we introduced autonomous database which provides automation of infrastructure as well as database and other data center operations. Taking care of database administration so you don't have to spend time and resources tuning your database, applying patches or updating software among other features. Now so how does Oracle support the R community? Well, we've been a member of the R consortium since its inception in 2015 with board membership from 2015 to 2020 and current participation on the marketing committee. To help our users work more seamlessly with their database data, we introduced Oracle Machine Learning for R. And with OML for R, Oracle continues to enhance the ability of our users to take advantage of powerful database capabilities from the convenience of R. We also contribute to open source our packages and participate in our consortium working groups. Now for our agenda, you know, we'll start with some background on using R for databases. What are some of the issues users encounter for scalability and performance, sometimes viewed as two sides of the same coin, and tools commonly used to interact with databases. We'll then look at using OML for R to access and manipulate database data used in database algorithms from a native R API and run user defined R functions from SQL and REST. Now Sherry will be demonstrating aspects of each of these capabilities as we go through. I will finish with a brief overview of complimentary Oracle Machine Learning components. For some background on interacting with databases from R, we might ask, you know, why is this important? Perhaps the most obvious reason is that databases store and manage data typically with a focus on performance, scalability and security. And since R uses data for statistical and other analyses, it's a natural combination. Of course, SQL is the most common language of databases, supporting operations to create, retrieve, update and delete data, join and transform data from multiple tables. And this is typically done using SQL queries. So when users need data, they may end up importing it from Excel or CSV files, which might have been extracted from databases by IT professionals, especially in larger enterprises. Now some R users may have programmatic access to pull data to the R client and then push results back to the database. But what are some of the issues with these approaches? Well, we know that files may be large resulting in data load latency or network data transfer latency. And files can have size limitations as well. And by definition, a data snapshots are immediately obsolete and often require refreshing. Now data errors may require going back to the source for correction, requiring re-retrieving the data. And because we're pulling data, we can lose control over data security. In addition, data privacy laws can even require keeping data in source systems or locations. Now when talking about separate analytical engines, we might refer to traditional R engine and database interaction. Analyzing data using separate analytical engines may require enterprise users to request data extracts, load resulting flat file data in memory, and then push results back to the data repository. There's usually a human intermediary for data extraction from enterprise systems, and many times the needed data may require a few iterations to get right. Other users may be able to extract data programmatically through various connectivity packages. For example, on Oracle Database, this may be our Oracle. However, problems arise for larger datasets given internet bandwidth constraints. Overall, the challenges include data access latency and requiring a paradigm shift, the need to change language or access and manipulate data. And if you're an R user, you may not want to have to switch to SQL to specify the data of interest or needed transformations on that data. And perhaps even more critical are memory limitations that with the need for the full dataset to reside in memory for analysis. There's also the issue of single-threaded packages, where even if you have multiple processors available, the software can't take advantage of them. And when data is pulled from the database, there may also be issues of backup, recovery and security. Finally, how do we get solutions to production deployment? There's the need to spawn our engines, load functions and data, and enable applications and dashboards to access results. So let's look into this, but first let's do a quick poll. Which of the following pain points do you experience? And please select all that apply. Is it data access latency, moving large volume data, changing data access language or paradigm shift, memory limitations, single-threaded functionality, issues around backup, recovery and security, perhaps getting solutions to production deployment. And so what we're seeing here is great responses, wow. We've got memory limitations is coming in at 75%, moving large volume data at close to 70%, getting solutions to production deployment, about a third of you. Data access latency, over 40% of people are encountering that, and others are close to about 20% experiencing the other issues. So yeah, that's really good feedback and it's really important to have an appreciation for what are our users encountering. Alright, so thanks for those responses. And we'll first talk about data movement and its impact on scalability. You know, for pulling data from the database, we have to get it to the R engine. And while the visual here is drip, drip, drip, you know, to move large volume data, users can typically move smaller amounts of data reasonably well, but there are still other issues. As data gets large, we have to consider disk and memory, pulling data to the client may exhaust local memory or even disk storage. And this can be costly as bits travel across the wire, especially if cloud vendors charge egress fees for data to move data from one environment to another. And we also need to consider data backup recovery and security. Ideally, we could leave data where it lives and perform processing locally. So having programmatic access can help address some of this, right? Well, these are our packages that provide programmatic access to database data from our using an explicit connection object. Users can query and manipulate database data, leveraging the database as a high performance compute engine. The R community developed packages like our ODBC and our JDBC, which provides standard interfaces for interaction with the database. Our Oracle is an Oracle database optimized package that like our JDBC also provides a DBI compliant API. And on top of that, all mail for R uses our Oracle for its own database connectivity. Now there are a few other packages of note. Our SQL generates and run SQL queries to the underlying database without having to explicitly write SQL code. And then we have dplyr, part of the tidyverse, which provides a grammar of data manipulation with a consistent set of verbs to address common data manipulation requirements. And this is complemented by dplyr, which is a database backend for dplyr, and it allows treating remote database tables as if they were in memory data frames by automatically converting dplyr code into SQL. Now, some of the benefits of these packages for in database performance are the ability to take advantage of column indexes, query optimization, parallelism and disk level partitioning, provided by the underlying RDBMS. However, there are some limitations. Obviously, you know, users are limited to the interfaces provided by each connectivity package, but also that data must be pulled to the client to use native R syntax or third party packages. And of course that data must fit in memory. Also, these interfaces don't include integrated data parallel processing support, which needs to be explicitly programmed using other packages at the client. So for our second poll, let's find out which of the following interfaces might you be using today for database interaction. We have our OWBC, JDBC, our Oracle, our SQL, dplyr, dbplyr. We're seeing about, we're coming in at about 75% leveraging our Oracle. We have our OWBC, we have dbplyr and dplyr, of course, a couple of others, that if you have some other systems that you're using or tools that you're using, please put it in the chat. We'd be interested in hearing what that is. Okay, now we're at 44% for dplyr, 17% for dbplyr, and a good set of responses. Okay, thank you very much. So what do some of these interfaces look like? And let's take the our Oracle package, a dbi compliant interface. It's open source and available for download. In this example, we connect to an Oracle database and fetch rows using a query. We get a driver object, connect to the database and invoke db send query, which takes a connection object and SQL query as arguments. And this is what we're talking about when we say some interfaces involve writing SQL. This is both powerful and limiting at the same time, depending on the user skill set. Now we can fetch the data from the database to the client, but we can also write data frame data to the database. So when we're finished, we have a table in the connected database schema. Let's consider an example using dbplyr, where we want to compute the mean sepal length for each species in the iris data set. Well, we first load the dplyr package and connect to the database. And here we show creating a database table with the iris data set, and then getting what we refer to as essentially a proxy object called iris2 to use in other dplyr calls. Now we can lazily generate a query to group by species and summarize the mean sepal length and then arrange the result in descending order. On the right hand side, we can see the query that is produced for a result. And by invoking collect, we actually run the query and get back our result. So this gets us pretty far in being able to manipulate data in the database and leverage the database as a high performance compute engine. But there are other considerations. Let's take another look at memory and scalability. You know, we can write data to the database and operate on it remotely using a connectivity package. But if we want to use local functionality or third party packages that don't automatically get mapped to SQL, we need to load data into our memory. And as we've seen, this can be a non-starter. Most R packages expect the data to fit in memory, such as when visualizing data or building or using machine learning models. A typical approach to get around this can be to chunk the data, process one chunk at a time and aggregate the partial results where possible or necessary. And this likely involves manual coding. So we may have addressed scalability, but what about performance? We still have to process each chunk serially and effectively add the time to aggregate the result. But if we want to get performance as well, we can spawn multiple R engines to process chunks in parallel. If we spawn multiple R engines, we can expect to see elapsed time reduced significantly, perhaps by as much as a factor of the number of R engines used. Now, this data parallelism approach is often referred to as embarrassingly parallel, since it's relatively easy to implement. So what are some use cases that can benefit from data parallelism? Well, perhaps the most obvious one is scoring data at scale, which is inherently an embarrassingly parallel problem, where you have built a machine learning model but want to do large scale batch scoring. Other ways to leverage data parallelism is for computing statistics or generating plots on a customer or per region basis, perhaps. Also, model building can benefit by using different data subsets to create ensembles. In other situations, there could be more task parallel use cases such as Monte Carlo simulations that can be run on data that's randomly generated or selected. Now, with parallelism, we can improve performance by performing multiple operations concurrently, as we've seen. Like most things, achieving parallelism can range from simple to complex. Now, on the simple end of the spectrum is data parallelism, where the basic algorithm is run on data in parallel using multiple threads or processes on data partitions. Now, typically on the complex side is individual algorithm parallelism. Many algorithms are implemented with serial processing in mind, and to gain the benefit of parallelism, some require significant redesign. Others, by their very nature, simply can't be paralyzed. So we should note that scalability is not just about the size of the data, but also on being able to handle larger data volumes in a reasonable time frame to support enterprise needs. Now, to relate that to our packages, some packages are single-threaded, so there's no parallelism enabled or supported. Some are designed with parallelism in mind, but may rely on specific parallel packages and infrastructure. Yet a third option is to leverage optimized math libraries such that there's no algorithm redesign involved. For this last case, we can transparently boost performance with hardware optimized libraries. Replacing the underlying math library allows us to transparently substitute optimized math routines for those provided with the core R engine. For example, Intel's MKL or AMD library or similar libraries can reduce elapsed time by speeding up operations on, for example, matrices. And this enables some degree of parallelism without requiring a top-level algorithm redesign, although it may not yield the best results. As shown in the figure here, hardware-specific libraries can optimize runtime. Looking at this benchmark result, we see that using an optimized library can have dramatic performance implications, even when running single-threaded, such as for matrix multiplication. And if we add more threads, we can do better, but also notice that not all operations see an improvement. So let's take a specific example of how we might use data parallelism to make predictions using R. To start, we'll build a naive Bayes model to predict species using the Iris dataset. And after connecting to the database and pulling the data into memory, we define and run a function that builds the model and then saves that model to disk. We can implement data parallel processing using the parallel and for each packages. In this example, we want to score data using the model just created. So we'll connect to the database and pull the data into memory. And then to enable data parallelism, we partition it into some number of groups. Here we chose 15, with 10 rows in each. On larger datasets, imagine each partition consisting of many thousands of rows, or even more. But this depends in part on the cost of producing each score and available memory. Now, although the Iris dataset has only 150 rows and five columns, which easily fit in memory, consider the scalability implications at each stage if we have 150 million rows to score. Now continuing, we load our model from disk and we'll first illustrate data parallelism using MCL apply from the parallel package. We invoke MCL apply with our partition data and scoring function and we'll choose five cores to use. And since the result comes back as a list, we first are buying the results before writing those scores to a database table. Alternatively, you know, we could iterate over the list of data frames to insert them into the table incrementally to conserve memory if we wanted to. Now next, we compute a confusion matrix off the actual and predicted values and write the batch scores to the database. We can do something similar using the for each package, but here we first explicitly make and then register the cluster and when finished stop the cluster to free up resources. So what scalability concerns do we have with this approach? Well, we need to pull the data from the database to the client and then partition it. The partition data actually consumes significantly more memory than the original data. So we need to factor this into any solution. And since our model is stored in a flat file, we need to ensure our production process has access. We also need to factor in backup recovery and security for not only our data, but the model and scripts. All the processing occurred on the client machine, so it needs to be sized accordingly. Next, the result is returned as a list of data frames that we are binded together. And if this result is large, additional memory considerations again come into play. And in this scenario, we also write the results back to the database. But there's more to the story than just using an R API to move data in and out of the database or transparently manipulating data in a database with generated SQL as though it was a data frame. What other aspects should we consider? Well, we want to perform broader statistical analysis without data movement, which is really important for scalability. And we want to manipulate database data using standard R syntax and functions. We know packages like dplyr provide a grammar of data manipulation, but could we use native R syntax and functions to manipulate data and explore data as well? How about building machine learning models and scoring data without data movement? You know, if we're using commonly available open source algorithms, we have to move data from its source and load it into memory before we can build a model, and this can significantly limit scalability. There's also the need to use native R models at scale by leveraging built-in data parallelism. As we've seen, we can do this using the foreach and parallel packages, but there are alternatives that avoid moving data to the client. There's also how do we invoke user-defined R functions for deployment using SQL and REST APIs without having to explicitly manage R engines or coding for data parallelism. So this is where allmel4r comes into the picture. You know, the objectives we have for it is to empower our users within database features and functionality so they can take advantage of the database even more as a high-performance computing environment. Minimize data movement and scale solutions. Through an R interface, you can also leverage machine learning algorithms as well as store and manage our objects, like our machine learning models, directly in the database, avoiding the need to manage flat files. As well as integrate results from user-defined R functions and applications and dashboards and do this from languages like SQL and REST that are common to many applications. And we can eliminate the need to explicitly provision our engines for solution deployment. So how do we do this? Well, we use proxy objects and overloaded functions to take advantage of database column indexes, query optimization, parallelism, and table-level partitioning. The use of proxy objects and overloaded functions can greatly improve performance and scalability on database table operations since the data is not being brought to the client. And there are several ways to get proxy objects. We can create a table from an R data frame using ORE Create, specifying the data frame, in this case Iris, and the table name we wish to use. We can also create temporary tables that are automatically reclaimed when the database connection ends using ORE Push. If the table already exists in the database, the most common use case, we can use the function ORE Sync and specify the table name we wish to access. Now, once we have a proxy object, we can use it to invoke familiar R functions like dim, head, and summary. So whether we have 150 rows like Iris, or 150 million rows, the computation occurs in the database, and there's no need to worry about getting that data from the database to the client to compute the summary statistics. We also have other overloaded functions, including those that implement the dplyr interface. So let's consider an example involving the overloaded aggregate function. You know, we first get our proxy object to a table called onTimeS and aggregate to count the number of flights to each destination. And this request effectively results in a query with a group by on the desk column. Running this function, we get back an ORE frame object, and it's only when we go to retrieve the result that we incur the computation cost. Invoking the head function computes the result and retrieves the first few rows. Note that results may be large themselves, so rather than pulling the entire result to the client requiring both time and memory, assuming it can fit in memory, of course, the result remains in the database with a proxy object returned for subsequent use. Now, the data store for our object persistence is another capability where I have our objects that I want to store in the database rather than flat files. And this centralizes the objects for a deployed solution and keeps the content secure. This also takes advantage of database backup and recovery. Now, using ORE save, users can save our objects, including OML for our proxy objects in the database as a named entity. And then using ORE load, load them back into our memory across our and database sessions. Data stores are also used for passing non-scalar arguments like our machine learning model objects to our user defined functions for use in embedded execution, which we'll discuss in a few minutes. Now, a data store can contain multiple R objects, making it very convenient for passing both scalar and non-scalar objects to user defined functions. So let's see a brief demonstration of OML for R for data exploration and preparation, including dplyr. Sherry, would you please take us through that. So you are looking at Oracle machine learning notebooks interface, which is a Zeppelin based interface and it's included with Oracle autonomous database. So after your database is provisioned, OML notebooks is ready and there's nothing to configure and it's available and ready for you to use. So the OML notebooks support several different languages, R, but also Python, SQL, PL, SQL and Markdown, as you'll see. And in a notebook, you can have all of these paragraphs with our Python SQL or PL SQL code and you can choose the one that best suits your use case. We're starting here by downloading and activating our condo environment that our admin created that contains some additional libraries that we're going to be using in our scripts. In this case, the environment name is my RENV and it has ggplot2 forecast and e1071. Oracle machine learning for R contains a small subset of third party packages that are dependencies of the product but you can use condo environments to install additional packages. So we're loading our libraries and then starting with our data preparation and exploration. So the first thing that we're doing is we're syncing some tables. And when we sync these tables, we are, we're not pulling them locally. We are using the ori.sync command to create a pointer to the proxy object. We're creating a proxy object. So we have a 1 million row table, a 10 million row table and 100 million row table that we'll be using for this demo, SD1M, 10M and 100M. And then I can use ori.ls to list the available proxy objects and look at the data query slot to see the underlying query for that proxy object. And this is the temporary object name in the front here, that number. But that's our underlying query. And then as Mark mentioned, I can use the common R syntax like call names for the 100, using the 100 million row table here to get the column names. And then I can use head to get the first five rows of the table. This is a z.show is a Zeppelin function for formatting output. And I can run summary on our proxy object. Here I'm showing this for the 1 million row, 10 million row and 100 million row proxy object. And then aggregate. I'm returning the length by group, by occupation. So, but as Mark said, when I'm running these, these functions, these base our functions, the computations are happening inside the database, not in your local R engine. And then column and row fill customer ID, education and occupation from 100 million or dataset, and then I'm using column indexes. And then here, I'm using, I'm filtering by our rows. So, where the education variables equal to 10, and the occupation is equal to farming and I'm returning the first three rows. One of the features in Oracle machine learning for ours Mark mentioned is that we've overloaded the deep layer package to run these functions in the database and you have to load this package explicitly it's already a player. We load the already deep layer package. And then I already have a table for the iris data so I'm syncing that and creating a proxy object. And then I'm using summarize to return the mean of sepal length by category by the species category. Category and species. You can use the player to select columns exclude columns and filter rows just like I can with with open source deep layer. We can also use merge to join tables. And the already player left join. And then with respect to the data store. We can store objects in the database data store they can be our objects or almost for our objects that can be saved they're using the ori save function which is an analogy to the open source some save function and I'm saving these to a database table and I can load them. I'll show that later in my session and we'll we'll show that later in the embedded execution execution section but here I'm saving in our data frame iris and then a list to the same data store. I can overwrite the contents of my data store or I can just append a new object to my data store and then use the ori data store function to take a peek inside data store. So there we can see the different objects that I have in there. So I'm going to select data store to show different store. Okay, to mark. Great, thank you, Sherry. So let's continue with building machine learning models and scoring data at scale. And this is where you want to leverage familiar algorithms designed or redesigned for performance and scalability. There's no movement of data to other engines. Algorithm code is actually in the kernel database software beneath the security layer. And there's they've been redesigned for distributed parallelism and scalability across cluster nodes, along with optimized memory utilization. Data is brought into memory incrementally as needed. Model objects are cash to be shared across queries and even rely on the database memory manager for efficient allocation in a multi user environment. Now for exit data, those of you familiar with that it's an Oracle hardware platform scoring can take advantage of storage tier optimizations and all of this is available from an R interface. Now, there are many factors that affect model building performance and scalability. You know, data volume is perhaps the most obvious, but data movement and loading also need to be factored in and can be a real bottleneck if needing to move data to and from the database. Choice of algorithm as we see in this plot can have vastly different computational demands. Algorithm implementation, as we've been discussing. And of course, the number of concurrent users and system load. The plot here shows near linear scalability for model building with data ranging from 100 million to 800 million rows on autonomous database with relatively quick performance times. So what are some other characteristics of in database machine learning algorithms and models. Well, first, you know, resulting models are first class objects in the database stored and managed in the user schema. You can control access by granting and revoking permissions audit user actions and export and import machine learning models across databases, which can cross the cloud and on premises boundary as well. And on the right, we're showing the set of machine learning techniques and corresponding algorithms exposed through an R interface. And these algorithms benefit from algorithm specific automatic data preparation where required transformations can be automatically determined applied and maintained with the model. There are other features, but I'll just mention explanatory prediction details where each score can include the top end factors that most contribute to each individual prediction. So Sherry, let's see a brief demonstration of the OML for our modeling capabilities. Okay, so, as Mark mentioned, the, the in database machine learning algorithms are exposed through our APIs and this is a one of them. The ODM GLM is an interface to the in database GLM algorithm. And as you can see here, we're using our syntax to build the model, but the models being built in the database environment, and they're in the database and the in database machine learning models are first class objects in the database which means that you can control access and granting and revoking permissions and importing and exporting models across the databases and and auditing user actions. So here I'm building a regression model to predict years residents in my 100 million row data set but I can also and the the function itself has its own set of hyper parameters and and arguments that you can supply. Right here in the API, but you can also pass additional hyper parameters that are available in the Oracle machine learning for SQL so on this in the database side through a settings list. And then you can pass that list of parameters to this ODM dot settings argument. When you're building the model. And so I'm doing that here and then I can use the settings function to display the model settings and I'm displaying the model coefficients here using getting a summary of the model and then displaying the stats as a data frame. And then another example is building in database random forest to predict affinity card. And in this case I'm persisting the model well the model is in the database but I'm giving it a name RF classification model. And so the ori dot ODM RF is the our interface to the in database random forest algorithm, and then I'm creating that object it's the model is called mod, and then I can use the predict function to make predictions in our on that model. And I return them here as an our object or on the SQL side using this SQL interpreter here in in Zeppelin I can make predictions using that same model from SQL, and ensure the prediction details so we can see why the predictions were made. And back to you Mark. Thank you Sherry. So the last piece that we want to talk about is simplifying solution deployment, using the database environment, and what we'll call embedded execution. And this is also relevant for scalability for the option to have data parallel and task parallel invocation that's supported by the underlying system independent of the client R engine. You can manage and invoke user defined our functions in the database environment, which will dynamically spawn and manage our engines. You can use our SQL and on autonomous database rest interfaces, and it automatically loads the data into your are user defined function. You can also use additional third party are packages to augment database functionality keeping in mind that the performance and scalability characteristics of a given package remain the same. And this is where the task parallel or task parallel invocation can facilitate its use. And when you're returning values, you can return structured data and images, which can come back as a rosette or a JSON or XML string. Now on autonomous database, you can also invoke UDS asynchronously from SQL and rest because sometimes you don't want to have your client waiting for the function to complete. Visualization of this. Suppose we have multiple packages that are installed in the environment, and we want to write a function that leverages perhaps Kern lab. So we write our user defined function and store this in the R script repository. And now let's use rest or SQL interfaces to use that function. The database environment will spawn multiple R engines if that's what we request, and it loads the function and the necessary data. And the results that are created, which can be structured or image results will be put back in the database for subsequent use. And finally, the R engines are automatically cleaned up. So Sherry, please take us through a demonstration of the OML for R embedded execution. Here in our paragraph here in my Zeppelin notebook, I'm creating a user defined function, and I'm loading a forecast and ggplot2. And this is just an example you can use there's just a question in the Q&A about using standard open source ML modeling packages and you could use that here as well. And I'm creating a function called multi forecast now you don't have to create a function you could just use the code. You know, in standalone arm mode as well but I'm creating a function so that I can save it to a script repository and run it in SQL and rest later. But first, I'm going to show you the standard our API for embedded our execution. So I create my my UDF multi dot forecast, and then just to show you that you can you can run this in R and get a result and this function returns both an image and data. So, you can also run this in the R interface for embedded our execution. So this is one of the APIs which is Duval. And as you can see, this is being run on the left the R function is being run in, in the R environment, but the Duval is being run in the database environment. And the databases is managing and spawning those, those are engines. Duval is not a parallel enabled function. This is just to show you that you can run a function in a database environment and return data and an image. So now I'm saving this multi forecast UDF to the script repository. I'm saving it to a script with the same name overriding any script with that name if it already exists and the reason that I'm doing this is because when I use the APIs for embedded our execution, I'm actually going to be running the script, you're calling it from the script repository. And we'll show that in a minute. Here's another R UDF that build a naive base model. And in this case, what this is doing is it saving the model to the data store with a name that I provide. The data store name is DS one, and I'm creating a script called build and be model using the iris proxy object. Then I can do the data store summary for data store one. Now on the left we have a scoring function. So what the scoring function is going to do is it's going to load that model that I just saved in the data store, using the, the or a load function. And it's going to do the scoring. And so I'm saving my scoring function to the script repository overriding any function with the same name, if it exists. And now that that exists in a script repository, I can run the scoring UDF from the API for embedded our execution. And this is, in this case we're using or a dot row apply with the iris proxy object and calling the score and be model script from the script repository, providing the data store name DS one. So five parallel. So row apply will run in a parallel and a row wise fashion so I'm specifying 30 rows at a time with with the five parallel engines, and then specifying the structure of the output, which is going to be a data frame. And again, this is running in the database environment it's not running your local R session. And in addition to Mark mentioned in addition to the R API for embedded our execution, we also have SQL and rest apis for embedded our execution as well. And one prerequisite for that, like I mentioned was that you need to have your script, your function saved as a script and the script repository. You need to get a token unit your admin will need to add you to the database access control list, and then you will exchange your username and password for a token, which will be good for an hour. And there's a procedure that we've included in this notebook that will will get the token for you. So let's get token to, and this is this script paragraph this is a PL SQL, a function that we've created so that you can get that token. And then we're saving that token to a table so all that you would need to do here. It looks long and complicated but really all you would need to do is insert your own URL, and then your own all username and password. And then your token is saved to a table and then we pass it to do an authentication script and then you're ready to use the SQL API. So here on the left, we're running that user defined function in the SQL API for embedded are so this is the forecast function. And we're using our Q of L to which is the SQL API equivalent to do a vowel. We're setting the graphics flag to true because we want to return our image along with the data and the output format to PNG for the same reason. The script owner is set to know because I own the script, my own user owns the script I created it I saved it, but if another user had shared a script with me which you can do through some granting options then I would put the script owner name here. And then I put the script name and then this environment name is actually the content environment where the packages that I'm using in my UDF where where they reside so if you're not using third party packages you wouldn't need to see and be named. And the result is both data and then a portion of the PNG bytes for the image. I, we haven't shortened here, just for the demo. And then on the right, we're running the UDF using the SQL API for embedded are using table about which is the equivalent to the API already table apply. So this is where we're building the model using the proxy object iris we're providing in the parameter list the data store name and the script name. And again that function just returned true because it was saving the model to the data store. We're turning it as XML so you can return images XML or structured output which you'll see in a moment. So we're going to score our naive base model using row about which is the SQL API equivalent to already roll apply. The only thing different between table apply and row apply obviously that it will run in parallel. So we need to set the parallel flag to true the service level to medium. One of the things is that autonomous database provides parallelism through these service levels so low is a no parallelism, and then medium and high provide medium and higher levels of parallelism. So we chose the, the medium service level. Okay, I'm specifying the output as structured output. I want a data frame or structured output. Actually that returns species and prediction. And then I'm running 30 rows at a time in parallel. And again, the script name any environment name. And this is the result. And then if I wanted to run the same scripts from rest. Curl or another client like postman another rest client like postman there are a few out there I'm showing the curl here you can't run this curl in the notebook. So this is just to show you that we have the do about endpoint, which is the equivalent to the ory do about running the multi forecast script. Using, you have to pass the environment name or the packages and image back, and then this is your, your token would have been saved to a variable, and then this is just the do about endpoint. And then with role play, we're going to score on a base model in rest, we're going to call that script from the script repository, and in the parameters list we're going to have our table name, the number of rows that we want to score and parallel the data store name, the parallel flag and then the service level, which is medium. Okay, back to you mark. So, we'll just continue on now that you know if you want to use the additional third party packages, as Sherry was showing us in in the demonstration, you know you can augment database functionality in that sense. And with Oracle database you can install the third party packages in each our instance on the database server nodes. But you can also install those packages on the client our instance of course and then use them with embedded our execution from our and SQL interfaces with Oracle database as, as we've seen, but specifically on autonomous database we have support for third party packages as a new feature where a user with administrator privileges creates a condo environment with the desired packages and uploads the environment to object storage. And we require special privileges to address security in enterprise environments. Now the available package repositories are from Anaconda as shown here, and once created a user downloads and activates and available condo environment using the OML notebooks condo interpreter. And as Sherry was demonstrating OML notebooks being this built in notebook environment for autonomous database. And you can specify the condo environment to invoke embedded our execution from our SQL or REST interfaces. And this brings us to summarizing the broader context in which OML for R exists. You know we have SQL and Python interfaces as well. The SQL API is the foundation for the in database algorithms and is leveraged by OML for R. OML for Pi has similar functionality to OML for R but with the addition of automated machine learning or AutoML. On autonomous database you've seen OML notebooks, but there's also a no code AutoML user interface where the resulting models can be immediately deployed or use with SQL queries or deployed to OML services for real time scoring using REST endpoints. And then there's the original OML user interface Oracle data miner, which is a SQL developer extension. So in summary, you know we've seen how you can use R for accessing and manipulating database data from a variety of connectivity packages with benefits from using the database as a high performance computer engine. And with OML for R we saw how we can further leverage the database as an HPC environment for data exploration data preparation and machine learning modeling. By empowering data scientists and R users within database machine learning algorithms from an R API we gain scalability and performance in part by eliminating data movement. We also can easily deploy machine learning models and invoke RUDFs with system provided data parallelism and task parallelism. And by operating in the database users benefit from RDBMS backup recovery and security so we don't have to address these separately in our application. And further the OML for R interface is included with your autonomous database instances and Oracle database licenses. Key is that if you have your data in or accessible through an Oracle database you can readily take advantage of this functionality. Now you can explore OML for R on an autonomous database instance from our free tier using over 35 OML for R specific notebooks. And these same notebooks are available from our script repository. For more information check out our OML web page, a blog and GitHub repository. The repository contains not only our examples but also code examples using notebooks from other OML APIs as well. We also have OML office hour webinars that you're welcome to join us for and you can use and try out this functionality on our free tier and explore workshops through Oracle live labs where we've just added a new lab in the OML fundamentals workshop for OML for R. So thanks for joining the session and if there are any additional questions we're happy to take them now. One of the questions that came in had to do with shiny apps for pulling data from the database. And I think what we're looking at is how we can do that directly from the database environment. Sherry did you have any additional comments on that. I was showing the the Zeppelin notebook, which is basically you can use that for for your R in our interpreter you don't actually need a shiny app to do that you can just use the Zeppelin notebook from your web there's a URL that you will log into with your OML user and password and then OML notebooks is right there you can just open them up and start using it immediately. Thank you. Any additional questions. All right well with that I thank you for joining us and look forward to seeing you at the next our database webinar. Take care. Thank you.