 Welcome to Singapore Sequel's past after meetup, December, the last one for the year. Okay, I am Nagraj Venkatesan coming up with another session today on R and Sequel server. This session probably is the first one out of my comfort level, okay, meaning all along I have been a database administrator dealing with Sequel servers, installations. This is the first session where I am stepping out of my comfort level trying something different, more on trying to learn something different and also talk about something different outside Sequel after many, many years I am learning something outside Sequel after a long time and why me a database administrator has to read about R, if you are wondering. Sequel server 2016 as you all know, most of you know has integrated R and Sequel server where you have the ability to run R scripts within Sequel server. As I explore R, to be honest, I am kicking myself why I did not do this earlier. It is such a fascinating language and combining with Sequel server it becomes a terrific combo. I really mean it, it is a wonderful combination to have the most popular statistical programming language with the most popular data platform solution. It becomes a terrific combo. So what we are going to see in the next 45 minutes to an hour is okay, how R actually integrates with Sequel server, point number one and how many of your database administrators out here like me, just one, meaning do I have a room full of data scientist out here then that scares me a lot. Then you will be bored to death because I am going to talk some fundamentals of R and Sequel server. So R looking at from a database administrator's angle that is the theme of the session. Hopefully at the end of the session I will motivate more database administrators to learn R which I think is the future, the way things are going to move on. Rest or all, I understand few database administrators then rest, developers, how many of you know R in this room? R experts, any data scientist? Okay, rest. Developers? Okay, enthusiasts like me, I fall into some category. The thing is I do not know what category you belong to, that's okay, fine. Very good. Let's get started. So before getting into R, what is data science? This I just copy pasted from Wikipedia. Concentrate on the red words over there. Extract knowledge from data using statistics, machine learning, etc. In short, data science is about these three words. Extract knowledge from data. Okay, just to lighten things up. So can somebody just tell me what is the difference between knowledge and data? What you gain inside? Good. What you gain inside, right? So in some sense, yes. What actually we are doing in data science is we have large volume of data. It is just data till you analyze it and you get some useful information which you can apply. If not, it is just data, right? So where R fits into this picture is it is the programming language which you use to do that analysis. For example, let's say I have a huge data of last 100 years of rainfall and weather information in Singapore in one huge CSV file or in one database. It is just data. If I use algorithms to predict what the rainfall is going to be for the next month, then it is useful to me, right? This job of analyzing it and doing the prediction R helps you a lot in that, okay? Let's get a little bit more detailed into the process of data science, okay? First, you have a business problem. Mostly it comes from the CEO or the topmost management. Let's take up the same weather example again where let's imagine the MET department or the meteorological department. They come up to them and say, okay, I want to know next year how much of rainfall are we expecting to get in Singapore so that I can plan my reservoir capacity and so on, okay? So when we have this question, immediately this particular question will pass to the data scientist. Data scientist or any data scientist in this room? Good. Are these expensive, very smart people? You know, mostly a PhD degree have very deep understanding of statistics and mathematics and so on. More often than not, bearing a few exceptions, mostly they don't deal with IT. They are more of those very geeky mathematicians of sorts who apply those algorithms and try to do some analysis. This goes into the hands of a data scientist. Then what happens is he actually tries to make it into a data science problem. Just the thing that we were seeing right now. What he, okay, the boss is asking to predict next year's rainfall. So what are the things I need? Can I make it into a data science problem? Certainly yes. If I have the data, I can start doing my analysis. What kind of data I need? I would need probably last 10 or 15 years of weather information. What was the temperature? What was the rainfall? What was the humidity? What was the pressure? How was the pollution? Was there haze? How much trees were there? How many buildings were built? All these kind of information. Lots and lots of data related to weather and rainfall. So he will immediately, what the next thing he will do is he will immediately ask the IT department, okay, I need this data. Can you give it, whatever format he is comfortable in a CSV or something? Can you share this data to me in a CSV? Once the DBA, more specifically the DBA or the developers, they actually pass the data to him, he will start using the R or one of the statistical algorithms to analyze the data. What I mean by analysis is he will start looking at the data to see okay, okay, these are the most likely parameters which are influencing the rainfall or whatever the thing that I would like to predict. What are the factors that are most likely to influence? He tries to understand the data. In case if you are a developer, probably you can think of, you have a problem from your boss, you start coding. Initially when you start to understand the problem with whatever approaches you can take. That's what happens in the exploratory analysis or analyzing the data. Then comes the most important part of defining a model. Okay, what is a model? A model is the important phase where the data scientist actually defines the algorithm, the algorithm that he would like to use to this problem. He defines one particular algorithm. Okay, he knows that most likely these 30 columns are most likely going to influence my result. He picks up some algorithm which he believes most likely is going to give him a solution. For example, he'll put the three-year-older data into that algorithm as a test data. When we talk about three-year-older data, we already have how much of rainfall, we have the result already. So you feed the data to test it and evaluate how it is going to predict last year's rainfall. You see, okay, last year this was the rainfall. My algorithm is saying this. So what is the difference? How accurate is my algorithm or my model? So this process he defines and let's say it comes up, depending upon the problem, let's say if it is giving satisfactory results, then he goes ahead with that particular model. Okay, and then it is rolled out. Again, developer DBA and also the other application teams are also involved. Finally, it's rolled out to production. This model is, once it's rolled out in some of the systems, what happens is as the data is growing, it is also fed into this analysis system so that it becomes a continuous learning process of sorts, meaning yesterday's rainfall is also fed into the system so that it can be used to predict tomorrow's or next month's rainfall, something like that. So do we have an understanding of what data science people do? Okay, one important thing is I have marked who are the people involved in this. If you notice, it's data scientists, DBAs and divs. Let's say if the DBA and developer and guy starts learning R, starts understanding R, do you imagine how powerful that person can get? Because he can cut across the entire process, meaning right from the start to the end, he has an understanding of what can be done and probably he can, over the period of time, he can participate in the end-to-end process. Why I personally feel, this is my opinion, DBAs should start learning R is, the way this IT and overall data platform scene is moving is it is all the tasks that are done in the same way or repetitive tasks like installation, configuring a backup, these kind of configuring monitoring. Most of these tasks, repetitive tasks are getting automated. What you would have over a period of time is more of analysis, data analytics, those kind of things. So it's not that these traditional ones are not needed. That one alone five years down the line may not be enough for you to sustain in the market. So it's extremely important to widen your skills. So picking up languages like R is, in my opinion, every DBA should start looking at it for a few reasons. Let me also explain as we go through this session even more. What are the common applications just to imagine? Imagine a cancer laboratory where healthcare is one sector where data science places massive rule. A cancer laboratory where you have a patient's data coming in. You want to know whether that person has cancer or not. You have all the kinds of information, what kind of symptoms he has and so on. You have, when you have R and other algorithms which actually look into these symptoms, symptoms in very, very layman terms, you can think of them as columns for database people. Let me explain in database terms. You can think of these as columns and then you actually want to find out column A, B, C, D values are like this for last 10,000 rows. For the next row, if these are the values, what will column E be? That's what are one of the things that we are trying to do. So imagine some case where like a cancer or whether a person is likely going to have a heart attack. Let's say you have the data of his age, his weight, his blood pressure, cholesterol, all these information and you have the data of people who have heart attacks. Then you can use these algorithms actually to predict there is a 70% chance of you having a heart attack if you go at this rate for the next two years. In healthcare system, usually what they say is the algorithms need to be a lot more accurate. When you are saying a person does not have cancer, make sure you get it right. Next fraudulent transactions or stock prices, how they are going to progress, it is getting, you can apply it very well over there and customer behavior like in retail markets, whether the person is going to buy the product or not, what is his past, previous choices, what is his nature, the way he spends. In my opinion, any industry you name, there is scope for you to apply these concepts. Even some of the simplest one can be a spam filter. The email comes in, you want to figure out whether it is a spam or not. It is again a bunch of parameters and you want to say yes or no. So far do we understand what the data science is all about, okay. What is R? I have already explained a few times. It is a open source language. One of the things is it's a open source. There are a lot, as I say, 2.5 million users has been around since the 70s, a very old language. What Microsoft R is all about is they adapted the open source R to be specific, a particular version 3.3.2. Microsoft R is also open source. Meaning you can add on your packages to it. They have adapted a particular version and they have made some really good enhancements. One of the pain points or let's probably, let me ask the gentleman over there who actually works with R. So what are the pain points you would commonly face if you are using a traditional R? I just want to know from some, sorry. Slow, okay, good. One of the few major pain points in the traditional R is the computation, meaning I keep speaking about algorithms, right? The algorithms for it to run, it runs on the R client, just like our management studio. It runs on the client tool. What happens when it runs on the client tool? You are limited to whatever resources you have on your client. And more importantly, R does its compute all in memory. So if your client machine has limited amount of memory, I have seen many developers complain that they are unable to compute simply because of lack of resources on that particular client machine. The first and foremost thing about Microsoft R is you can leverage on the server resources, meaning the compute or doing the calculation and working out the algorithm, it runs on the server, just like we run queries, meaning you take advantage of the massive resources of a huge database server. Next, again, you are defeating the memory limits which are traditionally there with the R client. If we are using SQL server as the data source, and the data base, we are obviously putting our R service and the SQL server on the same database server, then fetching the data from R server to the database, there is no network traffic involved. It's a direct straight transfer, which is again a massive, massive gain when you actually use traditional R. And then parallelism. When you are using the traditional R, when it runs from the client, it is single threaded, whereas when you actually put it on the server, again, like SQL server, it has the ability to make use of parallel processes to do the compute. Another one of the pain points that was explained by a few developer friends of mine who actually use R was porting over the application into production, meaning I use R. It is not that easy to deploy the solution. More often than not, you have to convert the R code traditionally into some other programming language to actually make it into a running code on production. Some people say doing that can take months. This, again, can be avoided if now you have the ability to place your R code within SQL server stored procedures, which means it is a much, much easier task to place your code inside stored procs instead of trying to develop another layer to deploy it to production. Yeah, sure, shoot. Of course. You said that Microsoft took the certain version of R and then they made it as a open source. How does it evolve? Are there some people? With patches, actually, it does get upgraded. Beyond that, even if you want to add your own packages, obviously, you can also plug it in. For example, R, let's say that there is a package that exists that runs in R and you want to make it like converted to Microsoft R. I won't be covering how a package can be deployed. Your custom-made package can be deployed, but it's not so much of a task to do that. You can always port over a particular package and push it in. Previously, there was a company called Revolution Analytics, which they had developed something called Scale R. Their packages have been adopted. Microsoft actually took over that company. Their packages have been adopted and further it has been improved. What effectively happens is on the database server, you install, once you install R services, you get two services. One is the R service, which is called the Launchpad, and then you have your usual age-old SQL service as well. Once you install Microsoft R itself, you also get something called Microsoft R client, which is like a console where you can directly just run the R commands. Even on that client, you can actually set the compute to server, meaning you can actually set, run all my R commands inside the server, not on my client. You can set that. Once you set it and you run some commands, then what happens is the request receives by R service, then data is fetched from the DB, and then the result is thrown back to the end user. This screen looks familiar. This is the installation screen. What's different in a typical SQL installation? What you see, too small, is it? Oh, yeah. First, let me show this screenshot. Yeah, so you see R services in database, which means once you check this component, this actually gives you the ability to call R commands from your SQL stored procedures. One of the things, even if you don't like, okay, I don't want to learn R as in, as a database administrator, at least you should know how, moving forward probably three years down the line, you may be asked to deploy R code onto your DB server. So at least you should know how to add the R code or how to install R on database server. Nothing fascinating on setting up R or something. It's pretty simple. So you need to enable R in services. That's one. You have this other one called R server standalone, which is not the focus of this session, which is mainly for if you decide to run just R alone, not database service, just Microsoft R service alone, not the database service. And when you pick the services, you get this SQL server launch pad, which is actually responsible for the communications between the database server and the R. So you can think of it as typically the R service, which is behind the scene going to do the job. And another screen is also regarding the license, GNU license. So you need to accept that because it is based on open source components. You have to accept the GNU license to use this component. And one more point is installation. By default, it goes to download the packages from the internet. So if you are on a server, which does not have internet where you need to install R, you have to download it offline and then store it in a certain location and then use that to install. So there is enough material for you to look up how to do an offline installation. Okay, the focus of this session, again, is to show how you can actually start calling R commands from SQL server stored procedure. Me being a DBA after I learned these concepts, just a little bit of it, I have enough ideas to actually implement this on my day-to-day job, whatever database. I maintain about 120 database servers. So I am fascinated so much by this language that I would like to apply it as a database administrator. I have enough use cases where I feel I can do a better job using R in my office itself. Couple of scenarios I will show it today as a demo. Probably my intention behind this session is, hopefully I will motivate some of you to think in that lines to use it innovatively on whatever projects you are working on. So I mean, how it works. Let's say you have a stored proc. You want to run an R command on also a SQL query. So immediately the call, you need to call using something called SP underscore execute external script, which is the script similar to our XP command shell you can think of it. That is the script that is used to run R commands. And then the call goes into the R service for the launchpad.exe, which actually spawns processes. It actually spawns something called satellite processes internally to run the R results and return to the DB server. Each SQL account, which is actually running the satellite processes, for each SQL account it runs, you have a different Windows user account which actually runs the satellite processes so that you have a good layer of security when you are actually calling some Windows process. Meaning, let's say account A executes an R command, he gets an account, internally his commands are run using one account inside Windows. Account B will get a different account. It's not the same account which runs all the satellite processes. Okay, so let's... You just need to use... I have already enabled it on my machine. You just need to enable... Just like XP command shell, you need to enable external script execution. And then for non-Sysadmin accounts, you need to grant the permissions separately. Again, similar to what you do for XP command shell. This is just a very, very simple, basic Hello World call. Hello World. Programmers will know if you start a program, first thing you do is try to print Hello World. So this is something like that. So what it actually says is languages are and this input data set is the input query. Output data set is what the output we are planning to print and the result set out here is the result that we are going to give out. Just a simple Hello World. So we will be looking at slightly lot more complex. I would not call it complex from data scientist view, but for someone new, slightly lot more complex R scripts in the subsequent slides. So before we move in there, any questions so far? Yes, please. Okay, shall we get into little bit of R? First algorithm I would like to take up is probably the simplest, one of the most simplest algorithms that you would actually use in R, which is they call it logistic regression. What logistic regression actually does is, let's say we have certain bunch of columns and we want to predict a value of another column, but it allows you to predict only zero or one. Usually you use it in the cases of pass or fail. One of the classic examples in the R site, if you number of people who survived the Titanic, whether the guys who traveled took the Titanic ship, how many people survived. You can use logical regression to find out, okay, first class people survived, second class people didn't survive, first class male survived and so on. Again you can probably use it for credit card fraudulent transactions because over there we are just predicting whether it's a fraudulent transaction or a genuine transaction. Let's say you have the data of where the transaction is coming from, how many frequent transactions, what is the transaction amount, usual transaction value, what is this account limit, all these kind of information you have and then the new transaction that comes in, you just want to figure out whether it is going to be a fraudulent transaction or not, yes or no. Those kind of things can be applied using this algorithm. Okay, now let's get into a little bit of a familiar territory of a typical DBA table. Now how this algorithm can be used in DBA's day-to-day life is what next thing you would like to see. How do you guys monitor your DBA server? Like do you track some, how do you track your workload of your DBA server? One of the common methods is people have these tables where you actually track at a certain frequency the queries that are run against your DBA server along with how much of CPU it took, how much of memory it took, what was the weight type, how long it weighed, was there any blocking, where it came from, all these kind of information. In case of a performance problem, what does a DBA would like to figure out? Who, whether the query is expensive or not, expensive or not, right? Let's see whether we can apply R to this particular, excuse me, I didn't select something. Okay, let's say I have this table, this particular table I have simplified it so that for a demo of this range, we just have a few selective columns where I have CPU time, duration in minutes, how much logical reads and what was the query. By past experience I have marked the query myself. This is the training table where by my past experience I have marked these queries actually caused a problem to my server. I have marked it on the column called expensive query. If the value is 1, it means it's expensive. If the value is 0, it's not expensive, right? So over here I'm going to, I know the data fairly well, so I straight away get into the model development phase of the data science process. I just get crash into that phase. So I'm going to build up a model. I'm going to use logical regression. I'm going to build this model using this data, this is the training data, and see how it is going to be helpful if we were to use for finding whether the query is really expensive or not. So this is the way you actually create, pass some serious queries into the R scripts, use it along with R. First you define the query as a string, and then this is revolution scale R's function, Rx Logit, which is the logical regression, logistic regression function. Over here you specify which parameter is the one that you are going to, which column is the one that you are going to predict. Over here I specify the factors that are going to influence, because we are DBS, we know obviously CPU, memory, all these kinds of things are the ones that are going to influence whether my query is expensive or not, right? So I pass these parameters. Data scientists don't laugh. This is a very, very simple model that I am showing for people. And yeah, me being a beginner in R, I know only simple models. And what I am doing out here is, I am going to store it inside a table. I build this model, I am storing it inside the table, so that I can use this particular value to do my predictions. So basically this query and this algorithm gets inside the table. The data type needs to be var binary, that's it. Okay, so I have created the model. Now let's go and actually do some prediction. Let's say I want to track what are the queries that were expensive in these three days, between 2015, July 16th to 19th. And just I would like to leave out backup queries or sleeping queries so that I actually care for the serious workload. Again, we define the query and then we pick the model that we built just now. Unserializing is actually opening up. You can ignore. The most important point is the Rx predict where we actually say this is the input, this is the score, predict var names actually says what is the probability that column is going to have 1 or 0. It gives me a probability saying that whether the query is going to be expensive or not. Over here I pass the input query and I try to get the output on this particular variable. Let's go ahead and run this. Okay, it says it's done. By the way, I'm running on a very, very small laptop with 8GB RAM and not supposed to be meant for these kind of calculations. So it takes a while. Now, what we have, let me just first run this part of the query. I have stored the result in this temporary table so that it's easier for us to query and see what it has returned. The important point is I asked it to return the thing called score. This is the output of the algorithm and over here you can actually define score and these are the input columns. I'm also pulling the input columns so that we can actually compare our results. Okay, now let's run this quickly. I have multiplied the last column by 100 so that it appears like a percentage. It is expressed in percentage. So whatever that's close to 100 is actually a expensive query. Let's go ahead and observe the ones that are more than 90, meaning the algorithm actually tells me I'm 90% sure or 90% or more sure that this particular query is indeed expensive. Let's see whether as a database administrator by looking at the query we can actually have an idea whether it's an expensive query or not. So these are the queries it says it is actually expensive. Look at the reads out here which is significant or look at, especially look at the ones that are 100% which act, if you notice the reads are significantly high the CPU time is high the time taken is about one minute and two minutes and so on. So as you can see how much of time did it take for me to come up with something with certain amount of intelligence instead of me randomly looking at columns and trying to pick something instead it hardly took me probably about 10 minutes to build up this thing. Little bit of effort to learn what these words are, what the weird errors are. We have a friend who understands R fairly well it makes it little easier but not so hard. When I showed this to my development friends they equally loved the idea because usually they have it as text files their data is as text files. Let's say if they want to change something it is painful over here okay I don't like this model go ahead and delete that row and put another algorithm in club it in. I want to change my training data set modify my query it's in 10 minutes I can do changes. Imagine about using a CSV and doing those things it makes it a lot more painful to build something like that. Any comments like it can be basically I'm interested in learning I have I try to talk to two or three people regarding this what they were also telling me is yes it is hard to meddle with CSVs than to play around with these columns it's much much easier to do those kind of things. This is one part of the story other part of the story if you are a regular to this meetup I myself would have given sessions on all the other fascinating things on SQL Server 2016. So what it means is my query can take advantage out of SQL 2016's powers my R also takes advantage out of Microsoft R's powers together it becomes a superb combo of sorts. Imagine reading CSV files imagine loading billions of rows as CSVs and then trying to do the processing instead you have a cluster index or you have a cluster column store which is actually going to fetch the data for you from the database which makes it several times faster not just one or two times in my opinion it is going to make the process faster by 30 to 40 times and when you have it as these tables in these formats you can make it into a continuous growing one it is so much more easy your real time data flowing into your training model you predict and you pass it back on to the training and so on doing these manipulations becomes lot more easier if you are if you know how to query and if you are able to do. At the same time I have also like to say that building up simple models are fairly easy but to build real serious models you need to know you need to have a very good understanding of R and R it is not something like in my opinion it is not something whatever I have heard from other data scientists whom I have met it is not something that you can just go ahead and pick up in six months I am a data scientist no way there are people who work on that have their PhD degrees they work so much spend so much of time on statistics and all these things to get that skill probably you start exploring this you have a mix of both worlds very competitive probably you spend about a year or so on R there are interesting courses on Coursera have you guys heard of Coursera.com yeah so you can probably take up one or two courses on that start learning a bit of statistics and algorithm not to compete with the data scientist but if you have if you invest certain amount of your time then it can be a very very terrific skill for you or at least know a few algorithms which you can actually use in your office to make do things in a lot more efficient way any comments or questions yes please any how many from the R user group out here anyone you are the only one I have asked you many comments yes yes yes for a base point it is more important that you actually understand then this is only a tool in some sense okay in a way it is important to understand what the algorithm is actually doing because yes this is just a tool but which algorithm you want to know yes just like C++ or which algorithm you want to apply is something you need to have certain yeah there is no to say just using this absolutely yeah for example you are talking about developing the model yes how in the first place do you verify a model is a valid model yes absolutely you have I mean here it is garbage how yes absolutely that is the point I was actually previously driving through as well these are what I am showing is you have to go back to the first or the second slide where I was actually explaining the data science process evaluate what you have actually done you take the data you evaluate over and over again whether the algorithm and your overall system is able to predict and ROC curve you need to check the evaluation process when you define the model okay I mentioned that at the bottom of the slide where you have to test train using an ROC curve to find out how much more yes absolutely there is no denying of the fact that there is lots and lots of iteration and there is lot of time spent in this these two areas where you actually evaluate the model so much more times to figure out whether it is working or not but as I said you should start getting yourself exposed into this there is no point saying that okay it is too complex for me to learn and I will never ever take the dive as I said if you start getting the understanding then you have an ability to participate on the entire life process of the data science yes of course as I stated this particular session in the expectation of this particular session it is more from the angle it is obviously you cannot wrap the entire data science process into one single slide I am giving a 12,000 feet overview of the entire thing that some the target audience is not data science expert as someone it is more for people to get started with the concept been running up from SQL server and from Studio Studio for example from Studio Studio Vish Fundamentally there is nothing much where as long as you can set the context even on Vishal Studio you can actually set the context to run from on the db on the server okay so that is not so much of a difference as far as developing a model or playing around to see the data you can there is not much of a difference but doing it from the management studio you actually get the power to actually put the entire code into a single stored proc and deploy it a lot more easily that's about it yes we are helping a few our algorithm now and our analysts so they are running from visual studio or from RStudio so it's their own reference and to get the data for their testing when they are developing they use this ROR dbc library so they create this connection inside R script and we found also that you can send only one input data set as a parameter to this external script and it means that we have to use this connection inside R script and does it mean that when we run external script from procedure as you showed now if we use that connection to single server will it use the advantage of this input data set as far as okay to repeat you can actually so will it optimize it as long as you set the compute context there is something called Rx compute where you actually set that thing has to be executed on the server on the R line as long as you set that it doesn't matter what you are doing the compute runs it will be run there yes absolutely one more question go something like this resource allocation you just give me one yes which one should I give priority because you are running both db instances on R services let's go for the next demo called linear regression this particular algorithm is actually what it does similar to the last one over here this one also predicts based on certain columns over here beyond zero one you can actually predict the value that you want to you want to figure out another use of this algorithm is you can also see whether how much of a relationship between these columns and the column I plan to predict whether does it have a very strong relationship does it influence so much or not so typical use case is the advertisements or advertisement campaign to the sales after advertisement do you think the sales has improved does it have any relationship to the way sales is progressing and so on what we can do over here again more from a dba angle of sorts we have server and database growth we have this is my growth this is the server so I want to predict how much my growth of my database will be after a certain period let's see how we can do that I have this table called tb growth which actually has the size and server name date of growth and so on it tells me 312 gb this is the growth and etc now we are going to again build up likes last one we are going to build a particular model and train it and then again as usual try to come up with some estimate over here I am passing one month of data January 2016's data I am passing it and every day the size is recorded per and let's see how we can predict just to quickly go through rx lin mod is the function that I am using again similar to the last time around size server name database name dt so these are the columns I think which actually influence which are actually I need to take into account I actually have to perform a similar task in my day to day office where I have to do capacity planning meaning I have to submit I think most of the database administrators will have to submit instruction for next year etc where I do a very very naive simple approach I take a extremely simple approach of just taking last 6 month average and then just projecting it out you will be surprised there are many monitoring typical monitoring tools which actually takes last 3 month average and project 30 day 60 day 90 day projection instead applying something which is sophisticated like this gives you lot more realistic numbers instead of doing a blind projection of using just a 30 day average or a 60 day average over here I go ahead and I said which month I want to evaluate 2016 February okay again we will I pick a date that's there already so that we can we get to compare the syntax is pretty similar again I am using the same function to predict you can this particular function is used for multiple algorithms for prediction over here it gives me what is the predicted size because these are old months we can actually compare what the existing size and the predicted size it is slightly of the mark around about around about 10 to 20% 20% of the mark but still something certainly better than whatever I am currently doing with 3 month average or 6 month average which is just a blind guess of sorts whereas this one takes into account how the numbers actually vary and then it tries to apply yes of course earlier I was actually explaining how it actually influences yes little bit of effort needed to understand how you use it but as I said for a database administrator you can actually think of many use cases where you can actually apply performance monitoring, capacity planning you can use to analyze query regressions what queries started performing poorly you can start in SQL Server 2016 you have the query store which actually gives you reports on query regressions even if you have an older version of SQL Server you can port over if you have the data of how much data what the query performance is into a SQL Server 2016 instance and use these algorithms to quickly actually try to figure out which queries actually regress error log, audit log abnormalities you can even use this to analyze abnormal entries V database administrators when you have massive volume of error log to analyze it becomes lot more harder to do that manually you can use that as well another interesting use case somebody actually in this year's past conference one Joe Sack actually explained one of the interesting use cases was when a DBA goes on a holiday what is the chances of you having a problem on the DBA server interesting right so actually if you are creative you can actually do wonders with this you start getting yourself exposed then you actually start doing many many things you can analyze query store data, performance data do capacity planning, error log analysis you can do lots and lots of wonderful things with this GUI mode meaning meaning as of now you have to cut and paste your R commands right into into the stored proc so there is no grasp but you can obviously look at graphs and histograms on the previous session where the other gentleman actually was giving you an introduction to R you can run the R commands alone to get some graphs or histograms using R commands on the R console or on your on the R client when you are plugging it into SQL server this is the way you do okay as you asked actually how do you how do you actually control the resources one of the key things to R again is a memory hungry technology so one of the key things is how do you size your memory for R so launchpad.exe over there you have the config file under the installation folder you can actually set the minimum percentage by default it's just set to 20% of the physical memory available on the server you need to carefully size it again meaning allocate enough for your SQL server allocate enough for you can set the adjust this cap so that you have enough for both the things to be running one of the another interesting enhancements with resource governor is you can actually cap use something called external pool on resource governor with which you can actually cap the memory that's allocated outside in the SQL server instance you can use that to control which user account actually takes how much of memory for R outside SQL server as well launchpad controls it at R service level even at a user level you can use RG suited for large servers again because you want to do heavy duty processing you want to process large volume of data you want to use column store and all these concepts on the database level so bigger RAM faster machines carefully size your memory as I said during the initial slides where once you install it you get a group called R user group grant your permissions wisely do not over grant provide any additional permissions that are not needed for R execution number of distinct user sessions is for R is limited to 20 by default you can configure that but unless you really need it meaning when I mean distinct session number of distinct SQL accounts which can execute parallely R sessions is limited to 20 you can you can actually configure that set launchpad service store similar to agent launchpad service is dependent on sql server when sql server is stopped launchpad can also stop so you would like to set it to automatic references any questions yeah this space on the revolution analytics before questions I actually want to narrate a little story one of my key references for this particular presentation was this e-book released by microsoft free e-book released by microsoft microsoft data science the author was Wang Hong talk when I last august I attended sql server geeks conference where one of my microsoft friends actually introduced me to is this gentleman called Wang Hong talk he is a very big expert in data science then he came up to me and spoke two lines do you use data science no I don't have much of exposure do you use this no I just maintain service now when I read this book oh my god I was talking to someone so fascinating he is the guy who has written this book and I have been reading this book for last two months out of that I understand 30% of what he has written out of that I am presenting 10% it's like you meet Tom Cruise you haven't seen a single english film you are an actor on Hollywood then you go and watch a big movie oh he is this great data scientist whom I met why I didn't read this book earlier so that I could have asked some interesting questions to him so that was the feeling I was going through I was just kicking myself I meet this gentleman who has written this he holds a PhD from NUS so but I didn't have an interesting conversation even though I met him at that time of the day so start learning start exploring just 9900 pages even less than that no just 70-80 pages probably that's about it 70-80 pages you cut off the foreword all those things it will be just still I would probably vote for R simply because still analysis services has a lot of processing you have to build your cubes design and then write this one is a lot more no nothing each one has its own certain pros and cons meaning analysis services is targeted for a different kind of workload where this one is more on prediction and what you want to do that is more on classification and so on any more questions folks sorry if you are super bored by this presentation which was more on basics I hope you understand the focus of this yeah of course we have just recently seen a case of real life statistics got wrong of course absolutely you just keep going wrong but statistics are in demand you want to study everything you want to predict it's just like astrology you know that it's going to fail but still you take your flick your paper to see what's the zodiac sign like it's not going to be true all the time so at least know to learn how to use this actually anything else folks yeah you can if your workload is more on non-sequel workload probably you just have to do some text-based processing you don't have much to do you can just install or as a standalone as well if you have a SQL enterprise license then you are allowed to actually install or separately and you can have your SQL server separately yeah standard edition basic analytics is allowed meaning there are certain limitations in terms of parallelism but still the fundamental packages you can still use it on standard edition as well yes of course yes shoot does it only work with R? then there are like different versions of R right so we have the R that is from CRAN this is specifically for Microsoft R what we are talking about is strictly Microsoft R or the version that has been developed based on R 3.3.2 so you can import your package into this one certainly but this is based on you have to import it R from CRAN basically there are some fundamental differences basically as I said one of the two basic you have certain packages which are inbuilt you can port your package into this that's one even the way you use the functions over here you would see Rx Rx predict, Rx linmod and so on there you would just see something like RLM and just a prediction this is more of the revolution analytics version of sorts if that's all thanks a lot people so if any further questions probably I will take it up offline thanks