 And he's going to be talking about Postgres and then an extension with PLR, right? Correct. Okay, cool. So he's going to be here on the project, he's been on a cool board, the S-A-G-U-S, is that what you said? Postgres for your life. Oh, it's just a Postgres. Oh, that makes sense. Okay, he's been on board for the Postgres organization. He's done a lot for Postgres, he's also worked on the infrastructure part of it, so I did not do a justice, but he could talk to you more about what he said. All right, thanks. I'm also the VP of engineering for a company called Crunchy Data, mostly US and North America focused, but very heavily into security and Postgres and containers and Postgres and Cloud Foundry. But in any case, I've been using Postgres now for about 20 years and contributing actively since around early 2000 and became a committer in 2003. So there's a fair number of features in Postgres. If you use Postgres, you probably use features that I've written. I wrote the ability to have a function return rows and columns instead of a single value. It's called setReturningFunctions. I wrote that. A number of features that are kind of closely associated with that I wrote. The one that is almost a trivial feature, but is probably used more than anything else that I wrote, something called Generate Series, if you've ever used Generate Series, that shows up all the time. That's something I wrote. Anyway, I'm here to talk, as the last two speakers mentioned, Postgres is very extensible and one of the benefits of that is we can combine Postgres with other pieces of software, in particular if they're open source. And so what I'm going to be talking about here is PLR, which is a procedural language handler which connects Postgres to R. How many people here, assuming if you're showing up, you at least have some idea what R is? A few people? No, Microsoft should know that, right? Well, this was from school. I used R in my statistics class. Yeah, well, Microsoft actually bought Revolution, which was a commercial company on top of R and now R is integrated with SQL Server. But I did it first, like long time earlier, in 2003. So ever since 2003, Postgres has had a connector to R and one of the things I want to just quickly, it's kind of a challenge to do this talk in 25 minutes, although there's nobody in this room after me, so I couldn't go along theoretically, but I'll still go through it. This is the CRAN, this is basically the site for the R project that hosts all of the packages. The coolest thing about R is if you notice right here, there's over 12,000 extensions to R. So basically any kind of analytics you can envision is available in R and in many cases the most recently, newly developed cutting edge analytics are implemented on R before they're implemented anywhere else to the point where even SAS, which is the large commercial analytics engine that's been around forever, it's been around in the mainframe days, they have an extender that will connect to R for that reason. Oracle has an extender these days that will connect to R, Microsoft SQL Server does. R is just phenomenal language. On this page, there's a link to, and I tried to do this ahead of time so it wouldn't have to depend on the internet, this task views. It's kind of daunting to figure out at 12,000 packages, which ones do I want to use? This task views thing is actually pretty cool because it organizes these packages by topic. For instance, if you're interested in machine learning, you can drill into that right there and you get a whole page that describes all the various R packages that are related to machine learning. So that's kind of a bit of an aside on R, but R is extremely powerful and that's why I created PLR in the first place. Another aside here, if you're not familiar with it, postgresql.org is the main Postgres website. The documentation for Postgres is right there in the middle, no internet connection, see that's why I did it. Documentation for Postgres is awesome if you actually render it as a PDF, it's like 4,000 pages. Lots and lots and lots of detail on how to run Postgres. So unfortunately, if you don't know the term you're looking for, sometimes finding what you need in there can be difficult, but we can talk about that separately. The other thing is PLR itself is on GitHub, postgres-plr, slash plr is the repository. It moved there a while ago, it used to be under my own account. It's now in kind of a more generic account. You can also on my website, joeconway.com, is a full set of documentation. The documentation does come with the source code in SGML form, so you can build your own PDF or you can build your own HTML if you want. But it's also available here. We do compile Windows DLLs whenever we have a new release of PLR. So building PLR for Windows is a major challenge. Maybe you guys could help me with that someday. It's a major handy in the rear. But so if you need Windows DLLs, you'll find them in the releases on GitHub. We build them for each time. So now back to the actual talk. So the first question is why would you want to do this? Why would you want to do your analytics in R inside of the Postgres database? And one of the first things I always come to is that you want to leverage people's knowledge and skills. And what do I mean by that? Statistics and math is a very distinct specialty from someone doing web development. So you really want your analytics to be developed by your statistics and math people. And the language that many of them use is R. So I'm going to show you a few slides. I've got a very limited set of features I'm going to highlight. One of the features is the ability to write in pure R your analytics pulling from a Postgres database and then cut and paste that into a PLR function. It also lets you leverage the hardware. This laptop, although people always give me a hard time about this laptop being so big. It's 32 gigs of RAM. It's pretty powerful. But I might have a server that's got 48 cores and a terabyte of RAM. So you may want to do your analytics on a machine like that rather than on some kind of a workstation. The other thing is you're doing that analytics closer to the data and you're only sending back the result. So one of the things about PLR, since it runs inside of Postgres, basically, and this is kind of a step back from it, when a client connects to Postgres, Postgres forks a new, it's called a back-end process. It is a dedicated process that is handling that connection. PLR is running, the R library is running right inside that same process and has direct access through all of the internal APIs of Postgres to the data. So you're pulling the data up directly out of the database doing the analytics and you may crunch a billion rows of data and produce the answer, which is the number 42. Do you want to send a billion rows across to a workstation to calculate 42 or you just want to send the 42? It's much better to just send the 42. How many people here understand the reference to 42? Okay, good. Just wanted to make sure. The other thing is consistency of analysis. On the one hand, your statistics people, your math people may be doing analysis of the data, but eventually you want them to say, all right, this is the calculations, the set of calculations I want to make on the data. You don't want Analyst A doing it a little differently from Analyst B who's doing it a little differently from Analyst C and potentially producing different results. What you really want is your analysts to say, this is the analysis we want to do, you've got QA people, you've got other folks that are vetting, that it produces the right result, and then you can create a function on your Postgres database server and now it's kind of locked away where people can't mess with it, it just does its job the same way every time. There's also abstraction of complexity. Some of this analytics might be this many lines of code or whatever, but I'm going to name it with a function name that's this long and now my web developer could just say, select star from function name with some arguments rather than having to worry about staring at 200 lines of code. And then as I said earlier about R, the ecosystem around R is just phenomenal and so this gives you direct access from inside of Postgres to all of that power. In terms of cons, there's different perspectives on what's a con. If you're a Postgres user, something written in PLR, if you can do the equivalent thing in native Postgres, let's say an average or a standard deviation, which Postgres comes built in with aggregates that do that, they're going to run faster in the native Postgres version than it would in your PLR version, even though you could potentially do it in PLR. So don't do that. I mean really the idea here is PLR lets you do things that you couldn't otherwise do in Postgres. And if your database person's going to be doing the writing, then it's a new language for them to use, but as I mentioned earlier, ideally you have someone who's already familiar with the language who's writing the analytics. From the standpoint of an R user, it's more challenging to debug your code once it's been created as a PLR function just because it's an environment that they're not used to. And the other thing I didn't really mention I probably should add to this slide at some point, PLR, because the R interpreter itself does not have a quote unquote safe mode, meaning it is able to access files in the file system and whatnot, that means the PLR language is what's called in Postgres parlance an untrusted language. And what that means is only a super user can create a PLR function. Now, super user creates the function and can give execute access to someone who's not a super user. So you don't have to worry about anyone using the functions having super user, but the person creating the function itself is actually a super user. So that's actually another downside. And you probably don't have your statistics person with a super user login to your database, right? So basically the workflow should be they debug their code, client side, and then cut and paste and go to a PLR function. So this shows you here, this is what it looks like if you're familiar with R, this is what it looks like to create a function in R. So basically you assign to a function name, this keyword function with the arguments and then here there's a function body. In Postgres, if you're familiar with creating PLPG SQL or even PLS or even just SQL functions in Postgres, you probably recognize this syntax as create a replace function with a name. You can have either just argument types or you can have named arguments and there's other options, you can have defaults and other things like that. But basically what it all boils down is this function body here is exactly the same as the function body there. So here's a more concrete example. First thing you have to do if you're gonna use PLR is you have to actually the very first thing you'd have to do is make sure the binaries are installed on the operating system. So there are PLR packages available from Debian, there are PLR packages, RPMs available from the Postgres Development Group YUM repository. I don't think Red Hat actually ships them but Postgres Development Group does. So you can pre-install the binaries but once that's done, or you can build it from source. Once that's done, you need to create the extension inside the database that you wanna use it in. So then again, this is exactly the same as what we had before, create a replace function. I don't know if you're familiar with Postgres syntax but this out on the arguments basically means instead of supplying arguments into the function, this is just a method for saying these are the arguments I want to come out of the function. And you say, in this case, this is some R that basically just says, give me the letters A, B, and C and the numbers one, two, three and create a data frame which is the equivalent in R to a table. And when you return that from PLR, using a return type of set of record, effectively what PLR does is it returns it into a, it converts it into a virtual table. So now when I say select from this function, I get two columns and three rows with that data. So that's just kind of a quick and dirty. So now I'm gonna go into some, just a small number of features that I wanted to highlight because it's not time to do nearly everything. So the first one is this R PostgresQL compatibility. R PostgresQL is one of those packages that was available on CRAN for R. And basically it lets, if you're programming in R on your desktop, this library lets you connect to a Postgres database, pull data out and do your analysis on it. So one of the things that I built into PLR is compatibility with the names of the functions that are in that package. So all of these functions that basically are normally used in R to set up and then connect to a database and then run a query and get back the results, exactly the same words are used in PLR. So now if I wrote this in R, so this is an R function and just as an aside, this is a traveling salesman problem. How many people are familiar with the traveling salesman problem? Just a few. So I did, I pulled it up. Sorry. Pulled it up on Wikipedia. It's basically, if you said, I have a traveling salesman and they're gonna go to 50 different cities across the US and I know what the cities are, I know where they're located, but how do I determine what is the shortest path I could take and still visit all 50 cities exactly once? And that's what's known as an NP hard problem, meaning to get the exact answer to that would basically take until the sun burned out billions of years from now. But R gives you a library that will allow you to calculate that at least a reasonably good solution in a reasonable amount of time. So that's what this is doing here. This library here, TSP, calculates the solution of the traveling salesman problem. But in any case, what I'm illustrating here is I'm connecting to a Postgres database. I'm setting up a string which is my SQL statement. And then I'm gonna run that query and assign it to a variable, the output to a variable and then I'm gonna do my calculation for the traveling salesman problem. But notice on this next slide, this is the same function from PLR. Because of the fact that I had that R Postgres compatibility, the only thing that really changes is the decoration around the body of the function. Everything else is exactly the same. It's just cut and paste. And then just to complete the example that when you output it from R, you get this answer and you get the exact same answer. You can actually control how many digits of precision you get out of R. But basically it's giving you the same number because it's the same library doing the same calculation. Now one of the other things I wanted to highlight about PLR is its use in creating a custom aggregate function. This again, Postgres is built almost pretty much from the ground up with the idea it should be extensible. So that even gets to the point where your aggregates can be extended by creating custom aggregates. So the way it works in Postgres is if you've got kind of a group of values. Does everyone here know pretty much what a SQL aggregate is, right? So an example would be average. And when you're doing average of some column grouped by some other field, right? For all the rows where that field is the same, I'm going to collect data at each of those rows and at the end I'm gonna just spit out one row which is equal to the average. Well the way that actually works is it's kind of like a map reduce. There's a state transition function that gets run for each row in that group. And in that case of average, all it does is it collects the sums and it collects the count, right? That value gets passed as an argument to the next call of the transition function. It's kind of like an invisible argument. So when you have average column A, when that gets called it actually has another argument that you don't see where the result from the last call is actually sent in as an argument as well. So that allows you throughout the group to just keep summing up and keep incrementing the count. And then at the end of the group when the group is done there's an optional final function that runs. The final function could do whatever you wanted to do but in the case of average you're gonna take sum divided by count you get average. So you can build custom ones and you also can have an initial condition because the first row in the group obviously doesn't have a return value from the one before it. So that's what the initial condition is used for. So in this example I'm creating a PLR function called our quartile and there's a built-in function in R called quartile which I'm gonna use basically with four bins. So I'm gonna create a quartile or quantile as the function. I'm gonna create effectively a quartile by having five bins that I want to aggregate on for my data. And that's actually something that you can natively do in the most recent versions of Postgres but these slides are from a few years ago and up until then it was effectively impossible to do without some kind of really custom code or something written in C. So I can create a custom aggregate using this syntax here create aggregate quartile. I'm saying that I can accept any element. So this and I'm gonna use the state transition type of any array. So what this means is basically this function can take an integer and it'll collect a group of integers or it can take a float and it'll collect a group of floats or it can take a numeric and it'll collect a group of numerics. So this will actually work as written with multiple data types. So when I run this thing now it's just gonna basically build up an array. This state function in this case is this array append and the final function is the PLR function. So array append just says each time I'm gonna new row in the group I'm just gonna append it to an array and at the end of the group I just have an array of all the values and I feed those into the PLR function and it spits out the value of my quartile. So that's what these values look like but even more interesting is when you look at them visually. So you can actually graph these and this is just one way to do it. It's called box plot. It's kind of a common way to do this sort of thing. But this actually the data that I used for this example came from real data in a real factory where I used to work and kind of the interesting thing you can see here is these middle lines are basically the median. These are four different workstations producing the same part and this gives you an idea of the spread of the data. So you can see right off the bat that this one is getting a very different median value for this particular measurement than these two are. So you might wanna go look at this workstation and figure out what's wrong with it. I don't know how I'm doing it for time. Yeah, like four or five minutes. Four or five minutes. You technically got like 10 included questions. Okay, well, I'm trying to speed up a little bit. So now I got two more examples. These examples, there's no one after me, too. Yeah, there's no one after you. So if you guys don't wanna like walk out on me, I'll just keep talking. So this is another example and this is an example of where I'm going to actually return an object from R as what's called Postgres byte A data type is essentially binary data. So I can just take the object as it exists in R and return it directly. And the nice thing about that is that object in R could actually be a chart or graph. So in this example, I'm gonna show how I can get high-low closed data from Yahoo for some stock symbol. I can plot it with Bollinger bands in volume if you've ever been into trading. You've probably heard of those things. You might not have otherwise. It does require some extra packages. And just to be complete, the slide shows you how you would install those extra packages in R. By the way, you would wanna do this either as the Postgres OS user or you'd wanna do it as the root OS user to make sure that those packages were available for the entire machine or at least available for the Postgres user that's on that machine. So in this example here, I'm creating a PLR function. It's just gonna take a stock symbol as an argument. I load up the libraries that I need. These two lines are basically just setting up an area, a buffer and memory to take the graphic that I'm gonna output. This one line here goes out to the internet, grabs all the last two years with the data for that stock symbol, I load close and assigns it for me. This one line here produces a chart and then these two lines just harvest that chart out of memory and then it returns it. So you notice, as I said a minute ago, I'm returning it as buy day, which is just the binary. So I'm just gonna create a chart and memory and return it from my function. Now, one of the things just as an aside, on many times on database servers, you're running headless and so you don't have a graphic system or at least in older versions needed a graphic systems in order to render a chart. So you could work around that by using a virtual frame buffer. So if you install XVFB and kind of run it like this, this example would work if you wanted to try it yourself. And then this is just a very short PHP script that connects to Postgres, runs my function. There's another function that PLR provides called get raw. And the reason for that is there's some, when you return an image as an R object, there's some R decoration around the actual JPEG or PNG and this function will strip that off. And then basically just returns it to the browser. And so this is what you get. So in like a dozen lines of PLR code and a dozen lines of PHP code, you get a chart that is showing you all kinds of good information about this stuff. Now, one more example I'll go through. This is a real life example. I got contacted at one point by a researcher at I think it was University of New Mexico. He was doing research on seismic events, earthquakes. And he actually was complaining that something was too slow, which I ended up fixing for him in PLR. So that's now fixed. So basically if you're passing in, basically a seismic event distort is an array of floats and there's like 16,000 elements. So originally, if you passed in an array with 16,000 elements, PLR was converting one at a time, each one of those to a float. But in fact, to a float inside of R. So it was already a float in Postgres. But in fact, R in Postgres for one dimensional arrays, the floats are stored in exactly the same way in both systems. And so to speed it up, it's now there's an optimization that recognizes the situation and it just does a mem copy. So now it's like lightning fast, instead of 16,000 elements. So in any case, this is to show you if you were to import a thousand seismic events of this size, this is just a test using PLPG SQL. Basically this reads the file and then creates an array of 16,000 elements in Postgres array data type and stores it. It takes like 37 seconds. But if instead I do this, I read the file inside of R using R's ability to read a CSV file and then I just return the data as an R vector directly to Postgres and store it in a table. That only takes 12 seconds. So it's significantly faster. So now I can actually take that value, which I've stored as an argument to a function and this is pretty much exactly what you saw in the stock example where I'm basically just setting up to capture the image and I'm just running plot and then grabbing the image. So this is what you get as output. This is basically what this particular earthquake looked like. But what's more interesting is I can do other kinds of analysis on this data now that I've got it in this form. There's something called a fast Fourier transform. You run that on a waveform type data and you can calculate what's called a power series. And so again, this is a very similar example. You can see it's not a whole lot, it's just a half dozen lines of R code that are doing the bulk of the work. And instead I get this diagram. What this is showing me, these are frequencies inside of that waveform and this is the amount of energy at each frequency and so you can see how much energy at this particular frequency is where my peak is. The reason that's important if you're familiar at all with like mechanical engineering, every structure like a building has what's called a resonant frequency. If you push something or vibrate it at its resident frequency, those vibrations get amplified. And so when you push something like a bridge or a building at its resident frequency, it'll start going faster and faster until it just falls over. So what you don't wanna do is design your buildings and your bridges for a resonant frequency that matches where all that energy is and the size of it. So anyway, that's my quick overview. Hopefully I think we have a few minutes for questions if there are any. Can you like lock it so that R kind of right, like do you wanna make sure your data's safe and not? R does not have a safe mode so I've had this conversation actually with a bunch of different people. I've had AWS for instance won't make PLR available at RDS because it's an untrusted language. And I've talked to the guys who run RDS, got named Grant McAllister is one of the original guys and kind of runs that team. And we've kicked around different ways that we could solve the problem. It's not easy to do just because you'd have to really do some major hacking in R itself to fix the problem. So most of the things were turned to data how, only by select and then insert. Right. So that means that if the one running it doesn't have the permission to be answered then they can only query it or could you write it? So remember what I said is to create the function you have to be a super user. So what you would want to do is make sure before you created a function, someone actually studied what it does and make sure it doesn't do something nefarious like reading a file it shouldn't read on the file system or writing a file on the file system it shouldn't write. So presumably your DBA has to be your last line of defense if you want to use an untrusted language. And there's really the only trusted languages, well there's a few. The PLPG SQL that comes with Postgres is trusted. It cannot do any file system manipulations. There's another extension called PLV8 which will do JavaScript using the V8 engine. That's also a trusted language. And then PLPURL has a trusted version because PURL actually supports a sandbox mode where the PURL engine itself has been made safe. But R doesn't have that mode. And so the only, if someone really needed it badly enough it could be solved, like I said, you could go into the R engine and find the right spot in the code where you could basically just put a hook. And then PLR could register a callback of that hook and you could implement something like a white list and say only these R calls are allowed. So that's one way it could be solved. But. There's also a PL codeator on time note from, I think, the green one which allows you to. Yeah, actually the guy who implemented that is a guy named Dave Kramer who works for Crunchy, the Cachain company I'm in. He worked for Pivotal at the time he wrote that. Yeah. So PL Container basically lets you spin up a container that'll run, I think it's primarily meant for R but also I think you can run other things inside it. So it spins up, does some analytics and then collapses. There's a lot of overhead with doing that. So that's useful only if you're working on these massive data sets and something like Green Plum where that time is negligible compared to the overall time of doing the analysis. All right, well thank you. Great, till the next speaker.