 So today I'm going to discuss analytics in Postgres and I'm going to talk about it from the standpoint of not just using Postgres as a data store but using Postgres as part of your analytics infrastructure and really your engine for running predictive models. So the traditional way that analytics has been done is you have separation of the model and the data. You use the data to build the model but the models coded in Python are what have you or frequently it's coded in whatever application language is being used for the application in particular. And the execution model for predictive analytics data for the models themselves is basically the cycle of okay pull some data together from the database or whatever data sources you're using. Perform some computation with that data and then make a decision and then possibly repeat the process as you step through a decision tree. If you notice half of this formula that I just described gathered data that's a data problem and frequently the data half is the majority of the cost that you invoke when you're running your models in production. So the main idea here is instead of bringing the data to the model why don't we bring the model to the data because bandwidth just network network infrastructure is one of the highest costs in any data center and it's one of your big performance bottlenecks. So instead of shuffling a whole bunch of data to your model run the model where the data lives. So I'm going to do this in the form of a case study it's it's don't worry it's not going to be some deep dive advertisement for my client but I did I did want to describe this from a standpoint of a system that I've actually built that is a real-world application and not something that's just theoretical. So just to give you some background on my client they're a financial company they're a near-prime lender and really the key thing that you need to understand about that is that means the FICO score that all of us worry about ooh do I have a 720 well that FICO score for near-prime so people in like the low 600 five the high 500 range they're not quite as concerned about what their FICO score is and that means that the FICO score is not a good predictor of are they going to repay their loan. So for somebody that's doing lending in that environment they basically have to build their own version of a FICO score they have to look at all the information about people applying for credit and make a decision on the fly of is this person a good credit risk or not. So having the ability to track and control their analytic process is absolutely critical especially because with online lending where there's no collateral involved and they're not a car lender or anything like that these are strictly their installment loans but you can think of it kind of like a credit card. So if somebody comes in and you agree to give them a loan and in fact they were somebody who is intentionally trying to defraud you you can have millions of dollars waltz right out the door that you're never going to see again and obviously that's something that a company cannot tolerate for long. The other aspect of this is as a lender they operate in a regulated industry so they have to be able to demonstrate things to regulators such as we are not making lending decisions based on gender and things of that nature and again that speaks to the repeatability and the trustworthiness of their analytics becomes very important so that they can show regulators what they want to see and not come on their regulators well what do you what do you mean you can't show me what how you made this decision what do you mean you don't know what model you use. So the addition to that you know they need to be able to protect sensitive data obviously they also need to be able to rapidly evolve the analytics models because as somebody comes up with a new way to try and defraud them they need to respond to that very very rapidly and finally performance is a consideration. A lot of the leads that they have coming in are done on a affiliate purchase program so a lead will come in and they'll have a number of seconds to be able to decide do I spend the money to purchase this lead or not and those lead purchase costs can be very expensive so by the time you get around to purchasing the lead you want to have a really really good idea is to whether this is a good potential customer or not so they have to do real-time analytics as leads are coming in. So let's talk a little bit about traditional the traditional workflow in more detail or as I like to call it death by a thousand paper cuts except that now you know today is the era of big data so it's not just a thousand paper cuts. So traditionally the way that that analytics models are developed is you pull a bunch of data from a source system into some kind of analysis environment whether it's Python or SAS would have you everybody's got their own flavor or maybe you do some amount of searching for patterns on the source data system itself you find the correlations you build the model you train the model with a subset of your captured data and then you validate the models performance on the remainder of the data so if you have a data set of say a hundred thousand loans and you know ten percent of them defaulted you want to you want to be able to identify what is it about the ten percent the default so you would run a sample set through the model of say you know 20 percent of the defaulted loans and 20 percent of the non-defaulted loans you'd use that for training and then you'd run the model against the remainder of the data set to see hey does the model do a better job of predicting or not that's the analytics side then you get to the development side or as I call it throw it over the wall so the model's thrown over to the developers the developers complain about the code possibly mock the language that the analysts are using well why aren't you using rails because that's the cool language and then the model gets completely recoded now you have to test the new code or you could just throw it out in production and cross your fingers because that's always a good plan and then finally after this entire process has has gone through the analytics folks get to see the real results of the model running in production against real data and one of the issues here is this process can take weeks to months depending on how efficient the organization is as you can see there's some problems with this first of all data wrangling is not the same as analytics there's two separate problems that are involved here the ability to track models with this system is extremely limited basically you're only track traceability on the models themselves is whatever is getting checked into the version control system for the application itself the analytics department may or may not be tracking their stuff you know and what they've done for the models tracking the execution the actual performance of the models in in real life use is ad hoc it's it's pretty much whatever gets implemented for each individual model and a lot of times there is no cohesive system in the application there's a lead model and then there's a credit limit model and then there's a loan amount model and maybe there's some models for who knows what else and and they're all separate and and they don't know about each other there's no commonality to the code and finally this us versus them leads to bad morale and massive cycle times now to be fair I'm kind of describing some of the worst of what the situation gets into hopefully none of you are stuck in that kind of environment but even if it's not that bad this this kind of disconnect that you run into between the analytical team and the development team with this well we'll hand you the code and then you have to get it out in production it's it's not a useful paradigm and it adds a lot of inefficiency so the solution that we created for Kairios is based on recognizing that there's there's basically four separate challenges in predictive analytics there's gathering the data to construct the model there's constructing the model itself there's the model execution phase and there's the data that you need to execute the model but the commonality here is you've got two things that are data and two things that are code so how we approach this as we said let's look at these two problems on kind of on their own but from a cohesive standpoint we'll recognize the data problems as data problems and recognize the code problems as code problems we use the right tools in each case but we'll do this cohesively instead of ooh the data is here and the code's over there so data you want to make it as easily accessible as practical because inaccessible data makes the job of your analytics team trying to build the model more difficult and ultimately the data that they choose to use that they say this has significance you know this these data points here have statistical significance those data points need to be readily available for the production model to make use of if they're scattered all over the place buried in XML documents or what have you that impacts the quality and the performance of your production models so instead of taking a point by point approach of let's expose this data point oh and now they want this one so we'll expose this one and now they want this one so we'll expose this one we wanted to try and make as much of the data available as possible as practical so there's two main data sources that Kyrios deals with one is third-party reports and generally speaking those come in the form of XML you go you make a call to a web service and here's your gigantic XML document that comes back the other is the the metrics on the lending portfolio itself so the problem with the XML is that hand-coded XPath like that stinks have any of you ever dealt with that yeah all these sheepish grins going yeah this is a horrible time in my life I felt dirty you know hand-coding all that stuff and I've seen this from working with other analytics teams where they've got just these expressions that stretch this long and then they need this for a hundred different data points so huge expression huge expression huge expression huge expression and trying to maintain that is a is a complete nightmare let alone that the pain of actually creating that thing so we said why JSON's the cool new thing yes we'll follow the herd into the JSON but it doesn't take much thinking to realize well the JSON operators in Postgres make it nicer it's not as bad as the XML but who wants to hand-code all that stuff so really what we did right from the get-go as we said look everybody knows sequel sequel makes it easy yes these are structured documents or I'm sorry they're no sequel unstructured JSON but this is a document that it doesn't change you know so we can easily take the JSON and turn that into a set of views that represent the hierarchy that's inherent in the JSON document so if you're if your top level has keys a b and c and each of those is an object you have view a that has its fields and view b that has its fields and view c that has its fields now when you're dealing with it from a sequel standpoint all you need to know is how to join things together and everybody knows that so that results in data being accessible via sequel very easy to deal with the portfolio metrics this is this is a different problem so the third-party XML you've got this well-defined structure and the third party vendors say hey you pay us five bucks here's the day that that we're gonna give you they've already identified for you the kind of data that you probably want to look at the portfolio metrics it's up to you now to try and decide what are the things in our collection of data that we've captured from our clients that could be useful predictors of different credit risk behaviors some of these are going to be kind of obvious like people that just they get the loan and then we never hear from them well obviously that's bad so people that never pay okay that's one metric that we want do people pay on time that's maybe important maybe not and and this is one of the places where the FICO score is not so handy because the FICO score is based on predicting are you going to be late on a payment most lenders don't really care that much if you're late they care if you don't pay a lot of them and they won't say this a lot of and maybe I shouldn't say this since I've mentioned who my client is and I'm not saying my client does this but you know discover card and certain other folks you miss your payment they get to charge you the $35 late fee do you think they're crying in there you know crying in there in there in their glass over that no as long as you pay back what you owe them they're going to be happy so what this boils down to is you need to identify what metrics are going to be useful some of that identification can be done strictly as part of your analytics modeling process I mean that that is part of predictive analytics is you look at a whole pile of data and you let the machine do the hard work of identifying hey I found this pattern over here it seems to be important some of this is just going to be driven by well look we need to measure portfolio performance so part of that for a lender is being able to accumulate a loan loss reserve so as loans are going through their life cycle you need to be able to predict okay these are the loans that are okay and these are the ones that we think there's a risk that we're not going to see the rest of the money so we need to start planning financially that this chunk of money is not going to be coming back once you identify these useful metrics you want to code them cohesively and this is this is an area where a lot of companies I think do a really really not so great job they look at things like tracking portfolio performance as oh hey that's the responsibility of the bi team or it's the responsibility of the finance team and then maybe finance has to go talk to the bi team and then the bi team might kind of try and complain to the coders that hey it's hard to get at this data and somewhere along that chain you know things just break down what you really want to do is come at this from a holistic approach of metrics that are going to be useful for predicting things are probably metrics that you should be tracking and reporting on anyway and if you now have two uses you know your business reporting and your predictive analytics that are saying hey we need the same data point well let's just say this data point is important and we should worry about it all the way through the data life cycle so starting with the customer interaction on the website how do we collect information in such a way that we can do analysis on it and that starts with everything from how are you collecting customer addresses are you sanitizing that data do you keep the original data that was unsanitized because sometimes you can glean information about things like did they use all caps did they not use all caps that may be a predictor I honestly don't know but that that's the kind of things that you need to worry about on the front side and by doing that now when you're collecting data down the road it's not this enormous nightmare to try and measure these things it's not a problem for your bi team and it's not a problem for your analytics team finally the the performance aspect of collecting the portfolio metrics that's something that certainly in the finance industry for portfolio metrics you generally don't need to worry about too much up front and have any of you heard the axiom that the first rule of performance tuning is don't that's what this is about so what you don't want to do is get all worried about oh well if we don't materialize it and do this and that and the other thing then you know it it won't perform well and then you end up doing nothing and now to calculate that metric you just have to run a bunch of horrible sequel by hand which by the way is going to perform horribly anyway so focus more on getting the metrics to find getting them implemented and then where there's a performance problem you can go back and address it at that point once you've built some infrastructure and you know yes this is the metric we want it's a metric we need and by the way we need this report to run in less than 10 minutes and we use it for the lead purchase so we need to be able to pull the data in less than 500 milliseconds the really important key part of all this though is everything I just talked about is is built on this concept of bringing the analytics folks the business intelligence folks and development together to define this stuff and work on it together now that that doesn't mean all of a sudden you can't create a single data point without having 15 people in a conference room but you want to have that cross communication and most importantly you want to have a feeling of shared ownership so that it's it's not this throw it over the wall mentality so the second aspect of this is the code portion this is the stuff I know you're all waiting for so creating and trading models that is frequently a computationally difficult task executing the models is usually not depending on what you're talking there's exceptions and frequently when executing the model is computationally expensive expensive it's not the CPU cost it's not the computation itself it's getting access to the data that the model needs so if it's necessary consider if you do have a model that is very computationally intensive to any of you work with predictive analytics with image processing image recognition one okay all right work up work with GPUs or if it's if it's too expensive to run inside the database you can consider using an external you can use an external storage mechanism obviously you want to avoid duplication of code and traditionally this is this is a big problem because your analytics folks like to use languages like are in python and developers like to use things like rails used to be hot and now it's more node and you've got this dichotomy that doesn't really mix by running the at the the model code inside a postgres you can let the analytics folks create the code in the language that they're comfortable with because postgres supports are and it supports python and supports a bunch of these other sort of procedure languages and now you're not you completely eliminate this step of throw it over the wall and let's rewrite it from scratch so you want to support that you want to make the model development environment as similar to production as practical so what we did with Kyrios is every model has two database functions and that's really all that there is code wise to each model is these two functions one function is responsible for pulling all the data together now because we've got the XML information separated out into a series of views the data gathering functions are really simple it's it's just hey here's a sequel statement pull these fields from this view join to this view pull these other fields join to this view pull these other fields just bring it all together and we bring it all together in a compound data structure of a postgres composite type the execution function that's responsible for running the model can then be coded in any language that you want and the only restriction on that execution function is that it accepts a compound data type and it returns a compound data type the reason for doing that is that allows us to predefine here's the structure for this model for both its input and its output also the framework then has these compound data types that it knows about and it can just store those into a table so every time that the every time that a model executes it just inserts a record into a table that says here were the inputs here's the output two fields that's it you're done the framework doesn't need to know anymore about it now obviously to use the output you will need to understand what the structure of that compound type was here's the here's the credit score portion here's the margin of error indicator here's you know any warning flags that we saw whatever you want to do yes it is a structure so compound types in postgres the syntax is basically create type give it a name parentheses and now you essentially have a table definition it doesn't have the full support you can't say foreign keys and all that stuff but you say here's the field name here's the data type here's the field name here's the data type and in fact postgres under the covers creates one of these types every time that you create a table so if you say select star from PG class and you get those rows back you can actually treat each of those rows as a field and you can store that inside another table or you could treat the entire record set as an array of composite PG class and and if you wanted to actually code that you would just say here's a field and its type is PG underscore class square bracket square bracket and that tells postgres oh it's the type name is PG class and it'll go and look up and say oh PG class that's a composite data type I know what that is and then the square brackets just indicates that it's an array so the the model code that kairos uses is typically a pi a python function so the model itself ends up coded as a single python function if you're familiar with python python has this nice ability where you can define a function from within a function itself so the more complicated models you'll have a top-level function and the top-level function may define it additional functions but at the end of the day the model itself is nothing but pure python and all that that python function does the only constraint on that python function is it accepts a single dictionary which is the equivalent to the postgres compound data type and it returns a dictionary which will then get turned back into a postgres compound data type so the database function the pl python function then is nothing but a simple wrapper around the python function this means that doing the model development the analytics guys their job is easy because once they know what the structure of the dictionary they're getting is they can produce test data they can do whatever they want and as long as they're returning the dictionary that's their interface and they can go off and do their thing and they don't have to get hey mr. developer I need this no let them do their work let the database people do their work now as I said the key one of the key constraints here is that you do have to know what these compound data types look like and this is where you want to get the people working together to again on this data problem of hey here's the data that it looks like we need for this particular model and by by doing that in advance what that allows you to do is you get the collaboration again going so that it's it's not you have analytics guys that aren't necessarily database experts and they're crafting these horrific this is how we'll pull the data no you get them involved early with the database guys oh you need this well hey we'll do this thing over here and it'll make it a lot better but by by having the separation of the code that means that integrating the code into the database becomes very easy and in fact we can do this there's two ways that we can do this so as models are being developed we can create the data structures in the production system and go ahead and put just the data structures in production not the models but just the structures and provide the support for hey this is how you gather the data for the structure the analytics guys can then take that use those data structures and actually we can create the data gathering function they can use that on a Postgres replica to say okay give me the actual data set and they can they can get a data set that could then be pulled directly in Python so they could take a Python console like Jupyter Notebook connect the console to the replica they get the data streaming back in exactly the same format that's going to happen in production and they can do whatever they need to you know analytics wise operating in their tool set in their environment that they're comfortable and productive in and they don't need to back and forth with DBAs or developers hey can you push a new function you know a new version of this function the other option that we have is we can give them just a snapshot and say here's how you create the Python function and you can now you can just go run your model so you create the function you know you create your model you stick it in in a PL Python function and you say hey okay here's the loan portfolio select Python function go and that's how they can do the model verification the third option because of this feature is you can actually put a non-production model into your production environment they have to be careful obviously that you don't start accidentally calling it from the web front end and all of a sudden bad things are happening but you can actually do this to support your development and your model validation efforts to say okay here's the model we think it's in its final form let's load it in production we won't use it yet but once it's in production on our streaming replica select model function from all these loans run basically run the the model against the existing portfolio see what the performance is all the same code base there's no recoding here there's no magic anything it's all the same exact framework working from within the database so tracking the way that this framework works is the actual function source code for the data gathering function and for the model function is captured in the framework so every time that you run a model you have pointers back to the exact source code there is no question there is no well let me see what git looked like on that day oh wait we're doing a deployment at the same time so I'm not sure which one we use now every single execution you know exactly the code that was put in place the because of the composite types the input and the output those get recorded with the individual model execution as well so every time you run the model you know everything there's no questions and by the way the table is not updateable and it's not deleteable so once the model executes you have a permanent indelible record in Postgres well super user could really go in and mess with it but short of that you know this is a record that's not it's not going away it's not going anywhere and you can show that to regulators and give them the confidence that they know now that hey you know what you're doing this isn't you know this isn't guesswork when when you provide data to the regulator they know that it's the correct data so all that stuff that I have been talking about you know that sounds kind of complicated and hairy what does it look like in practice that oh but that's client code please don't share that unfortunately I I would love to open source this I am going to be talking to my client about potentially doing this but obviously this is a competitive advantage to them so I don't know that I will be able to open source the framework itself but I have given you the valuable insight here today of this is the idea you know this is how you can build a system that executes this stuff from within Postgres itself and if the code simplicity aspect so that that seriously is all that you need to do to run the model the for those of you that are not familiar this the kind of funny thing with null and then casting it to a data type that data type is a table that stores the actual model execution so every every major version of the model that makes major changes to the data structures it does get a new table created but we're using Postgres table inheritance so that when you want to see just show me every model run we've ever done that's easy you just query model dot model run I think there's a table no it's model dot model execution I think is the table name but you query that and it's going to show you every single model execution that's ever ever happened and then if you want the details for a particular kind of model well I want the the lead purchase model query the lead purchase you know call the table lead purchase or lead purchase model you query that table it's got the results so by casting null to the data type of the table that tells the function oh hey we're running this model and the function goes and finds out here's the details about the model itself and then we just have a single surrogate key called identity ID we use that to identify individual people so that tells the model okay well the identity ID gets passed to the data collection function so it knows when it's pulling data out of the database yes hey it's Bob so we're running the model for Bob so go get Bob's data and then hand Bob's data to the model function that we were told about earlier and the model function provides its result and then the model run will ultimately provide the the actual record that's that's logged for the model execution so that's what the code looks like of course the other question especially for real-time analytics is performance so using just a very simple really a do-nothing data gather function and a do-nothing model execution function these are just test functions that we use for unit testing the framework you can see that the two functions take less than a millisecond to run the important number is that other which indicates that the overhead for executing the framework itself is just under seven milliseconds so this idea that you can't run analytics in postgres I don't buy it so much and unfortunately I didn't think ahead of time to run these tests I could have run these tests for some of the production models but in reality using the real models we're still seeing sub-second performance so this is something that will easily scale out to quite significant workloads in production without you know causing a performance problem so key takeaways use your data tools for the data use your code tools for the code and this doesn't mean keep the team separated you want the teams collaborating together but you want them to be productive so if your model guys want Python let them use Python let them use it in its native environment or are and let the data folks use their data tools don't ignore this idea of using code Postgres as an actual code execution platform a more traditional way to do this would be oh yeah we hate all these things you know all most of those bad things about analytics you can say well alright we're gonna build analytics as a service and we'll do service oriented architecture and you'll make a call to the service and the service will know and it'll do all this stuff and the problem with that is the service is still going to be shuffling data in and out of your database so it's better but you still have this data movement that becomes a significant performance problem so use Postgres as a platform Postgres is a service or Postgres providing a service and as I said get the teams working together and we've got about 10 minutes for questions yes yes so the question was because we're tracking everything that's involved with the model execution that's the input data the output data and the code itself has that created a data size problem not yet because they're not in production however from a code standpoint we don't store a copy of the code every single time the code gets normalized so so when we go to execute the function we pull the source code for the function out of the Postgres catalogs we create a hash we compute the hash of that we use the hash to look up in a table to say hey do we know about this and if we know about it we just say oh yes we're good here's the hash and actually just to be paranoid because I think the hash is actually that the hash is the Postgres four byte integer hash function so there will be collisions so we don't use the hash as the key we just use the hash as an easy fast way to do the lookup so we grab the hash once we use the hash to find the function we just do a comparison of the actual code that's pretty fast because it's just a mem compare so that you know C level standpoint it's just running through two arrays at the assembly level and then finally we say okay here's the surrogate key and we just record that with the execution when it comes to the data itself the nice thing about the Postgres compound data type is that it uses the same mechanisms that are used for defining a table so if you want to know what the internal specifics of a compound data type are you run the exact same queries that you would run as if you wanted to say hey what are the contents of this table so that information is stored one time and then the storage for the compound type itself is just it's a Postgres internal format that's essentially just a stream of bytes it's the same as if you had created a table and said oh I want to store this and this and this and this in the table and then essentially however that table would end up on disk that's what gets stored in the compound data type so the only disadvantage to this from a data size standpoint is we are making a copy of all the input data in our case at least right now all of that data already exists in Postgres the data gather function which by the way the data gather function because we have PL Python and what have you we could go call external services from the database we're not doing that but we could but we do have all that data already in Postgres we are now making a second copy for every model execution and that was an intentional design decision because going back to the regulated industry thing yes we could have set things up so the data is you know we pulled it from here and here and here and here and the tables are immutable and now we just said you know what it's not worth it in this case the bytes are cheap enough we just make a copy we store the copy and the important thing from model executions is in your production environment when you're running the models that table is basically insert only you're you're not if you were to refer back to the table you'd be referring back to an execution that happened extremely recently so it's all going to be in cash so we are not we could it really we were early enough now that we haven't really we don't really have data sets to do the model training on so right now the models are basically well okay this we think this is a good starting point and then we'll start collecting data if we were to do that we have two choices for for running running the model as if it was up in production we could either put the model in production not call it from production code but then once it's in production we could run it we could run the models by hand on the production database not a good idea the second option that we have is if we create a snapshot or if we have a logical replication slave we could then run those model functions and actually go create those records that I think is going to be a matter of what's convenient to start off with when we first start doing model training we may well do that because it may just be the easiest thing to do oh hey look it's it's here's the code in production and just there's your overhead yes yes and and everybody please understand this was developed with a specific use case for a specific client in mind we're obviously very focused on the regulated industry portion of this as well so should you just blindly go take this model and implement it where you work well unless you work in another financial company maybe not so yes if you are talking about this is a huge chunk of data no you're not gonna want to do that my experience in the financial industry is that typically the the model input data is only going to be maybe a third of the database in its raw form but then in its compacted you know in its composite key form I don't think that it's it's going to be terribly significant right right so the the question is if your analytics folks are using something that there is not a procedural language for could you still use this this concept the answer is yes because you mentioned shell actually there is a pl shell so you could fire up a shell from in the model execution the the one downside to that now I wouldn't necessarily recommend shell if you want to do that what I would suggest is that you treat it as oh that portion of the model is a standalone service so you use one of the procedure languages like PL Python or PL Pearl would be two good choices or or maybe v8 and you use those to call out to that service and you pass the data to the service and then you get the results back and then you can you know for the rest of that you can you can do what I describe so I'm being told we are out of time I think is next the lightning talk oh yes next the sponsor talks so I will stick around here for another four minutes but then we all need to be next door so thank you very much