 So, welcome to the Data and Analytics track here at RailsConf. I'm Barrett Clark, I'm Barrett Clark, and we're going to talk about making data dance. Or maybe you could think of this as harness heretical anti-patterns for fun and profit. I'm excited to see so many people here in the audience also want to play with data or excited about data. I'm super excited about data. And I promise, this is a bold claim, but I promise there'll be at least one nugget in this whole mess that you'll be able to take and use. A little bit about me. This is basically my resume where I've been for the last 15 or so years. I started out, well I started longer ago, but I spent a long time at AOL in a market research group, and in production we used Berkeley DB, yes really, Berkeley DB, Informix Sybase, two versions of Sybase actually, MySQL, Postgres, and SQL Server. And we did most of those, the last four we did with Rails, which was super exciting. Then I went to GeoForce, which is an asset tracking company that mostly works in the oil and gas space. So lots of geo data, really interesting stuff. So we used Postgres, PostGIS, and we used Redis there. Right now I'm at Saber Labs, I've been there for two and a half years. We're an emerging tech incubator within Saber, which is a big company that I'm sure about a few actually have not heard of. If you know the brand Travelocity, we just sold that, so I actually don't have anything to do with that anymore. We used Postgres when I needed to do GIS stuff, PostGIS usually, Redis, and occasionally MySQL. What do all these things have in common? What's so great about me, right? I've seen a lot of data, like I've seen a lot of really interesting data and a lot of it. And I've generally played with it with Ruby, not always, but generally. And then again, generally, I've played with it with Rails. Again, not always. You can do some really useful stuff with plain old Ruby. This talk is kind of based on a series of blog posts that I did a while back in Postgres Weekly, the weekly news that I picked up a handful of them. I'm super excited about data, y'all. So ActiveRecord, it's great. Handles most of your data access needs. It's a great place to start. But I maintain that you're going to eventually outgrow it. And maybe even outgrow Rails, at least in the monorail or what do we call it now, integrated, I don't know. But that's the conversation for later. We're not going to talk about that. Having said that, Rails is still our go-to tool in the lab, especially when we need to build a back end. This isn't a talk about ActiveRecord, though. This is a talk about raw SQL. Now, where I go to raw SQL is kind of in these two areas. You can build up complicated joins. You can do joins in ActiveRecord with scopes. You can chain scopes together. And that's great. It also starts to get kind of ugly, right? And so I find myself, rather than figuring out how all these things fit together, and then you've got this thing from this scope that's referenced in this other scope, and that's weird, I'd rather just write the query. And then you've got deeper database functionality. ActiveRecord maps, you know, all of your CRUD functionality, and that's great. But if I want to use something in the database that's a little more robust or a little more complex or unique to that database, I'm going to dig into the query, into the SQL. And also, really, if I'm going to do post-GIS, the RGO gem is fantastic, but I know post-GIS, I know those functions. And so I don't really want to learn a new API for this other thing that I already know. I use it some, but I also write the queries. So I know you're thinking, bear it, dude, what about database portability, right? Everybody here is all freaked out, right? Okay, raise your hand if you've ever changed databases in an established project. Okay, fantastic. How about in the last year, have you done it? Okay, so, some of you know, it just doesn't work like that. Database portability is a pipe dream. And maybe y'all had a different experience than I have. I've done it a handful of times, and it really sucks. Even beyond just the basic query interface that ActiveRecord gives you, like the data types are different. These are different databases. There are different data types. And even if you have, like, that really cool MySQL to post-GIS script, you still have to do a lot of work to migrate that data. Takes a lot of effort. And then, back at AOL, we switched versions of sidebase. So we were within the same database, right? And even that was a lot of work. We did all the prep, migrated the data, tested everything, and it was great. We still spent a month fighting fires. In part because of the DBA, but that's a different story. We still spent a month fighting fires. Database portability just doesn't work like that. So I say, you chose the whatever, the post-GIS database for a reason, let's use it. And now, it may not have been you who chose the database, but let's just assume that you or that other person are not a total moron. I'm comfortable with that assumption today on this stage. Okay, some caveats, caveats. The more work you put on the database, the more that can become a single point of failure. So pay attention to your connections. Pay attention to your connection pool, concurrent connections. And if you have a query that takes longer than 500 milliseconds, like you're doing it wrong, something is bad. And probably you don't want it to spend even that much time in the database. A half a second is an insane amount of time in the database. And we're gonna talk about how to do that. And also point out, you need to be benchmarking your queries against production data because they're gonna work a lot differently in your local Dev database with a couple of hundred records than they do in your production database with a couple of million records. Cool, so let's get our hands dirty. Window functions. This is gonna be the foundation of what we're gonna talk about today. Has anybody here played with window functions in Postgres or Oracle? Okay, great, so a few of you know this is way cool stuff. So the documentation says a window function performs a calculation across a set of table rows that are somehow related to the current row. So we're gonna do a calculation across some rows related to the current row. As far as documentation goes, the Postgres documentation, I think, is pretty good. Calculation across a set of rows related to the current row. It's not bad. I think I can simplify it. You're gonna fold data from other rows into this row. Okay, so there are probably a dozen, maybe more, window functions. These are the five, these window functions are my jam. Lead and lag, first value, last value, and row number. So let's take a look. Given some cheesy or fruity sample data here, we're gonna play with these functions. So, lead and lag are up first. We're gonna take the lead ID and the lag ID from fruits. Note that over open print close friend. That's where you define the window. We're not doing anything special in this particular case. We're just saying the table is the window. So we're gonna take the next ID and the previous ID for that given row. So you see for the first row, the next ID is two. There is no previous ID, so it's null. And then on and on, the last row, there is no next ID, so it's null, previous ID is nine. So we're doing calculation across some set of rows that are related to that current row. So let's talk about how to define that. We need to partition our fruit. So here we're gonna partition by the value of fruit, and we're gonna order it by ID. So all occurrences of the similar fruits are gonna be bunched together. We're gonna chunk the table up by fruits. So that first value of the ID across the fruits. So for Apple, we have four. The first ID is one. There's one banana and he's hanging in there at 10. And then we got five pairs. The first pair happens at four. If we were to look at the last value, Apple would be a five banana, still a 10. And pair is a nine. You see those IDs bounce around? That doesn't matter. We've rearranged the data by fruit. We can round out my list of top five with row number. So we see each row's position within its window. And you'll notice each window function gets its own partition defined. They can be different if you wanted to do that. They're not related at all. So for each row, it's computed across the rows that are within that partition. So Apple, one, two, three, four, one banana, five pairs, one through five. Great. Now let's look at a practical example. We're gonna make a chat app. So here you see we have some messages. We've got a couple of different rooms. And we want to make this app behave kinda like your message app on your phone. We're not gonna call out the person's name with each message. So how are we gonna do that? So we could pull down all the data and we could loop through it and do all the stuff in the client, but that's a lot of work. So instead we're gonna just let the database tell us what's the next ID. We're gonna break the data up by room because we've got two rooms or n number of rooms. So for each room, and then we're gonna order it by room ID, which isn't strictly necessary, but when I'm looking at the data, it just makes it easier to look at it if it's ordered. And then we're gonna order it by message ID. So this allows us to hide that name banner for the second and third message because we see it's the same user ID. So we've got some user, the user, who's about to go to the airport and then they just landed and wow. And then somebody else comes into that room and says, hey, so the second and third message in there doesn't have the person's name. And then meanwhile, in some other room, somebody's waiting for an Uber. And then as a bonus pro tip, this doesn't have anything to do with window functions, but you can use positional field numbers for your order by and your group by. And this is especially handy if you have a calculated field like a window function or account or something. So you don't have to re-specify that calculation. Great. So this is RailsConf, how do we do this in Rails? This is how you would execute that raw SQL in ActiveRecord. You could also do find by SQL, but I just prefer ActiveRecord base connection execute. So we've got a calculated field in this query that's not gonna show up when you inspect the object because the class doesn't know about it, but it's there. Now, some of you may be thinking, well, this is a little weird, that's too, I'm uncomfortable with how anti-pattern that is. I've had that conversation before, that's fine. You don't have to use ActiveRecord, you can just use the PGGEM directly, or if you were just using plain old Ruby, you can talk to a database with plain old Ruby, that is the thing. And it might look something like this. So whereas with ActiveRecord, you got an array of message objects, just directly with the PGGEM, you just get an array of field values. So you get an array of arrays with the data. As a reminder, models don't have to be backed by ActiveRecord, you don't have to inherit from ActiveRecord. You don't even have to persist the data in the database. We can just put business logic in a class and call it a model, that's totally prominent. Great, so now we can answer questions like, when did something change? Like when did somebody leave? How long did each thing last? How long were people in which places? We can do enter and exit events. And you could totally pull all of that data down, all that transactional data and you can munch it in your client. But that's a lot of objects, a lot of ActiveRecord objects, or a lot of stuff to send across the wire to your JavaScript framework. But to do this, to answer these questions, we need more sophisticated queries. So we're gonna use subqueries. You can subquery to filter and group. So you can wrap a query with an outer query to filter and group it. You can subquery for a field value. That's really cool, we're gonna talk about that. And you can also do a subquery in a join. We're not gonna talk about that, but it is a thing and you can do it. So let's start with our base query. For a given phone, pull in the next minor field. We're looking at data that comes in from a Bluetooth beacon and iBeacon. So we're doing indoor location stuff here. And this is gonna help us see how a phone moves around in an area covered by beacons. Again, that order by phone, there isn't strictly necessary, but it just makes it easier to see. So we're gonna partition the data by phone. So we have, for each of the phones, we'll have all of their readings and they'll be in order by the reading timestamp. And then we're gonna select some data out. We're trying to answer the question of when did something change? So we're gonna wrap that query in another query and then we can say when that minor and the next minor were different. So now we have our change events. And we name our subqueries. So it's all that as beacon readings leave. Yo dawg, I heard you like queries, so I put a query in your query so you can query while you query. So now we can see how people move around in the building, where they move from and to and how often did those things occur. And the data looks something like this. I'll give you a second to digest that. Great. Really, this is what the data looks like. So I used D3 to make a core diagram to show the connections and show the intensity of those connections. So this is the power of raw SQL. Don't be afraid of it. Okay, let's refactor. When you're building up queries and nesting them and nesting them and nesting them, it can get unwieldy. So we have an answer to that. Comment table expression. So here was our original query with the outer query. We rewrite this with, it's also known as a with query because we have with, we're still naming our subquery as all that stuff. And then you see the bottom half, we just select from that. And it essentially becomes like a name variable, the query. Now this case isn't really all that spectacular, right? That's not really that different. The power of comment table expression comes in something like this where you can build up bite size pieces of the query. The specifics of this query really aren't that important. It does some cool stuff. I like it a lot. What we're doing here and all that stuff is we're just building up time series data and we're putting it in hourly buckets. We're doing that so we can look at this heat map and we can see how long were people in which places. Again, just using D3. Now you may have noticed, hey, wait a minute, so we've got this apparent reporting app and we're using this base query all over the place. What's going on? We shouldn't be repeating ourselves. Yeah, sometimes you may have a query that gets used a lot in a lot of different places. And that's cool. We can fix that problem. We can clean that up and we're gonna do what's called a view, a stored query in the database. This is not a stored procedure. It's just a query that's saved in the database. And it looks like a read-only table. So to create it, we do create or replace. That or replace is optional, but it's a super good idea. It's gonna save you from if you were to rerun it and then it won't complain, it'll just overwrite it. So we've got this and we can totally put this in an active record migration, which is a good idea. If you're doing this in a Rails app, you wanna do this. It's not a reversal of migration, obviously. So we have to define the up and the down. And that execute is hanging out bare there because within an active record migration, it understands that this is active record base connection execute. And we're gonna just pipe a here doc into that. And that strip here doc is a cool trick. So when you're looking at the log output, it's not hanging out like way over on the right. It's gonna strip all the leading spaces. And so it'll look reasonable in your log output. And then we drop the view with drop view if exists, again. If exists is optional, but it'll save you from having an error if you did something dumb. So now we have this, we're gonna just select from that view and it's much cleaner, much easier to look at and we can use that all over the place. But now you're saying bear it. We've got business logic in our database and that's a super bad idea, right? Yeah, I know, I've had that religious battle too. I fought that religious battle too. I'm not saying put all of your business logic in the database. I'm not saying start there. I'm just saying think about it. Talk it over with your coworkers or your peers. The point is that you can decide where your code runs and why. Let's go faster. My query is too slow. I guess saying 500 or 600 milliseconds and unicorns are crying. So let's do the heavy lifting in a separate process and we can do that with a materialized view. So whereas before we had a view where the query was stored in the database, when you execute that, it's going and fetching the results for you in real time. With a materialized view, we're gonna take the definition of the query and materialize the results from it. We're gonna store the results from the query ahead of time and it can then be indexed, which is great. So we're gonna store the query definition and the data in the database. So we have create materialized view. We name it. And I would reckon this may not be the canonical Postgres naming convention. That's okay. Just have some convention. Be consistent. And we're gonna just select out of that view. We're really gonna select the most recent 24 hours because in this particular case, we have a reporting app and we're just gonna show the most recent data. And then you can see we have indexes on this materialized view, which is good. Again, you can put this in a migration. I would recommend that. The indexes will drop for you automatically. You don't have to specify dropping those. When you drop the materialized view or when you drop a table, it'll drop its indexes for you. Cool, so we have got now some fresh hot data. But when you materialize that view, that's a one-shot thing. You have to go update it. You have to tell the database to update that. So we have this refreshed materialized view. And brand new in Postgres 9.4, which dropped in December, we can do that concurrently, which is good because it would basically table lock. Now I'm not saying it's gonna be fast if you try to select from a while it's updating, but you can do it. Or it doesn't have to wait for all the selects to clear. So we can update our data now and everything's great. If you're hosting it on Heroku, you can use this, or you can use this rake task anyways, but if you're hosting on Heroku, use a simple rake task and you can update it as often as every 10 minutes using the scheduler plugin. If you need to update it more frequently than that, Heroku's got your back on that. Also, that's a different solution for that. So we're gonna just refresh the materialized view and we've always got up-to-date, reasonably up-to-date data. Now while we're talking about rake tasks in Heroku, here's a rake task that I wrote to pull down your production data and load it into your local database. So then I can benchmark my queries against real data. I can run my app against real data. This might be a security problem for you in your situation, so maybe you can't do that. And that's fine, but maybe this is helpful for you. There's a just there and you're welcome to fork it or clone it or whatever. There's also on postgres.heroku.com. Somewhere down kind of in the middle of that page, they have a little report where you can see your slow queries and it I think puts it in order and shows you on average how bad or good they are. All right, let's keep dancing with data, shall we? We're gonna sub-query it for a field value. This is super cool. So this query here is gonna do a look ahead. The interquery is gonna do a look ahead to see what is the reading ID for basically an exit event. When did somebody cross the threshold that I call exit? Which is minor zero. So you see, we're selecting from that materialized view twice and we're using that main query to filter the interquery. You can do that. So I just want the next occurrence where the minors are zero but is greater than that current record. And we're doing that filtering in there which is why I couldn't do it with a window function. So then we can build up the sunburst chart to see all the places where people went and how often those paths overlap. Again, just D3. Cool. Let's talk about some other useful things while we're here. If you're the kind of person who likes to read documentation to see what cool things exist, then by all means, check out Ruby's innumerable. There's lots of good stuff in there. Also, look at Postgres data types. There's an array data type where you can take some other data type and put it in an array of things. And that's really handy. I'll tell you why here in a second. There's also a date range and a timestamp range data type. So if you're paying close attention, one of those earlier queries did some time boxing where we wanted to look at something where the start date was whatever and the end date was whatever. You can use this date range type where it has a start and end date as a single field and a range. And it has its own operator so you can ask, give me all the records that exist within this date, this point in time. And those range fields understand that and they have their own indexes. So that's really handy. There's a JSON data type and then in 9.4 we got a JSON binary just stored in binary instead of JSON. Indexes are a little more efficient with the JSON B also. So if you're familiar with the HStore data type this is like that but actually useful. You can do like a nested hierarchy and you can bury down into it. And with Postgres you can index anything that you can select from also. So like I said, the JSON B more efficient index. So you could pull out like all of the values for a field as an array or you could pull out all the keys as an array. And then Ruby or JavaScript or whatever your language is understands how to iterate on the collection. You're set like Jell-O. Then we have a UUID data type. So if you're storing UUIDs in a character or character varying field maybe you want to look at this. It has its own index type and you get for free validation that it's actually a UUID instead of some goofy thing. So you see there at the bottom it's kind of old to that edge guides Ruby on Rails site. It has some good information on how to use I think all of those in ActiveRecord and in Rails. Future U will thank you. These are good things. So just kind of general how do I Postgres? If you're using OS 10 you have Postgres installed on it but I couldn't tell you what it is. I haven't ever used it. You can install Postgres and PostGIS with Homebrew. If you're on Linux it's super easy with either the young or the apt repositories. Or you could just download Heroku's Postgres.app and you get PostGIS and all this dependencies for free. There are a lot of dependencies for PostGIS. It takes kind of a while to install. The current version of Postgres is 9.4. Within I think it was two weeks of Postgres 9.4 dropping Heroku was already supporting it which is pretty amazing. So database tools like a GUI interface. There's PG Admin 3. These are for OS 10. I think PG Admin 3 is multi-platform. It's free. The keen bindings at least on OS 10 are crazy. You've also got Navicat. They've got several different offerings and free trials and command line PSQL. I use all three of those. And if you're not familiar with the explain analyze command you need to be familiar with the explain analyze command so you can run your query like against actual data and see where are your hotspots or you can take that and copy it into that website and see your hotspots. That's pretty cool. Okay so to recap it's okay to write SQL. Really it's okay. We're gonna be cool. It's okay. When I refactor this is kind of the order I go in. Step zero is just get over myself and write something. Get out of my way and just start. And if I see that I've started to kind of mess things and it's getting a little crazy then maybe I'll reformat that with some common table expression. If I have something that I'm using a lot I'll put that in a view. If I need to squeeze some more performance out then you've got materialized view. These are just tools that you have in your toolkit that if you didn't know about now you do. Again I'm not saying do everything in the database but it's not a terrible thing when you decide that that's what you want to. Just expand your toolbox. Insert the more you know. Shooting across the screen here. Postgres is awesome. I'm still finding stuff that is just amazing to me and I can't believe it's in there. So thank you.