 Okay, my name is James Marka and my talk today is called that sequel looks pretty complicated where the tests I was going to name it can't test that Because I had a really bad thing. I had like visions of MC Hammer and so on but I forget how to test it so that would be a lie Fonce a little small but anyway database. They're super useful. I'm a user of databases. I'm not a DPA. I tend to abuse them and they get really messy and It's becomes like the preserve proverbial spaghetti code that your parents warned you not to do so my motivation for this talk and for My work that I've been doing in the last Several months is a cleanup and test my sequel code So just give context. I want to talk about you know the problem that I had That that led to the disastrously long complex code If you look here, I think the mouse works. Yeah These little orange or yellow dots are detectors along highway. This is a snapshot of Fresno in California and So these we know these points. These are detectors. We don't know anything about the highways We're working for the California Department of Transportation. They didn't have really information on the roads I don't know why that's sort of puzzled us, but so we used Open-street maps. I'm just distilling this down to you know the details of what our problem is what we want to do is snap each one of these little detectors onto a roadway and What we need to find out is how long the segment is that? This detectors response before so we think about a car going along a highway if it crosses the detector you want to say Okay, this is a half mile segment of highway that I just counted a car on so then when you can say I've got 10 cars passing a half mile That's five vehicle miles traveled and that's what everything is all about in transportation counting vehicle miles traveled The problem of course is that these detectors come in and out of service over time So if one downstream goes out, I need to know okay redo this this query and say you know This one is is now dead. So this detector effective length goes halfway to the next one down the line So I need to have some sort of dynamic way to go in there and find out how long are the segments so My colleague and I wrote a lot of code This doesn't look that bad right here except that it's really long and There's about four or five files worth of stuff like this and this is just one little bit of it Basically, what we had to do is take open-street map clean it up Pull out the freeways and open-street map get the freeways from Caltrans data get the directions right Go fix some open-street map files that were wrong and do all kinds of crazy stuff So it was a start at the beginning work all the way through and at the end we have a pretty good output But and and it works pretty well But it's really ugly. So like my favorite Italian cookies Brutti. My morning Brutti means ugly in Italian and those are ugly cookies So that my talk objectives here to give a real-world examples of using PG tap and sketch Which is what I've been using to try to bring order to the chaos I have two Examples I'm going to use the primary one is cleaning up after a minor minor sequel disaster because It just lended itself better this talk and I'm gonna talk a little bit about modularizing and testing that giant sequel statement and it's related stuff, but When I was looking at pulling together the talk that was actually kind of tedious because it's just sort of okay Cut this move it over here clean it up So my hope is that others can learn from my mistakes So my name is James. I said I'm a transportation researcher at UC Irvine My friend and I are also starting off our own consulting firm because our research money is tending to dry up Not because we do bad research. It's just the way it goes politics and stuff I've been using post-credits since 2001 or so. I Needed a geo-enabled database to track some incoming GPS data. We're getting from Cars we had a little detector box we stuck in cars with a cellular digital packet data modem So it was really slow About as slow as the old dial-up modems and really flaky, but kind of cool back in 2001 Just want to acknowledge Caltrans both dry ass eyes what they currently call themselves with Department of Research and Innovation and District 12 and the California Air Resources Board who funds a lot of our other work So the context that I'm going to talk about today is the California vehicle activity database We have to estimate from data vehicle movements on state roads and highways This is for a resources board. They want to know how many trucks on the road How many cars and so on so they can get an estimate on emissions So again back recall how long is that segment of roadway if I can measure you know trucks on that roadway? I know I've got you know a vehicle mile of trucks so to speak We've got our input to our sources of 10,000 inductive loop detectors 100 plus way in motion stations. There's I think about 180 total 150 actually care about but of those only about 110 actually work And then we have loop detector counts from sorry a pneumatic tube detector So this is what a loop detector looks like you've all seen them and this is a bad picture I apologize, but they're actually six detectors right here three in this lane and three in this lane There's a little faint circle so as you drive over one of these detectors if you've done any sort of electrical engineering If you pass a current through a loop You get a B field going up a magnetic field So when you run a car over that B field you induce reverse current and that triggers a little voltage that you can measure on the side Of the road so the way these things work They say there's a car over my head and there's not a car so it's zero one binary counting. They're really robust I mean highways are incredibly abusive environments. You've got trucks pounding on these things You got wild heat swings and temperature so they have to be robust people have been talking about putting out Let's put video cameras. Let's put other stuff And they're all more expensive and don't work as well or as long and these don't work really well very well at all But they're still better than anything else out there, but it's still 60s level technology So the way they count the way they use these they aggregate up the data every 30 seconds So you get a count of cars in 30 seconds and they also have the amount of time that's been occupied for that 30 seconds So a traffic jam a car is stopped over detector even occupancy of one Because it's 100% of the time occupied this is the other technology I'm using the way in motion Detector station so there's two loops here so you can get a speed trap going on first loop triggers time at second loop triggers That's how fast the things go in and then you've got left and right bending plates So you can measure the weight on each axle and also by counting the actual hits You can get the measure of the distance between axles because you know the speed of the truck These are really accurate do pretty well They tend to go out of calibration and then Caltrans would just shut it down for a month while they recalibrate it The problem with these as far as I'm concerned as a data guy Caltrans isn't doesn't care about cars. So they throw out all the cars They only track trucks which drives me crazy because you know, I want to do some modeling work I said, well, I have no baseline for cars. They're like, well, oh, well, we don't want that This is the lowest tech technology putting out pneumatic tubes again These are done on the rural roads or actually any street. You'll count them Every street in the United States has to be counted every three years or so by official decree And so you'll see pneumatic tubes out on the road. You'll drive over them That's part of the highway performance monitoring system project So that that's also kind of it's there. I'm not gonna talk about it now. But anyway So the first project I'm gonna talk about is a sequel disaster. This is gonna talk about the problem and then how I fixed it Trying to go quickly through this so in 2003 I was asked to put up a map of Detector points so someone one of the other professors could display His safety results so I figured it's pretty easy I put the the loop detective data into postgres I already knew how to do that make points I knew how to do that put him up on a map Which I knew how to do and then do a little clicky clicky thing on on the website to load up this safety data The only problem, of course, they wanted me to do it in Java and which was really horrible, but whatever So I'm gonna talk about the first two. This is what the data looks like. I apologize. It's too small to read in the back I'm sure but there's basically IDs and the most important thing that I cared about was there's latitude and longitude That's it was in a certain projection if anybody knows anything about Maps and stuff there are projections you can use so post GIS is awesome Let me just say I want to transform from one projection to another There's a huge table that you get in your database when you load up post GIS that has all the SRIDs You can possibly worry about and if you need a different one you can define it using You know a standard way of defining a projection so The original data came in 4269. I wanted to convert it to 4326 Which is what you need for Google Maps. Otherwise you get a weird shift And then I need to join tables. So this is kind of what it looks like I didn't really need to put two different Geometry point tables. I probably could have stuck it in there nowadays But back when I did this post GIS had a limitation. You can only have one geometry item per table So I said alright, I'll just put it as an external thing. I read a book that said there's a thing called normalized tables I said cool. This is normalized So I have some Perl code to load it up. I see I have a create an ID populate the metadata table and then I create a geometry ID populate the geometry table This is a function I call but it just calls out to the the post GIS functions Then I populate the metadata in between At the time I didn't know how to do this all in one go. So I said well, there's only 20,000 odd Who cares? I'll just do them one at a time So I decided you know get one row of metadata do all these steps and each step is you know going back and forth the database It's slow, but I didn't care about speed. I cared about being able to make sure it did the right thing I didn't know how to test at the time. I had no idea what testing was at the time other than well, it works see That was my extent of my testing So then I did this transform project Project geometry to 4 3 2 6 and that code looks like this So this is an example right here of what the the post GIS functions are So I selected as text the output of the as the transform function And that text came back to Pearl so I could you know move around with it I didn't want to get a binary object back from the database and then down here I say all right, let's send that back the database into the new projection geometry table and then With that new geometry I can create the joint table So years go by 2003 to 2010 my daughters are growing up and the Calvated Project starts and The the client wants way in motion stations thrown in this mix. I said great same idea You know same an approach, but I'm going to use those same Geom points tables to save the data and this is kind of important and also the amount of time involved It's kind of important because you know you forget these things over the years. So This is I don't see anything wrong with this myself Maybe you guys out there if you're good at database of that stupid and naive, but So I've got a joint table here pointing at my geometry points table and same thing for the other points over here So this is VDS points of win points both point to just generic points that get created in this job points tables Same approach do it in Pearl Years go by my daughters grew up some more and we're back in 2015 And the Calvated Project ending so this year. I'm delivering it to the client So it's time to transfer the data over there But the table is too big to download. I said well, why don't we just learn by doing and So my counterpart is you know running through my code downloading the data from California Getting my code. I'm cleaning up my pro code as I go because you know from back in 2010 and sometimes older 2003 and so on So he can do all the stuff and he learns how to use it himself and that he becomes the expert at ARB for running this project So What to do we go back and forth running our code running pearl running JavaScript Etc. I doing whatever we had to do and then I got to the Wim date apart and I said all right We processed all of our VDS data We have all this other stuff and I was really getting sick of dealing with my old pearl code so as you know, it's only 150 sites and I just copy this stuff over there I had a test in one of my JavaScript functions that that actually exercised R and R is a statistical processing language, but so it had to go in there and make sure that the R code worked So I had mocked up a Couple of tables and I said why don't stick all the real data for for the way in motion stations so I did and has some real other stuff in there so I could create a little Imputation tests that would say if it ran and finished then we're all good to go So I said to him well, why don't you just go into this file over here see those two sequel files? Just you know psql dash whatever dash f that file and then you'll load up all the Wim data So all the Wim data meant you know this Wim stations table all these other tables that you can't see and also I included these Wim points tables and that was not really smart because now these Wim points tables Pointed to GIDs that are in my database not his so he had created a whole bunch of new points in a completely different order and He actually had no Wim data at all on that table, but I had perfectly valid GIDs pointing to completely wrong geometries, so we went along for a couple days and then We were doing something else for the Wim data I said well Why are those here results are weird? Why is also weird and I looked at some of the detectors said this detector is down here in Ventura County I know what that is because I drive there when I go to a mother-in-law's house And it's pointing to a detector in District 4, which is San Francisco is being the closest loop detector So you know my distance calculation is either the horribly wrong or something's going on And I figured out what the problem was I had done this so Broke the join table, and I had to fix it So I've been using PG tap to try to clean up that other giant sequel stuff I said well hey, this is a perfect application for it, and actually works really well for this talk, too So I had to prove that was a problem first You know PG tap says yes these things fail and then when I was all done with it I mean we say yeah it worked and so I could ship it off to him and have him Proved to himself said it worked, and then I use sketch to organize the fix so PG tap was done by David Wheeler who is at this conference, and I've already said thank you to him because he's awesome this stuff is great Highly recommend everybody using it all the time for everything So tap is a test anything protocol and PG tap lets you run tests in your database So it actually will put an extension in your database For the test I'm running actually puts it extension in only during the duration of the the test itself So from beginning to a rollback block So it rolls back that extension if it's in there, sorry not in there already And it lets you just sort of throw some database code in there run the tests, and then it wires up You know the usual test things you expect like you know is it okay does it pass etc So it's a lot like the pearl test anything protocol type functions if you ever programmed in pearl The basic purpose of PG tap is to say either this is okay it passed or this is not okay. It failed That's it. That's all it wants to do So I there's there's a simple test script you can run You just make a little sequel script and then run it with PG prove or we can use X unit style test functions But I really don't know anything about that. So I don't do that. I just use a simple test script with PG prove It has also advanced features You can integrate with any tap a testing system. There's one in JavaScript. There's obviously a pearl one I think that's where it started, but they're in pretty much any language you want to use They'll be in there And it has X unit style usage which again I don't really know what that is and it's compatible with integration servers like Hudson and Travis and so on which again I don't know what those are, but it's pretty cool. You can see little icons on github that say test passing I think that's how they do it. I don't know. I need to look into that So my testing needs I need to make sure these database tables exist. You know those five that I talked about Verify the primary keys work for relationships are there and that each women's station has a geometry and that the geometry actually matches the metadata Is that I knew I had all those five tables. I know the women's stations all work, right? I don't really need to check those things, but I'm about to mess with the database So I want to make sure my tests don't reveal if I break anything so One last thing I run it with PG prove So just to make sure this is clear in case it's on the internet So CPAN M dash dash pseudo tap parser source handle. That's in the PG tap documentation You don't need to I'm sorry that the fonts are just a little smaller and I expect them to be but CPAN M is CPAN minus Has anybody wrote program on pearl anymore? so CPAN is The grand daddy of all package managers package managers CPAN M is CPAN minus. It's a little bit smaller and Easier to handle so I use that PG tap so testing tables There's two approaches here. So these are the kinds of functions you might use to test a scheme or a table. There's has schema and Those checks whether or not a database has something so in this case if the database Table whim stations is in the scheme of public then the test passes if there is no database Wim station, sorry table whim stations the test fails The R variants are more restrictive So they things they do things like check that all the tables are exactly this list you said and All the schemas in the database are exactly the list you said so they're not so useful to me because You know if I if I'm hacking around a database and I create a new schema to just have some other things in it I don't want to clutter the main public or any my other schemas then this test will start failing but if you're a DBA or if you're You know running stuff for other people and you want to make sure they don't mess with things and break important stuff Then you can run this and say hey, you know at the end of the day someone created a new schemas Let's delete that And teach them bad user I Actually had some so at UCI I'm I'm the guy in the department who runs these stupid database things and sets up the websites because you know we're transportation guys and So I actually had to manage the database for a little while for some people I put it on a server I didn't care about because it was I knew it's gonna get host and they were just shoving all Kinds of junk in there and I wish I'd known about this because I would have said why are you putting things in public? What are all these tables and put it in your own schema? But I couldn't turn off access to public because I didn't really want to do that to them that would have been rude Had I known I would have been able to use this So I want to do a quick demo of the some of the basic tests here Hopefully this all works out So I have a Database that exists and I'm going to do PG prove I already did this for for the talk because hey I actually have a little bit of a mocha JavaScript program that as part of the scaffolding for the test creates a database So I also have a teardown script after by just deleted the teardown part And it just creates databases filled with data that is all broken it wrong So anyway PG prove Dash D actually don't have it up there. That looks good. Okay, so What are you trying to test here test basics so Want to show this so this I had an extra I had one of the tables are Commands in there because I want to show that it fails. I have other tables than these Four or five listed tables here. So if you go into emacs, you can change that Delete this tables are apart or comment it out Oops and now the test should pass There we go. So there's lots of other tests as well. I think we're doing okay on time. I think I just Completely jumped around in my click the wrong thing That's reasonably close So this table relationship functions as well There's a is primary key has primary key is foreign key has foreign key indexes are and so on These are my primary key tests. I don't think it's important to go through them Same thing with indexes and for the indexes actually do use the indexes are variant because in that case It's my table. I want to make sure I have all exactly those indexes I don't want to have extra indexes laying around So I was using that for that test It just when it goes database-wide it begins to get a little bit much for me So then the foreign keys are tested that the fk ok is is a super cool function So there's all these things that you do you have a primary key do you have foreign keys and so on but this one wraps everything up And one so it says, you know the first line there Define your foreign key schema a foreign key table foreign key columns and then your primary Key schema table and columns so it's basically says does this primary key point to a valid foreign key in another table? So that's great. That's exactly what I want to run, right? So these are the tests that I have the fk ok My WIM station site has to point to and be pointed from the WIM points site Same thing with geometry points You know that in both projections and so on so I can run that test And it should fail because Stuff is broken right now in that database. So I don't think it's fk's which one is it because it is fk's Yep, so it failed two of the five subtests because of course it doesn't have The the geometries don't exist and that that one table Because that's why I built this database to say okay do things fail So I'll click on this and try not to click on that there You're not limited to simple tests with PG tap you can do lots of complicated things It doesn't just say a or b is it there is a bad you can also say result equal and I use resist all the time You can put arbitrarily complicated sequel and a and b and then those The results are compared row by row There are some caveats there because it Executes the sequel you have to use prepare statements usually you can put quotes around it too But usually that gets too messy you'll have a quote inside of a quote and then I always forget how to escape quotes and sequel statements So it also does item by item comparison So if you select a huge table of things you want to make sure it's sorted in exactly the same order as What you expect your other side to be because it's going to compare one by one And if you just rely on the order it's going to fail or it might pass for a while and then fail later some from you know Because the ordering changed So for example, you can say do I have enough entries does every Select the site number from the WIM stations I'll get a list of site numbers and then I select the WIM ID from the WIM points They're not called the same thing and that's just Because of an idiot back when I wrote this but so site numbers match up with WIM IDs all the way down If that test pass and of course you do the same thing for both projections