 I'm Aja Hammerly. I'm Thagamizer or ThagamizerRB on all of the Internet. I have tweeted link to my slides and they're also up at my website, Thagamizer.com. If you want to follow along at home or in your seat, I like dinosaurs and I work at Substantial. We are a digital product studio with offices in San Francisco and Seattle. We are hiring for our San Francisco office, Ruby developers, iOS developers and product designers. If you're curious about that, come talk to me. I would love to chat with you about it. So I do actually really like dinosaurs. The Thagamizer is the spiky part at an end of a stegosaurus and I'm giving out dinosaurs to folks who teach me something or ask me an interesting question or challenge me to a Rubik's cube off. So come find me after the talk. So I've wanted to give this talk for a very, very long time. All the projects I've worked on have created gigabytes or terabytes of data and I realized about this time last year that if I wanted to see this talk, especially, I was going to have to give it myself. So here's what I have. Many of us, for many of us, physical clutter is a fact of life. And no matter how much we work to keep it under control, all things tend toward chaos. And I am one of those people and I bet some of you are too. My first real job was setting up hard drive arrays like this one and we were warned to be very, very careful with the 20 gigabyte drives because they were very expensive. Now you can buy a 20 gigabyte flash drive at Walmart and it fits in your pocket. And the moral of this little story is storage is cheap. And cheap storage encourages us to accumulate digital clutter. And this talk is about how to eliminate digital clutter and use storage to your advantage. I was worried that this talk was going to end up being a 30-minute rant or a long list of how this is how not to do it stories. And I was talking to my friends about this and someone suggested that I added an example thereby turning my rant into a case study. So for a case study, I did actually need an example and I was having a really hard time choosing one. Luckily, life provided an example and it is perfect. So like y'all, I get a lot of email from recruiters. And as I was really getting stressed out about this talk about six weeks ago, I got the best spam yet. It was from office space. Like office space? And so I went to lunch with Aaron Patterson and Ryan Davis because Seattle R.B. And I told them about this email and I was just, you know, vibrating with excitement. Like, I got recruiters spam from office space, guys. And Aaron goes, well, did you forward it to recruiters spam? And I had my light bulb moment. I could use the recruiters spam database for my talk as my case study. It is relevant. It is interesting. It is somewhat amusing. And best of all, it is free. So for those who are familiar with recruiters spam, it's one of Aaron Patterson's projects. If you don't know who Aaron Patterson is, I'm talking about tender love. Recruiters spam captures data about who spams developers with recruiting emails. And here's how it works. You sign up at recruiters spam dot com. I encourage you all to do this because this conference network is actually pretty good. And you can register your email address and then when you get a recruiter spam, you forward it to spam at recruiter spam dot com. Aaron's program parses the email and it links it with your account and also with the recruiter who sent it. So then you can log in later and see these great visualizations about how much recruiter spam you're getting. Are you getting more than most people? Who is the spamiest recruiter of all time? And I'm going to use this as my example and I'm going to come back to it again and again throughout the talk. So a quick overview. This talk has three parts. I'm calling them guidelines and in each section I'm going to give some tips and tricks and share some of my experiences and then I'm going to show how to apply those tips and tricks to the recruiter spam website. So guideline number one, keep useful data and only the useful data. What counts as data? Well, let's start with the obvious things. The stuff that's in your database. That was easy. And I really want to call it that it's all of your databases, your primary database of course, but you have a secondary database. Do you have a data warehouse? Do you have a reporting database? All of that's data too. Your logs, your logs are also data. And by your logs I mean your app logs, your DB logs, your email logs, your error logs, and your server logs. All of your logs are data. Emails can be data. Raise your hand if you work at a company that emails your own employees with copies of all the emails that you send out to customers. Yeah, I'm seeing some hands, probably fewer here than I've seen in other places, but lots of folks do this. And a lot of places also store all of their outgoing emails somewhere. And that all becomes data, unmanageable data rather rapidly easily. Customer feedback, a lot of places email every single piece of customer feedback to every single employee. I work on a project that does that, more data. Clickstream data, raise your hand if you're familiar with the term clickstream, I'm actually kind of curious. Okay, about half of you. So for those who aren't, clickstream data is when a website or an app will track where people are mousing and what they are clicking on during their session. So that you can figure out where people are trying to navigate any problems that they might be having, and maybe improve your navigation or your UI or other parts of your application based on that. And I worked on a place that kept about a gigabyte of clickstream data a day and never looked at it. So clickstream data, totally data. Your backups can also be data. Any stuff you use from an outside service can be data. Google Analytics, New Relic, AB testing services, email testing services, Salesforce, all that can be data. Basically everything and anything that you have a lot of is data. So back to my question or my point. Keeping useful data, well what data is useful? Well, the data you use is useful. Going with the obvious answers here. So if you use your logs to diagnose a problem, that's useful data. If you look at the data from your outside services often, that's data. I log into New Relic nearly every day to look at how our site's doing. So that's useful for me. If you haven't looked at the data in years, it's probably not useful. Useful data is also relevant. Some examples of things that aren't relevant. Clickstream data from seven UI revisions ago, not relevant anymore. Server logs from six months ago, most likely also not relevant anymore. Customer prefs from someone who hasn't logged in in 10 years. Also likely not relevant anymore. Are they even gonna remember their password to your site? Probably not. And a really key point I want everyone to take home is that data has an expiration. Is your data, is the stuff you're keeping around past its prime? Is it actually useful? Another type of thing that can make data useful or not useful is there some new data that's only useful in aggregate. So page view, knowing that at 757.5, Susie in Des Moines, Iowa, rendered your page and it took 15 milliseconds. Probably not a useful data point in and of itself. But knowing how many people tried to render your page in that hour, that could be useful. That aggregation makes it useful. Likewise, knowing the mean time that it took to render your page in that hour might also be useful. Another category of useful data is data that you will use soon. And by soon, I mean as in it is on the schedule. It is no more than one release out. And you know that you are going to use the data post release. You're going to actually have a use for that feature. I wish I didn't have to include this, but it's a reality for most of us. Sometimes the data that's useful is the data for CYA. This is the stuff you have to keep. You hope nobody will ever look at, but you need to keep it around. For these kinds of reasons, Sarbanes-Oxley, HIPAA, COPA, any sort of financial data, you should really be keeping your financial data and you should preferably be keeping it in a mutable state. That would make me happy. More people did that. So whenever you're going to talk about removing data, you're going to have a bunch of people go, well, what if? Well, what if we're going to do data mining in three years on this data? Or what if we actually do need to look at the server logs from nine months ago? And if someone's doing this, they're being a what if fairy. And your job, as someone who believes in the conscious cultivation of data, is to destroy them. If you are saving data for some unknown future, for some what if, by the time you get there, that data is not going to be relevant anymore. I virtually guarantee that. See all the previous slides up to this point. It's okay to not deal with the what ifs right now. So how do we apply this to the Recruiter Spam website? Well, let's start with the database. That's where most of the data lives. Here's the schema. The raw emails come in when they're stored in the messages table. The messages are parsed and put into the parsed messages table. And those parsed messages are linked to a recruiter. So here's a rough count of how many records are in each table. You'll notice that this is not actually a giant pile of data. This is actually a rather small pile of data. Well, luckily, case studies don't have to be big. And if you look at it carefully, you'll notice that most of the tables are essential. We need people. That's the basic user account. Or they have relatively few records. So the two tables that stood out to me initially were the raw messages and the parsed messages table. We've got a big discrepancy there, about 1500 records. So are the messages that are raw but we haven't been able to parse useful? Probably not. We should probably get rid of them. Likewise, does it make sense to keep the raw body and the parsed body? And in the raw table, we're actually keeping the text body and the HTML body and the plain body. We've got that same information in three or four places. Probably doesn't make sense to keep it there. It's wasted space, we're wasting data. And it makes it harder to understand what's going on. So I'd probably get rid of those as well. Also, we have this experiments table down here that has one record and is not connected to anything. And nearest I can tell is not being used. So we should probably kill that off as well. Recruiter spam doesn't have a lot of this stuff that fits in the other data category, like logs, emails. It's on Heroku, Heroku deals with managing the logs and log rotation. The emails that are, it doesn't send any emails and the emails that are received are stored in the database in the way we just discussed. And there's no outside services in use, so we don't have to worry about that. So there's actually not much more to clean up to make our data useful here. So that's how we can apply the keep useful data idea to Heroku to Recruiter spam, so onto guideline two. Once you have useful data, you need to make it usable. And usability is something that's pretty much described only by front end folks. And I think the back end folks need to actually get on the usability bandwagon. It's gonna look different, but usability is important to everyone. So first thing we need to do is we need to make our data accessible. If your data is accessible, no one's gonna wonder how do I get to X more than once or twice. We need to make it easy for anyone in your organization to get out what they want, and not have to bug you to get it. One way to do that is to make it centralized. All of your data of a specific type needs to be in one place. And the classic example of this is log aggregation. There are tools that you can pay for, Splunk, Logly, LogStash, LogEntries, all those things to help get all of your log data in one place. And make it easy to access by anyone who needs it. If you don't wanna pay for a service and lots of folks don't, it's really easy to write a script that uses scp or rsync and goes, pulls all the logs and puts them on a centralized server every couple hours. I worked at a place that did this and we put all the logs on a server inside our corporate network, not on the production machines. And it was great. If I needed to debug something, it was a single command for me to do it for my terminal. And I didn't have to have production credentials or log into production to do it. Likewise, usable data is searchable. Everyone should be able to answer a simple question like, what errors were logged between 3 a.m. and 5 a.m. yesterday? Or who upgraded their account the last week? And they should be able to do it out bugging a developer, ideally. That'd be awesome. Usable data is also idiot-proof. A lot of people get twitchy about accessing your data because they're afraid of ruining everything. So make it so they can't mess stuff up. Make things read only. I had a project we worked on where we restored the production DB to a internal test machine every time we took a backup. So every six hours. We slapped a really, really ugly UI on the front of that that allowed ad hoc queries. And it was great. Anyone could ask any question and build any ad hoc report without any risk of them affecting the production database. And we used it a lot. Your data should also be comprehensible. And you can do that by cleaning up cryptic column names for those who are using services or have legacy data stores where you only have eight characters. Clean that up when you give your data to other people. If you use a database that lets you annotate columns, do it. Explain what this is, what it's trying to show. Everything needs units. I have a scientific background. And data without units makes me kind of twitchy. Specifically, things like are the load times in your logs in milliseconds or seconds. It probably depends on what version of rails you're using. Yeah. What time zone are your time stamps in? If they're in UTC, does it make sense to convert everything to local time? Does it make things easier on your employees? You should explain things. If something's confusing, and you can't make it less confusing, add a legend or a key. You don't want folks asking you, hey, what does that mean? Because it gets annoying. Worse, you don't want folks assuming they understand it and having them understand it wrong and then drawing false conclusions. Finally, usable data is cleansed. And that means that you've pulled all the test data out. I talked to someone about this and he works on a team where every single table has a Boolean column that says this is test data. Makes it really easy to pull out all that stuff when they're doing their analytics. If you need to anonymize your data, get that set up so it's automatic. Everything should be automatic so that anyone can do it. So how does this apply to Recruiter Spam? Well, honestly, the Recruiter Spam data is not very usable right now. Getting a copy of the production data is a pain because it's running on Heroku, it's in Postgres. I don't run Postgres on my local machine normally and I can never remember how to get data out of Heroku without having to look it up. Also, getting the app running locally is a problem. Here's the gem file. You probably can't see it very well but the important thing to take home is that it's running a branch of Rails that doesn't exist anymore. Because this is one of Aaron Patterson's projects. So he was running a test branch of Rails to do a use case and see if it actually worked. So I still don't have the tests running and I've been working with this on and off for about two months. So some improvements I would make. Write a script to make restoring the production back up easier so that I can just do it in a single line because I'm going to forget it and I'd rather just write it down. Likewise, fix the gem file so that I can actually use DB console. I hacked some stuff together and got to this point but the tests are still failing and I still can't run it locally. Also, I found some of the models themselves and they're naming confusing and a couple sentences of comments in the schema file are in the model or maybe renaming them would be much better. There's a model called addresses and it took me 45 minutes to figure out that that was a recoup ease signed up email address because it just didn't make sense to me. So onwards, guideline three. This is gonna get a little mathy. Feel free to do something else if math is not your thing. So use your data. Damn it. Use your data to drive improvement. You probably hear this from lots of people. I'm gonna say it again because it needs to be said again. Justify your decisions. Every decision you make should be justified from data that you have. Data is cheap. We all have too much of it so put it to use for you. Use tools to make ad hoc reports for the folks who don't like looking at giant piles of numbers. See my talk here last year for ideas on how to do that. Challenge your assumptions. I make assumptions all the time about how things work. Like clearly this is a bandwidth issue but I've learned through bad experiences that I should challenge those with data. So when someone says to me something like people have trouble inviting folks to our site so clearly we should move the invite button slightly to the left or most users only log in three times a week so we don't need to handle that particular case. I'm like, well, we have the data. Let's actually go challenge that. I end up saying a lot of that makes great sense but let's prove it. And when I do that I end up tweeting things like this. Yay for data. Data improved, my intuition was wrong. Followed about 45 minutes later by tweets like this. This is what happens when you don't understand your data because your data is not usable. That was what bit me here. You can also use your data to come up with experiments to run or if you're pushing a new feature you can build in reports that you actually know if the feature worked instead of making assumptions and just kind of guessing. So this is where the fun part happens with Recruiter Spam. How can we apply all of these lessons about using our data to Recruiter Spam? Well, one thing is Aaron was really nice to let me use this data set and I'd like to kind of help him out make the site better. So I was kind of, I was really bugged by the fact that we had those 1500 unparsable emails and I'd like to make it so there are fewer unparsable emails. So what can we do to reduce the number of unparsable emails? Well, I don't actually know what feature to add. So let's look at the data and see if we can find some trends. What? And what I did is I just went into the DB Rails console and I printed out all of the subjects for all of the unparsable emails. Yes, this looks a little bit like The Matrix but you can kind of take a step back and squint and you'll know there's a lot of unprintable characters there. There's also a lot of non-English characters there. If you get a little closer the folks in the front row might notice that there's a lot of naughty things in there about enhancements that you might be getting emails about and there's a fair number of like production error reports and stack traces mixed into this. So all of those are probably not email from recruiters and if they are they're probably not email from tech recruiters. So maybe we can get rid of those. And just for comparison, here's the subjects for the parsed emails. If you looked at this, the words startup, senior, engineer, all those things are all over the place and those are really good ways to detect if something is actually recruiter spam. So perhaps a good feature to add would be detecting what is just plain spam of the kind that Google spam block or other email spam blocks are gonna hide from you and is not recruiter spam. And maybe we detect those and we just don't bother putting them in the database at all. Just flat out rejected. I also noticed when I was doing this that one person has sent three quarters of the non-participal emails. So maybe his account should just be turned off. Another way you can use data to improve your process is to start with a question. One of the questions I had really early on was what months have the most spam? And again, I'm going real low tech here. Rails console again. We're gonna grab all the messages. We're gonna group it by the month that it was created in and we're gonna sort those and then we're gonna take the counts and in this case the counts is going to be a hash where the key is the number of the month like one for January and the value is the number of emails received in that month. And I was showing this to Ryan and he pointed out that if I just joined all the values together with a tab I could paste it into numbers and get a pretty table that looks like this. But tables aren't very interesting because again, giant pile of numbers and we are pattern matching humans. So we make a chart. And if you look at this, you'll see that March has the most recruiter spam very closely followed by February. And there's my answer to my question. It's really did not take me more than a couple of minutes minus some formatting to make things sort of pretty. And I actually went and talked about this I talked about this talk to a bunch of folks and I was at a poker table with a guy who was a data scientist for a national lab and he's like, tell them they have to start with a model. You need a model. You can't just like go look at your data and try to ferret out trends and assume that that's how the world works. It's not how statistics is done. And so let's start with a model. Let's look to see how we can use this technique. Here's one of my assumptions about how the recruiting world works. Recruiting budgets and targets are set quarterly. And all of my stats and data science classes that I've taken have said, first thing you have to do is draw a picture. So same picture we just had. It's a picture. And I'm gonna augment it slightly. That line is the mean across all the months. So you can see that a couple of the months are over the mean, specifically February through May and then July and October. And to dig into this more deeply, we're gonna have to do statistics. Statistics. If you haven't taken stats, I highly recommend you at least get a basic understanding of it. You can detect a lot of, well, frankly, bullshit that's going on in the world if you just have a basic understanding and it's really, really useful. It is one thing I wish I had not slept through as much of as I did in college. And one of the things I did learn when I wasn't sleeping was that when you get data, the first thing you need to do is get some basic descriptive statistics. So let's do that in Ruby. I'm gonna use the stat sample package. This is part of CyRuby. CyRuby would love more contributors and I hope to dig into it more after I'm done with this talk because it's a little rough around the edges but it's really powerful once you figure it out. And I'm gonna take that same count hash we looked at earlier and I'm going to convert the count hash, the values of the count hash. So the number of emails received in a month. So statistics apparently makes the building very unhappy with us. I'm gonna have to go through this really quickly so if you get confused come find me later and we will go through it at a slightly more leisurely pace. So statistics, descriptive statistics, stat sample. And I'm basically gonna take my counts, my numbers of emails, I'm gonna take the value and I'm going to make that into a scale vector which is a numeric vector in stat sample. And if you've ever used MATLAB R or Mathematica you know that there's this summary method that you can call that prints out cool stuff. In stat sample, this is the cool stuff. What it's gonna do is tell us that we've got 12 values, all 12 of them are valid, we have a median of 533 and so on and so forth. And you know, this seems about okay, this seems pretty cool, seems pretty understandable. The standard deviation's a little high but we need to get into our actual question, our actual model again. So here's my model. Recruiting budgets and targets are set quarterly and we're gonna do some inferential statistics and inferential statistics you need a testable statement to make into a hypothesis and this is not testable. So here's my first attempt at making a testable. There is more recruiter spam at the beginning of the quarter than in the rest of the quarter and that's getting closer to something that can be quantified but let's get a little more specific. The first month of a quarter has more spam than the other months, okay, even more specific. I'm gonna get a little mathy now. The average spam count in months one, four, seven and 10, that's January, April, July and October, the first months of the quarter is greater than the average spam count in the other months. So now we're gonna get hardcore mathy and the first thing you need to know about mathematicians is they like to abbreviate things. So MSC is now mean spam count and I break out the Greek. So if you don't speak math, this says the mean spam count for all months that are in the set one, four, seven and 10 is greater than the mean spam count for all months that are not in the set one, four, seven and 10 and this is going to be my hypothesis specifically my alternative hypothesis if you are a stats person and if you're stats person you're also going that's a comparison of two means and if you're not a stats person you're going why did they care about that and the reason is is that there's a lot of published, lied upon metrics and algorithms for doing statistics and one of them deals with, several of them deal with the comparisons of two means and all the comparison of two means is saying is if you've got two averages you're trying to figure out if they came from the same data or from different data sets. So in this case our two data sets are like our collections of email and we're gonna do a man Whitney you test and why are we doing that? Because my stats book said so. Here's a secret about stats. You can usually go to any stats book and follow a basic flow chart saying I have this many pieces of data, I have this many means, I don't know if they're normal and it says use this algorithm. So that's exactly what I did. The important thing about man Whitney and the reason I chose it is because it makes no assumptions about the underlying normality of the data. If you don't know what normality means, it means bell curvy. And I have no idea if this data is bell curvy so I chose an algorithm that doesn't need it to be. So back to stat sample. We're gonna take, and we're gonna make a vector F that is all of, it is the counts for all of the months that are at the beginning of a quarter. There's a crazy mod trick there basically saying that anything mod three is one is gonna be the beginning of the quarter. If you do the math it works out then I'm gonna take a make a vector R for all the months that are not at the beginning of the quarter. And then I'm gonna say, hey stat sample, give me a man Whitney test. And I instantiate the man Whitney test object with my two vectors. And again, there's a handy dandy summary method that prints out something that looks like this. And if you know the man Whitney test you can see some interesting data at the top but the real thing you care about is that P value. And I'm going to hand wave this in a very large way because I've got very little time. Basically the P value is the percentage chance that nothing interesting is happening. And by nothing interesting is happening in this case I mean that both of my two values, my two data sets are not actually all that different. And 93% is a really, really big P value. So the chance that anything interesting is happening here is pretty much non-existent. So I was wrong. My model was wrong and data is awesome. That is the moral of this story. So why is that? Why was I wrong? Well, my model is incorrect. Well, why is my model incorrect? Short answer because the first half of the year. So the first half of the year crazy things happen. The second half of the year actually looks like the first quarter is more interesting than the second two quarters. And I did rerun this test on just the second half of the year and that P value was significant. There is a difference for the months July through December. But in the first half of the year you have your brand new 2014 budget that you really wanna spend before it gets taken away and you have campus recruiting and you have all sorts of other reasons why the first half of the year is gonna be different. So I'm almost out of time. So real quick some other ideas for recruiter spam. Things that I'm going to test. Some recruiters are spamier than others. And by spamier I mean they send the exact same email to the exact same people over and over and over again or everyone in a given Ruby Brigade like Seattle gets the same email at the same time. Another hypothesis all recruiter emails are basically the same doing some distance metrics on the language used. Hi I am recruiting for a company I will not tell you about that is innovative slash startup slash stealth and they use list of technologies that everyone is saying that they use. And you can also use this data to figure out what technologies are hot in which regions if you could figure out where a given person or a given recruiter was from. So thank the organizers for letting me give the last five minutes of this talk and for also for throwing this fantastic conference that I've enjoyed going to three times. Here's my contact info again. Got any source?