 I like it when people treat it during my talk because it gives me something to do when I'm coming down from all the adrenaline. And the slides are actually up for this talk at thagamizer.com thanks to the conference Wi-Fi that actually works. And I like dinosaurs a lot. I work at Google on the Google Cloud platform, the real quick pitch because they paid for me to come out here. We have VMs. We have container running infrastructure. We have storage. We have all sorts of cool things and really excited about all the machine learning stuff we've been pushing and all the big data stuff we've been pushing. If you have questions, if you have a website that needs hosting, come talk to me or hit me up on Twitter or email me. I'm thagamizer at Google.com. And because I work for a rather large company, I have to have this slide. Lawyer Cat says that all code in this talk is copyright Google and licensed Apache v2 because lawyers. So now that all that's over, I actually get to get to my content. So big data. Yes. I just made that joke. There's a third joke that I can't make due to copyright rules, but if you know it, you can laugh at this point. So data is at the heart of many of our best loved applications. I really, really enjoyed the fact that I've been in tech for about somewhere between 15 and 20 years now. And I've seen how data and recommendations and all sorts of other things, maps, even just the data on maps, is helping make my life better, like not from New York and from Seattle, but one of the things that makes my day better every single day I have to commute is an app that knows where all the buses are and says, yo, your bus is 10 minutes late. I'm like, great. I'll just sit here and wait and read my book. And as I'm talking to startups, I'm going to events, I'm hearing more and more folks that are like, and data is part of our story, and I'm like, great. And big data appears to be everywhere now, and one of the reasons I personally think that is is because storage is cheap. So this is the part where I get to tell my back in the day story. Back in the day, my very first tech job was racking servers and hard drives for a tech company in Seattle. And the hard drives I was racking were 10 gigabyte hard drives, and those weren't the biggest back then, but they were pretty big. And we were filling entire racks with them, entire, you know, six-foot-tall racks. And now I can go to Target and I can buy the same amount of storage, probably on a flash drive. So storage has gotten super cheap, super quickly. But I don't actually see that many Rubyists doing big data. And I think one of the reasons for that is that it's intimidating. I avoided big data for a long time because I found it intimidating. And the primary reason I found it intimidating was because, oh, my God, statistics. So if you go read a book about data mining or business analysis or machine learning, and I have lots of them, you get slammed with a bunch of formula and Greek letters that look approximately like this. Which even if you are determined and well-educated, it isn't exactly light reading, and it isn't exactly something that you can dive into right away. And I also think people are intimidated by machine learning because there seems to be this feeling in the tech community that you need a PhD to do machine learning. And I'm going to say right now that you don't need a PhD to do machine learning, but that is the title of another talk that I propose to conferences. That is not disc talk. This talk is about exploratory data analysis. This talk is about building dashboards that might make your data more accessible. This talk is about enabling you to do cool things with the giant piles of data that I'm sure that you are currently storing. And if I'm going to do a data analysis talk, I need a data set. So I'm going to use the RubyGems download data. So if you go to rubygems.org, in the bottom right, there's a little link that says data. And I actually encourage you to start looking for this on a lot of websites because more and more websites are offering some subset of their data as publicly and freely available to do whatever you find interesting with. I've pulled data sets from at least three or four websites that I frequented in the last couple of months. And I'm going to give you a quick overview. So they supply two data sets for the Ruby Core team, Ruby Central and RubyGems. One is a weekly dump of their Postgres database, and the other is their Redis data. I'm going to use the Postgres data because it has the information that answers the questions I'm most interested in. So we're going to go over a little bit about what's in that data set. So the main table is RubyGems. It lists all the gems. Here's the schema for that database. It actually is visible. And if I'm going to do analysis, this slug is not actually particularly interesting. So I'm just going to ignore that for now. And there are that many records in that database. So about 125,000 RubyGems out there. The next interesting table to me is the gem downloads table. Here's the schema. It's primarily keys, foreign keys, and a raw number of downloads for that particular version of that particular gem. Really simple. And there are about 900,000 rows in that table. So that makes sense because most RubyGems have multiple versions. The dependencies table. It's getting a little more exciting now. So here's the schema. Primary key requirements is the actual dependency. The RubyGem ID, the version ID, are going to be your foreign keys here. Scope is runtime dirt versus development time. Important thing to know if you're writing gems, make sure you get your scopes right. Created at, updated at, our favorite Rails timestamp fields, and then there's the unresolved name, which turned out not to be particularly interesting for the questions I was asking. So I'm going to ignore that. And this table has 3.5-ish million rows. Link sets. So if you go to a RubyGem page on the right in the nav, there's a list of links. Home, Wiki, that kind of stuff. That's in the link sets table. And there's about 125,000 records there as well. And finally, versions. This is where the cool stuff is. That's the schema. It's huge. There's lots of columns. But I'm going to walk through the things that are potentially interesting. So primary key, RubyGem ID, and the number is the actual version number. There's a pile of dates right there, yanked at, built at, updated at, created at. There's some cool stuff we could do there with an analysis over time or identifying trends. The platform, the licenses, the required Ruby version, and the required RubyGems version are also probably pretty interesting. And I'm actually curious about which authors are more prolific, and I'm happy to tell you the answer to that question, but it didn't end up making it into my talk. And I was also curious about how the pre-released and latest Booleans were being used. So that's kind of a summary of the columns I found interesting. And I added full name and metadata just to make my results easier. And I also added metadata because I wanted to play with an H store in big data. So 750-ish thousand records in the versions table. So now that I have data, the next thing I need to do is start asking a question. Turns out that asking good questions and data is actually pretty hard, but I have some advice if you're trying to figure this step out. One is use your domain knowledge. So I know that occasionally I run across gems that didn't get the gem spec right for development versus runtime dependencies. If I see a gem that is not a mini test plugin that has a runtime dependency on mini test, I get a little suspicious. Likewise for Ho. There's a couple likewise for Jeweler. People still using Jeweler. And so I can use that domain knowledge to come up with good questions. I'm also guessing that a lot of gems depend on the JSON gems, because every project I've ever written pretty much uses at least one of the JSON gems. So maybe there's an interesting question there. And then the other thing, and this is the time when I'm going to get a little bit stats nerd on you, is that if you're going to do any sort of formal statistical analysis, you need to actually have a hypothesis, a statement. And you're going to decide if that statement has a high probability of being true. You do that by comparing it to what they call the null hypothesis, which is, and if you're actually a statistician, I apologize right now. I'm in a hand wave vigorously, and the null hypothesis is a fancy way of saying nothing interesting is happening here. So that is your stats lesson. I will put stats aside for now. So some examples of some hypotheses that I could test. The gem with the most downloads is Rails. Seems like a totally reasonable hypothesis. Many test is more popular than RSpec. I know these are frightened words. Gems released in the last year require Ruby 2.0. We've been on the 2.0 series for a little while now. I would imagine that there's a fair number of gems released in the last year that are requiring Ruby 2.0. Hypothesis. Rails 3 is still more popular than Rails 4. And another hypothesis, partly because I was writing these slides and writing this talk while it was beautiful outside, fewer gems are released during the summer. So I want to make it clear that this data set is not large. By a lot of people's standards, this is not actually a big data data set, but it's large-ish. And it's relevant to the audience, so I'm going with it. So I'm going to use a tool called Big Query to do my data analysis. The Big Query is designed for doing big data sets. What is it? It's a non-relational, non-indexed database tool, database-looking tool, data analysis tool. It was built by Google, and it's part of Google Cloud Platform. Why was it built? It was built to search and analyze logs at very, very large scale. And that's part of why it's non-indexed. How does it work? I don't know. I actually don't know. I know a little bit about it. I know that you can parallelize some stuff, and I have a hypothesis that if there's stuff that's parallelized, there's probably a MapReduce step in there somewhere. But I know it's more complicated than just that. I know it involves specialized data structures. And it turns out I don't need to know how it works to know that I like it. I love Big Query. And why do I love it? Because it supports standard SQL. So I don't have to learn some crazy query language to get my queries done. Standard SQL, and there's some tool-specific extensions. Most databases I've used, they have, here is your ANSI SQL, and here's the other really cool stuff that we wanted to build on top of it. And Big Query is just the same way. It's freakishly fast, freakishly fast. Querying hundreds of millions of rows, it takes seconds. And that includes aggregation, cross-table to queries, with joins, all sorts of crazy stuff. Cross database queries. And it scales. It scales to handle super large data sets. I've personally used it with data sets on the orders of hundreds of gigabytes. I think I'm going to demo that shortly. But I know folks who've queried several terabytes with it. And it also turns out that it's just complex enough to do things like sliding window analysis, running totals, multiple levels of aggregation, nested queries. And it can also gracefully handle data that isn't consistently well-formed. And part of that's because of its heritage. Because it came from log data, and log data tends to be messy at times. The fact that it can handle data that's not consistently well-formed, missing fields, that kind of thing is really nice. So no one believes me until I do this. And this is when I'm going to do a live demo. And those are always slightly terrifying. So I happen to have, and they shipped in the half an hour since I brought this demo up. We're going to just leave it as it is. This is the Hacker News data set from 2015 October. This is all of the stories on Hacker News. So if someone gave me a word that you think is going to be in a title of a Hacker News story, please remember the code of conduct before you start shouting out. What? Bitcoin. That's a great one. So we're going to run the query. We'll try to figure out which site it's on. There it is. Awesome. So we processed 3.91 gigabytes to do that. There were 760 records that it found. And you can actually see what the score of those was. And that's the second most interesting part. So I'm going to do an aggregation and figure out the average score of all stories that had Bitcoin in the title was. And that took a second and a half. The average score of all those stories with Bitcoin was 9.4. The average score of all stories that had hiring in the title was not that high. So that's my demo. I'm going to hit Control P and good things happen. So a quick vocabulary lesson so that you can search for stuff and figure out answers. A data set in BigQuery is what you would call a database. It's a collection of tables. A table is a bunch of records that are structured in some sort of regular way. So now I need to get the data into BigQuery. And I'm going to do that in two different ways to show you that two different ways are possible. The first one is streaming. And this is just pushing records into BigQuery and having them added to your data set in close to real time. You might use this for clickstream analysis. Are people still doing clickstream analysis? And log data or anything else that you need available pretty quickly. I'm going to use the G-Cloud gem to connect to Google Cloud in BigQuery. This is written by Rubyists, for Rubyists. Mike Moore and Chris, whose last name I'm forgetting, who are in Salt Lake right there, work on this. It's great. It's hand-coded. It's not an auto-generated gem, so it feels natural. And the PG gem for Postgres, some basic requires, require PG, require G-Cloud, set some environment variables so I can connect to my account. If you've used a web service, this should look familiar. I'm creating a new G-Cloud object. I am accessing BigQuery from that. And then I am grabbing my RubyGems data set and a quick point of syntax. Because I'm dealing with two different databases and moving data between them, it got confusing. So anything that's prefixed with BQ underscore is BigQuery. Connect to Postgres, connect to my database named RubyGems. So now I need to make a table in BigQuery. And I'm using or equal in case the table was already there. And I'm going to go BigQuery database to create a table, call it gems, because I'm going to do the gem data first. And then I pass in a block. And that block is specifying the schema. And if you look, it looks an awful lot like a Rails migration. And that's because this was written by Rubyists for Rubyists. What can we do to specify a schema that you already know, this? And this creates the table. That's all I need. There's only four fields on the gems table that I'm interested in, so that works out. I need this later. This is the list of the columns that I'm going to pull, the ID, name, and created at. And then here's the fun part. That's the entirety of the code that does the import. First line, select star from RubyGems from Postgres, gives me the data set back and runs a block on them. Go through each row in that data set, the PG table. And then here is the actual cool part. I make a hashed row, and I insert that into BigQuery. So when I showed that first line to a bunch of my coworkers who write Ruby every day, they're like, what are you smoking? So quick digression into zip and hash. Say you have two arrays, one of keys, one of values. You can use the wonderful enumerable method zip to make an array of pairs by taking elements pairwise from both of those arrays. And if those words didn't make sense, it's OK. I have an animation. So it takes key one, value one, key two, value two, and adds them up. And that gives you something like this. So that's great. But why did you do that is the next question. And the answer is because hash has the square bracket class method that creates a new hash from an object that's hashable, a flat array with an even number of elements, or an array of key value pairs, like the one I just made. So if you take this and you run that, you end up with this. So if you happen to have an array of keys and an array of values, you can turn them into a hash with this trick. And that's what I'm doing on that first line. I'm taking that columns array that had the keys, and I'm taking the values from Postgres, and I'm hashing them together, because BigQuery wants a hash, a dictionary, of key value pairs. So that's streaming. Batch turns out to be easier. This is for stuff that isn't time sensitive. I know some folks who pull their log files in once a week or once a day for this. Maybe from a recent outage. You can also pull in maybe customer data. It doesn't need to be in real time. There's a list of formats. CSV, JSON, and Avro. I'm going to use CSV because I like it and it's straightforward. And I'm actually going to use the CSV gem or the package that's built into Ruby because the CSV spec is actually surprisingly complicated if you go read it. So requires. I'm requiring gcloud because I'm going to upload the file to Google Cloud. You have to put the file someplace that BigQuery can read it, and that was the easiest one for me to do. Connect to Postgres. I've got a columns thing again. I'm doing the dependencies table this time. I'm going to make my query using that columns array, just because I'm not using all of them. And so it's easier to do that than to do the select star. And then I have CSV open, big Postgres exec. Go through every row. Shove it into the CSV file. It's that simple. And then I'm just going to create a new storage object, pull a bucket, and write a file. If you've worked with any sort of cloud storage thing, buckets and files should be things you're already familiar with. It's really not that complicated. So I have the file uploaded. Now I need to import it. You can do this through the UI. You can do this from the gem. You can do this at the command line. There's a command line tool. Or you can just do it through the UI, which is what I do, because I really hope I'm not doing this more than once or twice. So it's not worth automating for me. So it's kind of hard to see, but I'm picking the source file location, telling you where to pull the file from. I'm saying what table it should put the data in. And then I'm specifying a schema again. Same thing. An ID is an integer, and it's not nullable, and so forth, all the way down. And there's also some options about how many header rows to exclude. So if your CSV has a header or whatever, you can exclude those. You can say what your field delimiter is. And you can also say how many errors are allowed. If you know that your data is noisy or messy or gross, you can say, eh, there's 10 million rows in here. I'm OK if I lose 1,000 of them. And you can also say if you should allow jagged rows. So if you have optional fields at the end, it supports that. And note that it takes care of things like parsing dates and stuff for timestep columns automatically. So what now? Now the fun part. Let's play with the actual data. So simple. I stated that Rails has the most downloads. You can also say which gem has the most downloads. Anyone want to hazard a guess? I was on my list, too. So here's the query. This is straight up SQL. Select name and count. From, table name, joining with gem so I can get the gem name. Order by count descending so the highest downloads are first. Limit to the top five. Rake, rack, multi-json, json, and bundler. And if I had been thinking, I would have at least thought of bundler and rake. But the downloads table is tracked by versions. So maybe Rails has had a lot of versions and the downloads are getting distributed. It's okay, I can take care of that, too. So we're gonna do a sum. Same query as before, but I just added a sum on the count. And I'm grouping by the name. So all the gems of the same name are gonna get counted together. And it's the same list again with slightly bigger numbers. But you can look and see that rake has 214 million downloads. So, numbers. So how many downloads does Rails have? Okay, same query again, but I'm adding a wear clause to limit it to Rails. And it's up there. It's probably in the top 20, but it's not in the top five. So, I made another slightly fighting words statement earlier. Mini test is more popular than RSpec. Same type of query. I'm doing a having clause this time. Again, all standard SQL. And I'm gonna say if the name is mini test or RSpec included. Mini test is winning right now. And just so you know, all the queries I've shown thus far took five seconds or less. So, gems released in the last year require Ruby greater than two. So this is kind of a crazy one. Again, it's a standard select and a standard count star. The wear clause is a little crazy. I am saying that I'm gonna take the current timestamp and add negative one years to it. And say that the created at needs to be greater than that. Greater than that. So that is kind of backwards logic of saying created in the last year. But it means that the query is gonna, I saved it off and my coworkers can use it. Which means it will always be up to date because I didn't hard code the year. So I get this back. So most of the gems released in the last year say that they require any version of Ruby greater than or equal to zero. I'm gonna say right now, I'm a bit suspicious of that. And then one, nine, three, and then a variety of versions of two, oh. And this isn't exactly answering my question because I was talking about greater than two. But we're gonna come back to that. It's first we need to go into some more complex queries. So I said Rails 3 is more popular than Rails 4. This is a different way of saying that. I think Rails 3 has more downloads than any other Rails major version. And that mess is the query that figures that out. But it's mostly standard SQL. There's two joins now because I'm having to join on versions and downloads and the actual gem table. I'm doing aware to figure out that the gem is Rails. But there's a cool line right there. That line is super neat. So I'm gonna impigot it. And I'm taking the number from the versions table which is the version number and I'm running it against this regex that grabs the first digit of the version. And I'm calling that major and then I'm grouping by it which gives me this. So there's some very, very large numbers there. And this is across all versions. And this query took about 12 seconds. I find it hard to read so did some division. That ended up there. So three is still more popular than any version than all the versions of four. Let's come back to this last question. Gems released in the last year require RubyGrader than two. So I'm gonna, this is the original version and I'm gonna use the same regex be extract thing here where I'm gonna pull out, the regex got a little more complicated because I need to match some combination of greater thens and the squiggle or in equals or things like that in front of a digit. I'm gonna call that version and I'm gonna do the same timestamp trick and that gives me this. So still most gems are requiring version in the zero range, greater than zero which seems suspect but at least I was able to collapse two, 2.0.0, 2.0, 2.0.1 all down into some variants of two. So I have greater than equals two or squiggly or greater than and I could probably write a better regex that would collapse all of those as well but this is a good start. So I'm pretty sure I'm almost out of time so I wanted to say thank you and I have large quantities of dinosaur stickers and styling Google Cloud Platform sunglasses that might be helpful on our boat later today and a giant pile of Google Cloud Platform stickers as well so please come get free stuff from me so I don't have to take it back to Seattle. Thank you all.