 Hopefully we can hear me, can't we? Sorry, am I not on? I cannot tell, but I can't. Oh, although I'm also disappointed I didn't hear about Dash because I was keen to, but actually it's sort of a nice that these two ties together are sort of interesting because we're exactly coming at the problem from different ends in the sense that we just saw this kind of nice high level, like here's really big, what's going on, things are happening, and I'm exactly coming out from the other direction. This is barely, the data we're going to look at today is not actually big, instead I'm going to look at a small problem as a way to start to just get some practical experience about how to use a particular tool that will scale up to big data. So a lot of these big data talks are kind of giving us this big overview. I'm going to do the reverse, but in return what I'm going to do is I'm going to give you a chance that you can leave here and you can just immediately go out and get online and start doing some analysis using the things I'm going to show you here. This would be directly and immediately applicable, but maybe not quite so exotic and exciting. I also want to take a second and ask your indulgence just to point out I'm really pleased to be able to speak to you here today. And I mean that and it's because, so it was right after Kiwi Picon last year in Dunedin for those of you who were there. Almost immediately after Kiwi Picon I became incredibly dangerously sick. And I was wrapped up, put away the stuff from Picon because I was helping out with a conference and then spent the next month in the hospital and then spent the next two, two and a half months at home recuperating. And so I'm very much not taking it for granted that I'm able, literally able to be here speaking to you today. When I say I'm happy to be here, I'm really happy to be here. And some of you might have picked up on the fact that I've totally just set you up now because if you don't like the talk, you're going to feel a little bit guilty. But that's the way it is. So yeah, I'm Tom and I work for One-Away Play, which is a mobile game studio in Dunedin. But I want to start not by talking about my work at Runaway, but thinking back to a job I had previously. So several years ago, I was working in the United States for a large television network working with the delivery of online video. And in a setting like that, and probably a lot of settings where you're working, this is also true, we were collecting a lot of data. We were logging all kinds of stuff. And we all knew, everyone in the team knew that that data contained material that we could use to gather some valuable insight about how our application was performing, about what our users were doing, and that we should be able to look at that data and find some things we could do, concrete things to improve the performance, the usability of our app. But we didn't do that for the most part. And we didn't do it for a reason that we don't do a lot of things in this industry. And it really came down to time. In this case, in particular, it was computing time. We did do some analysis of our log data just to get, basically, we just settled for getting the data we had to have. And just running those reports every day took a couple of hours. When it takes that long to run your reports, you have less time to begin with, but you also just don't have the kind of feedback loop that's going to lead you to experiment and play with your data and try to say, oh, what kind of insights could I derive from it? So now, today, when we play again, we're logging a lot of data. So it's a mobile game studio, but we've got data coming from our API servers. We've got data coming from our mobile application analytics packages. So all kinds of data coming in. And unlike what I was doing before, we're actually starting to go ahead and look at that data and to gain insights from it. And more than that, we actually sort of play around with the data a little bit and say, let's just play what if, and let's see what we can figure out from this data. And there are concrete things we do there to improve it. So for example, right now, we're in the process of internationalizing a lot of our games. And the decision about what language is, which should we be doing next, well, that's coming from looking at our analytics and getting a miss some insight and saying, ah, I think that this language here would be the next one to go after. Things like that. So we're able to do this because we're able to do things quickly and also we're able to do things cheaply because like most organizations, we're very resource constrained. And BigQuery is an example of a tool that lets us do this quickly and cheaply. So what's BigQuery? Yeah, it's a thing from Google. It's part of their, what is it, Google Cloud Platform suite of products. And its strength is that it lets you quickly query large sets of data. That's what it's for. And you query it just using SQL, which a lot of us are already familiar to. So when I say that you'll be able to leave here and immediately go to work, you know, analyze some data, I mean, for a lot of us that's definitely true because you can already write the SQL you need to do the queries. From those two things, you might say, well, then it's just a big DBMS. And while strictly speaking, that's probably true, but that's not really ideally how you would use it. You could, but don't. If that's what you need is a DBMS, use a proper DBMS. The workflow in BigQuery is more likely that you're going to generate your data somewhere else on some other platform, and then import your data in batches into BigQuery, run your queries and whatnot, and gain your insight that way. There's about three ways that you can get it. I mean, it's got a web console, and we'll see there's a couple of things you have to do in the web console. It's got, there are some CLI tools that Google distributes, and those tools are written in Python, interestingly enough. So a lot of the basic day-to-day maintenance stuff you can do with CLI. More importantly, for our talk, is that it's got a remote API, and you can get that remote API using libraries available in pretty much all the common languages people are using today, including Python. Now, Google says that their Python libraries are currently in beta state, but I find in practice they're totally stable and ready to use. They've played with them in both Python 2 and 3. Everything seems good. And again, because Google is using Python for their own stuff, so it's not surprising that those libraries are in a good state. BigQuery, of course, also integrates a lot of other tools. That's how I came to it, because my mobile analytics packages that we're running on the games, Google Firebase tools, so surprisingly, the data from my mobile analytics package imports directly into BigQuery for me to analyze it. And it plays well also with other things, and it also plays with other things on the way out. But for that matter, again, importantly for us, it plays well with Python, so that means you can run your analysis in BigQuery and then perhaps use nice Python data visualization tools or things like that. And we can all enjoy it. Let's see. So here's the thing. It was earlier this year, in fact, it was not that long before the RFP for Python came out, that I started messing around with BigQuery because, again, I started using an analytics package that made it worth my time to figure it out. If I'm going to figure out some new technology, I completely fail at learning things in the abstract. I just like to have a problem to solve, and then I work through how to use the tools to solve the problem, and that way I learn about the tool. And so that's what we're going to do in this talk. We're just going to work through a problem, a nice, modest problem that we can all make some sense out of and hopefully have a little bit of fun with. So our plan, we're going to create something called a project, and we'll see right away what that is. And within our project, we'll set up our data set and tables. And we'll go ahead and prepare our data and load that data into BigQuery, and then we can run some queries, see how that works, and then we'll beat ID, click Kiwis, and clean up afterward. And that also relates to, in some ways, BigQuery is a tool that you have to pay for, but there's a few little tricks that we'll learn here that help us keep the cost under control. So first, what's the problem? Sometime in the last year or so, I had on the city of Twitter ratios, a lot of us in various industries are looking at social media. And social media, our American people talk about engagement, but you can't tell it was engagement good or a bad thing. And somewhere this idea of Twitter ratio has emerged. If you're interested in a little bit more about it, the linked article there in Esquire describes it in a little more detail. But here's the basic idea. So here's a tweet that I put out there, and we can see that I have one reply and I have one retweet. So if we look at the ratio of reply, there were retweets. The Twitter ratio for this tweet is one. And in general, that would be a pretty good Twitter ratio. We're OK with that. We're OK with that because the assumption behind this is that a retweet generally is going to correspond to agreement or some sort of positive reaction to my tweet where a reply is more likely to correlate to disagreement or some other negative reaction to my tweets. Maybe somebody wasn't too excited about PyCon last year and replied to it. Maybe they did. Maybe they didn't. Whether that assumption is valid or not, let's look at the data and see what we can figure out. But that's what we're looking at. We're going to be looking at replies over retweets. It's a problem here. My own Twitter timeline is what we might call boring. And so in particular, I wanted to have to go back to 9th of September of 2016 because I don't have that many tweets on my timeline to have replies and retweets so I can get a non-degenerate Twitter ratio. So we're not going to get a lot of insight from looking at my Twitter timeline. I mean that in more ways than one. I need a better collection. So what do I want? I want a collection of tweets and there's a few things. I'd like it to be a fairly large collection of tweets so that we get a little bit of a sense of bigness of the big data. I'd like it if those were tweets that had a large audience. So then a lot of people had a chance to respond to them. A lot of people had a chance to reply or to retweet so that we can actually look at some sensible Twitter ratios. And related to that, that means that I would really like it if those tweets were likely to elicit a strong response from the people who saw them. So that way, we'd get some reaction. And again, maybe this exploration of the Twitter ratio in our case might be more meaningful. So I've got to suss out a source of tweets that's going to satisfy that property. OK, problem solved. Not hard at all. So what I did was I harvested data on a collection of Donald Trump's tweets. I collected about, actually I think it's 31047 tweets that he excreted between March 2009, when he opened his account, December of 2017, when I finished putting together the set. And in particular, so for each tweet, I just collected, of course, the ID, Twitter's ID for a timestamp so we can see when it happened. The text of the tweet, the number of replies and retweets, because obviously I need to calculate the ratio. Yeah, I didn't worry about tweets. So I excluded some tweets because they didn't have text, just because I was too lazy to make a more flexible scheme. It's just an example here, after all. OK, so we've got our data. So let's get to work on figuring out the problem. So the first step, well, OK, we just have to set up a project. This is just a Google Cloud computing project. Nothing exciting going on here. But just so you know what you need to do. And for whatever reason, Google doesn't make it super easy to figure out, oh, I want to do BigQuery. Where do I start? So here's where you start. You go to console.cloud.google.com. You'll have to log in with a Google account in order to get this thing. There's a little project menu, and you set up a new project. The project is just a big container for a bunch of cloud platform resources. In particular, it lets us deal with BigQuery inside it. You're going to have to enable billing for it, because this is a service for a fee. However, there's a pretty generous cap. So for example, all the preparation I did for this talk, I never incurred any fee. And for the stuff that I actually do at work, I've yet to incur a fee larger than $10 a month for the things that I'm working on. So you can play with this quite a bit without worrying too about it too much. Another important thing we've got to do at the Google console here is we're going to set up our service account. So I'm just going to go over to the user management section, create a new service account. I'm going to give it the privileges that it needs. So over there, I made it a BigQuery admin, which is a little bit overkill, but it's a small project zone. So that I can get a set of credentials that I'm going to download in JSON format. And we're going to see in just a minute how I'll keep using that to connect my client to BigQuery. Once that's done, I don't need to go to the web console anymore. And I've been against the web console, but from here on out we can do everything we want to do in Python. So the first thing I'm going to need to do is I just need to connect to BigQuery. So how am I going to do that? OK, I need the Google Cloud library, so I just have to pip install Google Cloud BigQuery. And I'm able to do that. Go ahead and set up a BigQuery client. And I just give it the file name from that JSON file that I downloaded on the last slide. And so now I've got a client that's connected and is able to interact with the remote BigQuery API. The first thing I'm going to do is set up a data set. A data set is just because they didn't want to reuse the word database, but it's a database. It's a collection of tables and other resources. There's some metadata you can throw on it and things like that. Basically, it's just a place to keep your tables. You have to do this little kind of two step that we see here where I create a data set reference, and I give it a name, and then from the reference I create this data set object. And then I go around and immediately change it again because I use this, my client, and I call create data set. And that's where I'm actually making the remote API call. Anything where you have the client dot something. That's where the remote API call is happening. And now I've got the data set up on the Google Cloud platform. The whole point of this is just so that I can set up a table. All right, now I'm going to set up the table, and there's actually, well, I'm hinting that there's more than one way to do it. But what are we going to do? Again, that same kind of pattern. I set up a reference. I create a table, and I can specify my schema. And we've got about the same kind of column types that you'd have on other relational databases, not quite so many, not quite as much variety. More just some plain vanilla data types. So I can specify what my schema is, and then tell my client, go and create the table. Now the next thing I need to do is I need to load some data into that table. I can load in a number of formats, but we're going to go ahead and use JSON. And so here's just a quick sample of some of the JSON that I collected. So again, I've got one JSON object for each tweet. They're actually separated by new lines. And they've got the fields that correspond to the columns that we set up before. You need to go into a little bit of effort here to make sure that BigQuery can correctly parse things and get the right data types. The one place where I find it's particularly finicky is in any kind of date time fields. Get your time stamp correctly formatted. Otherwise, BigQuery is going to insist that's a string. So there's what the data looks like. Thing is, that data is actually pretty self-describing. So in practice, I don't generally bother specifying a table schema. But instead, what I'm just going to do is set up the table, give it the same name. And again, here's a pattern here we're going to see a few times. I need to create a job config. And because I'm going to load data, this is a load job config. And that's just a way that I can specify options on the job. So the option I'm going to specify is that I'm going to say that my data is new line to limited JSON. And so far, this is the same as if I'd done what I did on the previous slide. But this next bit, that auto-detect is saying, I didn't bother to specify a schema. You're just going to auto-detect the schema from the structure of my JSON data. And in practice, I find that works fine. If I had specified the schema, if I had done it the way we did on the previous slide, just don't set an auto detector and load up your data. OK, so I've got a job config. I'm going to open my data set, my JSON file, in binary mode there. And then I'm going to call this load table from file option, give it the file handle, tell it what table I want to go in, put in my job config options. And it's going to load that data up. And of course, this is going to take a while, depending on the size of your data. The job actually is going to execute asynchronously. So I can go on and do some other things. I'm not particularly fast. I'll just let it go. Of course, things can go wrong if you have to learn as I did the hard way that you've got to get your timestamps right. So you might have to check the results of your job to make sure that things loaded correctly. Now that we've done the work to be sure that we can, I don't see any point in pausing to reflect on whether or not we should, let's just go actually go ahead and look at some tweets. So the first obvious question to ask is, well, let's just go find the tweet with the worst Twitter ratio. No big trick there. Set up my client, set up an SQL query string. And this is just SQL. The only little thing to know is I've just highlighted how I specify the table. Because that's slightly different than you see in some other contexts, including even in some other big query contexts. With the Python library, the table names enclosed in back ticks, and it's projectname.datasetname.tablename. And then I'm going to go off and have my big query client run the query. And I'm going to get back a query job object. So again, remember, all these API calls are actually happening asynchronously. And although big query is fast, it's not instantaneous. So I'm not going to get the results back right away. I'm going to get back to this job object. And so the reason why I'm bothering to print the job state there is just to demonstrate that while I'm waiting on the results, I can query the job state. I expect that when we're running this thing, the job state should come back and say it's running. Because I've just fired off the job, I print the state, the job's running, and I'm waiting on the results to come back. So I might not want to. So I'm going to collect the results with that query job result. But that's going to block if the job isn't done yet. So if I don't want to block, I need to handle that somewhere in my code. In this particular example, I didn't have anything better to do with my time. So I'll go ahead and wait on the result to come back. I'm going to get an iterator. So I can iterate over the rows of results. But my query is going to return exactly one row, one result. So I'll just immediately turn it into a list. And then result, the first and only item on the list, is the result. It's the tweet with the worst Twitter ratio of everything in the timeline. I'm sure you're all excited to see that. No, don't laugh yet. There's an issue. It turns out about 20 minutes before Danny came to me yesterday and said, hey, Tom, could you do your talk early? I noticed that the tweet was wrong. It didn't match. There's no way it was right. Because it turns out there's a bug somewhere in how I collected the data. And so I've got some tweets where it's underreporting the number of retweets. So we were going to have a funny moment here. It was going to be special. And it didn't happen. I'm not going to bother to show you the tweet yet. But it's OK, because I'm just going to make you wait a little bit longer. Because we're going to do some more queries. And we'll actually let us get around this problem. So sorry, but bear with me. There is an issue, however, if you were paying attention. And say, wait a minute, replies over retweets. Well, what happens when the number of retweets is 0? We can't divide by 0. That's clearly going to be a problem, Tom. And I definitely have not addressed it. Yeah, in fact, I have. It has never tweeted something that somebody hasn't retweeted. So I just didn't have to worry about that case in this which was my data. OK. Another issue, though, if we just stop and think about this for a minute. And this was the insight that we were going to derive from the tweet that I didn't show you. But we can still figure this thing out. If this Twitter ratio thing really works, you really need to have a decent number of replies and retweets to kind of smooth out the data. Because clearly not every reply is going to be disagreement and things like that. In my first example tweet where I had one retweet and one reply, that reply probably wasn't somebody saying, no, Tom, you're an idiot, or something like that. Probably wasn't. So really, when we look at our data, find some threshold value and say, let's only consider tweets that have a certain number of replies and retweets. But what's the right threshold for that data? And this is a nice bit of the story is to say, when you've got tools to let you do this kind of quickly, you'll just experiment and try things out. And that's exactly what we'll do. And while we're at it, we can just see how to use a name parameter. So let's go try to figure out what the good threshold value is. As an aside, finding that threshold value is going to get around the problem of my data set. Because once we set the right threshold, it excludes my problematic tweets. So first thing I'm going to do is just set the query string. And I want to try different threshold values in there. So I just preface it with an at symbol. And now I populate that later on when I run the query. To populate that, I've got to do a bit more work. So I've got a list of candidate threshold values here, between 5 and 1,000. And let's try and figure out how high do we have to go before it works. So for each one of those values, I'll loop through. I'll create this scalar query parameter. I'll name it threshold, and I'll set its value. Then I'll set up a job config. This time it's a query job config, because I'm going to use it to pass options to a query job. And in particular, I'm going to give it a list of parameters. In my case, it's a list with one parameter that it can use to populate the query string. And then I'm going to go ahead and fire off my query using the client.query call. And here, again, I'll exploit the asynchronous nature. Those calls aren't going to come back for a while anyway. So let me just fire each one off. Stuff the job object in a list. And then after I fired off all the jobs, then I'll go back through the list and accumulate the results. In that way, get my results back a little bit faster. So I ran this, and I found that if you get around 500 worked out to be a good threshold value, we could start to draw some interesting conclusions with that threshold value at 500, at least 500 replies and 500 retweets. It smooths out the issue that I had earlier. So now, I'm pretty sure, this is the slide I expect, I can show you the Trump tweet that we can say has the worst Twitter ratio. No, the problem. So yeah, this has a poor Twitter ratio. I don't know if I mentioned it earlier, but what's considered a poor Twitter ratio? Generally, if it's greater than two, we'd think it's a bad ratio. So 9.4,000 replies, 1,000 retweets. This has a bad Twitter ratio, but it doesn't really seem to satisfy our hypothesis. Because after all these inviting replies, I am not going to go through all those tweets to see if they are disagreement or if they are just sensible replies. But they probably are. So actually, the first insight we've got is that clearly this Twitter ratio thing doesn't always work. But of course, that wasn't too surprising. And you're feeling disappointed, don't be, because if you look at the next highest Twitter ratio, now Trump's going to Trump for us. And we get him saying something stupid and offensive. And we've got a poor Twitter ratio. So oh, it seems to work. So yeah, what's the insight we have? Yeah, Twitter ratios clearly don't always hold up, but they do seem to hold up sometime. If we went through Trump's Twitter timeline and looked at the high-ratio tweets, there's some gems in there. Gems. But anyway, we're just sort of cherry picking. Or we seem to be just cherry picking the worst tweets. What do we really look at these things in aggregate and answer some questions about that? So we can do that. Like what if we just go, what's his average Twitter ratio across all the tweets? Across all the tweets that over my threshold, because that's important. So we'll go ahead and do that. This is an example of something where I could have, on a less speedy platform, I might have calculated some intermediate results and saved them to make things like this faster, but there's no need to do that here. I'll go ahead and calculate the average, get my result back, and then again, because average is a column in my result set coming back. Ultimately, when I want to get that value, I can just say result.average, and there's the result. So it turns out his average ratio is somewhere around 0.56. Is that good or bad? Well, actually, I can't tell you, because I haven't analyzed anybody else's Twitter timelines to see. I don't know what Obama's Twitter ratio average is. But it does lead me to a question. I like, is that good or bad? Well, think about this thing. We've been looking at, he's been tweeting since 2009, while he's the same person, but we don't think the same thing of him over that time. So it might be interesting to just look at his own history year by year, and see if we can find out something from that. Now, I'm not going to bother to show you the code for figuring out year by year, because it's actually very similar to the code you've already seen, but all the codes on GitHub, so you can go play with the code. But in particular, to say, let's look at his average Twitter ratio year by year. Now, in 2009 and 2010, he didn't have any tweets that got over that threshold value of 500 replies or 500 retweets. So I don't have an average for that. Any kind of bounces around in this range, 0.4 to 0.6, somewhere in there, doing OK. And then 2017, the ratio blows up. So clearly, what's happened is some bizarre and inexplicable thing happened. Somewhere right here around maybe end of 2016, beginning of 2016. I'm not being ironic. A bizarre and inexplicable thing happened right about there that causes Twitter ratio to shoot up. So that's kind of interesting. But another question that occurs to me when I look at this, I say, well, OK, but we think of a bad Twitter ratio as a ratio more than two. So how often does he get a tweet with a bad ratio? It's kind of if we break that down year by year, the pattern seems to hold up. He's having a very good year, if you think of it that way. Again, some things happened right in here that we might want to explore further, or we might not. However, it's also important that when we're only looking at tweets that have at least, what, 500 replies, 500 retweets. There's only 6,380 of them. So only 2% of the total tweets that we looked at in this case have this high Twitter ratio. I don't really know what we can conclude based on that. I'm not too excited to find out. But it does actually kind of hold up anecdotally, what do people say? Which is that mostly he cruise along on Twitter, has a good time, doesn't cause any trouble. And it's just that small percentage or ratio that are the ones that really seem to cause some trouble. So OK, we've played around with our data a little bit. And I want to emphasize that we played around with the data a little bit. I like it because, again, I've got a platform where I can just quickly get my results, get my feedback, see what's next. Nothing particularly hard about this, no rocket science, no rockets, not particularly any science. We can do it. A couple other little find points here. One, now notice how I kept doing these year by year calculations here. Now, I still have a small data set, so it's not a big deal. But on the subject of cost control, big query builds you for your queries based on how many bytes do you have to scan to answer your query. So you'd like to structure your data and your queries in such a way to minimize the number of bytes. So if I'm going to do a lot of year by year things, I should break down the data into a table for each year so then I can just scan one year at a time in each particular query. So every query that I did to produce each year result there, I scanned the whole table. If I'd broken it down into a table for each year, then each query would have scanned a smaller fraction and could have racked up less. But, again, I was working with a small enough data set that I wasn't concerned about that. So first cost control measure says if you're going to access your data in chunks like that, then split up your data into chunks like that, into tables. Or there's actually a way you can take a table and partition it to sort of create these virtual boundaries between your data so that you only scan a particular partition. Another thing about cost control says now I'm done with the data. So the other way that I'll rack up charges is if I store the data on BigQuery. Now in this case, that's easy. Because I've already got the data in a JSON file locally, so I don't need to keep it on BigQuery. So I'm just going to delete the whole data set. If I ever want to redo the queries, I can just go back to the beginning and start again. And which is just create the data set in reverse. So you just take the create data set step and then copy and paste delete instead of create. And you delete the data set. Now the data is off BigQuery and I'm not racking up any storage charges. And that's fine because I've already got to copy the data, but what if I don't? Couple scenarios that could happen. One is, I mean, all I did was read that data and analyze it, but I could have done operations that created and saved new data in some tables. And what if I want that data? Or what if, in the case that I have where I'm exporting data from one tool into BigQuery, and then I'm done with it on BigQuery, but I don't have my own copy of the data and I'd like to get it back out. How am I going to do that? I've got a couple options here. I can go back to the web console and just pick the tables and download them as JSON files again. And that's fine, but if you have a lot of tables, that's going to get annoying really, really fast. So I'd like to not do that. But there's not an easy way to download my tables directly from BigQuery otherwise. However, there is a way out. There's such a thing called an extract job where you extract your table as a JSON or other formatted file into Google Cloud Storage. Now that doesn't solve your problem because now you pay for it in Google Cloud Storage, not in BigQuery, except in Google Cloud Storage, it's trivial to write a Python to just download all those files over HTTP. In particular, it would be actually quite easy to write an all-purpose Python program that exports all your tables to Google Cloud Storage and then downloads all those files from Google Cloud Storage and then deletes all the stuff that you got up there so that you're not storing anything. It takes care of that nice and easily for it. Well, OK. So hopefully, we've had a little bit of fun here. And we've given you something where you can actually go out, like I say, and put this to work. Because I'm sure that most of you, like me, are sitting in some situation where you actually are sitting on some data that needs to be analyzed. And what you just need is a tool to make it easy and accessible to do so. And this one seems to satisfy that. The data set and all the code, including some of the code that I told you I didn't show you, is all on GitHub. So you can go have a look at that. I'd like to fix the problem of the data set. So when I do, I'll post a second set with the fix on there in case you want to get that, both as a matter of principle, but also because I'm sure there's one or two more good conference talks that could come out of that batch of data, whether I gave it or somebody else does. The data's there. All right, thanks. I don't know if we have time for questions, but there's tea afterwards. So we have time that way. So we have time for some questions, if anyone has any. In your examples, you were manually writing the SQL queries. Do you know if the client also has a more pythonic ORM type interface? It does not. If you think about it, the kind of queries that you're going to, the ORM interfaces generally handle all your routine cases, but when you want to write some detailed query, when I'm using the ORM library, at some point I have to drop to SQL. And that's the typical use case there. There's nothing stopping any of us from adapting one. It definitely could be done. But I don't think there's a lot of use for it. When you were looking at the year by year data, you said you had to run the query multiple times once for each year. Is that because there is no group by equivalent? Yeah, I could have done something like that. But again, there's an object lesson there, which is I didn't go out of my way to optimize SQL because I'm going to get my results back quickly anyway. So in fact, I said I'm going to, and I think this is actually a pretty realistic development trade-off in a lot of settings. I said I could spend time writing a smarter piece of code, but in the end it's going to be a net time loss for me. Pretty much, if you can do an SQL, you can do it. If you were going to normalize your data, say by a number of followers or flux in followers through time, where in your workflow there would you incorporate that? I'm going to punt and say I'm not sure where in the workflow I would incorporate that because I'm not sure if that's something that I would use when I was constructing my data set before I loaded it up. Or whether I could, I just, yeah, I'm just going to punt and say I really don't know. I'll have to play with that more. You know, I didn't hear you very well, and so maybe we can repeat that. If you can compare this tool with other alternatives that we already have in terms of performance, like how long it takes you to perform the SQL against another tool that we already have, things like Hadoop and Hive and a lot of no SQL setup that we have already. Yeah, I mean, certainly we could do that kind of thing. I think A, to do that meaningfully, I need a much larger set of data to do that kind of thing. And moreover, in my particular case, I'm not that motivated to because I have other reasons why I'm using BigQuery, which in particular is integration with the other tools that I'm using. So that sounds like a perfectly worthwhile analysis to do, but I haven't done it.