 Hi, I'm Garth Griffin, co-founder and CTO of Gigashit. In this talk, we're going to explore a question that I find to be very interesting. What would you do with a billion rows? First, I'll give my answer to that question by showing a demo of how easy it is to use Gigashit to explore a billion-row dataset. I'll also talk about some other technologies that are out there for tackling big CSV files. I'll share some metrics about the sizes of files that Gigashit users have uploaded, and at the end, I'll describe five real-life examples of large CSV files from Gigashit users in the wild that range from 104 million rows up to just over 1 billion rows. If analyzing large data files is part of what you do, I hope you'll consider giving Gigashit a try. It's free and you can sign up at gigashit.com. So what would you do with a billion rows? We'll do the demo first. So let's see Gigashit's answer to this question of what would you do with a billion rows? So here I am in the Gigashit web application, and I've already uploaded a sample file of synthetic network connection data. So there's no real, not real data, synthetic data, but it is very large CSV. And depending on your internet connection, just to get this data in, that can be a time-consuming step, but I've already done that. So I have it already ingested into Gigashit. And the cool thing is, I haven't tuned Gigashit for this file at all. This is just, you drop your file in, works on any schema, you don't need to specify anything. And so let's open it up. So we're just gonna open up this file. And already you can have a nice result, which is that you can actually see this file. It has a billion rows, but we're just getting one page of data here, looking at the very top of this file. And depending on the tools you're familiar with, even just to get this far, can be challenging on these very, very large files. But here we are. So we open up our file, and what do you wanna do with the billion rows? Probably you wanna do the same, as with the smaller dataset, you want to understand something about what's going on in the data. And so let's suppose I'm interested in seeing, understanding kind of my high volume hosts on my network, something like that. We'll follow kind of the generic analysis steps of summarize and filter out noise and drill in. And we'll try to understand that. So first of all, the first summary is just to see the data and see the columns. And I'm interested in high volume hosts. So let's do a filter on the bytes per connection. And let's say I want 500,000, let's say. All right. So this is gonna go back to the gigasheet servers, apply that filter and bring us back results. So now we're filtering only to high volume entries or high volume pairs of source and destination IP. And you can see this cut the data down a lot. So this is really great. I now have a much, my result set is a lot smaller. So we can probably take some more sophisticated analytical steps on this smaller result set. A great way to summarize data when we're trying to get a particular column of interest to understand the column of interest is to group by that column. So I'm gonna do a group on source IP. And what this is gonna do is aggregate all of the rows in my filtered result, aggregated up by unique values in this column. And so we can see across my filtered result set there are three unique source IPs and they all have a similar volume. I probably wanna do another aggregation here. I probably wanna look at the number of unique destinations. So let's add that as another aggregation. Okay, so it looks like they each are only talking to one. So this is a great result. So here we've got, trying to understand my network traffic, I've got three high volume hosts each talking to one IP address. And you could imagine going on from here, you could, you know, I could, there's, I can do sorting on this data, I can chart this data. There's different things you can do. There's a lot of functionality here, but this is, you know, we sort of seen how that analytical flow works in Gigashit. I started with a huge data set, filtered it down, did a grouping to get a summary of my filtered results. And then I'm continuing on with my analysis. To give a quick tour of some of the other stuff you can see we got some functions for cleaning up your data. And this is similar to the kind of stuff you might do in like Excel with formulas. And you just sort of point and click here. There's also some different functions that you can apply, including adding interesting third-party data. I can add some threat data to my IP addresses, things like that. But that's, you know, that's basically the Gigashit answer is, okay, you open it up, you drop it in, open it up, and you can do, you know, these kind of analytical steps to filter, to group, sort, understand your data. So why is it fun to talk about a billion rows? It's fun because a billion is a big number. Thanks here to the social network movie. Another way to think about a billion being a big number is these examples here. If you have a thousand rice grains, that's a quarter of a cup. You cook with that, you know, you can make dinner with that. A million rice grains is three large bags. You could imagine buying that, pretty conceivable. A billion rice grains is three and a half shipping containers of rice, just full of rice. Couple other examples of the same thing on here, but the point is, you know, it's a big number and, you know, we have just three words for it, but it's an order, you know, it's an order of magnitude difference. It gets really big. Despite these, this being a big number, huge CSV files are increasingly common. So here's a few popular websites where people are talking about their victories, their difficulties, you know, talking about working with these huge files. One person with a hundred gigabyte CSV file, somebody being really pleased that they were able to work with million row data sets. And then some poor person trying to read a six gigabyte CSV file in pandas and unsurprisingly running out of memory. You see that, you know, these files, you see them. So some tools that you might like to use, if you work with data, these might be some of the tools that you work with. And if you try to do a billion rows in these tools, you'll get more, greater or fewer levels of pain here, but it could be quite difficult. So starting on the right, the most painful Excel is not gonna help you with a billion rows. Grep command line is also gonna be pretty tough. Then moving leftward, pandas, you know, you might be able to do something with pandas. And the databases, if you tune them, if you're willing to sort of spend the time tuning Postgres, I think you might be able to work with a billion rows in Postgres. But if you just kind of fire it up and hope for the best, you know, SQLite and Mongo gonna have a hard time, pandas are gonna have a hard time. There certainly are solutions. So if you're willing to pay and or spend some engineering effort to sort of get it working, there are solutions and I've listed out some of the kind of more popular ones here. So there are answers out there, but I think the point is if you have a billion row file and you just kind of wanna pop it open and see what's going on, it can be quite difficult. So that's why we made Gigashit. We're a company built for big data, the spreadsheet built for big data, that's our motto. We're on a mission to make insight more accessible. 70 billion rows come processed through our servers in the last 90 days, not just CSV files, it's CSV files, JSON files, pcap files, lots of other formats, but we, you know, my background is in data science and I struggled with this, you know, huge files are, you know, they're pretty common and it's difficult to just sort of do the basic stuff that we wanna do when the data sizes get larger. So that's what we are trying to solve here at this company. And that gives us some interesting perspective to share because we do see a lot of data come through our front door. And so that's what I'm gonna share a little bit about. So this first chart here, this is a histogram of CSV files by row count that we've seen uploaded to our platform. So the bins I wanna note are 10x each time. So this is sort of logarithmic scale across the bottom. And to the far right, we have the rarefied error, one billion rows, one in the past six months, this is sums from our past six months of uploads, a billion rows, one person did upload a billion rows to our platform. Moving down a bucket, the 100 million rows, this is kind of a little more common, 102 files uploaded. But interestingly, most of our files are not that big, they're under 100,000 rows, which compared to a billion, 100,000 is feels quite small. Actually, 100,000 is still kind of a big number. If you try to do, try to work with that on like Google Sheets or some online tool that is probably gonna struggle if it's sort of running in your browser. But compared to the billion rows, it doesn't seem that big. But most of our files are, most of the F files that our users upload are under 100,000 rows. Another interesting thing we can see in our data is that users are self-selecting smaller rows in terms of byte count, smaller sort of bytes per row on larger files, which makes sense. The, forget about the file contents, just the absolute size on disk of a file, that can add to the pain if you're paying for storage or you're trying to send it around, transmit the data in some way. So what we see here, the x-axis bytes per row, so not file size, but per row size. And then on the bottom, the log base 10 of file rows. So this basically the same buckets on the left is 100,000, on the right is a billion. And this is just a radium sample of our files, but you see that on the left, there are some files with very large sizes. And as you go further to the right, there are fewer and fewer sort of very large numbers of bytes per row on these files with large row counts. So basically you could say users are being sort of more choosy with their columns on the larger row counts. So now that we've seen the overview, let's check out a few examples. So I'm gonna share some examples that were sampled from files with 100 million or more rows uploaded to Gigashit in the past six months. We get a file of this size every few days, kind of so it's not, again, not the majority of our files, but we do get them regularly. And there will be one with that, we'll look at that one with a billion rows. I'm not gonna show the actual data, that's sort of private user data. I'm gonna talk about the metadata like the file name, column name is things we can sort of see without looking at the actual user rows. So first example, sensor data uploaded in February with 104 million rows, eight columns. Based on the column headers, this suggests that we have some kind of measurements related to a road and 104 million measurements. And the user joined this with another file using a cross file lookup, and then filtered down to matches that were in both files. So this sort of lookup operation probably means they were joining in something about the roads themselves, maybe some other like township information, but it's just a nice example because with sensors, if you're taking these regular repeated readings, you can add up the data size pretty quick, gets big pretty quick. Our second example, financial data, this was uploaded in January with 142 million rows and five columns. And a quick look at this data shows that it is this sort of very standard tick data format where you have a company ticker symbol in the leftmost column, and then the open, high, low, close, price as floats in the other four columns. And did end the final name ending with underscore zero zero. So I want, you could have actually just been one of several in a batch, but anyway, this definitely looks like tick data from financial, it looks like from the UK. And the user didn't do much with it. They just wanted to sort of open it up and look around. So maybe they had a company of interest or a little time period of interest or something like that. But financial data, especially the tick data, we have a line for every transaction that happens. This can be very high volume, add up pretty quick. Third example, domain abuse. This is a nice example because in technical data, this is another arena where you do see high volumes of data where it's sort of machine generated data. So this file is from January of 2023 with 419 million rows, seven columns. And based on the file name and the headers in this file, it looks like this is a list of potentially abusive domain names. So the user was, the user did a lot of work with this file. They took a lot of actions, filtering and grouping and sorting and doing all these different analysis actions. And so I think, you know, they were probably doing some kind of investigation, maybe looking for, you know, comparing their own network traffic to this sort of reference file or something like that. But technical data like this is a nice example of where you can get pretty big data volumes. Fourth example, healthcare data. And I want to note this is not patient data. This is pricing data related to health insurance in the United States, uploaded in January of 2023 with 870 million rows. We're getting pretty big now, it's pretty big. 21 columns and a mix of column types. But based on the headers, they included some things that look like healthcare acronyms in the United States, HCPCS and CPT, which are these acronyms for procedure codes. And so this looks like some mix of sort of what you can get at different hospitals with some pricing. And one hypothesis is that this is related to a law passed in the United States in 2022 relating to transparency in healthcare pricing. We have seen some other data files, mostly JSON files, but other data files related to the same kind of thing with healthcare pricing. And this looks like another one of those. And this is, you know, very, very large file, 870 million rows. Our last example, so here it is. This is the billion row file, uploaded in October of 2022 to gigashit three columns. So here's a nice example of, you know, that very high row count, meaning small column count. And it's a zip code, a zip code and a float. And based on the headers and the data contents, it looks to be the distance in miles between pairs of U.S. postal codes, United States postal codes, using some kind of U.S. Census data designation ZCTA-5, which I don't know much about. Interestingly, if you take, there's only 41,683 zip codes in the United States, according to Google. And if you multiply, if you square that number, you don't get to a billion. So there must be some, must be smaller than zip codes or, you know, some sort of sub designation. And maybe it's not even a zip code. Maybe it is some other kind of taxonomy, but it looked to be sort of the cross product of, you know, every possible pair of what looked like zip codes. And so probably the user was trying to do something like find, you know, the 100 closest regions to a point of interest or something like that. Because you might wonder why not just use Google Maps? Like, why are they opening up this thing? But if you want to get like a list, the list of matches, it's quite hard to do that on like a search-based tool like Google Maps. So they probably wanted to do some kind of filter like that. But in any case, it's a nice example of one way to get really big data, which is to do a cross product when you multiply two lists that are already pretty long. It's pretty easy to get a very large file. And that might be what happened here. But anyway, a billion rows with the distances between some geographic points in the United States. So I hope you've enjoyed hearing about these files. We were getting uploaded into our system here at Gigashit, our overview and these few examples. If you work with data sets and you have some data, we'd love for you to give it a try. It's free, sign up at gigashit.com. Or if you don't want to sign up, we also have some public data sets that you can try out. And you can find those by searching Gigashit data community in your favorite search engine. And that's all for now. Thanks for watching.