 Welcome to another edition of RCE. You can find us online with our entire back catalogue at rce-cast.com Also be sure to hop over to iTunes and leave us a review. Reviews raise our visibility on iTunes and gets the word out more of what we're doing. Also, I have again here Jeff Squires from Cisco Systems and one of the authors of OpenMPI. Jeff, thanks again for your time. Hey Brock, no problem. Cool. So today we have a project that's a little different than what we've done before, but definitely related to the high-performance computing because it deals with basically the inputs and the outputs of what we deal with. Right. So here we're going to learn about reproducibility, data transport, publishing data, sharing data. This is pretty exciting. So we have Joseph. Joseph, why don't you take a moment to introduce yourself? Ah, yeah. Thanks for having me. So my name is Joseph Paul Cohen. I'm the founder of Academic Torrents, the founder and director of the Institute for Reproducible Research and a deep learning researcher when I'm not working on this stuff. So Joseph, today, what institution are you affiliated with? I'm beginning a postdoctoral fellowship at the University of Montreal in the Milo Lab with Yasha Benjio. So what is Academic Torrents? All right. So in the big picture, I guess it's a legal infrastructure for academics to use BitTorrent. Also comes with a suite of kind of tools that make the tracker easier to use, some code to help work with it. And it's also an index of data sets and other academic materials. It's kind of one repo you can check for that data set you're looking for. And then as a project, we offer free technical support to people that are using the project. And also there's a community that's built up of donated storage and bandwidth. The kind of flows around all the data sets that are there to make sure what's in demand is all over the world, distributed globally. Okay, so give us the 30-second description, right? For those of our listeners who aren't necessarily familiar with what is a torrent, right, and how is this different? All right. So normally when you download a file, you click a link and you download it directly from a single server where the data resides. With a torrent, you download a skeleton of the file, just known as the dot torrent file. And the skeleton gives you hashes. It's just the hashes of the actual data. So you can have all the hashes of a terabyte of data stored in one or two megabytes. So why do we care about those hashes? Well, so then you can ask anyone in the world for data and they'll send you their version of that data, you know, their pieces that they have of that, and you can verify that the pieces people sent you are actually from the original file, just because you compare it to the skeleton, right? So now you know that if someone sends you data and you don't have to know them or trust them, you have the actual authority of data, right, that was meant to be sent to you, right? So that allows you to ask the world to send you this data and you can have 40 to 4,000 people sending you pieces of this gigantic piece of data and you can verify that each one is authentic and then build that into the original file. And this, it ends up being much faster than downloading it from a single source, which is great for a few reasons because the data gets corrupted. So if it gets corrupted on the single source, then you have to go to a backup. It's not easy for someone to kind of find a new URL of a file that they trust. But if you're downloading a torrent, you have the hashes. So you just trust the hashes. That's all you have to trust. Wherever the data sits in that today, you don't have to trust them, but you can verify that the actual data that's there is the real data that you were looking for and the real data that you were meant to download, right? Okay, so these hashes and the fact that hashes each represent a chunk of the data and we would expect no hash collisions. That's what guarantees that the piece I'm getting from user B, which is this is all happening transparently, is actually the data and they've not been able to modify it and run one past me. I know what's the original copy. Yes, exactly. So if you receive some bad data or some data that was corrupt in transit, your client would automatically get that data, compute the hash on top of it and say, oh, that's bad data or that's the correct data. So this also happens to be a benefit when you have data at rest sitting on your storage arrays, your NASs, your sands. If you have, you know, four terabytes of data, the probability that one of those bits is going to flip or there's some piece of it's going to be corrupted just goes up, right? So because you have this torrent file, which has these hashes, you can actually verify that the data on your own machine is the same one that you originally downloaded. And if not, with the torrent infrastructure, you can just download that one piece that got corrupted. So kind of from people sending you data, you can verify that it's correct. And then from your own data sitting on disk, you can verify that it's correct. So for moving all this data around, doesn't that mean that if I down the people who I'm getting these chunks from, they have to be running some sort of server or something? So we actually work with kind of two different paradigms of BitTorrent. So there's the basic BitTorrent protocol, which, yes, you need a BitTorrent client to be hosting this data, right, to host it over the BitTorrent protocol. We also encourage and work with people to use this concept of an HTTP seed, which we're kind of, with our GUI on the site, we integrate that nicely so you can kind of manage the data that you're trying to host. But these HTTP seeds, they actually will reside on a regular HTTP server. So maybe your computer science department web server, you can put your data on there or any web server. And then you can have that as one of the sources of data. So we kind of merge those together using parts of the BitTorrent specification to make sure that your data is able to be downloaded from anywhere. So if someone's downloading data from some location and there are no BitTorrent peers around, then that person will simply just download from that HTTP server where it was originally. But as more people start downloading, it kind of builds up that peer-to-peer BitTorrent protocol and that allows kind of a scalable way to distribute this data while relying on both HTTP web servers and BitTorrent clients. No, I want to jump back a little bit. You also mentioned that there is a legal framework aspect to this. Can you describe that? Well, so basically where there's, this is purely for hosting data legally, right? So in that, I've seen other people, you know, before this, which is one of the motivations, right? Put like data that someone needed to transmit and put it on some, use some kind of a pirate infrastructure, whether they put it on some system that's, you know, this data set is right next to a movie, right? Which, you know, there's some issues with that movie being, you know, people have a lot of concerns with those movies being on the same sites as their data. So this way, you can go to your boss and say, you know, we're going to put this data on academic torrents and then we stipulate everything on the site is legal to share, right? And it's specifically designed to aid researchers, right? So kind of makes it easier to use this data. It's kind of a protection, if you will, for dealing with your bosses or the other researchers you're working with. You don't have to go to the IT department and justify why you need access to the pirate bay to download the, you know, ImageNet because it's, you know, it's really big. So here you can, you know, academic torrents provides that with kind of the clean image that everything is legal and your IT department doesn't need to freak out that all their HPC servers are contacting the pirate bay or some other bit torrent index. You know, tracker. So by having a fully kind of legal infrastructure makes it easier for people to adopt this protocol and, you know, reap the benefits of it without having to deal with any of the negative aspects. So that kind of leads into what types of data do you distribute via academic torrents? All right, so we have, we still had two categories for a very long time. We just added a third one. There's papers, data sets and courses now. So data sets were a huge kind of win. Like people love using this for data sets. Almost all the traffic is from data sets. Papers was kind of an original direction. And there are papers there, but it seems people aren't, you know, so far adopting this for papers too much. I mean, there are some papers, but, and then also courses. So video courses. So if you're going to be on a flight and you'd like to download all the open courseware video about physics, right? You can, you can do that and then you can get on the plane. You have all these files offline and you have them stored there, right? So for people who prefer not to stream these videos, even though they're already kind of available freely online, people would much prefer to download them. So data sets and courses have been pretty popular, especially when they get big. So when you have big data sets like ImageNet, it's 150 gigabytes. It takes forever if you just download it over HTTP directly. So using peer-to-peer model speeds that up. That's the benefit for downloading this. So because you can have a fast connection and then you can download this from multiple peers at once. And that's kind of the allure of downloading large data sets to the system. There's also other data sets that are really popular, like the Netflix data set that doesn't seem to have a home anywhere else. It's not on any academic websites anywhere, but it is on AcademicTorrents and people are downloading it. Largely, I think a lot of the data sets people want are related to deep learning. So there are a lot of image data sets, whether it's the tiny images data set, which is like 400 gigabytes, or CIFAR, 10, 100, all the kind of seminal data sets that are used in tons of papers are on this platform and people are downloading them. So you gave some examples of data sets. Like what are some of your most popular, and I looked around to site a little bit, or most notorious data sets that you were hosting on there? So one of the most notable ones that I find useful is a virtual machine image for a course that's actually run at UMass Boston. So the professor distributes their virtual machine image through BitTorrent. And they actually told me, so I knew this professor and he knew that I was working on AcademicTorrents. And he said, I was trying to distribute this VM for my website. And he just sent out the HTTP link. And he said, almost all my students emailed me back and said, why don't you host this as a torrent? It's taking so long to download. So that was kind of a nice personal story. And then for the past few semesters now, he's been distributing this VM. And it's a 3 gigabyte VM. Every semester, students will download it maybe once or twice. So I'll make it 300 downloads per VM per semester. So this immensely helps the students. And you can kind of watch when the course begins, that torrent will spike to the most popular thing at that moment. A buzz of how many people that are downloading it. So that's a very notable file. So this is obviously working well for data sets and things like that. Do you do any curation? Are you, do you have people approving data sets being added? Or is this kind of like an open index? Anyone can kind of create a new torrent and attach it to the tracker? So in order to, anyone can sign up. But if you want to upload a data set, you need to have a EDU email address. Or we'll have to approve you and say, you have to tell us like, just that you're an academic in some sort of, some sort of context. You know, because it's geared towards specifically helping academics share their data. All right, so we've had issues where things are obviously like not in line with if someone, you know, uploads mission impossible to the site. We have to find that and disable it. Right, so there is curation light. It's very light. We don't have any problems, I guess recently on the fact that if people upload kind of things that don't adhere to, what we think is obviously not legal. But generally the curation is for those blatantly illegal things. And to kind of, I guess, help people get their data orchestrated with the system. So sometimes we'll see people upload data and then we'll help them properly set it up so that they don't have any issues with people downloading it. So now this does strike me, though, as, you know, your project gets bigger and more popular and whatnot, you get more uploads. Things will get more gray as opposed to the simple black and white, like, oh, that's a Hollywood movie. We should not allow that. And oh, this is a charting of stars in the galaxy. We should obviously allow that. How are you forecasting to be able to handle or is there a committee or a council or something like who decides whether this data is academic and therefore worthy to go here or this is not academic and we should not host it? It's been pretty clear cut for all the situations. We have restrictions on who can kind of put files on the site. So that pretty much makes these problems go away. But we do have a board of directors of the Institute for Reproducible Research, right? So that's me and two other people. And if anything was, you know, weird or couldn't make the decision, then we kind of have a discussion, have a meeting about this and figure out our trajectory. But we haven't had any issues that really made it a difficult decision. I asked this question in the context of like, you know, some of the more notable data sets that you have there. For example, you know, you have the Hillary Clinton emails data set there. And in some ways that's an incredibly valuable historical record and a sociological record. There's all kinds of things that you could study with that, but look through another lens. Someone might say, well, that's just a political statement. That's not academic work at all. You know, and another one with the Enron emails, you could also make similar claims there depending who you are and what your bias are. The issue could be gray. How do you handle these things? I mean, you clearly already made the decision on these ones. They are published, they are available and so on. So how did that go? Well, the Enron email data set is an actual old well studied data set in academia, right? Kind of the line between that and the Hillary Clinton emails. If they're public emails, then they fall in line with the historically what the Enron email data set has been. I mean, it's a studied piece of the public record of history, right? So if there's academic interest in it, then that qualifies as a data set that someone can use the service for. Okay, so let's talk about quantity now. I guess there's two different numbers. How much data does the tracker disseminate in any given year? And then also how many individual pieces of data are you tracking? All right, so the total number which we keep a live counter on the homepage is a 15.75 terabytes as of right now that I'm looking at it. In total, we have almost served 798 terabytes over the lifetime of academic terms. That keeps going up. So a while ago, the average was about one terabyte a day of traffic, and now we might be over that 1.5. Two, it seems to be increasing. So the traffic gets more and more every day, and it's a very surprising amount whenever I look at it. Okay, so that's a true truckload worth of data there. Do you ever prune any data? Is anything ever expire or do you require people to once a year re-up their data or something to make sure it's not stale and useless anymore? No, so for keeping the data fresh on academic terms, because you always need someone hosting this. Sometimes people will upload data or they'll start a seed for a week or so and then they'll just disappear. And so does their data, and no one ever wanted to copy it. So that's kind of the democratic process of hosting data, which is the way that we kind of envision academic terms as being sustainable forever. Is that the community will decide which data exists forever. So there are people who have a lot of volunteers that are, they seem like they're combing academic torrents, they're reading the latest uploads, and then they decide whether they're going to put that on their hosting infrastructure or not. So some people have this kind of automatic download, but it seems like other people are curating it themselves, and we like that angle, it's up to them. If you're donating bandwidth to academic torrents in storage, you make these decisions. We don't want you to download everything because that would be too much. So some data sets can be gigantic and then no one wants to host them. But also it's not just kind of, not all the data has to be there available for everyone all the time. There's one specific data set that I think the whole data set itself is seven terabytes or something like that. And it's only, all the seeds are only turned on when that professor wants to share data with someone else. So I'll watch this, it seems like there'll be someone that tries to download it and I'll see that and I'll be like, oh, that person's not going to get that data because there's no seeds. And then I'll see all the seeds come online is if that person must have emailed that professor, they turned on all their hosting and then that data was just transferred over once using BitTone. And then all the seeds will shut off again. So it's almost like, you know, this is a tool that they're using that the indices are all there on the site, but they might not have to be hosted all the time. You can try to download something and then reach out to the person who uploaded it and say, hey, can you, can you share your data? And then they'll host it again. Okay. Well, so that's a fair point. You know, actually that's pretty fascinating. I've never heard of Torrance being used that way. How do I find that? Because you said something fascinating in there. You're like, oh, I tried to download it. It wasn't there. So I emailed the professor, well, how do I find that data set and be find the owner to contact them? Oh, yeah. So specifically, that data set was probably linked from some paper that that person was reading. So, but generally, you probably wouldn't find that data set in the search because there's no one, there's no one seeding it. So that the search is kind of crafted in a specific way that it's going to use kind of how popular data sets are to return what's more relevant to what you're searching for. So when you, when you search academic Torrance, what you're going to, what you're seeing is a kind of algorithmically curated list of what's available and what's popular. So that that helps a kind of these data sets that are kind of no, not seeded or not hosted often are kind of pushed down the list. And then the stuff that's popular and probably what you're searching for is going to pop up to the top. So what if I want to use something like academic Torrance? It sounds like that, you know, you don't do seeding, but is that something that maybe I could either as a data provider who wants to put data out there? Could I actually like pay you or is there a very easy way to use like one of the cloud providers to seed and put data out there that I want to put out there? So we actually, we have a donated BitTorrance CDN from Whatbox. So they have graciously donated a gigantic server that we curate what's on it. A lot of other people just simply donate the seeding. So for some data sets that kind of we'd like to help people out hosting, we'll just throw it on our CDN as long as we have space for it. So not the ones that are far too big, but the ones that are reasonable size that we can help out. That's kind of like how we manage those donations. Alternatively, you could find an initial hosting location anywhere. So we kind of craft the Torrance to be embedded with a HTTP URL that works as a backup. So you can get cheap HTTP hosting from any provider that gives you unlimited storage. So it doesn't have to be fast, which is the benefit. You can buy just a cheap web server somewhere, put your data on there, and it'll be horribly slow if you download it directly. But then we help you embed that cheap hosting into a Torrance so that you get the speed advantage with only having to host the data on some really cheap provider. There's also other places that let you host your data for free. So we have an integration with Google Drive that when you go to upload, if this file is small enough and small as 30 megabytes, so it's pretty limiting. But you can upload your data directly through the upload page on Academic Torrance and it'll actually put your data on Google Drive and then share it in just the right way and then embed the URL inside the Torrance and take care of all the magic configuration that we've figured out works. So you can kind of take your data, upload it directly to Academic Torrance, and then that Torrance sits on Academic Torrance and people can download it and then you don't have to host anything because Google Drive is storing your data and then it's being distributed to Academic Torrance. So that's one solution we've figured out. We wanted to kind of extend that to other services but we haven't really found any partners that want to take that paid step with us for us to give some paid hosting. So we just rely on the free Google Drive accounts to do that. You said a CDN in there from what box? Can you define what that is? So people could also call it a seed box. So it's just a server somewhere, I think ours is in the Netherlands, that has a really fast internet connection and a bunch of storage and you can kind of go to some web page and upload the torrent files and they'll download on that box. So it's just a bit torrent server located somewhere. But the terminology is CDN because it kind of matches what people think a CDN does and which was what it does. It delivers your content. But using BitSmart. Now what if I am providing a piece of data? Like say I'm somebody who not only publishes data but I'm actually the one who collects it and it's something that is being updated on a regular basis. Is there a way for me as the original creator to kind of say, oh sorry, it's slightly different now. You put out the first new seed and everybody who had the old version gets updated. Is there any concept of versioning here or is it every version is going to be its own seed? Yeah, so generally we like the principle that a specific entry dictates a specific set of files. And those bits are in a specific order. But there's this amazing property that BitSmart has where if you download a file once and you have all the data, let's say you have a folder with 100 files in it and you downloaded that via BitTorrent, you can remove the torrent file from your torrent client and then the data is sitting on your drive somewhere. Let's say someone else publishes a new version of that data with just one file changed. You can download that torrent file again and then point your torrent client at the old data. And then it'll actually look at the hashes and pieces of all those files and find that just one file was changed and then it would only download that one file that was changed. So that kind of gives us versioning on a very low level, on a bit level. So it ends up working really well in specific situations. It's kind of not so friendly to use, but for gigantic amounts of data, it's definitely worth it. When you have 150 gigabytes of data and only one gigabyte changed, it's much easier to go to kind of have your client verify which data you have already downloaded versus the new data, than to download all the new data again. Okay, but this makes it sound like that this is really intended to be a distribution mechanism, not necessarily a storage mechanism. Like if I'm generating all this data, I wouldn't use academic torrents to store it and access it and whatnot. I would use that more to say, alright, here's all the data I've generated and now lots of other people can get that. Am I getting the right sense here? So I think the way to think about it is it's a tool. So you can't really rely on any of the volunteer hosting that exists in the community. That's good to make your downloads go really fast because it's hosted all over the world. But if you're really concerned that you want your data available forever and you'd like it to be easily downloaded, I think this platform is perfect for that. And I actually use it for my research data because it makes moving around and managing this big data easy. So if you want to store some data, you can set up multiple hosting locations all over the world. You can do it for yourself. So you don't have to rely on anyone else. You take your data and you put it on one machine and then hook a bit to our client up to it and then that data is now hosted. And then you can say that maybe you want to host it somewhere else. So you set up a server somewhere, run a BitTorrent client, and then you can replicate that data from the original location to that location using BitTorrent. And now you have a redundant copy. So you're controlling both those locations. So now if one of them goes down, it doesn't matter. You have the second backup location there and it's integrated with all this great file checking capabilities. So you know that that is still a valid source of this data. So you can set up your own global distribution network for your own, for your data. So if you think about it as a tool in that sense, you can use it to make your data resilient to disaster anywhere. Especially if you work with multiple different research groups and each one of these groups hosts all the important data sets that they care about. If one location has a break, their server breaks, that data exists in all the other labs on those machines. So there's no need to worry. So you also mentioned the Institute for Reproducible Research. Can you give us a little bit more background of your involvement and what that goal is? Yeah, so the goal of the Institute for Reproducible Research is to make research more accessible and to empower researchers to kind of move forward in research. So it's embodied itself in now two projects. So we have AcademicTorrents.com, which has been around for a while and serves its goal very well. But also a new project called shortscience.org aims to make research papers more accessible, right? So often research papers are confusing, so it's hard to understand what the significance of that paper was and kind of evaluate it and think about it. So a lot of times people will make blog posts about research papers and that kind of explains what the contribution to the scientific community of the paper was, right? Otherwise you'd have to talk to some leader in some field and they'd know what the seminal work of that field was, right? But as a new researcher, it's hard to get a feel for what research is important and how to understand it, because research papers can be confusing. There's a lot of stuff going on, so it's hard to know what the takeaways from a paper should be and if you really got them all, right? If you truly understand that paper or that research. So short science is kind of like a journal club for the world, right? So you can look up any paper that exists, we tie into tons of databases, and then for that paper you can see summaries that people have written, right? So currently we have about 300 summaries. That allows people to kind of get into a specific field much easier. So right now it's kind of really targeted towards machine learning because the people who are currently using this are all in machine learning. And then because there are people using machine learning, more people use it to study machine learning and it's kind of growing in the machine learning field. But so that's kind of another arm of making research more reproducible by making it more easier to understand, right? But not for just the general public, but actually for other researchers. Research comes out a month ago. What's so crazy about it? Why do people keep talking about this paper? You can read it, maybe it's a little, the nomenclature is difficult. So we have this kind of, we have two arms that are making research more reproducible by making it easier to get data sets to do those experiments again. And then to make it easier to understand research papers. So how do you sustain all of this? Where do you get the resources and the funding for the servers and the people and the time and all these kinds of things? So for a while and still to this day, it's mostly based on my contributions to this project. I've been working for a long time. So the costs were really kind of, they're really kind of low. There's a lot of engineering work that would cost a lot of money that's just donated. But the actual server infrastructure, the way we organize everything makes it very cheap. Especially where we don't host data. So we kind of make 15 terabytes available, but we don't host that data. We just host the index of it. So the overhead costs are low, which makes it very easy to sustain this through just our own funds. We've had donations from people, but it actually doesn't cover the costs, but that's not too significant. We recently gained non-profit status. We've had an application going for a while now, and about a few weeks ago, three weeks ago, we officially became non-profits. So now that opens up a whole realm of grants for us to apply to. So I think that's a route we're going to take, although kind of the founders have been distracted by other things to devote a lot of time in writing grants. As long as the system is working and it's minimal cost, we're going to be able to sustain it. It would be great to focus on grant funding for the future to kind of have more impact in the field and maybe make it easier, maybe host more data, maybe have better integrations with universities and labs. But we've been gearing up to move in that direction. But the costs are not so bad. So it's not terribly inconvenient to run this thing. So you mentioned integration with universities. What about integration with software? So there's DOIs. Do any of the common tools out there like HydroFedora or others have the ability to have a DOI point to a torrent living on academic torrents for a supporting dataset? So this was one of the initial questions we got when we launched the site. So DOIs, I looked into it and they're actually they're expensive. So with almost no money, we just we didn't even deal with getting them for every torrent. And we have info hashes. So that kind of replaces the need for a DOI. It's like a cryptographic DOI. For that info hash, it identifies that exact piece of data perfectly. So we didn't really need to go through that. People have actually put these info hashes in their research papers to signify which data to point to on academic torrents. So it hasn't really been a need. We integrated a little bit with some software. So a while ago, we released this utility called AT Down. So it's specifically a downloader that's meant to run on high performance computers. So it's written in pure Java, which allows it to not require any special packages that would need to be compiled on a high performance computer. So you probably can't email your system admin and say, I'd like you to install LibTorrent so I can download BitTorrent on your cluster. But it probably already has a module for Java. So that was kind of the goal to make it easier. And because at the time I needed it to download aerial images on a cluster. So we worked on a pure Java BitTorrent client that is integrated to academic torrents. You can type in an info hash and it'll go look up on academic torrents for that info hash and download that file right into the directory on your cluster or wherever you are. And it also downloads collections of files. So we have an interesting collections system on academic torrents. So files or torrents can be grouped into collections. So you can have a collection for deep learning or a collection for ImageNet. So if you just want to download all the ImageNet data, even though it's contained in multiple torrents, you can just download the entire collection. And this actually helps out for people mirroring data. If you have some BitTorrent client and you'd like to mirror all the data that relates to deep learning or computer vision or all the course video lectures that are on academic torrents, there's a collection for each one of these. And then you can download an RSS feed for each one of those and download it with whatever client you want to. So we've kind of worked to make it easier to deal with software, to kind of integrate with other software that exists by either making it kind of match other pieces of software or to write our own like at down. And then in the future, we're actually starting to work on this with a graduate software engineering course of students. We want to integrate academic torrent downloading into languages like Python or R. So a lot of the times when you go look at some code example, like a neural network that's going to classify the 10 labels of CIFR 10 and they're demoing some network with some tool. The first line of these software demos, because this might be a little Python script, it goes and fetches the data and unzips it into folders and then checks whether it's there or not so it doesn't redownload it. And then it goes on to actually do the scientific piece, which is like showing how this network works. But for this new utility, we could easily have it say, just get this file and it'll take care of downloading it, verifying that it's the correct file and if it has already been downloaded, whether the correct data is still there and if it's missing any. So it makes reproducible research even easier and kind of fitting into the life cycle of a person demoing code on a seminal data set and making sure that that data that they were using is available forever. Or if it's popular and used by these users or trying to distribute their code. Because oftentimes they'll put it on some web server that's run by the university and then when they leave, they get rid of their account and now that data is gone and that source code demo can never work again. But if it was a popular piece of code and people had mirrored that data, it would still be available. The original code that they published with their paper would still work if they had been using this new tool we wanted to develop, which is a Python and R package for academic torrents. All right, so what is the most surprising or unanticipated use of academic torrents that you've seen? One of the cool things that we get to talk about with our guests on the show is that they designed some cool piece of technology and make it available to people. And then the amazing creativity of the human race finds surprising and unanticipated uses for that technology. What have you seen with that in academic torrents? I haven't thought about that. So we did not expect course videos, course lecture videos to be as popular as they are. So that's kind of sprung out of this. And then it makes sense if you're on a plane and you want to watch a video or if you're traveling and you want to watch some set of course lectures that you might not have a good fast internet connection. So it makes sense for you to download the entire class with all the videos to your laptop. So that's the one that comes to mind. I'm sure there are other ones. So there's another one. So somebody uploaded all the images from a museum. So there's a museum and they had an API where you could view their collections online. But they had to use their website and do all this stuff. So someone wrote a scraper. They scraped all this public image data into a folder and shared it on academic torrents. To me, I thought it was really neat because that museum's collection was now globally distributed so everyone could experience that museum. And that was kind of immortalized in a torrent file that was full of all the images from the collections of that museum. So that kind of inspired me in a way to think like this is taking this data and making it so much more accessible to everyone. So how can someone contribute to academic torrents either via development or data sets? So I think that the biggest contribution would be putting data sets on academic torrents and kind of integrating that with the way that research is done in that area. And bringing that into the life cycle of how you do research with data. I think that's the most useful way. A lot of people also choose to donate bandwidth by hosting the data sets that they like and want to share. So we make this easy with collections. So you can just subscribe to a collection that's curated by someone else. And then your Bitar and Client will just automatically download all the new things that are uploaded into that collection. And I think that's a kind of preferred way of donating bandwidth and storage to us because it puts you in control. We're all about democratizing how the site's used. We are a non-profit so donating money is definitely a thing that you can now write off on your taxes as a charitable donation. For code, the projects in the past we have a GitHub repo with a lot of projects that people worked on that have a lot of bugs that are totally undocumented. So there's a lot of code there. So if you emailed us and said what your specialty is, right now we're looking at making interfaces for Python and R. So if you are interested in that, we would love for you to work on that with us and the graduate students that should be joining us later this semester or next semester. So that's the area there. We also have a desktop client that we worked on a tiny amount. There's a lot of different started projects that we'd love to talk to people about where their specialty is. Wherever they dream BitTorrent could be in research, we're willing to talk about it and how that can integrate. So on the code side we're open to working with people on all these things and take a look at our GitHub. But the biggest thing is putting data on the site. That's the most important thing. Making the data available to be mirrored globally. So once you put a data set up, the community will decide whether it wants to mirror it. And usually when people want to mirror a data set, it's hosted at at least 30 locations globally. So that is the best thing to do. Okay Joseph, thank you very much for your time. You can find AcademicTorrents at academictorrents.com. And once again, thanks for your time. Thank you. Thanks for having me.