 My name is Matias. I'm filling in for my co-worker Carissa because she's feeling sick today So you unfortunately you're gonna have to do with me All right, I think it's turned on now saying I'm filling in for my friend Carissa Who's also my co-worker? We both work on the debt project and she's feeling ill today. So yeah So this is gonna be a little bit improvised, which is probably gonna make it even better. Yeah My name is Matias I go by a muffin touch online if you're on IRC or any of those things. That's how you can touch with me anyways, I'm gonna talk about that project and I'm gonna try to do some demos also and my computer didn't work on this Using this adapter because I have one of those new fancy computers. It doesn't work with anything anymore So Jeremy was kind enough to lend me his computer So that way you also know that this is actually gonna be live demos and not just cheating which is great Yeah, so how many in here have heard about that before Okay, that's pretty good. We've done a good job doing that. So I work mostly on the technical stuff But we're only three people so we all tend to work on everything. Basically that is an open source tool for sharing data and versioning data and What that means? Hopefully you'll get a little bit better understanding of after this talk We're a non-profit project primarily funded by the Alfred P. Sloan Foundation, which is great because they pay me I like that. I like getting paid Which also means that we are like completely non-profit everything is completely open source We try to do as much things as many things as possible in the open We tend to do all our meetings on Live YouTube hangouts and stuff like that. So if you ever want to get involved just you know go on get our IRC It's very easy to get involved That is a free person team Currently and we currently maintain more than 800 modules on MPM I usually put this stat up here because I maintain 400 of them, which is like the most so that's important to know Fun fact about MPM is that MPM has around I think it's 200,000 modules So this is a percent and since you're all data scientists, you will realize that it's around half a percent of 200,000 so if you ever install 200 modules, there's probably a good chance you installed one of ours. So you're welcome yeah, so Like I said, that is a peer-to-peer file sharing network. It's all written in JavaScript Because JavaScript is awesome And it mostly works in the browser also because it's written in JavaScript It's primarily built for sharing data set and data sets and versioning data sets We kind of talked about this a lot About you know making tools that allow you to move your data to the code instead of moving your code to the data And we actually changed that bit to just mean data is just files. So it's all about just moving you know files to the code And actually it's a little bit better than that because it's mostly just about only moving the files you need So in in data science we tend to have these huge data sets, you know millions of files They're all really big But you might run an analysis pipeline that only uses like a couple of them or like only parts of them So we try to make tools that allow you to do that basically all in all in a peer-to-peer fashion So you don't need to host anything anywhere basically you can have a whole huge network of people helping each other out and sharing the data Similar to BitTorrent if you're more familiar with that So we share a lot of the same ideas behind BitTorrent. We just try to apply them to science instead of movie sharing Which is awesome So how does that look like? Well, it's right now it's command line tool. It's really simple because it only has two commands So you can install that from MPM I should have put a slide in there that told you how to do that, but I just did these slides ten minutes ago So I'm gonna tell you now you can do MPM install that that's how you do it once you get the tool you just need to run that link and then you need to give it a file and I can try doing that real quick just to see show you that it works. So hopefully it's done installing So it doesn't show over there. I didn't realize that sorry about that. We're on it. It's not my computer Yes, we can get it from MPM you just do MPM install that and that's also how you get updates Well, when you want to get updates you just run MPM install that again Which takes like a couple minutes and then you have it on your computer We're also working on a desktop app if you're not a big fan of command line tools That will allow you to just click a button instead of running a command Awesome. Thank you, Jeremy. So Is this big enough can people kind of see what's happening here? Cool. So you would run MPM install that you need that and you'll get the tool I'm not gonna do that because this takes a takes a couple minutes. I'm not gonna waste time on it So I did it before cheated a little bit. It's fine. I'm allowed to do that once in a while See if I can find my demo. Oh, it's here. Cool And then if you want to share a file and I have some cool files in here I have one that's C3 file you just do that link and You just put in the file and that will traverse the file and give you this that link That you can just copy see if I can figure out how to copy on Jeremy's computer and then You can just give this to a friend run with that command and They'll all make you find the other person and start downloading data and there's also APIs for only getting a partial data So it's it's really really simple tool actually and this this link has enough information to kind of verify the content to make sure They're actually getting the right content. It's like a cryptographic proof of the content and stuff like that So that's like the primary use case See if I can figure out how to get back to this Yeah, so, you know you link and you get data. That's basically it So the cool thing about this at least I think is that you know, it's actually really really simple to use It's very Because there's only two commands. So once you get the command wrong, it's probably the other one But you know, there are more things we want out of a tool like this for example, if I share a data set, right? And I update that data set. I don't want to download the entire thing again something If you're familiar with source code, we you know, we take for granted when using a tool like it So we spend a lot of work trying to make different better. So if you download two similar datasets They you only download like the diff between these two datasets, right? And that's a little bit tricky because datasets are just files. So files can be anything, right? So I'm gonna get a little bit technical because that's what I like to do So how do you do it? Well, basically if you have a file the way you normally do this If you have some way of dividing a file into chunks, it's a lot easier to reason about and a chunk is just like a partial file, right? So if you're familiar with how Git works Git cheats because it's all about picking a good chunk and in Git if you have some source code a Really good chunk and source code is a line, right? Because that's kind of like the natural delimiter when we write source code You'll write something you put a line in there you put a bunch of your lines so git just you know Acknowledges this and divides everything into a line and says this is every line is a chunk and that's kind of like everything I'll deal with independently So if I change a line in this example, I changed like the second line If I didn't look at all the chunks compared to two version of this file, you know, three out of four lines are the same, right? So that's pretty good for a diff because that means that if we have a system that's a little bit smart You can figure out that I only need to sync that one line, right? So this is a really good some model is really simple. It works really well if you ever use git You will probably try it out and it works super awesome There's only one problem. I don't want to work for text files and not everything is a text file If you ever tried to put a binary file into git, you'll probably realize that this doesn't work that well So we'd use this technique called rabbin fingerprinting Which is almost the same idea except that instead of just using new lines You use some cryptographic magic to find something better than a new line But it's kind of like the same idea you find some sort of natural delimiter in a file So it kind of works like this where it's it's an algorithm. We didn't even invent it It's something like from the 70s. I think There's cool papers out there for where you have another kind of scans for a file And it will give you chunks based on the actual content So we'll look at the content do like a sliding window kind of thing and once a while It'll find that the delimiter it likes based on some parameters that are not You know coupled to the actual file file content and Say, oh, this is a good chunk. This is a good chunk And not all the chunks the same they're like, you know, somewhat big some are small But you can kind of tweak it to be around that, you know some size that you like So the really cool thing about this is that if you were to insert something in the middle of the file I've been trying to run scans through it again will produce the same chunks on both sides of the file Which is actually really cool The only thing that might change are like the neighboring chunks. So in this example, that's the I guess that's kind of orange colored I'm not sure how to call that color base. Yeah. Yeah base pretty good so And if you use a beam turn only the base things would would change and that's the only thing you would then need to It's on npm. If you're interested in that, there's a rebene chunger. We mean Max wrote called rubene. You can try it out So a really cool thing about this technique is that It allows you to if you have two independent files that are similar that allows you to Without a declaring that they're similar just figure it out by running it through a rebene chunger And then you'll notice that like most of the chunks are the same So it's actually a really really powerful technique that works really well for singing files We also have this notion of If you ever download a file, you should only download it once if somebody else shares the same file, you shouldn't have to download it again Similar to that using something like a rebene chunger We also only want to download parts of a file once so if a person is sharing a File and her friend is sharing a similar file You don't want to download the similar parts twice, right? And we use this cool technique for that called the Merkel trees It's my favorite slide And I'm not going into detail today, but I said there's a really really cool technique where you like, you know get all these things are free Right, and that's also an MPM Anyways, so that's like the core part of it. How much time do I have left? Okay, cool. That's a lot of time. That's awesome So I can make a lot of figures now. So I You know, that's the theory if you're interested in hearing more about how it actually works like, you know Low-level come talk to me. I can I'll I won't stop talking about it. So you have to go away at some point Which is fine. I won't judge you But I thought, you know, it's more fun to show some demos instead because that's probably a better way to understand it so See if they can Go to Jeremy's browser here. It's awesome. So as you Said in the beginning This is all written in JavaScript and all people like you know Why JavaScript compared to Python Python is big in like data science and Python is awesome but a really cool thing about JavaScript is that it runs in the browser, right and by running something in the browser you get a Programming platform or like a runtime that's basically installed anywhere and you can just tell anybody to click on a link So I thought force in demos. I'm just gonna make that run in the browser. So I did that and Did a website See if I can remember the UL So don't judge me on the UI because it's pretty awesome So it's like all about being minimalists and only you know communicating what you want to communicate So that is built on this low level component. We call hyperdrive Because I thought it was kind of a cool name because I like to call things hyper because you know hyperlinks stuff And then somebody told me I didn't even realize this but you know hyperdrive is the that's the thing in Star Wars That never works Which I get as a guess is a good analogy for my tool anyway So this is just that running on the browser and like I said You know that have those has those two links a two commands one called link and we'll call one called the fetch And when you drag and drop files here, they'll basically just do a link in the browser. I'm just gonna open the Console here so I can see what's going on. Cool. So I put some files on Jeremy's computer so What I can do is I can just like take a file. I'm gonna take a picture and I can just drag it here and it adds the That's the file and the version is it's using that and my cool demo here will give me a link back That's the dead link, but it's in the browser and I can actually just send this link to a friend of mine Hopefully the network gods will allow me to show you this So So the two browsers will find each other using discovery mechanism we have and Then they'll start sharing each other, but all the transfers I have actually happening peer-to-peer So it's all peer-to-peer in the browser because one doesn't know or like an a desktop app or using right bindings for any language You want the protocol is actually pretty straightforward and simple if you haven't done anything with distribution systems Which I guess it's like a cop-out answer. It's like it's super easy if you understand it completely And in my demo, I can click the picture and I can kind of watch the picture in the browser, which is awesome And The cool thing is that I can also kind of Showcase diffing showcase diffing a little bit in a very primitive scenario. So if I reload this page I have this folder here, but I have the same picture a bunch of times Which is like a really trivial diff you could argue, but anyways, it's still a diff So what happens if I share like all these files? All these hundreds of pictures that are the same That actually is kind of funny because the main bottling in this in the browser is actually just adding files Because that's kind of slow in the browser because it has to do the hashing So if I add a bunch of files And I get this link I paste it in here. It's all about understanding Jeremy's keyboard shortcuts, I guess So if I Load this archive instead see if it loads In the hex string. Oh, I put a thing at the end because I copied it wrong. I'm sorry about that At least that's why I think I did. Let's see This is why it's good to have 13 minutes to do your demos Things don't work the way you hope they work. Anyways, there you go So See if I accidentally changed the link. I did no So if it like it finds the pure again, and then hopefully it will start downloading that list of data Maybe even if it doesn't it's probably okay. Just right. We're learning this I'm running this over my mobile connection on my phone. That's to Denmark. I'm from Denmark. I didn't say that so You know network doesn't always work the way you want it to work So I don't know if you can see it here, but it just Listed all the files and it told me that I'm downloading With 20 megabytes per second because the download speed here is actually telling me the data transfer But not the network transfer So because of that different technique it figures out that almost all the files are the same Are they all the same and only transfers once but the data transfer tells me that you actually got 20 megabytes of data per second so And you get the same technique even if the files are not the same but similar right so it's a really powerful technique and the really cool thing I like about it is that it's actually really user friendly because Users don't need to know the files are different similar the system will kind of just figure it out So from a user's point of view, you just sync files, which is awesome or datasets So let's try to actually sync a data set Because that's what we're all here for so The CSV conf so I need to sync a CSV dataset and I have one here And I can't even remember what it is. I think it's like earthquake data For a month, and it's around 100 megabytes. So This is my progress UI where it tells me how far it's written into the file. It could probably use a little bit of work so This is the CSV file added and if I sync this file, it's pretty cool Because I'm sorry to show this so this is a bigger file, right? This is a hundred megabytes. It could be a gigabyte or even a terabyte It will still work I just don't want to add a terabyte in the browser because then we'll stand here for like 10 minutes waiting for it to hash Which is not so fun But if we wait a couple seconds for these two peers to find each other We'll try to something different then we'll try to only partially sync the files So The cool thing is that if I'm only interested in the first part of the CSV file I can just click this now and I'm actually only load the first parts of the CSV file while it's sitting in the rest in the background and I could keep doing that and So this is a pretty trivial example where I'm just reading the first parts of the file But you can imagine like more cool examples where you actually know which part of the CSV file You're interested in or doing like a distributed search on it. Yeah, go ahead I don't know what discrete means in this sense Yeah, so it's all random X's so this is just because it sees this thing to show in a demo It's like the first trunk, but you can get any chunk of data in the file You can even get the last one or the first one. So to kind of showcase this just just quickly do this is you can actually you can do a cool thing where you actually Add a movie instead, which is like not a data set, but you know and If I add a movie see if that works At a movie instead because movies are interesting because movies give us give sorts of visual Thing that we can actually seek it, right? Oh, it's the wrong link. Sorry about that It's my UX here is not so great because I don't update the link twice So This is a hundred something megabyte movie. I'll be done in a couple seconds So the cool things about something like a movie or even a lot of other data sets is that you know It's it's inherently seekable, right? You can start watching a movie before it's done downloading But you can even start watching like the rest parts of a movie before it's done downloading You can apply the same technique to data sets where you know assuming you have an index somewhere You can you know that you're only interested in the last part of the data set you can actually get that It's all about just finding a way of representing that choice to your users in my opinion, which is also non trivial So let's try this one more time So see if this works It's my final demo in case somebody is watching the time Okay, that's pretty good. So I know I can just keep on forever So if I take this movie Hopefully it will start playing So I have a tendency of always picking a movie with like a lot of blackness in the beginning so the movies playing and if I can scroll to the To the seeking bar here I can seek into the middle of the movie and the system because it's just doing with files I ran into a very dramatic part of the movie there the system The radio player tells the data synchronization to only sync you know that part of the movie So, you know if you think about this in not in terms of movie but in terms of data sets, right? It becomes super interesting So you can have a data pipeline if your data pipeline high level wise does random access our system will just sing out the data just the data needs He and no it doesn't matter if it's like a gigabyte of data or terabyte of data or even a petabyte of data like the technique still works so, yeah That's Basically it I think let me go back to the slides real quick Yeah, so some links here. There's be some slides online Take one question before no time. Okay. I'm gonna take questions at lunch if anybody is interested in hearing Thank you