 excited to see the new bid space for the first time. And there will be plenty of time for questions afterwards, but also feel free to shout out in the middle in context. It'll be, we can just have a casual atmosphere. I don't mind being interrupted or having random questions from the back. So DAT is a project that we work on. It's an open source project. And it's kind of a weird project because it's an open source project that the three of us work on currently. And then we work with active research scientists or labs in partnership on pilot projects. And kind of our goal is to focus on science and specifically reproducibility in science. And that's a big initiative of our primary funder, the Alfred P. Sloan Foundation. So they basically approached us and said, hey, that looks really cool. Have you considered focusing on science use cases? And they basically said, if you don't focus on the science use cases, then it'll lag behind and we'd rather see you prioritize work in the sciences. And I said, that sounds like a really amazing idea. So for the last year, we've just hit our one year of science landmark. We've been working specifically on our first priority is reproducibility in science and data. And we, like I mentioned, some people think we're like a startup. We're just an open source project and we're building tools for making data science and sharing data science sets easier. And the three of us are here. Matthias came all the way from Denmark and Chris is here and is local. So you can definitely talk to us later and we'd love to show you some code and everything. So the project status is that we started a year and a half ago and the project started originally as a tabular data sharing service for mostly open government data, which is my background at Code for America. I got into this space in 2008 when basically President Obama told all the cities and agencies in the nation to open up their data and see what happens. And what happened was a lot of weird file formats got put on FTP sites. And so then people had to figure out tools to parse the data and turn it into APIs. And then I got really, really interested in not just opening data one way but also allowing data to be collaborated on. For instance, if there's a data set of all the city park boundaries and I find a part with my GPS tracker that is part of the park but isn't on the map, wouldn't it be really cool if I could send a patch to my local GIS department and say, hey, whoever collected your data, it's a little bit out of date, I sent you this pull request and they can merge it into their branch. And there's a lot, as I encountered in working for cities through the Code for America program, there's a lot of risk aversion and a lot of questions about data accuracy and a lot of questions about, well, we hire people to go out with these rigorous processes and GPS trackers that are approved hardware and we don't wanna take data from anybody. So there's a lot of challenges but I think there's also a lot of opportunity and opportunity far outweighs the downsides. So then a few months ago or I guess about back in August, we released our first big project after we started working with scientific data sets and it was our alpha version and that basically incorporated large file support because as we moved from government data which is mostly database dumps, tabular data, we started working with a lot of scientific data sets which are what we call hybrid data sets or made up of tabular data or metadata but also really big files and really big files that you can't convert to generic formats because they're domain specific. And then we've been working really hard on a lot of collaboration features on top of that system and that's what we're gonna mostly talk about today. It's our beta release and then hopefully sometime later this year we'll put out a thing that we'll never break again called the 1.0 and then it'll be ready for really, really cool stuff but it's still kind of at the point where we're looking for people who wanna try out new tools we're not going for mass adoption yet just to kind of frame who our audience is. So like I mentioned reproducible research is a big buzzword in open science these days and there's some quotes flying around like half of research can't be replicated I think that's a pretty high number and really what the question is is how do we collaborate on research because replicate is one thing just getting somebody else's code to run or somebody else's data to run or making the same chart that they put in their paper but the more interesting thing is collaboration and even collaboration on the same team how do you get some other researchers stack to function and then what we really focus on is just the data part we're not on the publishing side we just want when you have a data set and you run a model on top of it somebody else should be able to get the exact same version of the data and actually prove that they're using the same data that you used and then if they find errors in the data they should be able to really easily report them or send you a patch and with source code there's some really advanced tools but in data there's kind of this manual ecosystem still so for example zoom in my slides a little bit when you have a file you can use kind of the generic file sharing service like email or Dropbox or even checking your files into GitHub as a way to send the files to somebody over the internet so you might put your data there they might download and a collaborator might download your data or just somebody that's trying to reproduce your results might download your data and then we're using green to show that they've now modified the data so now they've changed your data they've diverged and what happens they email you directly they say what do you think of this change that I've made and what tools do you use to actually diff it because what if it's a really weird file format it's probably a lot of manual opening in Excel or dumping it into a database and running a SQL query to kind of explore it so that's one problem is it's just really hard to compare versions and then it compounds when it becomes a bottleneck on the maintainer to repeat that manual process with tons of collaborators and this is a problem that's very familiar to me as a full-time open source developer so if you have a lot of modifications there's just it's just madness out there so that's basically our assessment of the current workflows so we have a system dat which is a command line tool right now that in a lot of the same way that get kind of codified what it means to collaborate on source code it went from I opened up a text file in my editor and I typed some characters and then I sent you the text file through email or through some other channel we are trying to add a a language to collaboration for data sets so for instance we have the concept of a push so the idea is what if BIDS had a a server that could replicate dats that you could clone dats from it you could push dats to it you could push your data set up to a central repository that you host and then it's just a command like this you just give it an endpoint and it could be over whatever protocols you like HDP or SSH and it syncs the two sides so it says if you've made changes to a data set it figures out what has changed and then it sends the diff up so it's like an intelligent replication system that doesn't have to download the entire thing every time and then when you have your data in a final form you can publish a paper that refers to the data and in the paper you can actually include a URL of the hash of the version of the data that you produce or that you've published the paper with and produced the charts in the paper so this is one of our goals is to allow you to not just say I used this data set but I used this data set at this version and that version can actually be provably replicated by your peers um... so you know a lot of data sets change if you just say this uh... data was based on the census data from the state of california in 2011 uh... what if the state of california released five updates to the census data over the course of the year uh... what you know and you forgot to include what version it was and maybe when they released those updates they removed the old versions from their ftp site because they overwrote the files uh... a lot of times it's really hard to just get old links on the web when those old links die so being able to cryptographically hash things and know the fingerprint of the file that you need is really important so that's a key part of our system uh... and you can also roll back to any point in time so this is a similar concept to get uh... but we implemented on top of uh... the large data files that we support and the commit graph that we support so you can clone a data set at a version you can pull new changes and go forward in time you can also go back in time and then at any point that you check out to you all the operations that are done to the database uh... to the debt store uh... or relative to that point in time so you can kind of time travel so for instance if somebody had uh... uploaded some uh... alignment files from like a dna sequence uh... you can go back to the exact version that you know that they used and you can check out the file at that point so kind of a collaboration flow that we have in mind is uh... you publish a paper that refers to a file or a version of a data set at a hash somebody else can clone that exact version and now they have a copy of your data at the same version and uh... but what if it's been a year since you published your results and you published your data set uh... they can pull the changes so they might have got the version of the data originally at the version that you published but maybe they are feeling adventurous and they want to see if you've made any updates to the data set uh... they can update the data set without re-downloading the whole thing again uh... or if somebody else out there in the world has forked your data set so maybe another researcher has said uh... i took the data and i found these errors and i published my version on my server uh... they can try different versions from different people and you can kind of compare different uh... branches or kind of layers of the different data set from different remotes uh... and one uh... major philosophy that we have is that you should be able to merge data sets uh... without having the user interface so to speak or the application halt re-downloading the data is important to finish and then after that you can decide how you want to merge the data uh... one major problem with data is there's no default merge strategy uh... so we don't have a default merge strategy that's up to uh... you to pick we just try to make it really easy to pick the common ones but you can also implement your own merge algorithm if you want so uh... you this person has now pulled uh... the new copies of the data from the original repository uh... and maybe they've uh... pulled some fixes that you put into your data set that you've published uh... post uh... published and uh... now maybe that person wants to add a new file or they want to uh... add some new data to your study data set so we have an interface where you can either add a file to that uh... just an attachment or you can import a table of tabular data and the difference is each table row is versioned independently whereas each file is the entire file is versioned so csv into that but uh... if you treat it as a file whenever you edit any row of the csv it's a new version of the entire file whereas if you treat the csv as tabular data you can independently version independent rows uh... it's a little bit nuanced about which one is right for which case but uh... we found that having files and rows is our two fundamental data types is pretty flexible in a lot of data sets uh... so a theoretical situation could be they've downloaded your data made sure that they're in sync with your latest version oh these are example data files from uh... different pilot projects we've used sorry I should have made a note uh... this SAM file is uh... you might have a uh... data set in dat called SAMs and those are uh... aligned DNA sequences sequence alignment and uh... it's basically when you run a big job on a supercomputer it spits out these files and then you wanna it's really they're like the precious gold that's after the supercomputer has mined the data that you want to keep and version so a lot of people will take these files and want to make them available to everybody else on the team or uh... because supercomputers are scarce resource they want to publish them on the web or share them with other researchers so we've seen a lot of people that just want to take the assets from the end of their analysis pipeline throw them in a thing and make them available to people in a way that uh... they can expose multiple versions of files or maybe they run their their alignment pipeline five times throughout the course of their study but they want to make each one of those five times uh... they want to give people the choice of which version they want to download give you some species metadata into a data set and then share it so uh... and I'll talk more about the data sets API basically a dat repository is made up of many data sets so you can have one repository and a repository when you clone it it clones all the data sets inside of it you can think of data sets as just subfolders so you can have a data set that's made up of or a repository that's made up of maybe a uh... an images folder which would be maybe uh... one of our use cases is astronomy like raw photos from telescopes and then you might have another folder which is the kind of the metadata when the photos were taken and what different telescope attributes were and what images were detected in the photos so you want to both have the raw data and you also want to have the tabular metadata so you'd have a data set for each one you have uh... unlimited number of data sets uh... so this person could push up their changes and uh... oh sorry rick uh... so you can kind of this gets into the the details but uh... it's a table it's basically a two-dimensional table but it's also just you could it's an arbitrary binary store so you can find your own schema and uh... the schema that we use by default is a schema called protocol buffers which is kind of a network interchange format from google uh... it's just really nice because it's compact and supports every type of uh... like integer precision width you could do like sixty four bit floats for a column and you can kind of if you really know your data set and you want it to be both uh... accurate when it gets replicated to somebody and not coerced into the wrong type and you also just wanted to save as much space as possible you can go in and define your own schema and that can you basically say per column what it should treat that data type as uh... but it's kind of uh... the schema is i can show an example schema maybe later but it's a basically think of it as a two-dimensional table has rows and columns and the schemas for that whole table but if you have uh... two different data files maybe two different schemas you can just have uh... it's a schema per data set so you can just have two data sets each one with its own schema and you can kind of configure how many data sets you use in a repository but the idea is that the entire repository syncs all at once so you wouldn't push just one data set at a time you would push the whole repository and it's uh... at first we actually had it so that there was only one uh... data set in a repository and we were trying to keep the design as minimal as possible but a very common use case was people would would have either two different schemas or they had a bunch of images and a bunch of tables and it just wasn't convenient to have five dat repositories and having to type dat pull five times we found some nice ways to keep the design simple but still support kind of heterogeneous data within the same repository and so before i get into the merging at this point does anybody have questions or clarifying questions about anything we've covered so far okay so this is where for me i mean all this stuff is not new concepts it's it's new in the sense that it's for data but it's things that we would expect should have been built already but this is where it gets really exciting for me personally the idea of merging data uh... because this is where the collaboration features come in so when you actually push to somebody uh... if they made changes to the same rows that you've made changes to you have to figure out how to merge those and uh... it's easy with source code because it's just text-diffing every language every programing language is just text uh... which is kind of really nice for writing source control for text uh... but writing data version control uh... we need ways to let people define their own merge strategies or our goal is to enable kind of an ecosystem of really interesting merging tools so for instance via gis data and you you made it a park boundary addition to the same park that and somebody else made a park boundary addition to but there's got merged first there'd be a conflict it wouldn't be really cool if somebody had an open source tool you could open up the two versions and you could fix the polygons in a graphical editor and hit merge and then it merges into the database uh... so that's kind of one example of the kinds of ecosystem goals that we have is we want to build really nice low-level tools and in a really accessible open source way and try to get really intelligent uh... people that can write custom bridge algorithms are really nice you eyes to open source uh... long-tail data format support and merging support so for example if this is a three command workflow uh... if uh... you pull from somebody that's forked your data set and they've made uh... new changes to it you'll uh... you'll see that you have a new head of the data set is very similar to get so now you have your original head and these are hashes but they could also name them but you have your original head and then now you have a new head and basically that just means that internally and that we have a graph and what has happened is that uh... they've made changes that are different from your changes because you both made changes to the same uh... version so they cloned you know imagine i cloned it on to my laptop and then i turned my wifi off and then somebody else cloned it on to their laptop turns their wifi off and while both of our wifi is off we edit the same row and i update my name to be capital m max and they update my name to be lowercase m max or something we make two changes to the same piece of data when we come back online whoever pushes first uh... whoever gets pulled first by the owner uh... there's will become the new head but then i have a different head so there's a conflict so uh... now after i've pulled a conflict there's two heads and so in that you can get two heads and it'll calculate what uh... in the data set what has death or what uh... what the merge conflicts are it's either something new is added uh... it's either something has a conflict there's two new values you have to pick which one so we have this uh... kind of design that in is intended to be piped through whatever commands you want to build uh... for example imagine you have a data set of uh... parcel data like uh... address data for a region uh... and there's a conflict where somebody resized a polygon of a building footprint and another person did a different polygon resize and now they're overlapping and they messed up the zip code metadata you can write a custom python script that which polygon is in which zip code and fixes the zip code data some really long tail merge thing that no ui designer could ever come up with sometimes you need to do it in code so you can just pipe the diff through the merge function and then back into a dat merge operation which then saves your merged data back into that and after that you've merged the data and there's now one head again uh... and it might be uh... if you're familiar with the get tool when you pull and there's a bunch of new branches there might be ten new heads and that's a similar design as we're also a graph like it so you could you don't have to merge everything right away you only merge the stuff that you want to merge the kind of ideas you can pull different versions of the data down that people have made edits to compare the ones that you're interested in in seeing the difference you can write your own custom dipping tools that can give you insight into what they've done to the data set and then if you want you can merge it in so the really cool thing about this design is because everything is tracked if you don't want to merge it in they can just go publish it themselves so they can upload it to data.gov or whatever they want the idea is that anything that they want to do with the data set they're free to do and there's not just one central host of the data anymore uh... so for me a lot of times it's a city or uh... you know there's an open data set on an FTP site that's in a really raw form and there's a lot of duplicated effort so people will have to download the data and then spend a day munging the data to get it to a point where it's workable and then they finally get to the actual analysis but they wasted so much time doing the munging that they never published those munging tools as anything reusable the best thing you can do is just publish a script somewhere on github and hope somebody else finds it but it's a lot of trouble to upload a version of the whole new copy of the data set you have to host it on your own server a full duplicate you can't just host a diff so then the next person that comes along they have to do play all that effort because there is no way to merge contributions to the data set upstream and fix fix any changes and improve the original upstream data set or host better upstream data sets and I actually think a really interesting use case would be uh... wrapping legacy systems in interfaces that you can just clone and push from uh... just for the sake of making the data easier to get in the first place so people have to use arcane tools to access data sets uh... there's also some kind of analysis tools that we have in mind so uh... has anybody use unix pipes before just for piping things around on the command line uh... were huge fans of the unix philosophy and uh... we envision a lot of really cool things like you can import a data set that's uh... maybe a tab separated file specify what your primary key is and then you can uh... pipe that through uh... python script that might clean up the data uh... operate on each row and return a new version of the row at a new version and uh... then pipe it back into that so you can kind of filter everything through your own scripting tools and uh... automate away the mungine tasks uh... and you can actually just pipe that into anything that that can be piped to and we support uh... three different formats we like I mentioned we store the data in kind of a binary representation uh... so if you have your own custom schema to find that's really specific if it's uh... you know really precise astronomical units or something uh... you can just pipe that raw binary data into whatever program you want uh... or you can do csv or jason just for convenience if you have a custom binary schema and you convert it to csv uh... it'll probably have to be base 64 encoded or something sorry can you repeat it good question uh... you can either the default is it will cat all the data sets just one after another but you can also add a argument to pick which one you want uh... and if it's the if it's a file it doesn't we haven't decided if it'll print out the binary file to standard out yet uh... or if it'll just print out the uh... like a piece of jason or csv row that points at the file so you can use your own tools to read the file uh... but in the next couple slides i'll show you an example of how the folder structure is laid out and this last this last one i think is really exciting too docker is a really cool tool for making code that's really hard to compile and entire environments that are really hard to replicate uh... replicable and uh... we've been doing a lot of experiments and making sure that all the work that we're doing can be really easily tool used with tools like docker so they're kind of the high-level idea is that that is versioning the data it can get the data on your machine and stream the data down in the right version and then you can pipe it through docker which gets the code at the exact right version and uh... you can do your analysis in a reproducible way on both the code and the data side and these uh... all these tools are really new so we're super interested in feedback on what you like about docker what you don't like about docker what you use for virtualization what you use do you use vagrants how you get your code to run on other people's machines because that's a huge problem but that's not the code part is in our main wheelhouse the data we're really focused on data reproducibility but we see a really important tie-in with reproducible code and data together we just want to make sure that we're integrating into that ecosystem really well uh... and i can show some more examples of the docker stuff later uh... people are interested uh... so here's an example data set of what it looks like uh... with that or kind of like a high-level example uh... a lot of data sets like i mentioned are very heterogeneous you might have different file types that are binary files you might have different tabular data files uh... you might have serialized uh... data dumps from different analysis tools uh... and in the multiple tables so uh... our design takes into account you know kind of the the landscape of wild and wacky data formats uh... so the first thing you do is you add the data into that and that basically imports the data into a version control data store on disk uh... and then uh... it makes a new folder so if you uh... dat init well you'd have to dat init that makes an empty data store and then you can dat add all the files you want to version control and then uh... so this is creating a new empty one and we just have a uh... file for configuration data for instance uh... if you're pulling from somebody you can put in custom remotes to pull from so you don't have to type the url every time uh... we also support some really interesting plugins so by default uh... the transports are things like hdp or ssh but we want to make it super easy to install and configure plugins for doing things like uh... connect over a peer-to-peer network and anybody ceding this data set on bit torrent connect to ten clients so i can maximize download speed just make it so i don't have to wait as long to download the data sets uh... or i have this data on a dropbox account or on s3 bucket and i just want that to connect to that and download it into the dat store from there directly uh... so we can kind of automate the differences between all these file storage systems into just adapt pull commands kind of the goal and then uh... when you add files if you're adding a rose to the data set it imports them uh... into a database inside the dot that folder we're using a database called level dv by default which is a embeddable c++ uh... data store is actually built for the google big table system uh... and then later ported to work in google chrome as the database that chrome stores all its on-disk database stuff in and it's just a really simple key value store that does its job well and we use that as our on-disk key value store and then in addition anything that isn't tabular in nature it just gets stored on disk in what we call a blob store and a blob store can be anything that you can put in get files out of so the default blob stores is the file system uh... you can also configure blob stores like like i mentioned s3 or bit torrent or anything you can imagine getting and removing uploading and downloading and removing files from uh... and then so for instance you could add both uh... table data and also a file to the same debt and it'll appropriately store them it kind of guesses we have uh... some heuristics to guess if it's csp or json and suggest do you want to import this as tables or would you like to add this as just a static attachment oh yeah uh... there's one that we just called default right now uh... and we're open to feedback on whether there should be a default one everything else you have to specify a data set name for so you can say uh... that data sets create census and it'll create a new subfolder for the census data or actually uh... yeah the default is the default one is kind of like the main branch in git we're trying to figure out exactly what the first run experience should be uh... because it'd be really nice if you could just that in it and then that add your data and it'll create the data set for you but we also don't want to be too magic so funny enough like a lot of the hardest parts of the project have been trying to keep things simple but also intuitive and the api design is surprisingly difficult uh... on a command line tool so you know it's there there might be like graph database and all these crazy data structures inside in algorithms but the hardest part is the design sometimes uh... user experience so uh... you might be thinking why if i'm just adding a couple files why not just use a git repository or you know what if i uh... like in what case will that be really useful and uh... a lot of data sets that we've encountered in our last year of working with different research teams have been really big you know you might have a hundred uh... hundred thousand versions of the file or you might have uh... a file that's forty terabytes of images that are compressed or just really big data dumps and there's some flaws in the design of git that prevent us from just reusing the parts of git we've stolen all the parts that we like sometimes you just have a ton of data and uh... you also don't want to download a ton of data before you can start working with it you want to pick and choose the parts that you want uh... so you know this is the example from before and if you just add keep on adding tables and keep on adding data you'll eventually hit a point where there's uh... there's not a good tool for keeping track of a bunch of large files uh... especially if you want to host it on your own local university system or uh... something like that so this is one of the data sets that we worked with in our alpha uh... it's a really really cool project called uh... actually should have put these slides in a different order i'll go back uh... trillionverse dot org it's the initiative out of ohio state and they worked on the slown digital sky survey full sky scan and one of the problems that they encountered was people will do these uh... multi-spectral sky scan analysis uh... like data releases and they'll upload uh... data from their telescopes in one bands but if you're writing a model you want as much of the band is possible and unfortunately the data is in different places in different formats and different coverage so uh... they wanna first get all the data and normalize it and then they want to put it into a big distributed computation system that you could run your model against everybody's known data and it fills in uh... the parts of the sky that you need on the fly so they're doing a distributed computation system for doing astronomy research and we started working with them very early on uh... just seeing what their data challenges were and this is one of the things we immediately ran into this is uh... caltech releases uh... their i-pack on their i-pack server they do every few years a big data release and this is just the csv data and so what they do is they slice the visible sky up into a bunch of bands so you can see the this is between declination about you know eight degrees of the sky and then another slice that's another few degrees of the sky and they slice it up so that there's about fifteen and a half million uh... objects detected per slice and of course if you look at the galactic plane uh... you have more data so you get an interesting distribution but there's a ton of stuff that they find when they look really closely the sky uh... there's uh... fifteen million times eighteen objects that get detected when they do a scan and what these files are are around eight or nine gigabyte or seven gigabyte text files that have three hundred columns and fifteen million objects that were detected they basically they they take a bunch of pictures of the sky and then try to find all the light points and then at every light point they registered that they found something and they record as much data as they can and this is just one study and uh... this is the tabular data and then there's also forty terabytes of images the raw photos that are available as well so one of the things that we were doing was they use uh... for a lot of test purposes they want download the entire data set they'll just use what's called striped eighty two and it's kind of just an interesting part of the sky but there's no way to download striped eighty two from this data set because it's across a bunch of these files so you have to download a bunch of files import the part that you need disregard all the parts that you don't need and that's just for the tabular data there's a whole separate process you need for the uh... image data because it's in a astronomy format called fits and you need to do patchwork quilting on fits bounding boxes and it's really complicated so we were given the challenge of can you make this easier so that if we want to provision a server that's going to index one part of the sky can we just that clone it and only download the files that you know get really easy easily be able to download the metadata for auto data set is and then only download the parts that we need through basic shell scripting and not have to do all the parsing and all the downloading ourselves so this is one of the projects that's ongoing but definitely check out the trillion initiative it's a really interesting projects uh... and then this is what uh... that format looks like on this so this would be uh... theoretical study on scandinavian uh... population data and uh... because we have a danish guy on our team so we use scandinavia as other test case uh... so you might have over this would be a dat repository and you would have a uh... you know you read me or your python scripts to analyze the data or whatever other stuff you want but then uh... these are two data sets in that so at some point the author of the data set created a data set called sweden and another data set called denmark and the data sets could be from different sources they could be one from the swedish government from the danish government but that kind of idea with this example is that you might have cloned this uh... data from github and it doesn't have the data in it it just has a folder structure and each data set folder has maybe some more code and it has a dat.json file and in that dat.json file it could have the actual remote it could have the data inside of it that will give dat the ability to clone the data from the data source and that could be going to the swedish government and downloading the data and running it through a pipeline and importing it into dat or it could be connecting to somebody else's dat who's already done all of that and just downloading the right version uh... so you've kind of downloaded the skeleton file structure without the data in it because it's easier to just share the metadata and then uh... you can run dat pull and then dat will do the job of getting the data that's configured uh... so this is one of our distribution strategies is that you can publish your dat configuration to source control sites and then people can download it and run dat pull and it goes and fills in all the data but before you pull everything you could also configure it to just pull the stuff that you want uh... maybe download all the rows in the data set but don't download all the files uh... or write a custom script that tells dat to just fetch the files in this region based on metadata in the data set so we want to kind of be flexible around making it so you can download what you need and I have to download everything uh... so after you pulled the data there's now a dot dat folder in here and maybe this data set in the danish data set it might have populated uh... a thing called population and a thing called census there are now two tables that were pulled down uh... sorry go ahead uh... so the question was can you specify the subsets that you might be interested in uh... the there's a primary key on every data set and you can do depending on what the key is you can do key range that you can replicate uh... and then if there's any branches available you can replicate those but uh... for one of the things we're still trying to figure out how we're going to support is uh... for instance if you have data and you want to filter uh... on multiple columns so you want to say give me all the data that's in argentina that's also on trees of this species that's also you know cut down in the last five years multi-column filters uh... we don't have anything in the beta for doing the more complex filters so it's kind of up to the data set publisher to split up their data in data sets at the right level granularity uh... and it kind of anticipate how people are going to want to consume the data but we're really interested in figuring out how to do the more complex multi-column stuff because our philosophy with that is it's not something used to query the data it's just the thing that is the data sherpa moves the data between the systems and does the interop between maybe a legacy system and running the data through a pipeline and just doing the networking on the data and replicating it and making sure you got the right version but that's our that's what we've been focusing on for the alpha and the beta but a lot of people are like but if I could do sequel then I could do this really cool filter and so we're kind of saying well you know maybe we can figure something out but we don't want to add a bunch of bloat to it at this point so we're really interested that I think that's really interesting uh... challenger is kind of like a multi-column filter on a clone uh... so definitely keep we maybe we can talk after a brainstorm on stuff yeah exactly unless they have uh... parsed their data as a table into the row uh... system as kind of custom encoded rows but if it's just a file that they've added we don't touch the file we just uh... it's the raw file that they added so it's uh... kind of the in the file if you had a file to that we don't touch the file but if you add them as rows it gets encoded in a schema that uh... we can independently version the rows uh... if you if it's json yes if you import json or csv it will come out as json or csv but if you import like a binary row at a custom encoding we also replicate that uh... schema and then we can decode on the other end in the binary schema so I guess we support json rows or like kind of non-binary rows and we also support custom encodings for rows that are any binary schema you can think of for tabular data and we also suggest for binary files uh... kind of the difference between rows and files is think of it as like a sequel database you wouldn't want to put a five gigabyte file in a sequel row uh... but it's okay to put a million rows inside the sequel table so it's kind of and but you can also take the entire sequel database and dump it as a dat file so you can distribute your sequel database multiple ways it's where it's kind of uh... it's like a data modeling problem almost oh sure uh... I can show you a basic example if uh... I get my terminal area so if I um... make a new directory called data and then uh... I just created empty dat inside of there and then I dat add uh... oops dat beta add so I have like a a csv of earthquake data from NOAA and if I add that into dat and then uh... that'll import the so I'm adding it as a table since it's a csv and our default operation is uh... if we see a csv we assume you're importing a bunch of rows so then I can dat beta uh... cat which is the way we cat out the data and it'll start streaming a bunch of these objects and we just show you one of them uh... so this is what we store from the csv so the csv has actually been converted into a protocol buffers object on disk but we can serialize it back out as json so we actually convert everything to protocol buffers when we store it just to normalize everything uh... but if you don't give it a custom schema we just assume it's json so we store what type of data it is uh... what the key was since I didn't specify a key and the data set didn't have a key gave me a uuid basically and then the version hash of the data and then the original data uh... and then if it was a file and I got the file out it would just be a binary stream on the way out uh... so yeah uh... and I think this is another huge thing for us as open source developers I mentioned that ecosystem is really important uh... everything you do in dat right now is kind of uh... just files and tables and then command line so uh... it doesn't you know the implementation of that hopefully can be since it's just command line tools and streams in unixie we want to see a lot of really cool extensions for all the stuff that we don't have time to support and implement like be really awesome if we had a microsoft sequel 2007 support but I don't happen to have one lying around to test with so hopefully somebody contributes a sequel server data connector uh... and uh... the main technology that we use in dat is uh... Node.js and we picked that because it's a cross-platform tool for network IO which is exactly what our problem was uh... we when we're early on we realized a lot of researchers use windows mac and linux we need to make sure it runs everywhere uh... we want to make it run in the browser for cloning datasets into things like ipython notebooks uh... directly and doing all your analysis local uh... we want to just make sure that it works everywhere and there's a lot of work in the last five ten years that has gone into the javascript virtual machines to make them cross-platform and super fast so for the task of moving your data around uh... we've found node to be a really really awesome tool and it also has a really vibrant open source ecosystem of plugins so there's parsers for you know there's a hundred and twenty thousand packages on its repository and uh... any almost any format under the sun is supported uh... for the data crunching side uh... javascript isn't the best ecosystem or the best tool for heavy cpu data analysis but luckily that doesn't do any of that it just pipes the data to the thing that does so uh... we hope that people that have really interesting analysis problems that want to simplify the data access problems can hook into our command line api and there's also where we have an experimental rest api that we've worked on an alpha that if you want to stand up a server and expose the data over the network if you're interested in that, I'd love to get feedback on that oh yeah, so just a couple examples of some stuff that we've built using dat hopefully you can see it okay uh... I guess I'll start with the top half so this is actually a scripting config language that we've been experimenting with that's basically a make file but in a way that can run on windows too and but you can think of it as just a make file and uh... we're really interested in pipeline tools in general uh... everything in that is streaming so it can start processing it can start spitting data out as soon as you start the clone it doesn't have to wait for the whole thing to finish and another really interesting part of streaming is uh... if the network connection dies halfway through you can just start it again and it'll continue where it left off so if you're downloading a 40 terabyte data set and dies halfway through you can just restart it uh... so the pipeline tools are really important because uh... you want to compose all of your pieces of your analysis pipeline so that that's either at the beginning or the end uh... you're either getting the data out of that or you're putting data into that and uh... things like make files or bash scripting or just using python are all valid approaches uh... we've just been trying to keep our examples as simple as possible so this is a bioinformatics pipeline that one of the researchers we worked with wrote uh... he has a tool called bio-node actually on the table back there are some bio-node stickers they're double helix uh... hexagons bio-node basically wraps some really horrible xml apis uh... from ncbi and it lets you just get the data out as json he's defined a pipeline called reads and the first thing it does is it downloads a search uh... from ncbi he's searching for all the data sets that match this uh... taxon and then he forks he does a fork this is like a thing that is hard to do in bash it spawns four pipelines and it takes the output of the first command and pipes them into all four so the data flows from ncbi and then get split into four channels and uh... he has four pipelines so for example the samples pipeline does a bunch of data parsing and then he has another uh... search that he does and he pipes that into a dat and then after those four samples are done they all pipe back into a dat i think this is an older version of that this would be like a dat ad uh... at the end so he can kind of run this uh... fetch commands or the reads command they're just aliases and it goes in searches and parses a bunch of the data kind of does like a group and apply and then takes the output and dumps it into a dat so that's like the first step of his research is he just wants to uh... make a local copy that's a version copy of just the search data of the data sets that he's working within the study and so he doesn't have to rely on the ncbi server uh... and actually during his research last year on a monday he went to use the ncbi api and it worked and then on a wednesday he went to use it again and they had broken it they had changed a bunch of the keys he had to change his code uh... and so he's been really um... cautious since then about making sure he has exact version copies of all the data he gets out of ncbi so he can reproduce everything because if he's depending on systems that are out of his control then they could change and break uh... so then down here he kind of has uh... he has a another set of pipelines that might run different uh... tools uh... so kind of the idea of finally down here i think he's piping it into a graph database converting it to triples and so he can kind of uh... take his entire pipeline and uh... when it says run bionode ncbi that's the equivalent of typing bionode ncbi into the command line so you can kind of spawn processes and pipe them together so like i mentioned we're really interested in kind of pipe lighting tools we don't want to focus on them as our primary use case but uh... if you have pipe lighting tools you like we'd love to integrate with them uh... but we try to create our really own simple ones uh... in the meantime uh... and then another example that we really like is uh... this is why we like to use docker for things this might be a little bit hard to read but uh... this is the command that's being run i'm basically saying docker run and then i amounts my hard drive into the docker container so i'm letting docker access the directory that i'm in and then uh... then i just pass it uh... some arguments so i'm saying run this samtools t-view command and this is a tool that people use to look at uh... sequenced uh... dna data on the command line just for basically peeking at the data they might get some sequences back from a job and they want to just see they want to jump to an offset in the genome and just see what the data is there uh... normally you have to compile the tools yourself there uh... some custom tools that aren't cross-platform uh... they don't usually run on mac i don't think i could be wrong but with this you can kind of just have a cross-platform container using docker that runs it and uh... you can even pipe data into something like this so our researcher Bruno Vieira at Queen Mary University that does a lot of the bioinformatics work on that are with that and does the bi-node project he recently got his university to get him a new 500 gigabyte sequencing machine 500 gigs of RAM that's the most ram i've heard of in uh... individual researchers computer but that's what they do in the genome sequencing world he has this huge machine with ram it's basically a mini supercomputer and he got it because the IT department's supercomputer stack uh... had a too old of a linux kernel to run docker because it was from 2007 and they said we're not gonna be able to upgrade to the 2012 kernel that you need for a long time so it's easier for us to just buy you a new machine so he got a new machine just so he could run docker and uh... reproduce basically not have to mess up the environment on the supercomputer but just be able to run nice uh... isolated sandbox images and he really really likes docker because it's the only virtualization system that he can use that can now use his 500 gigs of ram because it's uh... it's not true virtualization it's uh... it's like a lightweight sandboxing that can still use the beefy resources that he has but he was to use like uh... virtual box or something like that uh... reveal where it would be virtualized memory and it would kind of defeat the purpose of having this you know super high horsepower machine uh... so where like i mentioned we just think that reproducible code environments is kind of the peanut butter and chocolate to the reproducible data stuff that we want to do uh... and then with that uh... i'm gonna just open up for questions and we have a question bingo uh... if you have any you don't know what to ask you can pick a thing off the bingo we can go into more detail we have think about half an hour for questions is that right for twenty five minutes uh... thank you very much for listening uh... i don't know what the bingo rules should actually be but oh so this used to be on in uh... mac by default but if you go to the system preferences as the wrong thing uh... inside of here there's like a odd man i don't remember where it is now oh sorry how do you zoom on macOS there's a thing you can turn on in your system preferences so when you hold down control and then you just scroll like you're on a web page it goes like that i use it so much you can also do the inverted screen if you're reading at night and you know anyway how do you get a cat emoji on your terminal uh... you uh... head batch profile you export this ps one in your uh... batch profile you can put emoji in there the mac terminal supports as of ten point nine i believe oh that's a fun one oh yeah oh yeah mmm so the question was say you're in our python and you want to access the data set at a particular version but maybe concurrently at one process for different versions or work with different versions at the same time and kind of not mutate the data on disk is that accurate uh... so it's an interesting question uh... because there's all these really cool new file systems that do things like copy on rights and you know if you have a ten gigabyte file and you want to have a backup of it you don't have to make another ten gigs of hard drive space available you can only the file system handles only copying if you make edits to it otherwise they use the same uh... unchanged parts so it uh... that's why it's called copy on right and so the the dumb way to do it is you just make a copy of the file if you're on the kind of a old school fs but we have two mad science projects as we call them one is a fuse layer that you can kind of mount that infuse and we implement copy on right in user space file system land so then you can run your computation on our virtual file system that smarter and doesn't take up as much space if space is a constraint the space is not a constraint you can just check out multiple versions of the file and uh... into different folders on your real hard drive uh... yeah yeah if you have a thousand versions of the file you have to have a them in a thousand different folders or a thousand different file names and so uh... we can create infinite amounts of virtual file systems that's one option or we're really interested on operating systems that uh... actually linux now supports a thing of overlay fs in mainline linux there's also other file systems like the fs or btr fs that implement copy on right and uh... those are really exciting for us to support if it's available uh... it's not available on like mac and windows but so it's it's kind of a it's a tricky thing to do intelligently but we want to kind of support it where we can uh... and fuse seems to be kind of a nice middle ground people are really we've we've been able to do fused cross-platform and performance it just requires users to kind of install the fuse plugin so it can't ship out of the box but it's a file system limitation in for really big files all all right right right yeah so uh... we have an experimental python and our library that we're working on the python one uh... actually the woman sitting to your left is author uh... carissa and uh... the idea with those we've been kind of designing what's the ideal integration from say python are into that and there's a few different ways you can do it you can spawn that process you can go through some sort of a p i that we expose that over a network service but uh... or you can uh... access that's data store directly uh... if you implement some of our a p i's that we use to read our data store just directly if you need really high performance access but uh... the easiest way is just see there's you know spawning a dot process to give you the thing that you need and just pipe it into your uh... python context uh... or we can expose that it's the binary protocol over the network to your uh... the local tcp socket or named unique socket or something so there's lots of different approaches and we should talk after to figure out kind of what your requirements are to inform our development but the python and our ones were wanting to launch and we want to beta in a week or two like this month we want to uh... ship with those as kind of nice canonical examples of how we imagine people using that from within a scripted language and not having to go to the command line all the time so the question is uh... how do you integrate dependencies in software with versions of data uh... and how do you make sure that different versions of software on polluting different versions of the data and uh... one of the the downsides of something like docker is a lot of times if somebody has something that depends on a book to as the base machine that they need to virtualize and then another person has one that's the later version of a book to or maybe debbie and or salaris uh... you could right or different versions of python on the same o s uh... really quickly you can use up a lot of disk space just because it's like installing an operating system every time uh... if two people happen to use the same version of the book to you don't have to reinstall that same base image but in a lot of cases people there's a lot of distributions out there uh... so one of the one of the problems we see with dependency management is there's kind of two ways you can do it you can make the programmer fix it or you can take up more space and just make it the problem go away by throwing disk space at it and we're trying to come up with a way to not use up as much space but still not have to have somebody fix it be in dependency hell so to speak and in java script there is they take the approach of use more disk space because they're a little tiny javascript programs but uh... in a lot of languages where they have incompatible run times containers are really appealing because you can have a hundred different versions of python installed on the same machine and they use a layered file system so the lowest layer is the same core linux kernel for instance or the buntu image and then you have two different layers one that's python two one that's python three and it can kind of boot up all the layers and we have a layering we have a similar approach to layering versions our branches of data and that so we actually matias here has some really impressive demos of booting uh... a virtual machine over the network and only downloading the files that you need to run it so if you have a one gigabyte image to be able to boot python you don't need to touch every file inside of linux we can actually just lazily fetch those over the network using kind of a union file system so you only end up using a hundred megabytes to boot uh... the python process you need you could do it over a three g connection in about thirty seconds you don't have to wait for the entire image to sink uh... so a lot of our algorithms are about streaming about booting things as quickly as possible uh... not downloading the entire thing when you only need part of it so we're but i agree that code and data if we can figure out a good way to make them coexist that's a really interesting ecosystem problem that i think the data science community is taking on in the back oh sure uh... so the question was how do you kind of hook up a data source to that and not have to make it be a manual job every time and it really depends on the data source we've and sequel databases i've tried to make kind of uh... streaming export hoses that can dump the data whenever it changes and some databases supported and some don't for the ones that do support it uh... you can hook up kind of uh... you can imagine the simplest thing would just be a cron job that every night at midnight just dumps whatever was edited today into that uh... in certain cases like with there's a oakland crime data set that i was doing that rick has extensive experience with they just put a new csb file on an ftp server every eight days or something and there's no way to know it's a human edited file so they don't just put the new stuff at the bottom of the file they could have edited rows in excel in the middle of the file so you end up just having to parse the entire file every time and only importing what's not in that so the the hardest thing with importing data into that is figuring out what your update strategy is because you have to have a uh... your data has to have a identifier that's unique in that data set if your data is just first name last name and has no idea uh... in the source data then uh... we can do things like uh... either generate you a random idea or we can hash the data in the row and use that as the idea or you can pick what columns you want to represent the unique fingerprint of that row uh... or if your source data set has ideas already you can just tell that what column ID column is but a lot of times uh... you know that's one of the big problems with the existing systems is that they are not designs a lot of databases are designed to do really complex queries they're not designed to download themselves like you know distribute themselves to users in like snapshots or i wish a lot of databases had more full-featured cloning things in them but uh... so a lot of times the strategy ends up just being really kind of naive and dumb and just brute force dump the whole thing every day but the good news is you only have to do that once at the beginning once it's in that everything is smart and can be dipped and can be kind of synced and streamed so uh... i envision kind of a lot of the first tier is using things to dump databases out into either custom schema into that or into json or just attaching files and those things are going to be you know you're basically you're converting your data into a version control format through that process but once it's in version control then we can be a lot more intelligent about it sure so the question was in a system where uh... data is being merged in for multiple sources how do you uh... ensure the quality of the data and how do you combat pollution of the data uh... so we uh... are kind of thinking about a lot of these issues of a lot of centralized data sources that want to have control over the official quality and what i really like about open source today and tools like it is that they're kind of designed around the sense it's like you won't take my pull request there's nothing stopping me from forking it as long as it's openly licensed i can fork it and have my version and people can use that instead it's kind of like an open market for whoever has their own you know scratch your own itch that's the thing you hear a lot in open source and with with data we want to make sure when we're versioning the stuff we can uh... built into our protocol built into our protocols kind of a ability to sign things and have a trust system so for instance we're thinking of a scheme where you can give yourself a key pair and every piece of data that you put into that is signed with your key and then we can cryptographically prove on the other ends the data came from the person that you trust so then you can build systems that you know actually prove that when you receive data from someone it hasn't been in the middle attack that hasn't been uh... there wasn't an evil peer inserting bad data into people's uh... databases at the very least we can offer at the protocol level the ability to build a trust network but i think it's a people problem on top of that i think it's uh... a lot of people that hold really tightly onto their data and uh... they may they might never data from other people but i hope that uh... by making that the stuff more accessible and making it easier to just publish that kind of dissenting versions of the data or alternate versions of the data it lets the analyst decide do they want to trust the official source or do they want to combine other sources and kind of on a per source basis they can choose who they want to trust and as long as the system itself can prove that the data that you got is the data that you intended to get uh... then it's it'll be really interesting to see if there's a data data forking phenomenon that happens or not but who knows back here oh okay so the question was on the topic of making the command line experience or just the user experience of the tool uh... improving it is there any process that we have for improving the actual command line experience and uh... so we haven't formally formalized our process or i don't even think we necessarily figured out the full process yet uh... a lot of it is uh... the open source ecosystems that were part of we try to make things installable in less than a few minutes uh... we actually made uh... a demo for the data alpha for instance that if you look at so we made this this web workshop because we realize that terminals are scary and if people haven't used terminals there are ten times scarier and asking people to open up their terminal they might have had a friend tell them to never open the terminal so uh... we probably use the uh... docker and then a lot of fancy frontend to um... when you open up this address anybody can go here we wrote this for a workshop it's it doesn't reflect the new stuff in the beta that we mentioned we haven't updated it quite yet but it's the it still has some of the basic that stuff in the alpha uh... when you go to this page you get redirected to a new basically uh... session and so what you get is uh... a terminal here on the left and it's a real when external kernel uh... you're actually logged in as the root user in a sandbox computer so you have your own computer to play with that has that installed on it already uh... and then we have uh... workshop content over here on the right that you can follow along we kind of have a tutorial that gives you you know here's the first that command you should try to go over the terminal and type it in and then uh... you can even uh... you can create files and then they show up here and you can kind of edit the files without having to use vim or e-max and things that people don't know how to use intuitively you can say hello uh... i could say uh... maybe name is name and age max and i'm not going to reveal my age but i can name and uh... hair color and i'll say max and blondes and then i can go back here and i can say okay i wanna debt and it's and then uh... make a new empty debt and then import uh... my data into it so i'll just leave the defaults and then i can say debt uh... imports foo as a csv so people can kind of in seconds they can go to their browser works in firefox in chrome and they can try out that without having to install anything and if they really like it hopefully then they're motivated to figure out how to install it on their machine uh... and then if you close this then uh... we close down the session so it's kind of we were able to figure out a way to do it at basically no cost for us so that when we do a workshop we don't have to spend half an hour helping people install it so just making the tools super easy to install is the biggest issue with command line and then getting people a good first-run experience those are kind of the two biggest things that uh... i think we're focusing on uh... any last questions are in the front i guess oh sure uh... the question was a clarify the use of open source versus free software and uh... all everything here i should mention was uh... license under bsd because i represent the east bay so uh... the bsd license everything to be used uh... free and modified under any case uh... that's the license we choose for our software and uh... but we had we can talk about any other questions on open source versus free afterwards but the light everything's kind of a open licensed uh... but i'd love to hear your opinion after and then behind uh... so sustainability in terms of our organization uh... okay so the question oh sure so the question was uh... are we have an open source strategy but how did how are we gonna stay around and currently uh... last week actually uh... it's exciting to announce well i guess it was announced last week but we got two more years of funding from the slown foundation uh... so we're at least going to be here until twenty seventeen uh... and we're hoping that we can uh... i mean like any good open source tool uh... we want to have uh... people contribute to it that aren't us so uh... that's kind of the the goal of that itself is to uh... be contributed to by the organizations who use it so that's the end goal for that and uh... in terms of what we end up doing i would just love to keep working with scientists and uh... get back into doing open government data work and continue being in the field of kind of more openness and more reproduce reproducibility alright so thanks everybody for coming and i think the video is available online right