 Alright, good afternoon. Great to see you here in Warsaw, beautiful place. I will talk about data pipelines in quite broadly about many different kinds of data pipelines. So I'm here representing partially from Baylor College of Medicine. There are two universities, there's Baylor University, there is Baylor College of Medicine. So and also from Data Joint Neuro, the company we have we started about two years ago helping neuroscience labs and multi-lab collaboratives build data pipelines. So just for context, one project that we have been involved in that shows the some of the challenges of running large distributed multi investigator multi-modality type of experiment. This is this project is called Microns, standing for machine intelligence from cortical networks and part of it involves three groups working at Baylor College of Medicine acquiring two photon calcium data from mice very detailed, a full cubic millimeter using two photon and three photon imaging. Every neuron in a full cubic millimeter is fully characterized in terms of its visual response properties. Then the same animal goes to the Allen Institute where electron microscopy is conducted and that volume is completely imaged. Every detail, every vesicle, every membrane. And then the data goes to the Princeton University where the data, the anatomy data are segmented and then the entire data set is brought together where we know each cell, you know it's anatomy, know what it did when the animal was still looking at pictures. So just to give the some of the challenges, one is complexity. So the mouse is imaged multiple times, multiple sessions. In each session about 27,000 cells are recorded and then over multiple days the full volume in one cubic millimeter you have about 100,000 neurons. So and after you show all the movies, you end up with about one million neuron hours from one animal. And then this is not the only thing that's done. The animal is also behaving. It's whisking, it's moving its eyes, it's running. It views images and so all of these modalities are combined. And then so there is a lot of data, data quality is affected by manufacturers. The animal is there, it's not trying to get a nature paper out. It's just really living its life and so it can fall asleep. It can do think it's not necessarily doing what you ask it to do and if you are recording for hours and hours keeping track of the state of the animal from a lot of physiological and from behavioral readouts just you know if the eyes are closed obviously the data are not good. And then so a lot of as algorithms are designed for example for segmentation, for image segmentation. They're designed in a little sandbox with one single data set. But in real life when you have to deal with many different aspects you have to control your entire pipeline very precisely so that you have all the controls to rule out the bad data to make it work. So again your image very large so this is another recent, it was posted on bioarchive from our app. Just if you with the complexity and with the size you find new patterns that you couldn't see before simply because sometimes more is not just more it's different. And then you need to have extreme flexibility. So this is a fraction of the data pipeline and so data pipelines don't stay fixed. We continually improved different phases so we may like to improve motion correction or segmentation and then the entire pipeline needs to re-do its work. So this is for example the spike finder challenge fairly recently last year ran a competition of different calcium spike detection algorithms. And so we want to work efficiently so we can do new kinds of experiments that we cannot afford to wait for days for data reprocess to get re-analyzed. So in this in this paper it's called the inception loop. The approach is where inference is made, inference about brain function is made in the following way. The activity of the brain in response to stimuli is recorded on day one and then a neural network is trained to predict the output of the brain and then on day two or overnight the network is then used to generate images, new types of stimuli by running the network backwards to figure out some test images, some test stimuli that will prove that we understood something about the brain. So some of them are called maximally exciting images basically using the neural network to produce the image that we predict will be most exciting to the nervous system. So to do this you have to be very efficient. You have to make sure that as you run experiments and run many of them per day then all the data are processed and ready to go for the next experiment in the following day. So that's the context. So what do we need to do this? We talk about data sharing there are different kinds of data sharing. The types of data sharing and the fear principle primarily talks about this type of data sharing at the end. When the study is done you package your data set you share it. I click something. Okay and so okay back. So the data sharing I want to talk about is the data pipeline sharing. So live data not data sets but data pipelines. Live evolving data as people are working together at the same time. So for that you don't just, fear is just is just the tip of the iceberg. Things like concurrent access and just efficiency of searches and so the data need to be indexed for efficient search. Data integrity and consistency and we'll talk a little bit more about this. Speed precise queries so because these are distributed teams they need to access just the data they need so you don't have to send the data sets around. So you can get just the piece of information that you need. So for that we have a framework called data joint that supports data pipeline. This is specifically the tool for general tool for creating computational shared computational data pipelines for for it's quite general but we happen to use it for neuroscience data pipelines and then to data joint you connect can connect a lot of interfaces and applications. So again we're working with computational relational databases where basically what that means we combine a schema basically the rules of data integrity with computation as part of the model and workflow. The main advantage of data joint and the main kind of the claim to significance here is that we can take students who are not computer scientists and kind of we reduce down to the principles we can teach students and postdocs to become database programmers and database experts for computation. So basically if you know about relational databases data are represented as tables and each table is not just a spreadsheet each table has a specific meaning it corresponds to some something in your experiment for example one table might be called mouse and it contains mice the next could be a session which is a recording session a neuron and there could be a spike detection out parameters or algorithms that produce spikes and on activity statistics things of that nature. So data integrity is something that so a lot of neuroscientists work with data as data repositories or data collections but not with databases and concepts of data integrity are not familiar and I know we mentioned some of them like identification so some of these some of these concepts need to be taught and institutionalized for example entity integrity just the guarantee that what you represent corresponds to what's in the real world and that representation remains unique so that when you work together and you go to a specific mouse and you update something about it you updated the right animal and the right and everybody sees this so referential integrity compositional integrity these are inherent part of the of the process. So if you know a relational databases in the in the common kind of talk people think relational means SQL there is no SQL involved in data joint and in fact SQL was intended to be a simple language but it turns out to be very complex because it doesn't make various by just it turns out that by making a few simple assumptions you can significantly simplify and create a language that to to interact with data that is much more conceptually clarified. This is just an example if you have done any databases there is like kind of a hello world example in databases is university registration database this is just an example from from that world and you can see that with data joint the query like give me all the currently enrolled students would be student and enroll in current term in SQL you would have to learn how to write these types of queries and takes a semester to learn usually but with data joint we can teach students who are neuroscientists as part of their job to work efficiently with shared structured data so each each node is represented as a python class or a matlab class interchangeably so that you can work from matlab and python and you can define computational so the data structure is the first that that's the definition that defines how this node fits into the pipeline and then the make is what what computation that nodes performs it just fetches data from above in the pipeline performs the computation and inserts the result result into itself so that's basically what it is data joint is a language for programming databases and computation the architecture underneath it's a relational database we use mysql for the most part plus to store the bulk data a file system or s3 with amazon or it's compatible services like minio for example is a great program that you can install on premises that will act as s3 and works amazingly so yeah so and then as as so we I came up I wrote the first version of data joint for my own experiments about 10 years ago it's spread through the lab and it's spread to other labs and then with microns people started hearing about it so we we started a company to to help labs if they need any extra everything is open source and free and a lot of labs have adopted it but some some are specifically multi lab collaborations often need support so we started the company now we have this group and so we need we have also some software engineers and data engineers and it it engineers to help with some of these more demanding projects so these are the labs that we know currently to be using data joint but we don't know all of them because it's just a free open source tool and so some the ones that are highlighted the ones we work directly either in blue as an academic collaboration from Baylor College of Medicine or in green as contract contracts so this type of approach allows separation of labor where where kind of the data engineering is hidden you you deal with a more high level representation of your data pipeline is this logical structure with computations defined for them and then the all the parallel computing all the things that need to happen to make it efficient happen in the background set up by the people who know how to do it so this is kind of the separation of mindset the scientist deals with the conceptual questions the data scientists who creates the the logic of the experiment the logic of the data pipeline the logic of computations and statistical tests but does not need to know how the cloud works or how the how how to convert and migrate data so one example one some large projects evolving to a large project but some projects now are put together so the international brain lab is a great example that where 21 leading labs were put together to work on a common problem and and and I needed to set up an effective common data infrastructure and so so our our team joined the project to to help and become that bridge so this is just a fragment from their pipeline theirs is open source you can see it this github link or if you just google in the international brain lab several other similar projects of similar scale are currently in progress and so one of them is for as an example is that called the mesoscale activity map it's funded by the simons foundation and it's between four labs bcm genulia farm stanford and nyu and so we we the data scientists defined the data pipeline we set up a a amazon cloud pipeline for them but but they define the logic of the pipeline they define the computations and run them and then the argon national laboratory provided massive storage so to store shared data in accessible to everybody this is this is my summary we have um djneuro.io is the uh the company website um i will be uh running a demo here we have a little booth we can we can show um but yeah a lot of great things are happening this this is um this is a great time to be uh both a data scientist and a neuroscientist and ideally both thank you yes thank you very much um questions rick getting my exercise today uh suppose instead of coming maybe from draw sql someone's coming from another or um like let's say i've used jango and so i have a jango developer and i uh what would i be looking forward to in in in data join i can imagine one thing might be uh you've thought about the quantities and volumes of large scale data that obviously someone just running a web normal web server wouldn't have and so maybe jango wouldn't be a good jango arm wouldn't be optimized for but imagine you might have some other thoughts on on that as well about what are some advantages of data join if you'd like to share those that's right so there there are family of concepts programs called or libraries called orm subject relational mappings and the idea is that there is that programmers are used to to work more with objects and the object model which usually has an object with its properties and you access them that way a relational database to work very very differently in a relational database the idea of a relational database that you work with data is with sets you don't iterate through sets you have set operations whereas in the object model usually you you iterate through some properties and the links and follow them and so there's kind of a mismatch and uh between so it's actually called the the object relational impedance mismatch and uh and so a lot of people have tried to kind of patch that um data join does not make that trade off it is fully relational so you work with data as sets and there is no that that mapping doesn't it so so the operations are fully relational things like joins and in um so basically to um and I'll throw in some examples but basically when you work a query will look um so you basically don't tinkle with the with the properties you you work with data much more on the level on on the much higher level of abstraction than than you would be with Django for example so Django is more specifically designed for web design rather than working with scientific data so we work with scientific data so one thing that we did databases are not very good with um by themselves and so data join helps solve that is work with scientific data like arrays massive uh uh and so um numerical arrays uh bulk data and so we had to put a lot of features there to to make it seamless so then when you fetch data it comes back as a non-py array so we never working with data join you never work with file there's no concept files there's no concept of file formats or anything of that you when you fetch data by specific criterion you get an numerical array in math or python that contains the data exactly the data that you ask for and and because of the flexible query language you can you can get data in different combinations that are specific to your analysis so you can say you know give me all the tuned cells from these three animals that have this genetic marker and you don't have to give the entire data set you just get that those pieces of data in one say structure array or array of dictionaries in in python was that uh okay any further questions specifically uh yes and I'd already like to ask uh the other two speakers maybe to quickly come in front we are a bit behind schedule but I think we have maybe time for a few concluding words instead of the panel discussion in this project you also so you have a client which exports data in nwb format right so probably in many oh this is this is more like planned yeah we have uh yeah we have clients to export yeah okay so this is work in progress work in progress but we have good demos for this yeah but so this is so nwb with the way we work with it is primarily as a kind of the output of the pipeline in the pipeline itself people work with the database and then and then just for the final kind of we sometimes refer to that as the golden record kind of once you're done with your project or once as you're done with the part of the project you want to package a data set for people to analyze that's but do you see a lot or sufficient amount of commonality between projects for you know you could provide some unification or harmonization on how you know even within nwb like custom classes and all that yeah so nwb we're in uh data join solve very different orthogonal problems and we work with nwb in teams so we work with a lot of teams that are some of them used nwb some don't and for those who do uh for example we recently were completing a project with lauren frank's lab at ucsf who uses nwb for high performance computing because they have recording selector physiology recordings that go on for days so they're they have they need to be able to index into a raise and just select parts this is something that they have already developed and so and they also adopted data join for their uh data management because it has all the search capabilities all the queries so they want to have both so we work with them and we had a little little project to to integrate the two process so when they vegetate and data from data join they get an object that is actually capable of indexing into an nwb file and that's also transparent so there's one when users work with it they don't need to open files they just get an nwb object back out of a data join query and then just can work with it right in memory and that's also open source that's also open source yes so that's coming out that's actually right now is getting tested so that's for his lab