 Today, I'll tell you about what I'll call a case study in reproducible data science. My idea is that this kind of process can be applied to many different fields of science. So if you come from North science, this will be interesting because I'm going to be talking about brains. But if you don't, if you come from some other field and you're interested in thinking about how you might make reproducible data science in our science and whatever field, you can think of how this would apply to many other kinds of data. And I'll talk about a process that has, in my opinion, three different parts. The first is really about measuring, modeling, and explaining, and has to do with the substantive knowledge of the particular system that you're looking at, whether it be stars, whether they're patterns, or small things, really small things. And then the second is to do really both evaluating what you've learned about the system. So if you create models of whatever data, you want to know how well are those models explaining the data that you've measured, how well are they accounting for the phenomena. And then the third has to do really with more reproducibility and sustainability. And this may be more than engineering the aspects of this that are social engineering, but some aspects that are software engineering, how do you make this kind of effort something that other people can build upon once we've got results? How can others just take those results and continue to do the research? And I'll talk specifically about the problem of assessing network connectivity that we can measure in the living human brain. So let me talk about showing you what that might look like. Here's an estimate of the connections in a single human brain measured while this person was alive, still is alive. And we refer to the sum total of all the connections of the human brain. We refer to that as the human connectome and analogy to a genome or something of that sort. And what we're looking at here is these lines, these tubes that you're looking at are estimates of the trajectories of big bundles of fibers of the axons that are the projections of neurons through the white matter of the cortex. So the cortex is made out of gray matter, which covers the outside. That's where all of the cell bodies of the neurons that make up our brain lie. And then there's a big mass of what we call white matter in between those gray matter where the trajectories of the connections between the different parts of the brain go. And so these colorful bundles of spaghetti that you're looking at are the branches that form the network of connections that form the network that is our brain. And we want to measure those using magnetic resonance imaging. In fact, magnetic resonance imaging is probably the only way that we can measure that in a living human right now. And MRI can be used to measure many different kinds of things. You can measure some kind of estimate of neural activity using what's called functional MRI. You can measure the anatomy of the brain, the structure that I showed you on the outside there using what we call maybe structural MRI. And then many other things you can measure with MRI. In today's talk I'll be focusing on measurements of connectivity using what we call diffusion MRI. So I'll go into a little bit more detail about how we make this measurement in just a minute. But first I'll try to impress upon you that we think that the white matter matters. It's actually important to make these measurements. And it's actually been accepted for a long time that there are many different kinds of situations in which people might have different kinds of disorders that come from disconnectivity. So things go wrong somehow in these connections in the white matter. And that might lead to, for example, an inability to read or an ability to identify objects. So this was appreciated way back in the 19th century by these neurologists whose pictures are up here. And the reason for that is that these are not just passive cables that transmit information. The white matter actually changes with development. It adapts with learning. It affects our cognition, our behavior, and various diseases that affect the white matter. And then if you measure in different individuals you might be able to explain a substantial amount of events and how they behave by looking at differences in their white matter. So it's pretty important to measure that. And the main principle on which diffusion MRI relies is the fact that the tissue structure itself restricts the diffusion of water. So if we look at a cell, say something spherical like this diagram of the cell, we look at the water molecules. And these water molecules can freely diffuse inside of this cell, but they're actually restricted in the diffusion by the boundaries, the membranes of the cell. And we rely on measurements of this microscopic diffusion in diffusion MRI. When a water is present inside of something that's spherical like this, it might be diffusing in the same rate in all different directions. And we refer to that as isotropic diffusion, same in all directions. But if we're looking at an axon, one of these connections between the different neurons in the brain, which will be on the order of maybe 1 to 10 microns in size, we look at a water molecule that's inside one of these axons. It is actually restricted to diffuse more along the length of this axon, along the length of the connection than across the boundaries. And so diffusion in this case is anisotropic. And if there are more of these bundles, more of these axons that are together in a bundle, and if this bundle has, say, some kind of isolating material that we call myelin that covers these axons, then water that's actually outside of the cells will also be restricted in the diffusion. So if we measure along the length of one of these bundles, we'll find anisotropic diffusion. And that's what we measure in diffusion MRI. For example, if we look at one connection, you can see where this is a bone in a postmortem dissection. What's happened here is that the part of the brain that was covering these bundles here was actually removed in the dissection, so that it uncovers these bundles, these large bundles of axons. And you can see that they fasciculate into these fairly large tubes. These tubes are not the individual axons. These tubes are more like collections, very large collections of axons that are sort of bundled together into these fascicles. And along the length of these fascicles, they will be both inside of these axons and outside of the axons for the reasons that I showed you above. There will be anisotropic diffusion along the length of the bundle rather than across. Yes? What about dendro-dendrolic connections? So the question was, what about diffusion around dendrites? So dendrites are mostly present in the gray matter, where the cells are themselves. And in the gray matter at the resolution that we're currently usually measuring, we don't really see a lot of anisotropic diffusion. So it's not really a good method for measuring very small parts of neurons in the gray matter. Rather, it's pretty good for measuring anisotropic diffusion along the length of the right matter connection, in the right matter itself and not in the cortex. But a very quick follow-up. Yes? You know, gut junctions, the elective junctions, much less junky factors over there. Yes, that is correct. So yeah, we cannot measure so well connections in sort of dense places where there are a lot of cell bodies. Here, we're measuring these kind of big fascicles and the properties of these big fascicles. Yes? Okay, let me just tell you briefly how diffusion on the eye works. So this is the measurement itself. Here, now you'll see the scale is about a centimeter. This is a slice through the brain. This is the front of the brain. This is the back of the brain. And what I'm showing you are different images that are acquired as a magnetic field gradient is applied across the sample. And it's applied in different orientations. So each image here in this movie corresponds to an image that was acquired when the gradient was applied in a different orientation. And what you can see is that the color of the image changes. So as we change the gradient, we're probing the diffusion in many different directions. And for example, if you look right here, what this part is, is a part of the brain called the corpus callosum. It's a big bundle of fibers that connects the two sides of the brain. And what you can see is that as we rotate this gradient around, this part of the brain becomes brighter and darker. And that indicates to us that there's more diffusion in some directions than in other directions. In other words, the diffusion in this location is anisotropic. And we can take and process these images to tell us where the anisotropy is. And this processing is what I'll tell you about next. The next question is, we make these measurements and we have measurements of the diffusion, the signal, as the gradients are applied in many different directions. For example, here we have the measurements in one single volume, about 2 millimeters by 2 millimeters by 2 millimeters, sort of a little cube that we call a voxel. In that cube, we've made the measurement applying these diffusion gradients in 150 different directions. So these are 150 different dots indicating the distance. We've calculated the distance of diffusion in each one of these directions in this one volume. And you can see that this is anisotropic. There is more diffusion along this length than along this kind of boundary. And one way to model this is to try to estimate the parameters of a three-dimensional Gaussian distribution of diffusion. So we can draw that out as kind of a lipsoid. So we take this and we figure a model with very few parameters that is sort of an estimate of the surface of this diffusion. It looks a little bit like in a lipsoid. It is in a lipsoid. It's elongated along the axis of the most diffusion in this particular voxel. It tells us which direction there is most diffusion along. And that indicates to us where, in what direction in that particular voxel, there might be a bundle of these axons, one of these big fascicles. So that's how our model of diffusion looks. This model is quite useful for several reasons. The first reason is that we can actually estimate physical quantities of the tissue itself in that location. So in the years, we take all these parameters and we estimate what is the mean diffusivity in all directions. And this is, again, a horizontal slice through the brain. And you can see that there are these holes in here. These are called ventricles that are water-filled spaces inside our brain. And in these ventricles, there's more diffusivity than, say, in the white matter here. So that tells us a little bit about the physical properties of the tissue in different parts of the white matter. And then the kind of quantity that we can estimate is how much anisotropy there is. And anisotropy tells us, so this is called fractional anisotropy. It's a quantity that varies between 0 and 1. When it's 0, that means that there's pretty much the same amount of diffusion in all directions. When it's 1, it means that there's really a lot of diffusion in one direction and very little diffusion in other directions. And as you can see in big sort of bundles where there are these big connections along the red matter, the fractional anisotropy is very high. And then the third thing that I indicated before also is the principal diffusion direction. What direction is this ellipsoid the longest in? And this principal diffusion direction is mapped here as an RGB map. So we've transformed the x, y, z coordinates of the vector in which this ellipsoid is pointing into r, g, and b coordinates. And so red is places where the diffusion is elongated along the left-right direction. The green is along the front-to-back of the brain, and blue is up-down. So here, for example, there are bundles that go up and down to the cortex. And really, maybe this is what we call the cortical spinal tract. These are big bundles of connections between the cortex and the spine that transmit information about what muscles to move. So this is the primary motor output of the primary voluntary motor output of the brain. So that's this blue thing here. It might be the cortical spinal tract. And then we can take this information that we extract from this principal diffusion direction, and we can do what we call fiber tracking. This is a way of identifying neural connections in living human brains. So for example, let's zoom in. So here we have, again, a slice through the brain. This is one of these F-A, this fractional anisotropy map. Now zoom in on some part, and now I'm going to plot the tensor together with its principal diffusion direction mapped to this r, g, b map on top of this F-A map. So you can see here, for example, there's highest A, and there are things that the tensors are pointing right to left. And that's because there's more diffusion along the right to left than along the front to back. And this is, again, this is where we know there is the corpus callosum, which is a big bundle of fibers that connects the two hemispheres. In another place, there might be more diffusion in front to back and less diffusion on the right to left. And what we can do is we can propagate a streamline through the vectors defined by these tensors and estimate maybe a connection in this manner. And doing that over many different times, we get something like this. So here is, again, the corpus callosum. And this tells us not only that maybe there is a connection, a mildest connection between this part of the brain and another part of the brain here on the other side, but also what is the trajectory of these connections through the right matter, what their shape is. We might be able to probe maybe the tissue properties in different locations along this connection. It gives us a lot of information about the right matter and these connections. So it's a very interesting and valuable thing to do. That's all great. So what is the problem? Well, the problem starts with the fact that there are many different algorithms to do that. I showed you one model, but there are many ways to model the right matter, the directions of fibers within one particular voxel and also many different algorithms to do this tractography process. And these algorithms can produce very different results from each other. For example, we're looking here at the corticospiral tract that I just told you about and another fasciculous called the arcuate fasciculous, this blue thing here. So this goes up, down from the cortex to the spine. The arcuate connects two different parts of the brain itself. And if you look at the results of algorithm one, you see these smooth-looking bundles that don't intersect at all whereas if you take another algorithm, you might see very many sort of torturous trajectories that intersect quite a bit. And the question is, okay, what is in the brain? Is it this thing or that thing? Not only that, these different algorithms tell us quite a different story about the connectivity. This, again, is this first algorithm. If you look at the endpoints of the arcuate fasciculous on the surface of cortex, you might see quite different endpoints in one algorithm rather than the other. So this seems like a pretty serious problem. And the question is, how can we tell? Well, it's true that one of the major challenges that we face in MRI is evaluating and validating the different algorithms that we use. And it's also true that different features of the red matter matter lies in different measurements. And so we think it's important to perform what we would call in vivovalidation. What does that mean? It simply means we want you to evaluate the values of the red matter with respect to the data that you actually measured. Some people propose that it might be a good idea to take, for example, make these measurements on some animal and then open this animal and sacrifice this animal and perform some kind of tracer experiment. The problem with that is that tracers are also not without error. And you have this data is already, you already collected this data. It might be good to actually validate this data in vivo on the subject on which you performed the experiment. And we'll talk a little bit about how we propose to do that. One of the ways that we use this is cross-validation. And it's a rather simple idea. If you measure two measurements, you can do one simple thing, which is just ask how does that measurement repeat itself? What is the test-re-test reliability? So this says the data in one voxel across 150 different directions of diffusion gradients. And we can compare it to the same voxel, same location in the brain, measure again with the same 150 directions and just ask how reliably does this measurement repeat itself on subsequent measurements? We can compare that to another estimate, which is we take this data, we fit a model to it, and we take this model and we try to predict the subsequent data. This is called cross-validation. And we can compare test-re-test reliability to how well the model does in cross-validation. And so from this we can calculate what we call relatively squared error. That would be the error in prediction from the model itself relative to the error in prediction from test-re-test. And if the model does, if the relatively squared error is smaller than one, that means that the model that we fit is better than test-re-test reliability. And we think of it as a pretty good model of that particular voxel. So let's see what that looks like. So this is this diffusion tensor. DTI is often used to refer to this diffusion tensor or this ellipsoid model, this very simple ellipsoid model that I told you about before. And what I've mapped here is, again, this is sort of a horizontal slice. We have the three slices here, but this is sort of a horizontal plane. And we're looking at pretty much the one part of the brain that does not fit very well by this diffusion tensor model. Overall, it fits very well. The relative root and square error is smaller than one in about 97% of the voxels in the white matter in a typical measurement. So it's pretty good. But there are some places in which the relative root and square error goes above one, I would note in particular this location. And this is no mystery to us. We actually know that this is a location where there are several different fiber bundles that intersect. So in this location, this is where the corticospinal tract and this corpus callosum going across, they both sort of intersect there and there's another bundle that goes along the superior and larger tumor with the sickle, that goes from to back in this same location. But the tensor might be a good model for say an individual fiber bundle going through the voxel. It's not actually a good model for a combination of different fiber bundles. So a different kind of model. And there have been these models in the literature several times. People have proposed these kind of things several times. I kind of mentioned here all the times that we found that people have talked about this. You can model crossing fibers, what we call crossing fibers is a combination of these tensors. So it's simply a combination. And where we find that this kind of multi-fiber model fits the data better than the tensor model, it fits the data better in most of the voxels in the brain. And in particular, it solves the problem that we had in this particular location where we know there are crossing fibers. Accounting for these crossing fibers helps us fit the data better. So that's pretty good. The second step to that, and this is, I should say, a work that was led by my colleague, Franco Pistilli, who is now at Indiana University and was a postdoc in the same lab that I am in, down in the other lab at Stanford. And the project that he led was to try not only to account for the data in single voxels one at a time, but to account for this entire trajectory of an entire big bundle, maybe an entire connectome together. So much of the single ellipsoid model is a model of the right matter or the signal in a single location. These bundles are also a model of the right matter. And typically what we do is we take the diffusion signal and generate these tracks in the way that I showed you before. We estimate them, we propagate these streamlines. You can also do it the other way around. You can use a forward model to try to predict the diffusion signal and then cross validate on this diffusion signal. And so here's how that would work. You say there's one fiber here. You just put in there the prediction from a tensor in one location and then you go along and add the prediction from another location, etc. You go along and do that for other fiber. So another fiber, you add back the prediction from each one of these individual tensors along the length of this thing. And you set up a big matrix. So each one of the fibers, you just put in the signal that would have been predicted along the entire trajectory in every one of the voxels through which the fiber or streamline passes. And each column here will correspond to one of these streamlines. And you can see that there are some locations where both of these streamlines, say here in the intersection, both of these streamlines pass through this voxel and there's some kind of prediction from both of these streamlines. But some streamlines pass only through some voxels and they don't have anything to say about those voxels in which they don't pass. So that's how we set up this matrix. And it's pretty large because we get a lot of these streamlines. And we just try to then fit this matrix to the data that we actually collected. So we set this up as a big set of linear equations. And for that reason, we call this method linear fascicle evaluation, or life. So here it is. This is the life equation. So it boils down to this kind of general linear model that you know from many different other things probably. And you can just solve for these weights. We solve for weights larger than zero because you can't have a negative streamline in there. So it's a non-negative. We solve this using some kind of non-negative v-squares procedure. And then we can use cross-validation, predicting some kind of herbats that using simply the predicted. So we've estimated the betas, the weights here as a bit of a hat. We multiply back to get a prediction. And we can calculate the root v-square data relative to the signal. Tell us how well, once we fit these weights, can we predict from this connection, from this set of connections, together the signal in another measurement. But the problem, so it works. It's great. I'll show you soon some kind of results from this kind of procedure. I'd just like to throw out there that this has some challenges coming with it. In particular, you end up with a matrix that's pretty large. It's on the order of millions of columns. And on the order of hundreds of millions of rows. It's pretty sparse. But still, it can be pretty challenging to solve this. And it can be pretty challenging to set this up in the first place. And it can be pretty challenging to solve these kind of problems. So that's an interesting challenge. But it has its benefits for ramps. For example, comparing these two models that we talked about before, you can see that the error in predicting using a tensor is larger than if you use this multi-fiber. Even on the scale of large fiber connections. So it actually tells us which one of these models of the right matter we should prefer and which model of connectivity with the cortex we should prefer among the two. An interesting kind of side story to this in a project that was led by Jason Whitman, who was a graduate student in the lab and now is actually at UW. So I'm following him up to Seattle. And Kevin Weiner is also a researcher over at Stanford. They were interested in this part of the brain. Jason, I'll just show you. Jason, you know, some work he did a few years ago, he kind of discovered this interesting boundary that goes up and down in the occipital cortex, in the back of the brain. And digging back in the literature, they noticed that people had actually found this bundle. This is sort of late 19th century anatomy. People had found this bundle, but it sort of disappeared later. And one of the reasons that it disappeared, which is an interesting sort of sociology of science situation, is that this guy had an entirely different theory about how the right matter should be organized. And he had a theory that long projection connections between different parts of the brain should only go from to back. And so he literally covered up in his textbook that came out the year that he died, he literally covered up the occipital cortex with this vertical occipital fasciculus in it. And it pretty much disappeared from the literature only to appear very few times in the scientific literature that Arthur Jason made this kind of founded in his data and looking back at the literature, realized that people in the 19th century had identified this bundle already. Okay, and so we collected more data on this thing and this is work that Hermosa Takimura was another postdoc in the lab, led to say, okay, so we have this estimate of these bundles that we say are the vertical occipital fasciculus. How long would we do in fitting the data? We assume they're not there at all, right? So we do a virtual lesion, we take away all these bundles, we say, how long can we fit the data if those weren't there? And we can compare this in sort of an error distribution, we compare two error distribution. One when it's lesioned, when we've removed the vertical occipital fasciculus, and one when the vertical occipital fasciculus is there. We find that there's a large difference in the error and you have a smaller error if you assume that the vertical occipital fasciculus is there. This is sort of an individualization of this anatomical structure, of the existence of this anatomical structure within an individual person. So we can say this anatomical structure exists and we think that it's here, here's the error on this estimate. Okay, so it has benefits. Now I'll move on to an even darker side of this and that is that as we do these kind of experiments, it's very hard often to produce these experiments and we all know this kind of situation from different scientific fields where there is quite expensive, complicated to collect, you spend a lot of time on collecting these things and spend a lot of money collecting this. This analysis requires multiple computational steps some of them rather complicated. The new algorithms that people read are sort of in-house code or computational black boxes that some you kind of download from somewhere without really knowing what it does. And this is not unique to MRI as I said, but it is a big problem. So the question is, what do we do to make this kind of research more reproducible and sustainable? The meaning and sustainability I mean that people can continue in each other's work. It seems like once we understand that we can find say the vertical occipital fasciculus in an individual, somebody else might want to also find the vertical occipital fasciculus in their individual measurement and maybe say something about the properties of the red matter inside the vertical occipital fasciculus in a patient or in some different individuals. And reproducibility has a particular kind of specific meaning here which is that we think stems from this nice quote from Bukhite and Dono that if we write an article about a particular computational result somebody else should be able to just download that result and run it and examine it and look at the entire code. It should be all the software environment the code and the data that produce the result. And so in the articles that we wrote based on these variations we're trying to do is first of all make the data available, sharing it through the Stanford library and then share the analysis code itself on GitHub. And in at least one of these papers every figure and every conclusion supported by an individual IPython notebook looks something like that. Here's for example Testry Test Reliability in one particular voxel over three different kinds of measurements I haven't really gone into details about that. So somebody can download or clone a GitHub repository download the data and install the dependencies and then run this. This is very nice but it's not particularly sustainable in the sense that it requires monuments, right? Code rats as we all know. Overtime code that ran a year ago. We don't know if it will run tomorrow. It requires some kind of a larger community effort to produce this kind of sustainability. So I'm going to tell you about MyPy which is an ecosystem reproducible in no imaging. So this ecosystem idea is that we have the Python language and people have written very good tools to do general scientific computing based on the Python language and building upon that we've built a series of tools to do specifically neural imaging research in Python including time series analysis, libraries to deal with file formats and for this talk it's particularly interesting to focus on this thing. MyPy diffusion imaging in Python is sort of a particular part of that galaxy which I'll talk about a little bit more now. So the principles behind this kind of community effort is that it's free, it's open source. We're trying to make it as thoroughly tested as we can. It goes through fairly rigorous code review through GitHub, the pull requests, et cetera. And that makes for a fairly natural place to collaborate around different ideas in neural imaging. So you can propose different ideas for how to do neural imaging analysis in this kind of environment because it has the feeling of the kind of peer review that you want scientific ideas to undergo. And then we try to make it properly explained so that others can pick it up and start using it. So I'll show a little bit what I mean by properly explained. How do I do that, though? Let me see. Ah, there. Let's see what happens if I do that. There we go. Okay, so properly explained means among other things, first of all, that we try to teach people how to use this software. But in particular, we have this whole system that we took from the developers of Pi and VPA, which is a library for multivariate pattern analysis in MRI. We borrowed from them this idea that you could write... Where is my mouse? Here it is. You could write these little scripts that include code and explanation together. So here, for example, is this linear classical evaluation now implemented in DiPi. Yes. Can you make it bigger? Oh, bigger, yes. Assuming I can get my mouse there, yeah. There it is. So this includes both the code and a little bit of what we're explaining what we're doing here. It includes a full code example, including how we're trying to evaluate the error and some candidate connectome. Here are the rates, the fiber rates that I told you about before. Here are, say, distributions of errors. And, you know, that part of the brain, how well are we doing, before we did this fitting procedure, after we do this fitting procedure. It has the full example here. And so somebody can then go and download, actually, the full source code. So once you've installed BiPi, you can take this file, download that as well, and just run the entire example, which I think is a way to explain to people how to use the code, and gives them the first step they need if they want to actually continue this research that we've done. Now, how do I get out of here? Okay. I'll close you. I wonder what happened to my presentation. No displays. One still wonders. There you go. All right. Okay. Now, so why do we know whether we're doing well? Are people actually using our software? One way to do that is to check that is to check how many people downloaded the software. And that's not a great way to do that, because people might download it and then never use it again. Different people might download it more than once. It's not a great way to assess that. One proxy that was actually proposed by Jake in this blog post about why Python is the last language you will ever have to learn is that we look at the number of cumulative contributors to the library as sort of an extreme lower bound on the number of users. Jake actually noted in this particular blog post that as people started, as these projects here, which are sort of the core projects of the scientific Python universe, started using GitHub here and around the end of 2010, the number of developers actually shot up. And so I extended that now and you can see that Python is still the last language you will ever have to learn. I also add our, so I really, really blog post about this myself. And one thing to also add to this is IPython. And you can see this IPython, this curve starts to shoot up before these other curves start to shoot up. So whether it's not only GitHub, it's worth mentioning that IPython moved to GitHub earlier than these other projects, maybe dragging all these projects along into GitHub. You can speculate about all these things and they have, this is sort of maybe data science about data science. Might be interesting to some people here. But anyway, we can do the same thing with our, the numbers here are smaller, but also you can see, you know, about two years' time lag past these other projects around 2013, there started to be a large influx of more and more new developers and that gives us some kind of a layer bound on how many users, maybe about maybe 30 people, some of them they have gone on to do other things since, but it looks pretty good. It goes up, that's me here in 2011, that little bump there. I think Stefan is in here somewhere. Also, yeah. Okay, then finally I'll talk about one more thing about sustainable reproducible research, is also how do we get this into the system, into the way we do science on a day-to-day basis. I'll tell you a little bit about a project that we have at Stanford called the Project, the Project on Scientific Transparency that is, that reproducibility requires transparency. These are actually somewhat interchangeable. And in this project there are two aspects of reproducibility that are being covered. The first is about data sharing and people at Stanford, I can't take a lot of credit for this, I can't take any credit really. Gunnar Scherfer, Bob Docherty and others at the Center for Neurobiological Imaging at Stanford, have developed a system called MINS that stands for Neurobiological Image Management System. And the idea is pretty simple in contrast to other databases where there's an experiment, you collect some data, you come to some results, and at the end of the life cycle of the experiment, you go to some website and you upload your data. The idea is that this system takes your data, as it comes off the instrument, and this is your first point of entry at the very beginning of the process, to get your data. Just put the data somewhere that scientists actually get their data at the beginning of the process. And it's a source that it says, so the archiving and then the sharing, ultimately, of the data has a fairly flexible system of permissions on different pieces of the data so that different people can share with collaborators, specific collaborators or just with the entire world, particular pieces of the data that they've used to do their science. The other aspect that we're thinking about is how can we do this also for analysis? And that becomes very complex and interesting as something we're thinking about a lot now and I hope to be thinking about also in the future. When we think a fundamental piece of the puzzle is, of course, the Jupiter a project and everything around that, another might be something that we're experimenting with are Docker containers and how we encapsulate pieces of our analysis into something that can very easily be downloaded, used in multiple different configurations. Maybe if your data is very, very large, you want to bring your analysis to your data rather than the other way around, wherever that data may be. This has a lot of potential to facilitate these kind of patterns. Okay, so that was what I had to say. I'll try to again bring this back to all of you doing neuroscience kinds of experiments or thinking about other kinds of science. I hope that I showed you a process where we go from the measurement of some kind of very specific physical quantity in the brain. We're trying to estimate a model about something that we're interested in, whether it be brains or other things and use statistics separately to know that we've got the right model of that thing. We also want others to be able to take what we've done and continue the work afterwards. We don't want it to just be a PDF on somebody's hard drive. It needs to be something that can be sustained. Finally, I'd like to thank my collaborators on all this work. Brian Lundell is the head of the VISTA Lab where I'm right now, a postdoc. I mentioned Franco, I mentioned Jason as well who did some of this work. I also collaborated on some of the work there. I have a very cool and interesting collaboration with Charles Zhang and Yuval Benjabini over in the stats department on some of these things. These are folks with whom I've been working on night pie related stuff. I should mention Fernando got me into this a few years back when I was a graduate student here and wrote me in. Of course, I started by pie and these folks have helped work on that. Maybe if you do diffusion imaging, you would also like to help yourself. This is the team around the project on scientific transparency. Also, we have generous funding from all kinds of places. I'd be happy to discuss this now if anyone wants. Thanks. We have a full half hour for discussion questions. We welcome you to ask anything. We are recording this, so I will repeat the questions since we don't have enough time. Any questions? Yes. So one of the things that you said about standard way of sharing data, would you just upload somewhere and maybe in a year it doesn't work. So you said you can have this website and you put explanations and data there. So in two years when the software updates do you update the code or what is the solution to the software? So there are two different parts to this. One is the software part and that is that what I've learned from my own experience in working on this project and just through the last few years is that if you work by yourself on a software project that should do your entire analysis, should do everything from the raw data it's kind of hopeless actually because you won't be able to maintain that. So you need a community of people to do, you need to collaborate with a community of people so that the code doesn't rot and we work a lot to make that code not rot. Matthew sitting back there is very active in really pushing this idea that we need to make sure that this code works not only on my machine today but also on some Windows machine that I've never even seen so we have big bots that just go and check every commit into this repository does it run on all these different platforms and that way when you have very little confidence that somebody else that what was interested in doing this analysis or examining my results doesn't have the experience of downloading the code trying to start having it be broken and then just moving on Yes Yes please Major issue I wanted to bring up basically working on scrapping public funding of ScienceD NSF style program I just wanted to address something in your talk as a result of that there are two projects your scientific transparency project there is actually a project called the Berkeley initiative transparency in social sciences this actually has some standard people on board and I think this is exactly the problem we've now got two different standards already of openness so the standard people on board people like John UNIDs who published that notorious paper why most published scientific papers and it is actually a serious paper with no the question would be then would your group be open to a dialogue with his group and this is particularly the case as MRI has been notorious ever since Ed built the paper the new correlations paper Okay I'll try to repeat that the question was I'll just pick out I'll pick the thing I'm interested there are these different initiatives of people trying to do make science more reproducible why are we so split up in different groups etc part of it is sociological we're doing our own particular thing the news system that I presented was developed originally to scratch our own the scratch our own particular age we had people at Stanford this was even before I came had an instrument arriving producing all this data and they didn't want to fire the usual trajectory which is you put this thing and all of a sudden you get this deluge of data and you start figuring out where to put it instead we wanted to print that by putting in place a system that will even years later be something that can be useful right I don't know if that addressed all of what you said but certainly we know of and there is obviously dialogue there we're not ignoring each other so there is definitely we've definitely gone to this project they're calling metrics over at Stanford we've gone we've talked to these people it's not like it's in different universes the MRI stuff have you followed up the testings that you did in that analysis? yes so you're asking about a paper that was published a few years ago where people did what might be called double dipping into their data they detected some kind of a correlation and they went to a particular spot and they plotted that correlation and said look how striking this correlation is and yes of course we have the cross-validation approach that I that I presented here deals with that kind of issue completely so doing cross-validation the field, the MRI field has definitely learned the value of this kind of approach where you separate your noise from your model you try to separate your noise from your model by doing cross-validation yes the question if I understand it correctly is how many quantitative is this measurement that you're making if I measured an individual if I took me and I measured me in Berkeley and I took me to Stanford I measured there and I traveled to Japan I did this measurement again how similar would the measurements come out in terms of for example if I had an estimate of diffusivity somewhere in the brain no this measure of diffusivity that I was talking about is not quantitative in that sense it depends on calibration of the scanner and other issues estimates of trajectories some of them will be reproducible and some of them will be reproducible also you know that's kind of measurement noise there are methods to measure MRI that are quantitative there was briefly on the on the screen there was a picture of my colleague Dave Metzer who developed a method to calibrate and quantify the density of tissue in different parts of the brain that measurement repeats very nicely up to some level of noise repeats very nicely when you take a different individual from one location to another so it depends a little bit on what measurement you're doing diffusion is problematic in that sense it sounds like it's not what yeah it's actually so the question was it's not ready for the clinic then and the answer is well it's been used in the clinic for some uses it's very useful right if you suspect that somebody has an infarct somewhere in their brain a stroke doing a diffusion scan is a pretty good way to assess that by looking at the diffusion scan seeing if there's a big spot of higher diffusion somewhere you need to know if it's 3 or 3.2 to be able to do that right so but that was all the question you're asking always yes the question is if you have a database with the raw data and you do some analysis on this raw data now do you put back in the process data into this database as well and further if that processing changes over the years how do you deal with that and propagate that to see whether analysis that would work how do you redo the analysis to produce these other results again and again and those are both really hard problems the first is a hard problem because as we've experienced in doing this kind of analysis so the initial data that one kind of interesting thing is that the data that we collect is not actually very large the data sizes are not huge but you start processing it for example you take the raw diffusion data measure that 2mm on a side cubes and it's for the entire brain you have some size of the data and now you start doing this tracking process on it producing a million different streamlines all of a sudden you have a huge file on your hand so do I allow if I have this database meant for showing results and data do I allow any size of file to go in and I don't actually know what the answer is should be to that question and the second question is also one that is very hard what happens when you change your processing do you update the analysis if so here's the worst case in areas I did some experiment today I fit some model to this data two years from now I discovered that I could do that better not that I could do it faster but that I could actually be more accurate do I the paper that I did before an update I would say probably not that's science you keep getting it more and more accurate so it depends exactly what change right if it is an error then if the code is just wrong and produced some result that wasn't there we want to know that and we want to update things but I don't think there's one answer to that question also it depends a little bit does that make sense yes that's why we're tracking exactly do you want to store that kind of analysis or are you more likely to want to do that the question is what about analysis that includes some stochastic element some probabilistic so the tracking for example can be done probabilistically whether or not to store that so where you can store the random seed and then it would be reproducible that would be one way to deal with that but doesn't quite answer the question I think yes I think you would want to store the results of probabilistic you did something you did some analysis it required some stochastic elements in it you came to some results it would be actually kind of great given this stochasticity if somebody else wanted to go and rerun it with their own random seed and see that the results still holds up if it depends on the trace of the random seed if the result depends on the trace of the random seed the exact sort of the conclusion that you came to rather than the exact result you know up to some digits is going to depend on the random seed but the conclusion that you came to depends on trace of random seed that's something you want to know so I'd say yes store that thing and then have others do it again reproduce it and try to see what that stochasticity did if that makes sense I don't know yes I have a question about the names of the data coming in from the machine I like that idea of putting the kind of repository responsibility a lot earlier in the data life cycle are you finding people coming in from other fields asking if they could copy that model so the question is about the database model where the data goes in straight from the instrument into the database and sort of scientists get their data from that not they don't put their data there and the question is whether scientists from other fields have shown interest in this and that's a good question I defer to others in the team to answer that I don't actually know whether there has been but it has obviously it has application in many different fields I will say that scientists that use this system on a day to day basis I've been around them now for a while don't think of this as a system for sharing data they just think of it as the place they go to get their data when they want to do their analysis and that has huge benefits right because then once you go to them and you mandate sharing for example somehow and you tell them oh yeah you already you already set up to do that you don't need to do anything you just need to click on this check box that will just make this public I think that has huge benefit for people yes please yeah the question is and there might be others here who could also answer that I'm sure the question is how prevalent is the attitude I've worked on this thing it's pretty valuable I think and I don't want to share it I want to sell it I want to I don't know I want to use it again before I make it available if I've collected this extremely complicated very large data set involving many subjects I'll go to data with your permission but it's true for code as well should I make that I've just published the paper on it should I make the code the data available I want to do I want to write like five more papers on this on this data set before I make it available that attitude is prevalent I've seen around and I think you know to some degree I actually respect people's right their people's copyright to their code people's right to their data I respect that to some degree it's not clear exactly where you put that limit if people make certain scientific claims about the brain for example using paying for the experiments with taxpayer funds I'm not so sure where the line exactly exactly crosses but I still I would like to respect I would like for people to not be demotivated to do cool things just because of this feeling that they have to share that said I think there's additional value that can be elicited from the act of sharing the typical thing is for me as somebody who's who makes a lot of errors is that if I share my code others will come along and make it better for me so I like that saves me a lot of time and then paying yes Matthew back there has it seems at least to me it seems it's quite difficult to imagine going back I mean do you have experience do you see people do open source and I think I don't know if she's keeping the code the question is have I seen people do open source and then say you know what this project is pretty cool I think I'm not going to tell anyone about it until yeah sure I've seen that yeah uh I mean I'll say more about that I've learned a lot from doing open source software developed myself so I'm sharing code and I've just learned so much from it I think I've gotten so much value out of it that I I don't entirely understand when I tell people come to me and say this kind of thing I say well I respect your you know your your attitude but it's you're kind of shooting yourself in in the foot it's it's probably not as valuable as you think it is if you're gonna do that thanks that reminded me so the question is what are the obstacles that we're facing to make research more reproducible well what should we do and that reminded me that I wanted to mention this really interesting article that was in this week actually this was Roger Payne and JT Leek writing about reproducibility and replicability those are two different things reproducible means I'll give you my code write it on your computer you got the same figure that I got replicability means you go do the same experiment on somebody else right you take some new subject that we haven't measured the vertical occipital fasciculus for I needed to explain we find the vertical occipital fasciculus is right there in that person too that's replicability and that's actually what we aim for is replicability of our results reproducibility a reproducible result can be can be just wrong right if you have a bug in your code it's reproducible that bug will come up every time and they point to something that I think maybe answers your question which is education they want to prevent you want people, you want the scientists to be sufficiently sophisticated in looking at the data so that they can evaluate how well they're doing in terms of reproducibility and I think that's really important I think we've had in the lab after reading this article I sent it around to people and we had a little bit of a discussion and we had disagreements about this among us as a group we all think reproducibility is really important and we're not really sure what to do and replicability obviously and we're not really sure what is the right thing if the incentive structure is all wrong should we make the principle investigators the people that are the powerful people in the field feel more strongly about this because they don't really need to feel strongly about that right now there's no it does not really depend on caring about these things right now or should we just go to my feelings we should go to the young ones and convince them that they don't want to be in a field that doesn't where you can do an experiment you can publish it and then somebody else comes and it just doesn't hold up right so that's my feeling it's particularly strange in the psychology department a lot of researchers come from a background in psychology and psychologists myself included rarely get trained in computational research I shouldn't just say about psychology don't get trained in computational research in computational practices I think after a carpentry obviously play an important role in making everything raising the level a little bit I don't think there are structural impediments there's no like oh this kind of research could never be reproducible it's complicated right there are all these technical things that need to be figured out like if you want to do like the questions before about if you do some pipeline and it's really complex and you put together several steps make sure that you always have the same steps in place but that's just the technical issue I don't think it's like a structural fundamental thanks