 Very excited again to introduce the third keynote of the conference. As I've been mentioning about the keynotes, we're trying to broaden our view, broaden our perspective a little bit. So we heard about OER and copyright issues first, then yesterday about open access to research, and today we're going to hear about open science from John. John's bio, as you can imagine, is very long. He currently works at Sage Bio Networks, where they build tools and policies to help networks of people who have their own health data, share it, and networks of people who like to analyze health data, engage in that activity as well. Before coming to Sage, he was at the Berkman Center for Internet and Society at the World Wide Web Consortium. That would be fun to hear about. And at the U.S. House of Representatives, he's with Science Commons. He's done all kinds of interesting things on the advocacy side and on many fronts. But it will be, we're just, you're in for a real treat. I assume that many of you don't know much, if anything at all, about open science, because it's, again, it's not something that we talk about, and it's just going to be great to listen to what he has to say and think about how to connect it into our work and broaden our minds, or maybe we could say open our minds a little bit further to other open work that's going on in the field. So please help me welcome John to the stage. John. So here's the 30 seconds of awkward time filling while I plug in. And so I like to fill that by saying, would the organizers of the conference please stand up? Besides David, who else is organizer? Can we give them a hand please? Okay, so if this works well, I'll be talking in a second. Let's see. I don't have my log in the screen. There's nothing like the first talk on a Friday morning at the end of a long conference. So I hope that I am interesting and engaging enough to get you awake. And I didn't really know what to talk about, because I haven't worked in sort of OER ever. I've worked near OER when I was at Creative Commons. But most of what I do as David alluded is, is science-based. But I wanted to try to connect the science to the education, because I feel like we often are creating open silos, whether we realize it or not. And so the OER group doesn't talk to the open science group, doesn't talk to the open access group, except inside the advocacy organizations or the tools organizations that connect them. And that's a shame, because what's happening to us in science affects education. And what's happening in education affects us in science because you don't become a scientist without going through an educational process. And both of these are affected by the broader culture. And so I wanted to start by talking about the impact of prediction culture on science and how I think that's going to affect education. So the classic quote here is attributed to Yogi Berra. But when you dig into it, it's actually really interesting. This goes back a couple hundred years to Denmark. This is an old Danish phrase. And it's been attributed to Yogi Berra, who did say it. And the amazing thing about it is how quickly you can find the origin of a phrase like this using the internet, right, using Google. And in many ways that's possible because all of the old books that are available that contain this phrase are in the public domain. So they have been scanned and I can rapidly analyze them. Nothing, of course, from the last 20 or 5 or so years is available in that context. So I can only track it to the late 1960s from an etymology perspective. But it's a great comment. And this was really true. It was really hard to make predictions about the future. But it's increasingly the case, do we lose anything here? I'm plugged in. I swear I'd have slides. Do we have any ideas? Do you want to just unplug and plug back in? Let's see, check the input terminal. I might not be connected to it. I was just getting warmed up. There you go, there we go. Alright. There you go, just fire it up. Okay, there we go. Thank you, David. I was just getting warmed up. So the thing about predictions, right, is that predictions are increasingly accurate. And so weather is the easiest example. When you think about it, it's predictions about ourselves that are the most accurate things that we can make predictions about right now. And as an example, I was sort of thinking about this yesterday when I was putting this together, is that every single website I went to was trying to sell me Tylenol. And not only that, Tylenol PM too. I've been traveling pretty much every day for the last two weeks. So it's not like they know me, they literally know me the kinds of things I type into Google. And the ability to index my e-mail in which I complain to Carolina about how bad my back hurts from all the flights I've been taking or where I complain about how I'm having trouble sleeping an e-mail to my family because I'm jet lagged, right? The ability to mine that allows them to make increasingly accurate predictions about what I'm going to do. And if you want to see a visualization of that, so this is the ingram for the appearance of the phrase data mining in books over the last 100 or so years. 200 years. It's not a phrase that ever existed and it has literally exploded in books, in journals, and so forth. And that is why we can make increasingly accurate predictions about me or about you. It's because we have this aggregative information about ourselves and we can use mathematical tools to mine it to make predictions. And in particular, what we're talking about is probability. The kind of mining that we're doing is not sort of proof-based. It says what's the probability that John needs Tylenol based on all the prior information we've got about John? And the tools for this are not related to advertising. They're basic mathematical tools. They're available to anyone who's got enough data about a large enough group of individuals or a large enough subject matter to use them. And so, you know, I work mainly in the life sciences and it's an academic field that is increasingly data-driven and increasingly less narrative-driven. And because biology really was a story-based discipline despite all of the trappings, up until about 25 years ago or so. And the data is overcoming it and it's rapidly turning into a prediction science or a probability-based science. And that comes from the cost of data as much as anything. And we like to talk about sort of data liberation, data sharing. And that's why it's so aggressive. People don't like to share it. If data is cheap, it becomes increasingly smart to share. And so the cost of the data is an incredibly important factor. And you see the impact of this where you go from, you know, sequencing right around the time that the cost of this started to be tracked is 2001. So that's when we announced the reference genome. That's where this graph begins. And somewhere around 2008, right, we crossed the line in Metcalfe's law where the cost of compatibly generating this stuff broke and that's exactly the time in which you see a company like 23andMe come onto the scene and provide a consumer service for something that used to be the exclusive province of a credentialed, titled PhD scientist at a major research institution. So suddenly, like I have the power to get something that was only the exclusive power of a scientist to give me in the past. And indeed, they wouldn't give it to me. It's not just genomes. This is a startup company called Science Exchange. I have no stake in this company. They were a company that said, you know, we could do an eBay-style thing for all the overbuilt core facilities at American Land Grant research universities. So during the prior decade, during the real estate boom, there was a lot of easy money going around in the sciences. A lot of people bought gigantic machines to do research and now that the NIH budgets are flat or dropping, those machines aren't getting used. So they're starting to farm them out. And if you look at this, right, microarray, I started a company based on microarrays about 15 years ago. It's insane that they're $100 per sample. DNA sequencing, $2.50 per sample. So I tested this. I said, well, what happens if I wanted to, if I said I want to take 500 women and see if they have the BRCA 1 and BRCA 2 breast cancer genes? How much will it cost me? And I got it down to $200 per sample for that, including biobanking and shipping. And no one ever asked me for a credential except whether or not my credit card had enough room on it. And I would call your attention in particular to the last one, which is bioinformatics. So bioinformatics is the analysis of this data. It was the hottest job discipline in biology in the last decade. $50 an hour. My plumber costs $100 an hour. And so I raise all of this because when you can generate data this cheaply and the tools for analyzing it are being sort of democratized by the advertising culture, the cost of this just drops and drops and drops. And that's coming to basically every academic discipline that has any impact from data. And so I tried to think of a field that was as far away from biology as I could. So I chose archaeology. This is an open context. It's run by the awesome Kansas in Berkeley. Eric Kansas is a dear friend. And so the big circles represent 150,000 records of archive data or archaeological data in an archive. And so obviously it's clustered around the Middle East and North Africa because that's the cradle of archaeology in many ways. But you see it popping up elsewhere. And even the field of etymology has been changed as we started out with this quote by the fact that I can basically apply large-scale data analytics to finding the origin of a string of text. So it's easy to think, oh, that's biology. It's not coming to me. I think it is. Because everything is text if you're going to be a deconstructionist. And if everything can be indexed by Google as text, then basically every field has got a data wave coming at it and we're going to have to deal with it. And just to really freak you out, these are all the open-source sensors you can buy for under $30. There's about 150 listed on this page. I could have chosen any number of vendors. Basically anything that you want to measure is increasingly measurable. Anything you want to index is increasingly indexable. We can have a debate at another time about whether that's good or not. But that is what's happening. And so probability is going to be the coin of academic disciplines. The ability to take this data and do probabilistic analysis on it is already differentiating the top biology labs from the other biology labs. It is already differentiating the top text analyzers from the other text analyzers. And the tools that do this are increasingly free because it makes a ton of sense if you're Google or Amazon or whatever to make sure that you're farming, you're growing as many people who can do analysis for you as possible. And so in the search to sell us sweaters and Tylenol, these tools are going to be made available to science and they're going to be made available to archeology and they're going to be made available to teachers and students in a way that has never ever existed. And that's weird. It changes the way that we know what we know because we used to think we knew something and it was stable for a while. We would say the central dogma of molecular biology, the way that genes and proteins interact with each other, that's a stable fact. If you think about Latour and his theory of the way that we know things, what you wanted was something that was so well known that people associated with you automatically, like Watson and Crick and DNA. But the stability of what you know changes in a probability culture because the probability literally changes every time you give it new information. So the stability of what we know in all of these data-driven sciences is dropping. And this is just a graph of a derivative, a rate of change. The rate of change is increasing. The rate of change at which what we know as a net increases as we add data to it because literally every day what we think we know is changing as we feed more data to the model. And this is a way of life in technology. It's increasingly a way of life in biology. And it's changing the educational cultures of those fields in a way that I think is interesting. It changes the way that groups come together and start to know things. And that's going to really change the way that we need to train those people and communicate what we know to those people. And what we're learning in biology is that the pedagogy for biology is completely failing. It's completely failing to teach people how to live in this kind of world. The average scientist who's getting her first NIH grant is getting her grant at 39 to 42 years old. That's for the first R01. That means that she's at least 10 years out of graduate school, which means she exited graduate school before Facebook and before iPhones. And the concept of the pedagogy in biology is that you're done, right? You get your PhD. There's no continuing education in biology. Your PhD means you're credentialed. You get measured on a couple of publications over a multi-year period. You don't have to learn anything more. We gave you a PhD. And it's creating a total failure in the ability of the people who have been trained in biology to deal with the reality of the data flood that's coming to biology. So, we talk about platforms or the platform of the future for the life sciences or for education in general. You know, a lot of times we'll talk about a journal publishing platform or a textbook publishing platform. You know, when we look at tech, which again, I mean, it's not the perfect metaphor. There's a lot of jerks in tech. But when we talk about this, we talk about multi-sided platforms where the network effects make it better both for the buyer and for the seller. So eBay is a great example of this, right? Uber, Airbnb, all these things get better for each side as it gets larger, right? And, you know, I've deleted Uber for a variety of reasons this week, so that the more drivers there are, the better it is as a writer and the more writers the better it is as a driver. And the thing about all of these markets is they're rental economies. They're not sharing economies. If they don't do any political empowerment, they're terrible from a labor perspective. They're sort of contractor and renter economies. And what's funny is that they're all created in a way that for the most of them tries to segregate the buyers and the sellers. Like, Uber doesn't want you to be both a driver and a writer. The drivers aren't the writers. This is totally different than the way that we first say text, in that we're both receiving and sending text messages, right? We are both sides of the market in something like a text platform. And my fear is that education pedagogy around the sciences are moving towards this model, right, where you're either a buyer or a seller of information, but almost never both as opposed to something more like a texting model where you do both, right? You change on a daily basis depending on where you are and what you do. And that's a bummer, right? Because these are markets that are, they are better than the status quo, because the status quo is terrible, right? In textbooks, the status quo is to print it. And that's terrible. And so a buyer, seller, two-sided market in textbooks is better than the status quo. But it would be a real shame if our goal was to be better than terrible. That doesn't strike me as a high enough aim. And so when you look at something like, you know, connections as a multi-open-sided platform, right, or OER generally as an open multi-sided platform, I know that connections has become open stacks and CNX, right, and I'm using it just as an example here. These are examples of open multi-sided platforms where you're both a contributor and a consumer, right? Where you're allowed to flexibly shift based on whether you're an educator on any given day, right? Wikipedia, you can be a reader and an editor, I would argue, far too few of us act as editors. And what we do at Sage is try to create these sorts of open multi-sided platforms where you can be both a data provider and a data analyzer. And luckily there are some things, there's some green shoots in the open movement we can point at that start to get at this. One is the Wikipedia nearby, right? So the fundamental thing about these platforms is they get bigger the more people use them, but we don't always, in the open movement, focus on attracting users. We focus on the number of assets licensed on a regular basis, not the number of people adding. And we make it hard for people to very seamlessly come in and add one image to the comments. And Wikipedia nearby is a great example. It says, hey, you're near the Washington Monument, we could use a great Wikipedia photograph for the Washington Monument. Would you like to take one? The new Knight Foundation funded Project Creative Commons called The List is another great example of this. Small easy ways that you can become a contributor to the cultural commons increase the number of users of the commons. And it changes people from only being on the buy side as a Wikipedia viewer to being a Wikipedia contributor or a Creative Commons contributor. There's nothing wrong with celebrating the number of assets licensed, I do it all the time. But when you think about the strategies that we use to build the commons or the open educational movement or the open access movement or the open science movement, we have to attract users who get value from being on both sides of the deal. Because whether it's open or closed, these multi-sided platforms, they get more valuable than more people participate in it. And so a huge part of this has to do about engaging people to think of themselves as part of the market. And we don't think about it that way. My point is that I think we should because I think it changes our strategies and I think that those Wikipedia and Creative Commons apps are really good examples of strategies that are lightweight ways to get people into our market. And it's a better market. It's more fair, it's more moral if you're on both sides. I've spent years asking whether or not a given asset is open. I can tell you there are people in the world who disagree radically with me on my answers to this question sometimes. But I think it's the wrong question strategically. And that's why things like the Budapest Open Access Initiative Declaration are so important, right? They let us draw a bright line around any given asset. But it changes the strategy about how we bring someone in, right? The difference is we say that asset might be a Creative Commons by non-commercial, no-derives license. But the fact that you as a user wanted to join us is something to celebrate. And that's a very different strategy. And unfortunately, some of the people on the Open Access side might license an asset that way, don't necessarily agree with me that we should welcome them into the movement, right? There's a sense that you should change the definitions. The definitions are vital to say whether or not an asset is open or not. But the fact that someone has joined the movement is something to celebrate, because that grows the market, that grows the platform. But I really think if we want to get out of the box that we've put ourselves in, we have to connect to people who don't agree with us philosophically. You know, selling the open philosophy is really important. It's something I believe in. It's something I live. But if we're going to scale this, we can look at open source, software as an example, right? It didn't scale because of the philosophy of freedom or openness. It scaled because methodologically and economically it was more valuable than closed content. The Apache web server was an economic weapon for IBM to leverage against Microsoft. It wasn't a philosophy thing. But then it didn't matter from a philosophy perspective. This is sort of Yochai Benkler's great insight is it doesn't matter why you're open. It's actually better if you have a diversity of incentives instead of just the philosophy. That's what lets you scale. That's what makes you robust against attack is a diversity of incentives. And so to scale this outside of the open box, our argument at Sage has been that we have to make the argument that it's more valuable to be open than it is to be closed. The thing that you make available is more valuable if it's open than if it's closed. And a good way to teach people this rather easily is to say think about searching Google and then think about searching Google Scholar. And think about the value you get from searching Google as opposed to the value you get from searching Google Scholar. So rather than is it open or not, the question that we try to ask is does it create more value in the search version? Because the argument that engages a person in using that content is so much easier. The argument in engaging someone as a donor of the content is so much easier because you say by being open your data is more valuable than if you keep it closed. And showing them by making sure there's a network of users. And so I work a lot with people who have rare diseases and this is the argument. They don't want to be open because of philosophy. They're afraid they're going to die and no one's ever going to look at their data. And openness is a methodology that gets their data in front of people. And that's a victory that's so much easier to get than trying to talk about the philosophy of remix or the philosophy of openness. And those are all things I care enormously about and will spend hours talking about. But most of my time I'm not talking to people who agree with me. And this is the big lesson I've learned. And it requires practice change. So this is the difference in selling a philosophy versus selling a value. You're trying to say to someone if you want this value, you've got to change the way you live, you've got to change the way you work or you won't get it. Open is actually less relevant than the practice change. So this is just a picture of our website and my boss Steven Friend. So Steven had a road to Damascus moment when he was the vice president of global cancer research at Merck six years ago saying if we're going to actually figure out why genetic variations affect individual health outcome we need so many genomes available and open that it's never going to happen inside a pharmaceutical company. He actually tried to pitch it at Merck. If you go to our website we archived the presentation he delivered at Merck arguing for a genetic commons six years ago at Merck. It's an internal presentation. We got the permission to open up. And they basically said well how much is it going to cost and he said $500 million. This being Merck, that didn't bother them. They said how many years of competitive advantage do we get? He said well three probably because if it works anyone can copy the method. They said that's fine you can go start a non-profit. And they gave us all the assets and we got a bunch of staff out. But the core idea at Sage is this concept that the data is increasing in volume decreasing in price, increasing in velocity. But if you want to go from that to wisdom, wisdom would be you should take this drug and you'll feel better and you won't die. That would be wisdom. That's something we can do with knowledge that we're not going to get there if we just apply Google's analytics to it. If you run Google's analytics on a clinical data system it comes back with things like diabetes is connected to glucose. Well we know that. That is a relatively stable piece of knowledge. So you need theory and you need experience to analyze the data. And some of that theory is the biology theory but some of that experience is the advertising analytics experience. So the question was can we build an open multi-sided network that connects people who are good at experience with doing data analytics people who are good at theory and people who have their own data or are willing to generate their own data. But it takes a total practice change to start collaborating this way. So the first practice change that we had to convince people to make was that your lab is not the natural unit of science. And that sounds like a very obvious statement but the reality is we fund as if labs are the natural unit of science. We measure as if labs are the natural unit of science. That's where the paper comes from. A really collaborative paper will have two or three labs on it. Especially in the life sciences. Only in the context of something like the human genome project do you get a truly multi-lab paper. But over the last five years there's been a change in federal funding practice towards collaborations and consortia. There's even been the statement of consortium fatigue setting in the NIH because this has been so trendy. We thought to ourselves this is an opportunity. These consortia are required by the terms of their funding to work with each other. Not to be open but to work with each other to be open among themselves. We figured they probably don't do it very well. Because they don't know how to. They haven't been trained. We identified a couple. This is the cancer genome atlas at TCGA stands for. This is a $100 million project of the National Cancer Institute and the National Institutes of Health here. The idea is to create an atlas of the genomes of the most common kinds of cancer. Tumors. You're looking at almost 250 scientists from 28 institutions who are required by the terms of their funding to work together. We said how do they share their data? We looked at it. They've got a data portal. What does it actually look like? It's an FTP archive. If you don't like that, they also have an HTTP version. If I wanted access to this and I wasn't part of the TCGA, I would petition. They were pretty good about giving access. They're not out there trying to keep people out of it. It just didn't occur to them that there was a totally different practice that they could take on. They were having all sorts of problems because the things like I'm working on it, that it's a different version than the one that David's been working on. You had massive statistical and semantic clashes in the analysis that blocked the consortium from analyzing the data collaboratively. The problem isn't that they're open or closed. It's that people aren't writing things down. There's all sorts of tacit stuff that happens in data analysis that the pedagogy doesn't teach you it's important to write down. If you change the biological protocol, you need to write that down. There's a major science practice problem that says if you're going to normalize and deduplicate a data set, you need to write that down because otherwise the people at the other labs don't understand why the data set is different than it was yesterday or how the data set is different. It's like sports. You want to be able to rewind the tape and analyze what happened, break it down We were lucky. Steven, my boss, knows all these people our staff know these people. They're our colleagues and they're our friends. So we agreed that we would spend six months doing bi-weekly phone calls with more than 200 people getting them used to the practice change. The guy Larson who did this is like a saint and it's a shame that we can't recognize him as a hero of open science more easily but the person who sits and chats on these phone calls is the person who makes this possible and this is the sort of stuff that comes out of it so this is an example of a data set inside the consortium as you see it's been re-annotated it's been deduplicated four different versions of the data set you probably can't see it very easily but there's actually four different versions of the data set complete file history and annotations and this is sort of like version control in GitHub but the difference is that you're not editing the file you're not editing the data the way you edit software the data is the data the genomes don't get edited that's what came off the machine what changes is the way that we process and interpret them so we have to capture a graph which is what you see over on the right of all the different things that have operated on it so at the top you have older at the bottom you have newer and you can see that we took two files merged them, you can see the names of the people who merged them the dates they merged them everything gets a DOI of the methods because the methods are what give us the probability of some prediction about science and so what you want to do is if you have a prediction that says John's 85% likely to respond positively to Tylenol PM I want to be able to rewind the videotape and figure out why you think that and I want to see at every stage who touched it and where because each of those things affect the probability's accuracy it trusts the confidence I can have because the flip side of probability is confidence you can say it's 86% probable that John will die if he takes Tylenol and I would say I don't have a lot of confidence in that I want to be able to go back and prove why so the practice change required to do this is enormous if you're used to just analyzing your data in the wild west and not tracking any of this this is miserable people encounter our software sometimes for the first time they're like this sucks and we're like well we can have a debate if our software sucks or not and we're a non-profit it's not exactly our specialty to make it beautiful but when we dig in it's almost always what sucks is the practice change not the software and distinguishing those two things is really important and when you focus on value that's much easier than when you're saying my software is good because it's open see the value comes from changing your practice and what's nice also is you start to get proof points so when we combine this practice we take this sort of weird amalgamated mass of researchers and data that was really slow getting publications out we're trying to quantify exactly how big the jump was but I can tell you that before they started working with us they had less than five papers submitted within nine months of working with us they had 18 papers accepted and this was one group in the broader cancer genome atlas they have 14 or so single working groups besides the pan-cancer consortium that we brought in and every single working group has joined our process without any additional recruiting because of the structure of the way that we framed this everyone else now says I want the productivity boost that came from the open methods which were required because of our collaboration system they're not in because they believe in openness but they're now in a framework where every time they publish a paper they punch a button and all that provenance is alive and becomes public the value brought them in so another example would be so now we said it's not just so the labs but it's communities but those are communities that were required to work together so can we bring together a community that wasn't required to work together with the same argument so the idea was let's find a place where there are competing papers claiming the same knowledge creation this is from colon cancer there are paper and four different major publications in four different stacks of science journal sort of corporations or non-profits each claiming to come up with the genetic subtype of colon cancer not surprisingly each paper comes up with a different canonical subtype for colon cancer these would be the mutations that distinguish colon cancer so in theory all four of them should arrive at the same result but they use each of them uses different math each of them had a different population so each of them had a radically different subtype and we kept digging we found multiple more papers on top of these sort of four most famous ones so we said to them this is how you built your analysis each of you has a different subtype some of you have the green dominating some of you have the blue dominating these represent different kinds and sets of mutations wouldn't it be nice if each of you could run your math on the sum total of everyone's data this is the value it's not the idea that you should share your stuff for no reason it's if you share you can run your math on all six aren't you sure you're best and after we went and got four we had nine more groups approaches to join the community because they didn't want to get left out so each of the groups now gets to run their algorithms on all 13 data sets and then we get to wrangle together a consensus subtype that gets published that has the highest probability of survival over a longer period of time because it has the largest data set and it has all the eyes from all 13 of the groups looking at it because the knowledge from any one of those papers was going to keep getting knocked down by the knowledge from each of the other papers because the way that we publish doesn't say is it true or not it just says is it true enough so we're like okay this is cool we can go from solo labs to communities whether they're required to or not and we're going to be announcing like 10 more communities in the new year I can tell you that this is starting to explode in a way that looks like a network effect you know the hard part is making sure that everyone who joins learns how to do the practice change at a deep enough level to take advantage of it I'm not sure we know how to scale that yet that's a pedagogy problem that we might want to talk about right but we don't just want to appeal to the scientific elite right the groups I've been showing you are on the theory side of the data knowledge continue not the experience side so we've done challenges or competitions that get people who are outside of the system participating so there was already a group called the dream project which was a joint project of IBM and Columbia that built a really good community of biological data analysts they spent eight years running challenges in sort of reverse engineering analytic methods in the life sciences and we said you know let's work with them let's get a training set on breast cancer so we got some data that was already out there from Oslo and from Oxford we got the Avon Foundation to generate the exact same data types on 500 women who had never been profiled openly and we said can you predict the odds that five years after successful treatment with chemotherapy that a woman relapsed with breast cancer right can we predict that probability Google gave us free compute space so no one had to sort of work on storage or pay for processing because that's a real barrier to entry especially in the developing world and then we got science translational medicine right not an open publisher to agree to publish the winner with the contest counting as peer review so there's no cash but the winner gets a guaranteed publication in the top level journal as a prize and we said if you want to be on the leader board you got to share your code because we didn't want this to be a contest based on the skill level of the programmers so what was interesting is we required code sharing from the beginning and no one shared that code and then we said no we're serious and no one shared that code and then we said okay first one who gets to the top of the leader board using someone else's code gets 500 bucks and the first one whose code is used by someone to get to the top of the leader board they also get 500 bucks and also by the way we're not going to let you win if you don't share your code and within nine days of putting this in average accuracy the models was a thousand percent improved right and it's incredible the value you get and the other benefit of this is so you know the winning model we got the cover of science translational medicine we didn't just get an article we got a cover, an editorial, a methods article as well as the actual science article and this is much more accurate than the previous models were but what's interesting from the pedagogy perspective is because of the code sharing requirement to make them all better there's an entire suite of algorithmic approaches because some of those models are going to get better as more data comes online some of those models are going to get worse as more data comes online and the entire suite of tools is now available to everyone to take on to attack the way that we combine genetic information and health data to predict the odds of cancer relapse and of course the winner was not a biologist the winner was the lab that invented the mp3 codec at Columbia and no one wanted to hear their theory they had this theory that there were metagenes that went across all cancers it was a very philosophical argument very semantic ontological argument and no one believed them none of the scientists wanted to hear it so this is an example of practice change we both have to get the people who are the experts who are the academy to work together and we have to design new practices that bring in people who aren't part of the academy and that's tough that's tough to scale probably four and a half years getting no traction right when you're below the line and the network effect it's awful you're convinced that you're doing the wrong thing you have to be somewhat insane to keep going but the evidence starts to mount and so we've had 25% user growth data analyst growth for eight straight quarters if we were a company we'd be raising a gigantic round of venture capital right now but the idea is that having an open platform is really important because if we do this well enough companies will come in and do it too but when you have an open player in the market it changes the entire market browsers are different because firefox exists encyclopedias are different because wikipedia exists the markets are more moral and they're just less ass holy pardon my French so these practice changes are really important when you're talking about the scientists themselves but it's not just the practitioners who have to change, I would argue that we those of us who advocate for open have to change our own practice as well we've focused a lot on the question of how do we govern open definitions manifestos, declarations I've written a few but when you talk about platforms the business school tech say there are three other things you have to think about for platforms how many sides are you either a buyer or seller or are you both at all times the price and price is non trivial even in things that are free everything I'm talking about here we provide those as free services the codes in github but we run them as free beer services on amazon because knowing how to configure an open cloud service is very different than giving someone an installable that runs on their computer so you can stand up everything I've shown you you can stand up our synapse platform everything we do is open sourced it's in github but the skill in knowing how to actually run those is not very widely distributed so we do have to think about price in an open platform you want the price to always be commodity based or cost recovery based but I think the big question is platform design because we haven't really made an attempt to design the way that software gets designed there are entire classes of people who do nothing but focus on how much white space is on your phone what your font is how to use design to engage you and draw you in so you spend hours on the platform that you're on whether it's twitter or facebook or whatever and we don't really focus on design as a community and I think we do if we're going to be a platform we have to think about design and the reason that I believe that is you look at something like the iphone the apple designers didn't sit down and say let's design a closed ecosystem their preliminary bias was to be closed because that's most people's preliminary bias but their idea was let's build the best damn phone we can for the user let's build a phone that gives the user the most value and they had money they had all these other things but that focus on value to the user I think is actually the differentiator and the good news is that open platforms are so much more ethical and engaging because you're not just a buyer or a seller you're a participant you're a citizen you're a member though actually if we start to embrace this idea not let's build an open ecosystem just for the sake of it but let's make it better for the user we have an innate design advantage because we're not trying to make people fail to read the terms we want people to read the terms we're not trying to lock people into the easiest thing we're trying to lock them into something that lifts them up and that's what design does design says the user's the most important thing not the asset and the asset's important the licenses are vital I believe fundamentally I'm considered irrational by a lot of people in my insistence on CC buy and CC zero but the reason I believe in those things is because I think those prioritize the user they prioritize whether I'm giving content buy and zero give me more value they create more potential value and if we design right we can embed open into systems that have never even thought about it because we've only really touched the surface of what can be open we've only really touched the surface of what we can crack open copyright is in many ways easier because at least there's a horrible international standardized regime that makes it possible to have a standardized set of non horrible tools but I've been spending the last couple years working on informed consent which is not a system that's ever been open this is as far as I can tell this is the earliest surviving informed consent document it's from a series of yellow fever experiments in Cuba a little over 100 years ago and we literally asked people to risk their lives to get bitten by mosquitoes to see if mosquitoes transmitted the disease instead of what were called fomites it was a thing that everybody knew everybody knew that diseases like yellow fever were caused by bad air coming off of people's clothing and everybody knew it wasn't mosquitoes that was a stable knowledge continuum for about 100 years until someone said you know what let's put people into a room with mosquitoes put people into a room with dirty clothes so you get sick right and so they consented people for this which was miraculous at the time they actually asked you to sign a document that they could die that was the risk, the benefit was to science and also they would get paid and the concept of informed consent is that you should as an individual get to judge the benefit risk ratio and make an unbiased decision as to whether or not you want to join what we are now is these forms are written by doctors they're reviewed by lawyers and then on top of that they're edited by committees and you get 18 page documents full of liability text that are handed to patients when they're enrolling into a study and they say if you don't sign this as it is you can't get in it's a complete lack of agency and for the most part you're panicked you've got a disease, you have a chance to take a drug that might save your life, you sign the form you have a rare disease you want to get into a long term study you sign the form and these forms default into the concept of no public access in order to protect your privacy which is often to protect the liability of the institution or the competitive advantages of the person gathering the data the data are never shared this is kind of tragic we already have an incredibly bad participation in clinical studies in this country it's less than 5% of people with cancer ever get into a study and that's just an easy one so we have a membership problem we have a scale problem the largest longitudinal study of Parkinson's disease in the world is under 2,000 people and for comparisons the traditional contagion study had 689,000 people when Google AB tests they test on Utah or Nevada but our clinical studies consider it okay to claim knowledge with less than 2,000 people and making sure that those data sets can never be recombined and so the value to the participant in the clinical study is actually really low the paper isn't open access the data doesn't get reused or recombined but the data doesn't get reused and there's no recognition and they don't have any agency or participation so I wanted to break this open I said this is a great place to do open, it makes sense but I was lucky I was working at a non-profit that was run by a guy who was an interaction designer when I was getting this idea and so I spent a year and a half doing what's called persona creation and interaction work so this is the sort of thing you do with the top guy at Cooper Design where he worked for a lot of large multinational companies and so what you do is you say who's going to use it? and you give them names, you create photographs you give them back stories and you imagine what it's like for them it's an empathy based process more than anything because you create the names you're not thinking abstractly we need to create an open consent form you say for Tim Boylan he's dead and he has this conversion at the end of his life we need to be used after he dies or we have a doctor on the right and you say her goal is to be an authoritative figure in front of her patients and to increase efficiency because they have different goals but these are the people who would have signed a consent form together to share that data and so the concept is what's a product design that's open, that creates sharing, that changes the way these two people interact and that's not a contract it's a new legal document that would not be considered sufficient anywhere except in the legal world so what we came up with was a visual design language so we said what are the icons that would represent what a clinical study would look like so if you have cancer one of the big questions we say are you depressed and how are you sleeping so we came up with a visual iconographic scale for those things because these are key elements of informed consent they would be in that 18 page document every document tells you study tasks every document tells you the risks and the benefit everyone tells you what kind of data you're going to collect everyone tells you how it's going to be shared which is usually we won't and we said you know what we're going to actually create a couple of studies so we're running a study in Parkinson's disease that starts in the first quarter and we're running a study in post chemotherapy cognitive impact that also starts in the first quarter and so rather than create things that are open as a standard the users said we're going to only build the things that we need to use in the studies and we're going to back out all the assets into an open source toolkit so everything we do starts with a user and then the beauty is part of being open source is not just making things it's curating things that are already open so we went to the noun project we went to open clip art creative commons by and public domain tools that are available we said there are literally thousands of icons that are representative of the kinds of actions that take place in clinical study but no one's ever collected them in one place so that's another part of what we've done is said you know we're going to go out to the commons and pull in as many of these things as we can right the nouns and the verbs are good but when you spend time working with design as you say you also need sentences so there are some concepts that are more complex than you can have with just an icon an icon's good for DNA or a medical record but this concept that we're going to separate your identity from your data is an animation we're going to distribute your data for reuse is an animation so we're providing those as well again all of this is CC by licensed or public domain storyboard layouts not everyone who wants to stand up one of these longitudinal studies knows how to lay out for a mobile device so everything we've done with every designer is going to be made available it's not just the end product it's all the methods this is a practice change as an open free culture person is to say literally every method has to be open it can't just be the end product because if we hand it to you how do you know how to make your own right we put these together into stories that create informed consent so this is if you think of the human readable layer of a Creative Commons license this is the human readable layer of our informed consent agreement it says we're going to ask you to tap on the phone so this is actually for our Parkinson's study and I'll explain this real quick so the insight was we go to the doctor for Parkinson's to get into that 2000 person cohort the doctor has a set of scales that they rate you on how is your walking on a scale of one to five how strong are you from a tremor perspective on a scale of one to five how stable is your voice these are very subjective opinion based things the doctor comes up with and they happen four times a year and of course you do your best when you're in front of the doctor and what we realized was you need to be done on a phone in a data driven way so we're going to ask you to take a survey that says how are you walking today on a scale of one to five but as soon as you rate it we're going to send you a notification okay holding your phone in your right hand take ten steps forward and ten steps back and during that phase we record all of the sensors on your phone to see your gyroscopes your GPS your accelerometer we can get a data driven measure of how you walk same when you say how's your hand tremor as soon as you answer that hold the phone stable for ten seconds and we record the tremor of your hand we ask you to say ah into the microphone for ten seconds and we get your voice tremor we get this beautiful long term analysis of the data and the cost changes radically so we've gotten approval from the ethical board doing all 100,000 people in the study so even if we only get half it's a 25x increase in the data and not only that every participant gets their data back and the data that's available is openly reusable and so all this architecture I'm showing you creates that same set of freedom the data goes back to the participant and the data goes into the commons as you generate it and now if you want to take it out if you want to take out the patient empowerment and sharing you can but we all know the importance of default settings so we're also providing reference implementations for clinical documents we spent hours, weeks writing protocols doing the paperwork laying out the forms that you need to apply to an IRB which is the group that decides whether or not what you're doing is ethical all the way down to the web templates and assets so literally everything it takes to stand up a clinical study on a phone is going to be made available as an open source tool with a bias towards patient data return and the commons you can take it out but you're going to have to do the work to do that if you want to use it the easiest possible way the way that it's designed cohesively you're going to wind up being open whether you want to or not because the methods are better for you and then the other piece is there's another reason to make this available as a fully open tool to not put in no derivatives or noncommercial license and that's because we may be running these two studies on phones and tablets but there's a lot of other contexts where someone does informed consent in the U.S. you get informed consent when you go in for treatment or for payment and there's no reason that you couldn't create a printed human readable overlay to those things to increase engaged informedness there's no reason you couldn't build this for a doctor to use in the room with a patient but we can't do that we can create more value for everyone if we let everyone else do that I can also tell you as a product creator I've spent a lot of years as a preacher and an advocate now that I'm actually making a product it's really enabling to go that's a fantastic idea you should do that it's so much better than having to do it yourself it's just wonderful so this is an example of how an open method creates more value the data is reusable the data goes back to the participant if you're a clinician you go from oh my god how am I going to explain this freaking terrible document to people to thinking about how do I build a process that makes people my partners I've been stunned at the lack of negative response to this and it's because the field is so weak that almost any design based approach would have made it better by making the first design based approach be an open design based approach we can sneak open in in a way that scales and we create more value and it's not just economic value I would argue that putting patients in the center of their own study has an economic value but it also has a social value a moral value has a scientific value and hopefully it has an educational value because the people who participate in this Parkinson's study are going to learn a lot about their Parkinson's by being in the study and it's not going to be a passive thing where they get scared because they're going to Dr. Google they're part of the educational process they're part of the knowledge creation process in a way that they couldn't be if we did it any other way and so I'm almost done this is a graph of what a Bayesian tree looks like and what we do when we do probability based learning is we add more data to refine the model and so what we're going to do in Parkinson's is refine the model that says why is it that some people get sick faster than other people but what we know about that is a lot less stable than it used to be and it's going to keep getting less and less stable as we go and if we're locked into a way that we teach scientists that just takes books and puts them onto computers the pedagogy is not going to keep up with that because the book is basically a container for knowledge that we think is stable and if this is what we have to teach with there's no way that the pedagogy for science is going to keep up with the methods that I just showed you because whether those methods are open or closed those are the methods we have a choice about the platforms and sort of the morals and the ecosystems but the methods are the methods until someone invents better ways of analyzing large scale data the methods are going to be probability, inference, and prediction and if you can't rapidly feed the changes into your educational system we are all basically screwed and so the right to reuse we talk about the right to reuse in the OER definitions in the open access definitions the open knowledge definitions but if you don't have the right to reuse you don't have the right to be current you don't have the right to keep up with the probability based models as they change you don't have the right to get better you are stuck with knowledge that practitioners in the field consider outdated a great example of this is in radiology a very underpowered study said that really high doses of radiation on lung tumors was better than low doses this was right at the absolute tolerance limit of what human bodies can take in terms of radiation the study was done on less than 50 people it was published in a very high impact journal it's become the standard of care in a lot of places although it has never been replicated and it's been demonstrated to kill people faster than the old way because the knowledge was propagated through the traditional channels and it can't be updated effectively it's literally blocking human beings from being healthier so the right to reuse it's a really important right because it lets you get better and really in the end it's the right to create new value because it's really important to create new value it's not just about putting something out and being satisfied that it's done it's saying that the ability to create new things that are better that are more current, that's what value is and we can't let the conversation about value be dominated by economics is important, jobs are important but the moral impact of being a participant in the systems that surround you cannot be left behind it's really important and when you engage people in a system where they have the ability to create value and receive value that's not just economic value that's knowledge value that's social value, that's moral value they want to be a part of it and I think that's what we need to argue for in open science and access and I hope that you in open education work with us because I'm terrified that we're going to wind up with lock in on systems that make it impossible to take advantage of these tools thank you it doesn't count when she stands up because we're married so we have time two questions, three questions, anyone? come on, I saw you nodding some of you have questions all right, Cable so, John, you talked about what do we need to do in terms of open pedagogy rethinking, learning how do we in a world of academic freedom where every professor has the right to do exactly what she or he wants and he never can or should to fringe on that how do you encourage cooperation, sharing not only of content but of practices new ways of teaching how do we share in real time both the data how do you switch that culture that we have in higher education or education I mean, I don't have a great answer there's got to be a pedagogy of collaboration there's got to be a method for collaborating that is abstract enough to be taught I'm just not sure it's been created but this is the sort of thing where my bias is if we could figure out a pedagogy a curriculum for collaborating that open educational resources are the best way to distribute that across the globe as quickly as possible we teach scientists now, and I have my biases towards science that's where I have my experience we teach scientists data analysis scientists have to go through a six hour ethics course if they're going to work on human subjects we have all of these things that are a part of the core first year graduate curriculum for being a scientist it's ridiculous that they don't have a week where they learn how to collaborate effectively using the internet and it strikes me that that's the easiest thing to do and that's the first thing that I would want to do it would be so much easier for us at Sage if people knew that they needed to collaborate over the internet then we wouldn't have to teach them why to use our platform or even some of the basic practices like take notes it's incredible how we don't teach people to be collaborative that's the first thing in terms of the professors realistically we can hack the incentive system a little bit which is that the people who are collaborating this way are generating far more papers than the people who aren't collaborating this way that's been the best way we've gotten people in is not by convincing them but by having them be totally afraid that they're going to miss out and so those are my two things, cultivate fear of missing out whenever possible but I would love if anyone wants to work with us at Sage on doing a regression analysis or a back bearings on how people are collaborating using our system so that we can quantify how they do it and how useful it is I would love to convert that into a curriculum that we could make available as an OER resource because I know a lot of PhD institutions that are desperate to teach their scientists not to be left behind but I don't know anyone who's devoting the resources to figuring out how to teach them that so just to repeat the question when she talks to social scientists that they're satisfied because they're dealing with human subjects they don't have to share and that is very true that's the standard of practice right now in anything that touches humans anywhere except large companies because of course if you're a company you're not restricted if you want to publish the results of your human subjects research you need consent if you don't want to publish you don't we have really weird law right now so part of the rationale for a toolkit for informed consent was that consent is not restricted to the biological sciences or to the health and medical sciences and so my experience with the IRBs is or these are the ethical boards that review research protocols is that they are not necessarily anti-sharing it's more that they've never seen how you could share the data in a way that was both driving science and had some privacy protections left in it and this is a place where for example like adherence to open definitions doesn't help right so the vast majority of the data we're collecting at SAGE doesn't have any IP constraints on it it has no copyright or data protection constraints but we do attach data use agreements which are essentially Hippocratic oaths right I agree although I might have the power to re-identify I agree not to use it I agree not to harm people I agree not to share the data with people who haven't also signed the contract and then the data lives in sort of a mildly secure repository where you can get at it and you have inside there you have a lot of freedom and that's I think the first step towards getting where we really need to get which is to have the data be liquid if people have consented to it the practice change I think in social sciences is going to come from the researchers who get more value out of sharing their data right so we were lucky in the life sciences that the federal mandates are requiring collaboration so we were able to get at those people who had to collaborate and help them collaborate effectively and then the scientists started fearing that they weren't getting the same benefit so they joined in similarly with the consent work we found people who already wanted this we found patient groups who desperately wanted to have large scale data that could be endlessly analyzed and we're working with them to go forward we found a few sort of clinicians who were really enlightened and wanted to go there and so what I would advise is if you can find a few people in a given social science discipline that want this making it unbelievably easy for them to do it and demonstrate the quantitative benefit they got as a researcher is how you'll create that sort of competitive desire to catch up a motivated faculty member with a really good plan can get things through IRBs but if that faculty member isn't motivated and has to write their entire plan themselves they're just going to keep doing it the way they always have ok we're done thank you very much