 My name is Liz Langdon Gray. I'm the executive director of the Harvard Data Science Initiative. Before we start, a few housekeeping reminders. The talk today is being both webcast live and recorded for posterity and available after the talk on the Berkman Klein Center website and also on the Harvard Data Science Initiative website. If you are so inclined, you can tweet at us at bkcharvard and at harvard underscore data. It's a pleasure for me to welcome you to today's talk and a distinct privilege for me to introduce our speaker, Christine Borgman. Professor Borgman is the distinguished professor and presidential chair in information studies at the University of California, Los Angeles. She is a fellow of both the American Association for the Advancement of Science and the Association for Computing Machinery. And she is the Harvard Data Science Initiative's very first visiting faculty fellow. We are thrilled to have her here for the month of October. Professor Borgman is also the author of three books published by the MIT Press. Her most recent book, Big Data, Little Data, No Data, Scholarship in the Next Networked World, is available here today. We have a representative from the Harvard Coop. And the book is $25. She has graciously agreed to sign copies after the talk. I can think of no better speaker to deliver the first joint talk hosted between the Berkman Klein Center and the Harvard Data Science Initiative. Researchers at Harvard and across academia are embracing the opportunities afforded by the convergence of new technologies and the availability of vast amounts of data. These data may come from sources as diverse as electronic medical records, financial transactions, and billions on billions of social media posts every day. They may also come from universities themselves, from the research data produced in our labs and libraries, and from the administrative data that accumulates from the very day-to-day running of a university. Researchers are developing the theory, algorithms, and systems that will allow us to draw meaning from these data. And they are using data to shed new light on old questions and to tackle new questions that demand our attention. Importantly, these efforts are underpinned by a commitment to pioneering the ethical use of data, from devising new ways to promote algorithmic fairness, to maintaining an individual's privacy. This commitment is reflected in the missions of both the Harvard Data Science Initiative and the Berkman Klein Center. As organizations, we are partnering to advance analysis and interpretation of data in the deployment of that analysis for the social good. And I'm going to read the next part because it's got lots of names in it. Professor Borgman's talk today is a timely exploration of the challenges that we are facing. The talk is based on a new article in the Berkeley Law and Technology Journal and draws from work presented at the 10th annual Berkeley Law Privacy Lecture, hosted by the Berkeley Center for Law and Technology in November of 2017. Respondents at that lecture were Professor Erwin Chemerinsky, Dean of the UC Berkeley School of Law, and Professor Katie Shilton of the University of Maryland. Following Professor Borgman's talks, there will be ample opportunity for questions. I'll ask that you hold your questions until then. And with that, I'd like to yield the podium to Professor Borgman. Thank you. Thank you, Liz. And it's wonderful to be here. And are we on, Ellen? No? We were on a second ago. Are we on now? How's that? Good, all right. Anyway, it's great to be working with both Borgman and with the Data Science Initiative. And as you will note, my first affiliation is the Center for Astrophysics as well. So I span many fields. I've been working on building information systems and studying information systems for most of my career. And the privacy interests have been a sideline for much of that career. In the last five years or so, they've really converged. They converged in that big lecture last year and a course on privacy and information technology. I taught at UCLA and now in this paper, which came out last week, even though the digital object identifier does not yet resolve, but it is there up on the website. So that's our first little tech joy. I'm gonna take a very high-level broad sweep over a number of issues and assume that this audience is pretty familiar with fundamental concepts of privacy and information technology and policy issues. And we can go back to some if I'm a little too high-level as we go, because this is really a changing nature of world pretty quickly. Universities have moved into a really data-rich world. Universities are collecting data at far faster rates than most faculty or most administrators have any idea. There's very rich data, they're assets, they're valuable, but they are valuable not only to the universities themselves, many third parties see value in these data. And the brokering with third parties is also what's complicating a lot of the issues that universities face. So we collect them for research and teaching, for services, but at the same time, we need to be very aware of the public trust that is a responsibility of universities. And I'm gonna talk about the privacy, the academic freedom, and some of the stewardship and governance issues that are involved. So what does this privacy frontier look like? The move toward open data is much of what has really launched us into the opportunities and the challenges that we're facing today. Now, those of us who do empirical research, which is to say most people in law schools are not in the business of getting grants, but those of us who need grants to collect data, live and breathe by this set of rules and policies. How many of you are familiar with these open data policies? Okay, so a few and a half, which sort of tells me what the range of the audience is. These policies have been evolving for 15 to 20 years. And what's the current state of things in most countries is that when you apply for a grant from a public funding agency, increasingly from private funding agencies as well, you have to submit a data management plan that says how you're going to make your research data available, how you're going to manage them, and how you're going to use them for the long term. And you're also stating some responsibility at the end of the grant and even at the end of every publication, you're expected to release the data sets that go with those publications. This is the current data practice in science, social sciences, and parts of the humanities as well. Anyone who really does grant funding. In the book which came out in 2015, I analyze a number of these policy documents or what are the drivers behind these policies? And this is generally what I found. The reproducing research comes up very highly and that's a controversial topic, but the idea is that gold standard of science is to make data transparent, available, show your work in a way that other people contested and reuse it and see if you're right and look for new findings. The second is one that you also hear quite often which is if taxpayer money is going into the grant, the public should have access to the products of the grant. Thirdly, you want to leverage investments and then more generally, advancing research and innovation, the European policy really pushes on that last point quite a bit. We've also got changes in the way people do their research, depositing, and you can deposit in many places starting with Dataverse which Mirce runs right here. You can put it in the National Institutes of Health databases if it's biomedical, you can put it in Dryad if it's ecological. There's many places that are domain specific to put data, if it's more generic data, if it doesn't have a home, there's loads of orphan data out there, those data can go in places like Dataverse or universities, or university institution repositories. You're expected to provide enough documentation, often including software and code books and code, with the data so that other people can retrieve them, reuse them, and interpret them in the long term. We have this set of practices and policies principles. Mirce is among the at all there, she can talk about fair principles for the rest of the day or year probably, about what's involved. But the idea is you want to release your data in ways that they're findable, accessible, interoperable, and reusable. The rub is that very few of any of these policies actually define data. And I spent an entire chapter of the book on just the question of what are data. You can get into epistemological and phenomenological discussions very, very quickly when you ask people what are your data and start to see what is actually subject to these regulations. Do you have to save the specimens? Do you have to save the spreadsheets? Do you only have to report what's in the journal article? How much of the software do you need to release? Do you need to release what's at the end of the pipeline? Do you need to release the entire astronomy pipeline and so on? Many different levels of what can be data that vary and we see all over the different parts of our studies because we're working with astronomy or sciences, seismology, ecology, undersea, oceanography, and so on that one person's signal is somebody else's noise. So what are data is one of the most challenging parts of what applies. Also what's changing is just the nature of what scholarly publication looks like and the way in which we disseminate our knowledge. We have recently celebrated the 350th anniversary of the first English language journal, the Philosophical Transactions of the Royal Society which is still publishing after 350 years. These are what are known as formal publications in library talk. They last a long, libraries keep them. They're cataloged, they're indexable. These are the ones that are findable and accessible whether or not they're interoperable or reusable, maybe another matter. Now librarians also think in terms of great literature and this is where the great data in the title comes from thinking about these things. On the left is a whole taxonomy or typology of some of the different kinds of great literature. These are documents of various forms. They might be videos, they might be whatever that may be the only documentation of the evidence on something. They don't show up in library catalogs, they might be in file folders, they might be indexed, they might be on microfilm. These are hard to capture but people try to capture them and notice that data sets is one of these things that's been considered great literature for all these years. Now these are getting picked up in Figshare and SlideShare and Sonodo and YouTube and call all kinds of other places. So this sort of great data is a growing and messy category that is much of which is not findable or accessible, much less interoperable or reusable. So that takes us to what I'm calling great data as an analogy to the great literature. And this is the idea that we are just generating massive kinds of data, sometimes called data exhaust that comes off these systems. And a great deal of it just falls between the cracks. It's not governed by any of the data protection policies that are the usual way that you think about privacy. FERPA has a giant loophole which is universities can pretty much determine what legitimate educational use means. Similarly, if things don't get covered by IRB, if they're not human subjects, then people go and say, is this human subjects? I'm putting cameras in the hallway of the engineering building because I'm testing some new vision algorithms. Is that human subjects research? If the IRB says no, a lot of people say fine, then I can do whatever I want to with it. University of California is very concerned about protecting these data, really leaning toward the privacy and governance side, but in a whole lot of universities, this is the wild west. If it doesn't fall under specific rules, you can pretty much do whatever you want with it, and that's opening us up to all kinds of interesting challenges today. The publications, the grade data, the research data, all then become different kinds of networks. They're much more useful when you can start to combine them, model them, merge them, aggregate them, integrate them. The value is not so much in the individual data point, it's in the power of bringing these things together, so this is also where we are of these new models. Now, universities have many responsibilities of using these data and managing them, and I'm gonna walk us through privacy, the academic intellectual freedom, and the stewardship and governance, and then I'm gonna take us through some of the uses and misuses of these data that we're facing. This one comes from the UCLPS University of California office of the president. Several of you have had some affiliation with UC here. I was one of the three faculty on this four year project which brought together the provost, the chief legal counsel of other people at the very top dealing with some of the governance and the institution's responsibility of how do you think about privacy and information security together at something as massive as the entire University of California, something at scale. And this is where we ended up, and it's gonna be very useful as far as laying out the governance models. The law makes a distinction between informational privacy and autonomy privacy, and we made that distinction, not exactly along the legal lines, but the rough one, whereas what most people were thinking of at the surface was a narrow definition of privacy, which is your student ID keeping that kind of stuff confidential, the things that would fall under HIPAA or PII, as opposed to thinking about autonomy privacy, which is very important to us in the University is can you do your research without being observed? Can you conduct your classroom? Is the classroom a safe space to have difficult conversations about hard topics that you don't really want broadcast? Can you think about the kinds of rules of governance about this is a safe space that's not gonna get tweeted, we can have hard discussions and have them closed and facing inwards? So that's where the autonomy comes in. And then the information security needs to protect all of this. So these distinctions are very important thinking about data governance and what we need to accomplish here. So information privacy is this narrow one, we need to protect these obvious PII, personal identifiable information about people, keep them confidential, but at the same time, we've got autonomy privacy. Can we protect this space? And can we protect people's research in progress? So the fact that one of the groups that we're working with at UCLA has been able to conduct their research and reanalyze their data for almost 25 years now, they were not required to release these data with every single paper. They have been continually reanalyzing, getting new data, bring them together completely legitimately, and they did not have to work completely in the open. It's very different to say, you can see everything I'm doing well, I'm doing it, as opposed to saying, no data before it's time, make it available at the end of the publication process. So we would like to have to be able to do these things without surveillance, and that's where the academic freedom comes in to the autonomy privacy. Okay, so privacy is a messy concept. Many books have been written, books by Solov, Nissenbaum, Chemerinsky has some very good articles on that, just trying to define what our, what does privacy mean in the first place? So we won't try to pin that down too much. Similarly, academic freedom is a large and messy concept, but this is a nice one from Donald Kennedy who was President Stanford when I was a graduate student there, which is that it says, we need the freedom to do the kind of research that we feel is important for society, but we also have a responsibility to publish those results and make them available. So that's that two sides of a coin, which gets us to the stewardship and governance issues. So we want to protect privacy, both the information privacy and the autonomy privacy. We want to be concerned about academic freedom, keep our infrastructure secure, manage our data in fair ways, and then governance principles and processes. Now, how many things are in the job description of your average data scientist or statistician? This is way broader than any one job description exists, and most of these things are not readily covered in any job description, and that they don't fall under any one dean or any one vice chancellor or vice president either. They're very scattered. The uses of data are highly decentralized, and the governance is highly decentralized if people are even thinking about bringing them together. So that's what gets us to the privacy frontier of these uses and misuses. I'll talk about public records requests, how that fits in, some of the cyber risk and data breach issues universities are facing, and then come back to data management and infrastructure. So reuse is what we have been studying in good depth, the number of different disciplines in recent years. And to us, that's sort of the highest calling is can you produce data and make them truly interoperable and reusable in ways that you can get new findings? I mean, that should be the long term where we want to go. And reproducibility would be a subset of that, but it's a contentious one, and one person's reproducibility is somebody else's epistemological disagreement. People are just coming from very different places, different models, different ways of thinking about a problem. So in some areas, I noticed this was a survey done by nature. Half the people surveyed said, yes, there's a real reproducibility crisis. Others say no, it depends on the field, depends on how you ask the question, but it's, again, a messy problem. So how do we respond to this? How many of you have ever had somebody attempt to reproduce your research? And how, so then the question is how one feels about that. It's supposed to be the gold standard of science. This particular faculty member responded with a lawsuit. It's a very famous lawsuit from about a year ago. So here the question is, how dare you try to reproduce and then publish this critique sued for $10 million, which has really brought a lot of these things that tend to be sort of sub-rosa within universities up into the public sphere and view. This is another one cover of the New York Times magazine from last year. Now, this is partly a Harvard case. I know what was public. Many of you may know some of the backstory of this. But the attempts at reproducibility are also bringing out trolls in not a good way. And women are more trolled than men. And so some of what's been written about this is this is a trolling case where men using the same methods as she was using did not receive the kind of scrutiny that she did now. That's the public story. Many of you will know a backstory, but the point is that these data and attempts at reproducibility can also be weaponized in ways that were not part of that larger set of goals around why we want open data. Who owns your data at Harvard? Do you own your data? Do the overseers own the data? Does the Harvard board own the fair? The University of California claims owners... Right. And that's why we're telling you to use the governance. Okay. And let me hold the comments to the end, too, if I may. What she said is nobody would really declare that data are owned. The University of California has claimed ownership, but it goes to a 1958 clause about owning laboratory notebooks. Okay. And that's all cited in discussion in the Berkeley Law and Technology paper. The notion of ownership doesn't usually come up until there's a fight. And again, here's a very public fight between the University of California and the University of Southern California when you've got many millions of dollars of research funding at stake. And this also turned out to be the National Institutes of Health funding on one side and a large pharma company on the other side. So then the question becomes who owns these data? Who owns this database when a professor moves from one university to another and wants to take the data sets with him or her? This is where the governance issue, this is where the rubber hits the road about ownership, governance, fair use, things like non-exclusive licenses and agreements, creative comments, all gets to use. Many cases like this, I'm just giving you some high level ones. The student data is one of the most contentious and most complicated. And this is where you've either got big concerns of privacy or you've got a wild west and a lot of contention within universities. I've shared this task force, the National Science Foundation, the court came out 10 years ago. And we've built privacy into this. We're really taking a very systemic kindergarten through graduate school, through lifelong learning and thinking about data protection and data collection at the same time. And yet this is much more, the headlines that you see is the future is being able to do data collection at scale and test out educational innovations in different ways. So this is not the way you want to be in the news. This again is a very famous case where instead of having an attendance sheet turning on the surveillance cameras in the classroom, collecting data without students' knowledge, without professors' knowledge, and then counting how many seats there were with algorithms, okay. Nice use of algorithms, but not a privacy-protecting, intellectual freedom, academic freedom, information autonomy, privacy autonomy, protecting way of going about it. So again, we've got stories in chronic our head, many other places of, if you really love to collect data, this sounds like a really cool idea, but is this how we want to be collecting data in our classrooms? Do we really want to be putting sensors on our students and seeing how attentive they are and building models about them? Now to the library view of the world again is libraries are very much based on intellectual freedom and academic freedom, at least in the US. It's not true around the world, but this idea of the right to read anonymously is kind of embedded in the whole freedom to read of this American Library Association. And then Julie Cohen famously codified that in a law reading journal in 1996. It's actually been quite a while now, since that was really fleshed out, because by that point, we were already seeing the threats to academic freedom of publishers and other parties wanting to track what people are reading. And library catalogs have been specifically built not to track, which is why you don't get recommendation systems on library catalogs, because they sort of do not track ethics that's very much built into them, which is not what's built into the publisher model. Now universities are also in a very difficult situation just making a contract with some of these publishers. They're taking analytics at a much finer grain than you might want to think. And I point you to this very good article from Cliff Lynch, again a year or so ago, of what's happening with that trade off. So things that you're reading are being tracked even in a university environment more than most of us would like. So improve your recommendations, give us your address book and so on and so forth. Libraries don't like to do that. This is one that I assume is very well known to people in this room that anonymizing the data is not even not sufficient. It's no longer possible. This is, again, Latanya Sweeney now here at Harvard, famously showed in 2000, going on 20 years ago, that 87% of all Americans could be uniquely identified with three bits of information, zip code, birth date, and sex. And it's gotten way worse now because there's so much data out there, it's very easy to re-identify almost anything with data. So we've got to have other ways of governance. Because of this high level survey. Now let's turn to public records requests. Universities do a lot with public records requests. This is a famous book where John Wiener of UC Irvine spent about 20 years filing FOIA requests to get the FBI files on John Lennon and then this important book came out. So universities want to use these where they can use them. FOIA is the federal, there's public records requests at the state level. Now, what happens when people come and start trolling the university to get access to faculty emails and faculty research data? Harvard is a private university. So is MIT. This university is not subject to state level public records requests where UMass is. In Southern California, University of California is subject to them, but the University of Southern California in Caltech are not. So that puts public universities in even more precarious situation with dealing with some of these requests. This is a well-known one, Michael Mann, the famous hockey stick graph of environmental change. The trolls went after him, the politics went after him. And often the goal here is to get somebody's data, cherry pick the data, draw different kinds of conclusions, and then push them out in different ways. Many cases like this. The University of California put a task force together and came out with this policy in 2012 that has been adopted by many other universities since saying here's academic freedom, but we are legally required to respond to public records requests and we do, but here's how we treat them. And here's how we very carefully constrain what is appropriate, what's not as appropriate. So this was published, this gave a lot of guidance to other universities to work with. So the freedom and the access is a challenge, but also as a challenge is the data breaches. And I've spent the last three years as vice-chairman and chair of the UC-wide Academic Computing and Communications Committee, which put me on the University of California President, Cyber Risk Governance Committee. So I spent a lot of time dealing in Oakland, UC-wide with this set of issues. University of California, certainly Harvard, MIT, you name the major university, is a target of foreign governments. There's a very rich intellectual property here. The bigger, the more data, the bigger the target. Okay, there's lots of people coming after these universities. This is just the .edu, I pulled this one yesterday. 25 million records breached and these are the ones that made the legal threshold of having to report. So this is probably the tip of the iceberg. Universities are very big targets for data breaches. And protecting the borders is quite the challenge. When I taught the privacy course last year, I gave every student a different major breach to go investigate. See, what can you actually find out about this? Here's two of their favorites. One is the target attack through the heating and air conditioning system, so it's a matter of just hitting the weakest link and getting in. These are not built as secure systems and even four years ago, they thought at least 55,000 systems were internet connected. So before you put nest in your house or in your building, think about that as a vector. This was another favorite one. The baby cams being one of the major vectors for getting into these. These internet of things were not built with secure systems in mind and there's many ways to get in. So data stewardship is this big, big challenge for universities. This is a challenge for every corporation, every organization, but particularly universities because of the way we run. We want to be open by design. We want our collaborators, some of whom are in countries that are attacking us for our intellectual property at the same time and we need to let them into our systems to work together on data and yet we've got to find out what's a penetration attack and what's truly a collaborative process and this is messy. We can't just plug the systems into the cloud with an ethernet cable, nice that might be. This diagram in the middle, this is much more what it is. Time these systems together and migrate them is often really getting a hammer and getting all those pieces together. It's the graduate students, the postdocs who do most of the data collection, write most of the algorithms, do the data management. They are not security experts and we should not expect them to be. It's beyond the job description. It's not that the corporate world is a perfect case of data protection, but we have special cases because of the trend toward openness and needing to manage these. So this is, again, what part of I coined in the Berkeley paper is we've got this long running aphorism in privacy and security that if you can't protect it, don't collect it. But universities are already collecting it. So if you collect it, you should take responsibility for it and we're not thinking enough yet about how we need to take that responsibility. So we've got this open by design, open data. Privacy by design, again, has been around for decades. It's much more honored in the breach in terms of really building into systems, but it's something that we need to be thinking about more. Similarly, one of the things I was asked to talk to the Cyber Risk Governance Committee about was how can we get the faculty to stop leaking data? So now wait a minute, let's talk about who's leaking, whether is it people leaking data? Is it a people problem or is it a systems problem? When I go to do course grades and I need to get one student's ID number or say to write a letter, a reference letter and what I get is a data dump, I get an Excel spreadsheet of 500 students at once. Is that my leakiness or is that the system dumping data on me? Open data onto my laptop that I don't want on my laptop. So where is the responsibility? Let's turn this around. How many people are really, is anybody in this tune a certified records manager? Do you have any in your labs? Faculty are not gonna be certified records management. How many of you know what the regular records retention cycle is? One person. Seven years. Seven years is the usual but there's a lot of things where again, this is another area of professional practice which is not very quickly coming into the data world. Many kinds of records are legally supposed to be purged every seven years but if you don't purge them and you keep them they are then available for legal discovery. So just because you are supposed to purge it but if you kept it then it could become discoverable. So stewardship cuts in different ways. Some kinds of data being responsible means keeping them indefinitely. Other kinds of data means purging them on a regular basis. So many, many judgment calls here. So just to kinda wrap this up of the challenge that we're facing is we wanna be promoting responsible data practices. Build this into the way we educate students, educate faculty, educate staff, rethinking a lot of job descriptions, rethinking responsibility, recognizing the very distributed nature responsibility of their data around campuses because there's a lot of opportunities but a lot of risks here as well of the community. So open data, we wanna release and reuse them and think about the collection, the collaborations and the publications. So we have plenty of time for questions. Let me end with this set of sort of big takeaway points which I hope I've made in the last half hour or so and these are all explored in great depth in a very long and detailed law journal article with many, many footnotes to it which will let you dive even deeper. So these are the ones. Is the data are assets to the university and other people are seeing value in them as well. The University of California has been approached by large Silicon Valley companies who would like to make a deal with us for access to all of the patient records, all five of our medical campuses. This is a very difficult conversation. People have different ideas of how you should manage this and that happens on a big scale, happens on a little scale every day. Think about privacy in context. When is the information privacy? When is the autonomy privacy? And again, the stewardship in context, sometimes good stewardship means keeping things indefinitely, other times it means purging them on a cycle, a lot of professional judgment there. Open data, we would like to be able to reuse them for new knowledge, but they expose us to risks as well. The security, the bigger the data pile, the bigger the targets, the aggregation gets us more power but it creates even more privacy challenges. And then lastly is the data provenance or the data governance, so provenance issues in here too is we need to move away from ownership which is and move toward governance issues. And that's really what we've been trying to do when you see and you'll see much more of the word governance around how we have principles and processes and how do we govern these. Okay, so lots of people involved and let's leave some time for questions. Liz, did you wanna have a question? Sure, that sounds good. We have two, first of all, thank you. We have two runners with microphones. I know we can hear each other pretty well across the room but for the sake of the recording, if you could wait till somebody gets to you with the microphone before you ask your question. And if possible, if you could try to state your question as a question instead of a point that would be great, who'd like to go first? Fran? Excuse me. Right here, other side, there you go. Hi. Fantastic talk, Chris. So I have two questions. So let's look at it from something quote manageable, unquote, which is the university itself, administration, faculty, staff, students. Number one, what do you think the biggest priorities are for universities to attack? And number two, the practical question, if we all wanna go do something about this tomorrow, what's the low-hanging fruit and what would you recommend? Okay. I think the priorities are, first off, recognizing that a problem exists, which is news to a lot of people, thinking in terms of governance rather than ownership and who should be responsible. Priorities should be looking for balancing tests to maintain the open by design because if we turn this into a military environment where you lock everything down, you'll never be able to share data with your collaborators, your students or anybody else. So you've gotta find that balance. Universities do have to be secure, difficult as though that sometimes is, but they also have to be open. So working on those balancing tests, I think those are things I'd call for priorities. The low-hanging fruit is to start the governance conversation and to get these discussions into the first dorm talks for the freshmen, get them into the first introductions for the graduate students. I think if we can get people to think about their data as assets that they can mine, that they can exploit, but in doing so, they need to take responsibility for them also. So I think that would be the low-hanging fruit to start with is to get people educated, think about that trade-off as assets because they'll protect them better if they think they've actually got something valuable. I wonder if I can follow on from that question for just a minute. The University of California system is arguably one of the only institutions more decentralized than Harvard. How did you think about implementation across a decentralized system like that and what were the sort of points of entry at each of the campuses? Vice Provost for Research, data management folks? What we did say, Matt can speak specifically to the UC-wide Privacy Information Security Initiative. We had several. One is first we, and we also co-chaired that data governance task force. We did have some other recommendations with that. There we asked for a, we have a privacy and data protection board at UCLA that we founded in about 2004 or fives. We were very early in thinking about these things. And it turned out to be a very useful process as a sounding board and building institutional memory because people would come up with a problem and you would have a panel of faculty and staff that we've seen this before. And these are some of the things to think about because people walking in, they don't know the starting point of the OECD principles, notice and consent and so on and so forth is educating them about that from the beginning. And we asked for some equivalent to be established on all 10 campuses. We asked for somebody to be a privacy officer on each of the 10 campuses. And it didn't have to be a new position created. You just have to assign responsibility for it. And then we tried to find ways to coordinate that conversation around the 10 campuses. And all of those were accepted by Mark Udoff who was then president of UC and then it was delivered again to Gent and Apolitano and endorsed and then we endorsed it through the Senate and it's been adopted on up and down. So, and we tried to write those documents, both the PISI report and the data governance report in ways that they could be adopted by other universities. We tried not to make them too UC specific. They're yours. Okay. Thank you. Alyssa needs a microphone. Thanks, Chris. I have a question about the library comment you made about that it's purposely not personalized to you and that they're not tracking you. I'm sure many of us in this room use search tools outside of the library for things that wind up being scholarly research and it's amazing what can get recommended to you even especially on YouTube for some reason. But anyway, so you're losing something by not doing that. So are there libraries that let you do an opt-in thing rather than going and searching Amazon and then going back to the library catalog? Yes. So the question is, so university libraries have been concerned about exactly this for a long time. They didn't, libraries have never wanted to default to tracking but the opt-in works up to a degree and UC has experimented with some of the opt-in to do the recommendations. The difficulty is the scaling problem. If only 5% of the people opt-in, there's not enough sort of basis there to do a real recommender system. But there's, it's a constant bounce and trade off because you, libraries are afraid that everybody's going to Google search, Google first and then Google is the way back in to the library systems. So it's being tracked anyway. So should, it's already being tracked. Yes, you're exactly right. So should the university do the tracking and then protect the track? I didn't want to say that. Okay. It's a hard conversation as Peter well knows. Yes. Okay, go ahead and then Prima, yeah please. In discussing autonomy, what is it? Privacy, autonomy privacy, you talked about class discussions and as you know that there are a lot of online courses these days and some online courses incorporate third-party discussion sites like Slack. And if you use tools like Slack, the discussions are archived on Slack as well as instructors being able to download the data. So I wonder whether there are any recent discussions about class discussions online. Yes, that came up early on in these systems and actually one that preceded Slack, we were calling it the Piazza problem for a while, which is we've got shadow networks are appearing where faculty will find a tool that they like and they like it better than what's on Blackboard or Canvas or Moodle or whatever it is that's local to it. And it's free and they will use it and require their students to use it and there's no contract with the university. Okay. And then what happens is the free is that the company gets the data and in that case they're marketing it to potential employers and students by enrolling in the course their choice is either stay in the course and give their data to this company or not enroll in the course. So that's where the shadow comes in. So the universities are trying to get some of these other companies to come to the table to make a university-wide contract that says yes, it'll work with you but we get to keep the data and protect it. And we've got, again we've got, right now you see us building that into the purchasing agreement of where the data protection rules are. So there's different ways around to embed it. So the Slack channel is a great example of, oh really great cool real world tool and yet Slack then has all of those data. Canvas, similarly making contracts of, can you protect them within that or are they going to run the analytics and sell them back to you later? Who's server does this sit on? So again it's, what part of what we're trying to do is just promote awareness. None of this stuff is free. You're paying for it with data and then how do you get the data back and use it as assets for the university? Prima was here, yeah. And then Shali, yeah. Yeah, so I have a question about the ownership of data or the claim of actually being able to control those assets. And so in Europe for instance, we do have these two generous rights on databases. And the argument goes that because we do have these two generous rights, then people can actually disclose the data because they know that no one can actually extensively and substantially reuse it or reproduce it and therefore there is actually more disclosure. On the flip side of that, because we do have these two generous rights and even if the data is actually open and available then nobody actually knows to which extent they can actually reuse it or not because it's actually very hard to determine whether or not the two generous rights exist. And so if there is not an actual open data license then no one will actually be, well no one actually dare to reuse this data. So yeah, I just wanted to know like how do you actually see these two generous rights as actually helping or as actually hindering the scientific community? The database rights have been well studied in terms of what they cover for openness. And there was a study of that I reported either in this book, I think it was the previous book, Scholarship to Digital Age about 10 years ago. Looking at whether it really promoted openness and at least at that point it was at not promoting openness. But what you raise is a much bigger problem which is around licensing. And it's very hard to know who owns any one of these data sets. When you start merging them and converging licenses it gets even harder to track the provenance of who's able to use any of them. So the European data rights around the database law creates one set of problems but it's a simple case of a much larger set of licensing ownership and control and what happens when you start to merge them. So if you really wanna reuse data you want to bring them together from multiple places every one of which has a different set of rules associated with it. It's messy. Shall we? And I avoided GDPR which we can also talk about but that's another can of worms, okay. So speaking of a messy that the, I wanna just sort of ask a general question about what are the systematic sinkings behind the dealing with sort of these conflict in terms of goals where one is open access, trying to be as open as possible which you want to be transparent and everything but the other is to protect the privacy. I wanna give a specific example where I knew someone, someone actually after he told me this I never talked to him again because seriously because this is someone, this is a researcher well respected researcher that told me that he would never in his article he tried to not write down, trying to write as little detail as possible. For example, don't tell which version of the data I'm using. So if somebody challenged me he said I can say well the answer is different because I use a different version, okay. So it clearly was an ethical behavior and also he would try to write a soft model instead of hard model in a sense just verbally say I use the linear model without telling you exactly what he does, that's what protect himself. I think he started by protecting himself because of he feels like it's just too much work to answer this but I think now he become a habit which I think is really terrible. When those things happens like they're all high behind the privacy, those things. What are the systematic thinking like in doing all these policies that trying to at least put some safeguard against that kind of abuse of the privacy protection? That's chapter eight of this book. Thank you. Which is dealing with how much of the question is sharing and reuse versus getting credit for things and exposing yourself. Any, we're back to the question of signal and noise and what are really data in the first place. You can meet the letter of the law without the spirit of the law very, very easily. Plenty of people have told us, sure you can have my data because I know you won't be able to do anything with it. A data dump is a data dump. Garbage in, garbage out, needle in haystack, you name it. Just because it's open does not mean that it's findable accessible, interoperable or reusable and different fields have dealt with this in different ways. So astronomy, which we spent a lot of time to the listen and competence, also why I'm spending so much time here at the Center for Astrophysics, is one where you, the data from the telescope are expected to be open after some embargo period and then you run a pipe to these, the telescope may run a pipeline of processing to reduce and clean those data, but some people will take those data at the end of the official pipeline and use them and then you can sort of know what happened. But lots of people will take a more raw version and reprocess it in multiple pipelines and then they won't publish and generally not expected to publish the pipeline processing per se. And that of course gonna change versions very, very quickly. It's gonna call lots of other things that change is almost impossible. They're using things like Jupyter notebooks and stuff to kind of get some of the trails out of it. But what part of this massive thing of whatever, drawing a circle around what are data and what are not data? What needs to be released, what are not released? Is not onto impossible, lots of judgment calls and how we do it varies very idiosyncratically. We'll interview you about your data practice and you might say everybody does it this way and we go next door and we get a completely different story that ends with everybody does it this way. Because they don't really talk to each other that much. They have very localized ways of handling the data. So this is why the fair principles are really kind of a holy grail. But implementing those in a truly long-term, reusable way is something else. Mirce. I always have questions but I don't know if anybody else has. Okay. We're just staying in the middle here. It'd be nice to get to the back. I'm happy to ask questions. I just thought I was just holding the mic. So in following up on this, what do you think about the preregistration? I mean, one of the answers I was gonna give you is that maybe journals or publishers should also take some responsibility on that. That if somebody publishes something that where the methods are not well defined and the data and the code are not available, they do have a responsibility for publishing this too. And that at least within some fields this has been addressed this way, right? The journals themselves, the data policy and the journals require for more specificity sometimes even reproducibility. But the other part is that I was wondering what do you think about the preregistration. How many people know about this whole idea of preregistering your hypotheses? Okay. Just a small number of you. Okay, so if you're in the, this is what comes up in biological sciences somewhat in the social sciences, social psychology. Social psychology is the one that is really having the trolling and the battles in social psychology which is amazing right now. You could write soap operas around some of this stuff of what's going on. So the idea is the way to avoid cherry picking, how many people know what p-hacking is? Okay, so that's another one. Is cherry picking the data so that you get something that has a statistically significant finding and then you publish that. So you've got a big set of data, you find what's relevant and then you start what we call salami slice the data and do lots of little papers out of it. So the supposed way around this is to publicly register hypothesis before you do the data collection. And then you can say later, did I do or not? To me that sounds like sort of Robert Merton and the very, very formalized notion of data as something, as research is something as a cookbook. As opposed to something where there's a lot, there's all kinds of dead ends and paths and explorations. And so we're leading to, Alyssa's got this phenomenal, the undiscovered is what is it when you are looking for one thing and you trip over something completely different, the completely different amazing things like the expanding universe when you thought it was a contracting universe, if registering the hypothesis would have taken us way in the wrong direction. So I think it forces a formalization on something that is heavily exploratory but that's also the difficulty of documenting data well enough to reuse it because it's all these tiny little details that don't get written down. The methods section for nature and science give you about this much space to put the methods detailed in and there's no reward. It's like Charlize's question too. There's no reward for giving the massive detail so that somebody else might be able to exploit your data for things you didn't see. There's a real mismatch of incentives and rewards here and that's a lot of what that whole chapter eight is about if you wanna read that. Yeah, back of the room. Julian, see how well I train my students here. I'll ask him. Back here, okay. I'm not a scientist but I was over at MAGH and they were discussing a clinical survey of patients in the field to gather biological data and the genome for their DNA and one of the things that's a dissuasion for people when they're canvassing them to become subjects, prospects as subjects is that they wanna know where this data goes and how secure it is and which third parties are gonna share it. That's not only research but private firms, government and other agencies. Right, so then that- Can you give me any reassurance on that? Okay, that's a huge governance problem and certainly clinical trials that you see in Harvard, everybody else are wrestling with this. So on the one hand, do you get a consent form for this study and this study only which means that you can never go back and so if you see something interesting later or something might be beneficial to that patient later. Yeah, you may not be able to go back to it. So the alternative is then like this completely open consent which patients are often reluctant to do. Then you've got a direction being pushed by Sagebound Networks and John Willbanks of thinking and particularly for things like orphan diseases where there's small patient groups that really want their data to be used in new ways and they're very happy to get it out there because it might be just a handful of people and they're trying to push a different kind of consent form. So again, it's really contextual about just how much you want to do because going back and finding patients again and re-consenting them as they say over the long term is problematic but there's also layers of degrees of accessibility and data enclaves of how much access you get. There's different ways of protecting these data too. Does that reassuring at all? Okay, good. Sort of in a position where they can make up a lot of their own governance. It's kind of a patchwork that there's no uniform broad statute covering all of it. Well, there's not. Like an EU or other places. Well, there's broad statutes around human subjects and it comes out of health and human services and then that comes down to all the universities. But there's not a lot of coordination of the IRBs, the institution review boards on individual campuses and those are made up of people who generally are methodological experts. They're not necessarily data science experts and certainly not security experts for the most part. So you can take the same study to three different IRBs and get very different answers about what is appropriate use and protection and so on. I think it's probably more consistent in medicine than maybe in social sciences which is what I'm more familiar with. And there's a question here at the back too. And I think there's one over here. I was curious about, obviously there's been a lot of work in the computer science field on algorithms for providing a particular type of privacy. So providing what's called differential privacy. And for a particular type of data release from a database, I think we understand that quite well. And what I'm wondering is, if you think about the theory and the algorithmic work and if you think about the practice and the deployment of these systems, how much work do we still need to do on that particular narrow question to get the systems, to even have this be really part of the conversation that universities are having or should be having such that we can have these techniques really be kind of leveraged and help to begin to answer parts of these questions. Only narrow slices, but I'm just wondering how far from that are we at the moment. Right, Cynthia Dwork gave an excellent presentation last week on the differential privacy. And her context is more around really big data that the Facebook, the Amazon kind of data and how or big patient records and how you filter out of that. So certainly we need to move that scientifically forward, but let's turn around and think about a case on this campus. So somebody does a study of campus climate. How do you feel about diversity and how you're being treated and so on and so forth? These are very sensitive topics. And if they're done for internal management purposes, the IRBs don't touch them. They say these are not IRB issues. It was not going to get published. And then somebody comes back a year later and says, we got some really cool data here. We want to publish these. This happens all the time. Something was assert, it was an internal survey, not subject to IRB. Then IRB comes back and says, okay, you can reuse them in this way. But by that point, highly sensitive data had been collected that probably shouldn't have been collected in the first place. And then you're trying to republish and filter them later. We're not talking on a scale and we're not talking about a kind of use that you could probably put these algorithms on. So this is where a whole lot of I think is much more subject to governance and people thinking sensitively about what issues at stake and whose privacy is at stake than thinking about purely technological solutions. If we're talking about things like what's around the edge of the network, so when UCLA had a breach at a level that had to be reported a couple of years ago for the medical center, we don't even know if they actually breached anything, but it got to the legal point of reporting. The first move by the president of UC was to put monitoring equipment on the edges of all 10 campuses and then hide that monitoring equipment under legal privilege and not tell the faculty. So the Berkeley faculty went to the New York Times pretty quickly when they found out, okay. Again, this is not how you want to do governance, but that the firestorm that resulted led to some very good and important conversations of getting people to think about those things. And among the questions are what is happening? What's being collected? How long is it being held? Who has access to it? So questions of are there data there that the Chinese students are worried about is they ever gonna get back to their governance, their governments? Is there stuff from classrooms? Is there stuff from my research? What's being tracked around there? And UC put in such protections for the big contract, which is now FireEye, that it looks a lot more like GDPR. It looks a lot more, it's not as close as some of us would like it to be, but they said what they had to do for UC, they're again using to work with Europe because the European protections are much stronger. So again, it's asking the question, the fact that the faculty got very strongly engaged about asking these questions, that some of those things you can run algorithms on, there's plenty of algorithms being run there, but it is the more up close and personal that the real challenges are. I think we have time for one more question. Okay, there's a gentleman back here who's been waiting, I think. And then Alyssa, if there's time, okay. Hi, I'm currently doing research on issues of data, on issues of data ownership and such as related to connected vehicles. So I'm kind of wondering how much of the push in academia is towards the desire for good practices and how much is backed by the law. I'm just trying to figure out, I mean, I see some overlap and I'll be looking into it, but I'm just wondering where are the pushes coming from here? Where which push is coming from? For better data protection practices, better data governance, et cetera, I've been trying to look up what laws might apply to the situation, especially here in regards to terms of use, terms and conditions. Apologies, this is a little outside the scope of the talk. The push inside with our experience is coming a lot from the faculty, but also from the privacy officer and we have a legal counsel, Kent Wada is our privacy officer has been for a long time and Amy Blum, the campus counsel who have been very privacy focused, so we have worked extremely closely with them where a lot of campus councils are much more kind of protecting the university from lawsuits and not trying to deal directly with faculty and students and think about how we keep this a safe space within the campus. So I see in our case, it's faculty and faculty and student and administrative driven, but a lot of this is Wild West and we started realizing that we're collecting all kinds of data that really are usable under FERPA and they don't fall under these areas of governance and said, what are the principles, what are the practices, who's gonna do this? And we need to do it in a lightweight way because IRBs are a very heavy weight solution. This takes loads of time, lots of administrative overhead, they have to meet every week. We can't have something that heavy handed, nobody's willing to do that. So what are lighter weight solutions? And there's no simple answer, but we're looking for lighter weight solutions that we can get this governance and I don't have easy answers for where the laws are, but the fact that there aren't a lot of laws is what's making people feel like they've got the way to do these modeling. And we've got universities, again, it's in this Berkeley paper who are saying, oh, let's just label the students red light, green light, yellow light and tell them whether they should stay in this major or not and let's compare their Facebook feed to their ID card that says what, are they doing their homework at two o'clock in the morning or two in the afternoon? Are they eating pizza or are they eating vegan food? How are they moving around campus? How can we model these students' behavior, which I find absolutely terrifying, but there's a lot of universities are doing this. I think if enough of these scary scenarios hit the front pages, more people pay attention, but let's try to prevent those. And I think that's kind of a good wrap up, let's be good stewards of the data that we're collecting, let's treat them as assets and responsibilities that the public trust for universities and really is at stake, as Larry Backhouse said, on Friday it's the trust is at stake and we need to invest in the universities and in very responsible ways that we continue to embody that trust. I think this is an area that we can do that and we should be doing that, so. Thank you. Thank you. Thank you.