 Good afternoon everyone and thank you for coming to the Berkman Tuesday lunch series. I'm Peter Hurdle a fellow at the Berkman Center and I am delighted to be able to introduce to you Christine Borgman from the Presidential Chair and Information Studies at UCLA. Christine of course is just one of the brightest and most important figures in all of information science. I can't you know thinking back upon it how many times I go to her work for insight into what we were just talking about Internet to name naming systems and its history on search on information retrieval and most recently now with data I was thinking just in terms of citation and how does one cite data and she's been doing an awful lot of that but much bigger issues now especially with her new book Big Data Little Data No Data Scholarship in the Network World published by MIT Press this January with a blurb on the back from Jonathan Zittrain telling us that this is an invaluable guide to harnessing the power of data while remaining sensitive to its misuses. So it's terrific that we can pull Christine away from the Dataverse meeting that's going on at Harvard this week and have her talk to us about data. Now this is a Berkman lunch for your information. We sometimes go around and introduce ourselves if we're across the street but not when we're over here because there's just too many people. This is being webcast and is going to be recorded so we will and is being recorded so we ask that if at the end Christine's going to talk for about 25 or 30 minutes about data and what it is and and what to do with it and then we'll have open up for questions we'll have microphones coming around so you use those for the webcast you should know that this the webcast is being going to be stored archived I hate that word and let's say preserved at Harvard I suppose it becomes data at that point and what will the University do with it well maybe we'll ask Chris about whether it's data or not so if you have any concerns about that don't go from there and that's let me see is there anything oh and we're using hashtag Berkman for this lunch and talk and so those of you who are viewing on the web if you have questions please send them into hashtag Berkman and we'll pick them up and ask them in the questions section when no further ado Chris thank you Peter it's a great pleasure to be here and and to be back and it's also a great honor for Berkman to have gotten Peter who has been my go-to guy on copyright and policy around data and scholarly communication for several decades and we got him to guest speak to my class at UCLA by Skype just a couple of weeks ago so it's a long-running conversation amidst a cadre of people here this is one where Berkman is an ideal place to talk about these issues because we are we at UCLA are in the stages of trying to do a report on data governance on what the policies the practices the principles should be and I have my notebook to take notes from all of you as well but I want first to start by framing a bit from my own research and the book to give you a bit of big picture and then get down to some of the real challenges that are facing UCLA and Harvard and actually all of higher ed and we had an earlier discussion I was here in October a smaller group over at at Berkman Center so this is as Peter said the book which is a large framing of thinking about what our data in scholarship and how we deal with sharing them reusing them citing them giving people credit for them and the middle part of the book is a set of case studies walking through the diversity of data and applications in the sciences the social sciences and the humanities and I close with questions of archiving stewardship curation and what to keep it what to keep and why because no one's claiming that everything should be kept but we are certainly at a stage right now where we are drowning in data and in many respects it does look like salt water we're swimming in it we don't know quite what to do with it we don't know what the risks are we don't know how we can get the power out of all those data that we would like to have so this is the the short overview I'm going to talk for a couple of minutes just about what our data in the first place and then the policy issues about the data we collect principally for our own research but also about students about faculty and then the data we collect about ourselves much of which is falling through the cracks and it's a chance for those of us who think about data and think about scholarly communication to put our energies into real action policy for universities before things get any messier so what our data big data is the buzzword I thought I was over it other people are over it but you still have to kind of explain what people might think big data are this three part tight definition comes back from about 2001 the Gartner group and it said data can be big in at least three respects they can be big in terms of the volume in terms of the variety how heterogeneous they are you certainly see that across different scholarly areas or the velocity the rate at which they're coming at you then we added the big in terms of value big in terms of veracity and we ran out of these and went out to a lot of other ones but you can see that even at that level people don't agree on what it means to be big that many dimensions that might be big the long tail metaphor also is popular this is even more reductionist suggest you can take all data down to just two dimensions and in scholarship that would mean we've got a small number of researchers like the astronomers and the high energy physicists who have a lot of whatever it is and then many people who have less of whatever we think that is okay we've also got definitions around open data and the most basic one is this idea that data are open if they're free if you don't have to pay money for them and it doesn't have a lot of license restrictions at some of the things that Peter and I talked about the nature of licensing and whether they're even subject to copyright what might be licensable and and whatnot but here the difficulty is being open being free doesn't necessarily get you toward anything particularly useful making the haystack bigger does not make the needle easier to find so there's some open data and a whole lot of open data is about that useful okay if you don't have metadata you don't have provenance you don't have software you don't have codebooks you don't have something that is really very useful to anyone at all now this definition from the OECD which has been around since 2007 is a far higher bar but as you can see it's a huge number of parameters the odds on making data open so you satisfy all of those is well not into impossible you if you're going to make the data have legal conformity to what every jurisdiction you're in it may not have legal conformity to some other jurisdiction with which you are collaborating for example do you want to be interoperable with your immediate staff immediate community interoperable with people in another discipline that you're working with we run into this all the time quality sustainability and so on so it's pretty messy and this is where I ended up and usually when I'm talking just you know from the book or the larger research we're doing I'll spend a lot of time on examples but today today I won't but rather to say that to think about what data are you really have to take a much more epistemological approach it's not stuff it's not bright shiny objects it is data are in the eye of the beholder so this is where I said their representations it's when you use them as evidence and certainly many of you are lawyers or lawyers to be understand the notion of evidence and I mean think about just eyewitness testimony and how fuzzy that is in terms of the differences of who saw what and when something becomes data so we've got all of this contestation around it and many groups coming up to deal with it we've now got the research data alliance which was founded in late 2012 already has members I think 60 countries around the world the meetings which are twice a year are getting up to five or six hundred people and this is the first time that you've had a big forum where people dealing with astronomy data with survey data with humanities data all come together to one forum so we're starting to get more of a conversation but the precondition to research data sharing without borders is that researchers actually will share their data and that turns out to be the hardest one of all and again that's what our research is about is we spent the last 10 or 15 years following people around primarily in the sciences we spent a lot of time here at Harvard and astronomy and CFA and we've been following the Sloan Digital Sky Survey the large SNAP survey telescope working in some bioscience areas physical science areas embedded network sensing and so on and looking at why it's so hard the incentives aren't really there the amount of effort it takes to make those data useful to somebody else to get value out of them that you weren't able to get value yourself is not really there the ownership is often a huge barrier people have no idea what the legal status of data are so it simply doesn't never leaves a lab never leaves the office rather than try to confront all of those so the dealing with the governance issues kind of starts with this fuzziness of people not agreeing on what our data in the first place so these are some of the kinds of data that we're concerned about governing that we would collect by our community obviously the research data but there's also all kinds of university analytics that we're collecting for teaching and learning so we're collecting they cross that border of being sometimes they're collected about our community there's certainly ones collected by our community the policy and management responses here we've got the mandates of funders and journals and I'll get to that slide next research data management services I've got a slide on that Harvard is way ahead of the game on dealing with some of those issues release and retention practices those are things that librarians and archivists think about there's you know certainly some in other legal areas of how long do you have to keep something do you keep it a year do you keep it five years what's the evidentiary status of it people who never wanted to think about those things they're starting to have to think about those things in doing their own research and then the laws and policies certainly things that fall under human subjects are fairly clear some things fall under open records laws of faculty in particular are realizing the hard way that your email may be available under open records laws your research data may be available under open records laws you often get clashes between these the university lawyers are now getting very busy explaining to people how they should separate their personal email their university email and so on and then we've got the HIPAA in the medical FERPA in educational records PII personally identifiable information which is a particular legal category and I think that may be state by state I think the California PII is probably different than the Massachusetts PII then the federal PII but a lot of what we're concerned about is things that just fall between the cracks their risky data they're not governed people don't think about what they might be up against of working with them okay the open access policies cover the publications and the data from your research how many people in this room have filed a data management plan with a funding agency John okay centuries ago it's only been a requirement for a few years but being a data guy you were ahead of the game on this okay so this may not be quite the world at all you'll be walking into this if you haven't taken your intellectual property courses do they will serve you no matter what area of law you went into my husband guy's law degree in Texas everyone had to take oil and gas law to pass the Texas bar he wishes he had taken intellectual property law instead it would have done him much more good along the way so this is just kind of a balance of some of things and it's really changing in the US right now the office out of the federal executive branch the Office of Technology Policy John Holdren's office issued a directive in February of 2013 that required all US federal agencies that fund a hundred million dollars or more of research and development to make available their publications their research reports and their data but it didn't say how they were supposed to do it so each agency Department of Energy National Institute of Health Institute for Museum and Library Services so on and so forth is coming up with their own plan of how to do it so you are going to have different requirements depending on who you get your funding from and whether it's acceptable to put it in to Dataverse here at Harvard whether to put your publications into Dash here at Harvard whether you can put them into SSRN the social science research network what the different publishers will allow what the embargo periods are all of this will be coming down on those of you who publish and those of you who are release who are doing data funded under any kind of public granting but also under if you're working for say one of the DOE labs you're going to be under these as well the Australians are the only ones I've found that have put these into their code of conduct so the code that you signed to do research get a grant your good data management is part of that but it's sort of implicit in many other ways but the point is funding agencies around the world including the Chinese that are going in this direction and starting to enforce it Dataverse that I'm speaking at tomorrow it's having its first really big community meeting it's out of starting in IQ SS the in super quantitative social sciences here and you can now put your own Dataverse on your personal website you can do university level and this is Harvard's response that is now being picked up in many other places and it gives you a workspace where you can put your data as you're doing your research you can keep it to yourself you can share it with your collaborators you can let it you can open it in stages roll it out release it to the journal as you wish you know as the requirements go and they're really working directly with journals and partnering in some very interesting ways so watch this space if if you're interested in that that set of issues or come on to the meeting this tomorrow now here's where it gets messier and messier and there's fewer and fewer people working on it yet but I'm hoping the people in this room will help us sort some of these things out is the data that we collect about our community and this top part has the a silamar conference again I have a slide in a moment on that how many there were some of the people at berkman were involved in that a silamar conference any of them here today okay no so we'll go on to that is how many of you are enrolled students at Harvard only a few how many are summer interns okay others do you have an ID card a magnetic card that you tap and use for an ID okay do you know how much data that is collecting okay all kinds of data that is every place you've been that's your data card that's your library card that's your internet services card it may be your health care card it's your dorm card and you combine that with the course management systems the university knows every time you went to the website they know if you did your readings if you downloaded them they know if you submitted all your rec your assignments on time if you are somebody who always puts them in late we can see that model we can see if you're in trouble and some universities are making decisions using those kinds of data and one of them is even doing red light green light yellow light of do we steer this student out of this major do we steer the student to health services do we steer the student to mental health services okay what do we do with this student based on these data analytics that we are getting from these systems I had this conversation with my niece who just finished her freshman year at University of Illinois and she's going into IT specialization in business and I said Elizabeth how you know what are you learning in your business IT classes about how to manage this just think about what's being collected and her eyes got wider and wider didn't even occur to her the trace that she was leaving as she moved around the Urbana-Champaign campus and so you know students are starting to think about this faculty is starting to think about this and they're starting to partner with outside agencies now that's one of the things that brought it up so we're working primarily from the two use cases of the student records and the faculty records UCLA finally is automating the personnel of academic promotion system now you say there's been a leader in technology in other areas but there've been so many fights over who gets to collect what and what they're going to do with it that it's been a heavily paper-based system and crates of paper go across the campus every time people go up for promotion and review and merit and to make a database out of this is sort of a you know an unstoppable momentum although a lot of people are just trying to throw as many bricks in front of it as they possible but you know if you keep throwing bricks you're not going to accomplish much because this is you know this is a train that's left the station we need to think about how we're going to govern it and who's going to have some say about it but we're definitely getting you know the publications your grants your teaching evaluations this came up already is teaching evaluations are supposed to be to improve teaching not to judge you for promotion and tenure should those data go from the office instructional development into the academic personnel system directly some people think yes some people think no okay but again the university has a lovely rich set of information on faculty as well as on the students around the campus so this is the asylum are that I mentioned we talked about when I was here in October is you know starting to think about this big array of data and and how it should be managed what are the values what are the criteria and drawing upon the the Belmont principles look at the the common rule the kinds of things we deal with for human subjects there's certainly principles there to guide this but as universities say what are we going to do with all this stuff and companies are seeing this as a business model showing up and saying oh we'll run it for you and then you don't know what happens when those data leave the campus we're trying to deal with that and epic and I'm in the full disclosure I'm the board directors of a lot of epic and this is one of epic's big pushes is student privacy rights and has come up with a student privacy bill of rights okay so there's again there's people working on this front the bibliometrics is the part that's really fun and it's an area that I've been working on for a long time and I just finished a actually the papers were due last night we're on quarters for the PhD seminar on scholarly communication and bibliometrics and really teaching it as a research methods course to doctoral students information studies and to think about the quality of data and what you can collect so this is a snap from the book bibliography which is also up as an open Zotero file that's a clip from a law review very different kinds of formats how many different bibliographic formats do you think are out there and widely used how many do you think sure 10 to the first so about a thousand or so okay got it okay got it it's higher than that who uses a tarot Zotero okay Zotero look at the Zotero style sheets 7500 every journal every publisher in note just as many men delay just as many you are never going to get clean data out of this even if people were good and meticulous about their bibliographies at the end of papers you are not going to get clean data because the algorithms cannot deal with 7500 different formats in which those page numbers might occurred if even there are page numbers and if people got all the middle initials right which they never do okay so here's just a quick snapshot from my personnel review last summer my resume lists about 200 publications a web of science found 145 of them scopus found 77 of them and Google found 380 of them okay and that's because Google is picking up all the versions and they're picking up the slides and the YouTube videos they will pick this up as a publication if the metadata is properly done which being a good cataloger I know you'll do Peter is on here but you know look at the difference in the age index look at the different number in the citations you know and this is about a year ago okay this is what happens in this morning we were over at the Center for Astrophysics and looking at what the astrophysics data system does fabulous statistics but only within that within the ADS and again so if these are closed universes of what they pick up but we are using them to map scholarship we're making beautiful pictures of how information flows around the world across disciplines across countries we're making policy funding decisions across these and then when you've got open access policies in the University of California one is very similar to the Harvard one is we're buying commercial services to go scrape data from publishers websites and others around the world to feed it into respond to those federal policies about what we're supposed to be keeping and the embargo rules okay so these are being used and once people have data they like to use it in ways that you don't always know this is what we're talking about this morning now we have altmetrics which is a company on top of a phrase being used and the head of it I was talking to last week readily admits that altmetrics are neither alt nor metrics okay you know their alternatives to citations and these are really indicators as opposed to metrics or true measures so this is one Alyssa Goodman in astronomy a bear to Pepe who was my doctoral grad I was on it this was a Radcliffe seminar a couple of years ago 22,000 views so very popular but three citations as of this morning okay Google scholar found 13 citations to 24 versions of this document which we also grabbed this morning okay so these are just not going to be good data and yet they are being used and very large companies are coming to universities they are taking these data they are selling them back to you and using them so deans and directors and department chairs can make decisions about people okay so that's how we ended up and you know for my sins of complaining that this was a real problem and we need to think about it is good you fix it okay so this is Kent Wada who is UCLA's chief privacy officer and chief information security officer and he and I have been making trouble together for quite a while now also and this is just the rough set of questions that we're trying to deal with is how should UCLA collect and organize and use the research analytics who should have access to them both within the university and a partnership with others we were asked to deal with the governance principles and the governance processes you know what are we going to do on the ground and this task force was jointly charged and this is very important jointly charged by the executive vice chancellor and provost and the chair of the academic Senate so the faculty and administration are working together to say how should we govern this and let's let's think about it and we have members from the faculty and we have members from people running operational systems that are on this board and we've been working on this since fall and we're already a couple months overdue on report because every time we get close to report it gets harder okay and that's why you're gonna help us figure this out this set of principles came out of actually first it came out of the UCLA privacy and data protection board again a joint board and then they went up to this UC wide privacy and information security and this is now part of UC wide information policy and we've actually gotten a privacy and data protection board and chief privacy officer on all 10 campuses just in the last year that was a major major breakthrough to have done that but this is an important distinction that we made and we hadn't seen other people make it before between autonomy privacy and information privacy the starting point for this board which was charged by then president Mark you'd off and then he approved the report Janet Napolitano has now approved the report and put into practice asked us to look at privacy and information security but the starting point was your usual kind of driver's license credit card information where we quickly became concerned about this much broader array of data that are being concerned and said we want to be concerned about the ability of individuals to conduct activities without observation and in from it protect information about individuals and the security spans all of these okay so that this is a principle we found very useful we're sticking to that's pretty straightforward that code of fair information practices so on we are building for the governance this is so the goals we want to resolve a legitimate disagreements because it's pretty obvious even you get eight people in a room on this on this task force and the people building the personal management system want one thing and the people from the institution review board want something else so we're kind of playing it all out across this room and we've got the camp the campus chief legal counsel on it also so we want to resolve these want to promote transparency and open discussion we also want to reduce the risks of breach breaches and such and leverage our structures so it's and we want to look at data held by the campus that goes beyond just things covered by FERPA HIPAA PII laws and so on and we've got these competing things we're really finding is people will come to the IRB and say is it okay to do this and they and the IRB says are you going to publish it and they say no it's just for evaluation and the famous Harvard study of turning on the cameras in the classrooms which I assume everybody here knows about oh yeah okay that's one of the cases we've looked at and people were not informed in advance you did not have consent people didn't know about what was you know being done and apparently they went to the IRB and the IRB says it's not an IRB problem and if it and the IRB says it's not us there's no place else to go at most campuses including including ours there's no place else to go so part of it is where do you go if it's not IRB but you've got sensitive potentially sensitive data okay that falls through so these are some of the triggers so this is what Kent and I have just been hacking together in the last week or two is when do we when does this process start to take hold so it's definitely about decision-making so it's data about people if we're collecting data without people's knowledge or consent people no idea how much data actually is being collected about them when they're used in ways that so if you gave consent for one kind of use and you want to reuse it for something else we think we should trigger these policies and that by me new data and in particular if you want to take data from multiple places which is the usual big data problem of combined for analytics and we're also seeing that if the data are going to stay completely within UCLA we've got pretty good governance structures of things like the Infrared Technology Planning Board the Privacy Board but when we want to partner with these various companies that want to come in and sell us services and we want to keep our values on it and not let them have data that they could use for unintended purposes later that's also where it gets really really sticky and it also hits the fan when you hit other universities so when you see in Harvard partner on something you know should we you know should we kick these policies in at that stage okay so this is roughly the model it's got a couple charts here so there's the EVC and Provost the board that was set up and this is who's on it right now the voting members are a mix of faculty administrative we got students on it and then we have particular people and we got audit and advisory services on it too okay and then we've got this dotted line to the IT Planning Board and Academic Senate and this oversight committee so this is where some of the decision making comes in and the privacy officer deals with the training and awareness and some data use questions and implementing these policies across the UC system and so trying to so we end up with the chief privacy officers the triage point but Kent has a few too many jobs already I mean he can't really triage all of this stuff but to sort of is it IRB is it not IRB where should it go and so this is the set of discussion questions that I ended up with figuring if anybody could help us figure these things out it would be this group so I'm hoping we can talk about you know is this the right problem to be addressing is you know the uses of data that are not covered by the obvious sets of policies that universities have in place and how should we scope it every time we think we've got some flow then the question is what's inside the box and what's not outside the boxes you don't want to shut down the entire university say nobody can touch data until they come talk to Kent Wada that's not going to work either so do we scope it by who's the data about by what they're going to use it by the agency collecting the data by the partners what are the criteria and what are really workable governance processes okay so if you can help us think through some of those I will take all those back to UCLA and we'll write up the final report and give it back to you okay all right let me stop there and I'm gonna take notes okay so surely that's provoked some discussion yeah I heard about four different books in that talk and the range of things not to mention the questions you have at the end and I could start things off but let's see who has I'll throw it open to the floor first I'm curious if you've seen the evidence of students or other people kind of altering their behavior once they know about these data collections like for universities that look and see like the extent to which like using students browsing habits and looking at their academic performance I wonder if you've seen evidence of students maybe downloading papers that they won't read just so the system will show that they downloaded it stuff like that no doubt no doubt it's very hard to track those things of course we certainly know in the citation behavior that people will have you know certain obligatory sites that they will make to other people and they will cite their lab mates and their others kind of pump up their their H indexes but you know as any time you take things down to one indicator or one index you're just asking for to be gained and people will gain these numbers and they'll choose whichever one looks best there's a wonderful paper out of Mexico a couple of years ago where this group created a bunch of fake papers and just put them up for Google Scholar to index because Google Scholar is you know that's actually a very small project and they're not and they're you know they say it's just algorithms where the publishers ones you know do more date you know do much more editorial and data cleaning and sure enough they just the paper says you know day one day seven you could see the index pick up then that you know they started citing these papers and they watch their H index just climb across these fake papers so basically they proved how easy it is to game it electronically now they couldn't get those papers into a legitimate journal indexed by Web of Science but they could get them into Google Scholar which is it is simply you know a network mechanism okay so yes people definitely this is long been known and if you're interested in the bibliometrics there's a group at Leiden in the Netherlands a CWTS in in in Dutch Paul Wouter's W O U T E R S is the head of that and there's a Leiden manifesto that was in nature recently and there because Europe has gone even more mad than the U.S. over citation indicators and they're dealing with the ethics and responses to those thank you my name is Tim Finnell's I'm a medical student from Leiden the Netherlands my question is to what extent can people can students say no against data collection and still attend a university so can students say no until and still attend a university probably nigh unto impossible at least under the the current current mechanisms the the best watchword I heard lately was let's make our systems information radiant okay it's not a great phrase so impealed building systems to throw off data I tried doing without a UCLA Bruin card for a few years I was like the last holdout you know just objections over these and you know after a while you you know you can't register for classes you can't use the library you can't use the health services if you don't you know if you don't have one of these now you know certainly the data directive and the data policy in Europe is different than the U.S. policy you may have on the one hand you may have more protections as students in Europe I don't know but certainly as faculty you do not because under these national plans for the kind of metrics to compare countries and things like the EU funding they have agreed on something called CRIS research information systems a coordinated research information system look at Euro Chris CRIS and you will see they've come up with standards and metrics to compare universities all across Europe and faculty have to submit like every new publication every new artwork and so on as part of their terms of employment so the metrics are even more intense in Europe in some respects than the U.S. but the student ones I don't know what the degree of protection is opting out is as far as we can tell is not until possible. Hi I'm Saul Tannenbaum I'm a local blogger I come at this as a sort of former practitioner having spent a whole lot of my career doing scientific computing support for researchers and then moving into university infrastructure and dealing with these sorts of data having walked away a number of years ago for a whole bunch of reasons related to this and my introduction to the hard part of this field was a 45 minute conference call amongst top tier universities trying to understand what the university's obligation was about retaining data about dead faculty and it was a deep and complex discussion and you know that was sort of my introduction to the cuckoo house that this is and having watched this I mean I go immediately to your last discussion question what are workable governance processes and sort of generalize that to what are workable university governance policies and you know my my observation was that you know a university's ability to grapple with these questions was you know partly based on how sophisticated you know the faculty and staff were at even you know understanding this and I mean that it's one question I have in UCLA you know how good is your you know board do they actually understand these issues and the other part was whether they're actually workable university governance processes about you know things in general whether it's you know faculty promotion who owns the room scheduling you know athletic fields I mean physical objects etc and if there were fights over those things right you know then you know what seemed hopeless for the less tangible objects okay well let me let me deal with that in a couple of ways one is that most of these are policy problems masquerading as technology problems and some of them like that the dead faculty problem shows up as an IT problem when it lands on the desk of some poor low level person and the Dean or the widow or widower wants access to dead spouses email account and wants the university log on to have the full access to the journals and all the other services that went with deceased person's account and then poor IT person brings this if they're you know after a while and sort of in tears at some point to the price this was one of the first things the privacy and data protection board doubt with is to realize that it was definitely a masquerade at that and this poor person was the wrong place to be making those decisions because you had tuition you had registrar issues you had all kinds of other things up here so this is where you know speaking from having spent 30 years in University of California I think we have a more functioning system than most places it's slow it's not a fast-moving system and the Deans have far less power and then at Harvard and the faculty of far more power than at Harvard what's a very deliberative process but it means that we can get these kinds of partnerships going where we can get the faculty and the people with the money in their pockets in the same room to hear each other but even at that UCLA is the only one of the ten campuses that has the joint IT planning board which has been in place for 15 years and the privacy and data protection board which has been in place for ten years and this is our tenth anniversary and we've been doing a number of things around that and in that ten years has meant that we've built up a lot of institutional memory and we've been trying to share that institutional memory around the ten campuses which is also why we pushed for trying to get similar structures and so when these things come up they don't continually that's probably want something that is a process because otherwise it's tabular rasa every single time and when you say oh we've seen the dead faculty problem before we've seen the email problem before we've seen the graduate student who didn't finish the PhD can't pay the fees and the dean that says just let's just waive them and give them full privileges which is a complete violation of the other accounts that we have with the library and so on so forth is you know we've seen this enough times that we know what's an IT problem what's a privacy what's an IRB problem what's a data governance problem where we can begin to shuttle it and move it in different ways but it's I haven't seen other people we don't have all the answers this why we're looking for them okay there's at least three more okay hi thanks for your talk my name is Jess I'm a postdoc at Microsoft Research and as far as your penultimate I have a comment and then a question the penultimate question about appropriate criteria or values like to me it's legibility is the most important thing right like so this is just stealing from Daniel Salove Salove I don't know how to pronounce this last name but like what's truly disturbing about you know the NSA collection of data or agency collections of data is that we don't know when it's being collected or what to what end or how it's being stored but if those processes become legible then you know that's a more that's a that's a you can at least contest the process you know so that that's his thing about the Kafka-esque metaphors of privacy so when I see governance questions it's more compelling to me if I can if I can say you know there's the reason we're gathering it or these things are legible so that's the value that I'm most interested in but then as someone who's gone both from industry and I'm going into an academic gig next and you know whatever moving between those things I'm struck by the ways that so my question is about do academic institutions have a higher burden like an ethical burden like are they obligated to be more ethical in a way that we don't necessarily expect of industry researchers or government researchers like how what is your sense of whether the we're putting a higher ethical requirement on that and should it be that way and just for a second I'll digress and say that I mean it's so nice to go through the equivalent of an IRB process at Microsoft Research right like it's just the most sane thing you've ever experienced right you sit down with a lawyer and then they sort of walk you through it and it's like oh this is great this is so much better than the crazy arcane systems that I've used at many different institutions partnering with collaborators and it's like oh if it were just this way like an earnest conversation with a lawyer about potential legal problems rather than like how can I game the IRB system so they won't ask too many questions about my research process so that I can get my work done and so that bringing that back to what ethical obligations we put on academic institutions that we don't put on other institutions and is that a good or bad thing right excellent okay so first off I was really telegraphic about the sets of principles and you know definitely we're building on the not only that you know your basic common rule beneficence justice and so on but the code of fair information practices which goes back to 1960s and that has to do with the transparency of data collection you should not have secret data collection systems people should know people should have the ability to inspect and respond and you should be able to have consent informed consent and so on so that's very definitely a layer that's that's built in here as far as the higher status and I've worked with Microsoft for search about the last 10 years too and they are you know certainly been much easier to deal with the first time I dealt with MSR the we this 10-page thing we need to 10 lawyers in a room and then later they said you know let's just make it open source and a one-page agreement and then they became so much easier to deal with okay but there's is a later stage of that we think and maybe we're even more you know concerned about the values as a public university that the university needs to be a protected space where you know if students can't go into a classroom and assume that they are not going to be filmed by security cameras and they have some equivalent of Chatham House rules that they you know that they should have the the ability to say in a protected space you know to exercise new ideas and not feel like they're going to be quoted or tweeted at everything they say which is not the set of assumptions that you have in in most for-profit business requirements you know business situations you know certainly we want them to get smart about this so that they don't assume that what you know the protected space the university goes with them in a little bubble wherever they go but if we can at least you know the nature of education and learning should be that experimental not not being creeped out about being tracked about everything which is why we're so concerned and because these systems are built this is virtually impossible to opt out back to Tim's question before but the other side of it and one of the reasons that we've got all these university analytics people on this data governance board is we have to report all kinds of statistics to the state of California and to the US federal government and Harvard has to be has to you know provide all kinds of analytics around time to degree and diversity and you know this that and the other thing so there's some amount of analytics you simply are legally required to get so students cannot opt out of them faculty cannot opt out of them and given you can't act opt out and you have to submit them how are you going to govern these responsibly so that that's the conversation there was a question waiting over here I think here and then now and then here I think thank you so much for the talk the first part of your talk was about interoperability and the legibility of data sets for different researchers in different disciplines and you listed a bunch of government regulations around that or organizations cross borders data cross board I was wondering whether the scientific community at your university is concerned about replication of experiments and if so are you know what are the initiatives that you think are the best initiatives that are underway now to permit access to data to replicate experiments or what would you suggest to solve that problem if it's not yet being done well UCLA is in a particularly delicate situation at the moment if you're following the Michael the court case ah well a la cour you will find it very quickly the paper UCLA graduate student the paper was retracted from science yes okay yes so we are we are having very exciting times on campus made a great end of term discussion in my data curation policy class I can tell you yes we are we are duly concerned about these things I have perhaps a different take on the replication and reproducibility problem which is that the notion of replication and reproducibility and verifiability and inspectability and legibility and transparency everyone in this room has a different idea of what that means and scientifically mean it's it's the epistemological problem of just what it means to reproduce something do you go back out in the field do you go back the data set the replication data set that the court did file was the cleaned data and that satisfied the legal requirements or the requirements of science which is as rigorous as anyone there's a very famous set of studies in social studies of science by Harry Collins about gravitational waves and the one group only believe the experiments that confirm them and the other group only believe the experiments that fail to find them okay so you know you again you get into deep arguments about what reproducibility means and what you trust in methodological issues so you know the replicability you know there's certain ways you can get some threshold we want openness we want transparency but if you get too demanding about it you're going to slow the progress of science and it's back to the gaming problem you know people will file a spreadsheet with unlabeled rows and columns we find them all over the place because that's how they do their work and that will satisfy the letter of the law okay and the liquor case came out because somebody said oh this is just too exciting a finding let's see if we can do it too and then started looking closely at the data and following the footnotes so it wasn't clear that anything in the standard way the science is done could have found it if somebody is clever enough and fast talking enough to fool that many people for that long it's pretty amazing and you know we you know we don't want to hang him until you know he's had his fair trial a lot of checks and balances that you know that went that went on through but it came once it came out in the public and enough people were able to have access to it so this is another concern here is you don't want to be so risk a verse that you lock everything down and that's also where we started on the uc wide compliance or uc wide privacy and patient security is the compliance people the ones coming down the medical side said oh this is a compliance issue we'll take care of it for you and the faculty said no you won't because you will lock down every bit of data we will not even share a spreadsheet with a student without filling out 47 forms if we let it be a compliance issue we need to do our research we need a free flow of dialogue but we need checks and balances and governance so this is why it's you know what are the triggers when do you do this and how risk a verse do you want to be and it's yeah it's three books in here but I'm hoping you'll help us figure it out okay here hi I ran a social listening program at a major bank here in Boston and I am now an incoming faculty member at Syracuse University at the Newhouse School I'll be teaching social media there so I am painfully aware of how effectively you can monitor external social media so in regards to the scope and data governance you know I would I would kind of divvy it up from like what's within your university firewall system and then what's external and publicly available information and then you know just think about how we are able to actually track students and faculties external social media use and what we can do with that and when we can use it and and have some processes in place and just also curious to see if you've had any experience with that kind of external social Twitter Facebook we got aware of this in the very early my space days one of our doctoral students will the original my space people so it's sort of inside inside view of what was going on and the it was a we treated us as a teachable moment at the initiation of the dorms for incoming freshmen of be careful what you say because it will be with you forever and getting them you know and getting them to be more sensitized to the whole process the University of California has had a we do not monitor policy in place for 20 some years I was revisiting it that led to this and you know so we are very careful and being in Los Angeles guests who's a very convenient target for the recording industry and the movie industry we have had these really knocked down drag out over my dead body kinds of conversations with I mean they will come to campus and say oh it's so much work to monitor for copyright violations let us just attach to your system and we will monitor for you and that you see why the electronic communication policy has protected us and we've said sorry we can't do that and then they said well why don't you go to the UC office of president change the policy for us and you know there was so much pressure that was part of why that you know we had to revisit these things and in an age of of information radiant systems and go back to it you know we follow the law if you know if Hollywood finds a violation we have to trace of the IP address and we have to send a nasty letter but we absolutely do not attach to our systems we traffic shaping is the most that we do but there are there are other universities you know some of which we know the names of that will use these course management systems and say when people are in trouble academically and then they will look at their Facebook fate feeds and say is the student making enough friends and then do a red yellow or green on them which we find appalling but other universities which I won't name on a webcast are apparently doing some of this yeah so you've got I mean it's back to you know how high a standard do you want the university to have other questions just a quick question on your privacy and information security policy is it are all the policy provisions the same for students faculty and staff or do those different populations have different protections what this is this you see why policy is more of a principal's document it doesn't it doesn't get down to very low level policy and the campuses have a fair amount of autonomy of how they implement you know of the ten campuses around California so some of them are divided so I mean this is also why I'm asking this question here is how we should divide these up because you know things like FERPA the family evaluation if you don't know for bits it's the student records laws those are clearly by the subjects of them the academic personnel is clearly by subjects but there's a lot of other things like that Bruin card and whatever the Harvard equivalent that cuts across and we even said can we distinguish between students faculty and staff well a vast majority well vast majority a large a substantial portion of the students are also employees so you can't get a clean line there either so is we're having a very hard time getting people by status now we tried to say in our community is not deal with say medic patients the medical center and figure most that's covered by HIPAA anyway okay so we're trying to do we're particularly concerned with kinds of data that are not covered under the obvious well-known policies because it's these things slipping through the cracks where people go to IRB and says not an IRB issue it's not a HIPAA issue it's not quite PII but it is definitely sensitive as far as thinking about it in terms of what you know what are the risks do we have consent what could it be used for what are the values you know sharing things like student analytics with outside companies and or even with other universities is you know how do we want to handle that because there's more and more you know efforts of partnerships that are coming up so not easy if you can help us and here's and there's the links again we're trying to work in the open there's the link to the data governance and the this is here and there's also a short paper on the privacy and patient security that Kent and Jim and I wrote in a new book that Epic just published on the future of privacy so Chris a couple places in your slides you talked about data collected by UCLA but universities are outsourcing all their IT to commercial outfits I know at Cornell student emails being handled through Gmail many of the videos for MOOCs and are being done through YouTube so isn't is is university collected data really becoming sort of moot and and it's how we deal you know the question is how to interact with the commercial world and set the terms that we want to have that's certainly a big piece of it and it's you know the outsourcing and and what should we do ourselves what can we reasonably do ourselves and not and what are the trade-offs and in doing that but yet that cats out of the bag in terms of yet yes there's all this kind of outsourcing certainly the kind what kind of contract you write with Google around Gmail and we fought the Gmail train for a long time and then and then lost it because the servers the the Bruin online servers were dying and they said do we rebuild them or do we take this very attractive deal and the previous administration had said once you let go of that infrastructure you will never be able to get the internal interoperability back again because you know then then you've hived off that piece of the infrastructure and building these other pieces gonna only get it gonna get harder but then a later generation came along and the deal was just too good to pass up so there we are of dealing with those so it's yeah it's how do you govern that and but it's also I mean the other side that the legibility is just plain getting people to care I mean that the faculty are so upset about this academic personnel system that they really care right now so we can get them you know this is a point they care and we can get them sensitized but a whole lot of these things it's until somebody's data gets breached or somebody realizes what kind of dossier exists on them that it that it comes up so we want to sensitize people we want them any more active and we'd like them to deal with it before they end up on the front pages of not only the Harvard Crimson but the New York Times and if you and become a case study for other universities of of how not to handle these things okay well intention processes but you know there you there you go so any last question well thank you so much for stimulating afternoon what you've heard today is exactly what you would find if you got to big data little data no data big issues a tremendous breadth of knowledge all presented in a way that's very accessible very readable terrific case studies and plenty to take away with you when you're done so please join me in thanking Chris for her presentation today thank you and thanks to all of you I have several pages of notes to take home