 I've been at Microsoft about 13 years. The last about two and a half or three years has been in the Microsoft Research Organization. If Microsoft overall is a corporation of about 100,000 people, the research division, PhDs Computer Science is about 900 people. Of that about half of those are located in Redmond, half of them are located around the world in research sites and innovation centers. Whereas I mentioned most of those have PhDs Computer Science and are working on longer term projects, five, seven, 10-year projects. Our team is much more applied and much more focused. We are looking more of time frames of 12 months, 18 months, 24 months in terms of getting the projects out there. How we do that and hence the name external research is, we don't do the research ourselves, we actually fund and partner with researchers outside of our company. So we'll work with the Academy, we'll work with government agencies, non-profit or in some cases, maybe commercial publishers, etc. To develop technologies, to develop functionality that we think would benefit the community overall, and if there's some role that Microsoft can play in facilitating that process, that's what we do. That's enough of the commercial talk, the rest of it is going to be the observations that I've made over the three years of working in this space. I'd also like to introduce Alex Wade, my colleague for Microsoft Research. If the questions get really tough, I'm going to give him to him. The themes that I would like to touch on over the course of my talk will be obviously the tsunami of data that is starting to overwhelm scholarship, and the daunting issues that that raises for the scholarly communication life cycle. The concept of moving upstream, how librarians, how publishers, how authors and individual researchers can really begin to move upstream and take more control over the process, and thereby making it better further downstream. I will look at that about how we can integrate or how the functionality that we need to surface out of the tools and out of life cycle can be integrated into existing tools and existing workflows. One of the huge themes that our group likes to pursue in our work in the scholarly communication space is enabling semantic computing. I'm going to talk about what that means and some examples that we've noticed in the ecosystem. One of the largest themes that I will be touching on is the provision of services. I think this is the future of scholarly communication. We'll be talking about tools related to data analysis to collaboration, and finally some interesting developments related to preservation and the capture and delivery provenance. I'm going to be talking about the potential for Cloud services, and then lastly, I'm going to make some comments that might get me fired back when I go home to Redmond, but hopefully not. I think the issue here, I'm not going to spend a lot of time on it. I think we're all aware of it. Data-driven science, data-driven research is an unbelievable. We have no idea. I think we can talk about the numbers, but we literally haven't begun to either calculate or really, we can envision it, but we haven't really done enough with it. It's an unbelievable scale problem that we, again, we can theorize around, but I don't think we really tackle it at the scale that we're going to need to. That said, computing is stepping up to both create the problem and help us solve the problem. We're seeing massive data sets, which is raising the issue around the federation of those data sets, integration of those data sets, and collaboration of scholars and across those. The evolution of mid-core and multi-core computing is, again, like I said, contributing to the problem, but is also going to be the way that we address the problem. Then there's the potential of the power of the client and the Cloud and how we'll be able to access this data, access this information anywhere at any time. These are the underpinning of a lot of the solutions that I think we'll be envisioning, and that I'll be speaking about today. But to capture, to characterize the situation as it is now, there is the ongoing need to collect this data. Scientists have been collecting this data and have been, in some ways, delivering it as part of the scholarly communication life cycle. But I don't think that the rest of the workflow over in the last three or 400 years, or even in the last 20 or 30 years, we've been able to really take good advantage of that. What we are starting to see is this data processing, analysis, and visualization. We're still in just the very, very early days of this. Then archiving, this is traditionally the role of archives, of librarians, and it's been in a very paper-based form, and the mind reels at how we're going to address this issue going forward. But this is a real, as we would say, opportunity. These are issues, these are challenges, but these are also the ways that we need to solve and address the problem. Again, to characterize the issue, I'm going to touch upon a project. Just some statistics from the life under your feet. I'm sorry, I'll move over to the side here, sorry. The Life Under Your Feet Project at Johns Hopkins. This is, again, a drop in the bucket. This is just one example of a research project. This is being, there are arguably tens of thousands of these around the world, but it just gives you an idea. 200 wireless computers, 10-sense, there's each. We're looking at something over 200 million measurements per year. This is a three-year project, et cetera. How do they capture this? How do they maintain it? How do they curate this? How do they turn around and compare it with other data sets around the world? How do they share that information, et cetera? This is a fantastic project. This is one of many, actually, that Microsoft is funding out of Microsoft Research to work exactly around data sharing, data collaboration, and data curation. But just to give you an idea of the magnitude of what we're looking at. This raises the issue, and this is a fantastic, Joe Hellerstein is a fantastic blogger, and I would highly recommend that you check out his blog. But this is a specific one where he referred to as the commoditization of massive data analysis. And you'll see, he's pointing at it, and I've kind of highlighted some of the concepts. But I like the idea that he's touching on that we haven't even arrived at the industrial revolution of data yet. We're in such early days, and as much as we think we're so advanced and we have all these tools in our resources, we're at the very, very, very early stages. And then as you can see on the slides, I ask a couple of rhetorical questions here. I'm not gonna answer them, I wish I could, but I can guarantee that the answer to that is going to be yes. The cloud issue is that answer is going to be yes. And then to characterize that problem, but then to put it in the aspect of what it looks like to your typical e-scientist or e-researcher, you've got obviously data coming from experiments and instruments, you've got data coming from other existing archives pre-existing, searching the literature, pulling out data from simulations that might be based on the first two. And then you've got all the problems. And the amazing thing is, is right now the individual research scientist or the individual lab is having to do and take on all of that. And on the other end, you might have the library or the archive or someone at some point in the university or the organization who's in charge of the repository trying to receive this at some other point. It's an incredibly complex workflow and one that has grown expansively in the last 20 years. And we've all kind of tackled bits and parts of it, but we've run into a handful of best practices, which we're gonna be chatting about here. But there is an immense amount of opportunity, I will stress that word again, that's happening in this space. Now, I mentioned the concept of moving upstream. And this is, I think, a very, I'm gonna use the word opportunity about 400 times, I think. But this represents for us and I mean scholars, I mean researchers, I mean corporations, I mean open access publishers, I mean librarians and archivists. There is an immense amount of work that needs to be done, but an immense amount, it's an exciting time because there is kind of no baseline, there is only opportunity to do that. And so I'm gonna give you some examples of what I mean by moving upstream. There's an organizing metaphor that I would like to refer to, it's the research life cycle or what I'll refer to as the scholarly communication life cycle. The idea of collecting your data, doing your research, doing your analysis, moving to a phase of authoring where actually you've got your thesis and I'm gonna write it down, then there's the idea of sharing it, the need to publish and disseminate that. You might be blogging about your research results, you might be writing an article or writing a book or giving a paper at a professional conference, but in somehow making those first two steps public in a way to share it with your colleagues in the domain. And then ideally, moving to a phase where you store that and you make it, you archive it and you preserve it for future generations and for others. So if we think about these as four core sets of this life cycle, I'm gonna add to that two other concepts. The need to collaborate and that is collaborating within and across those four core steps. And then additionally, the need for discovery, the need to search and find information within and across those sets. And so I just list this as a kind of an organizing metaphor. I think traditionally, we'll say from a library and archives or curation point of view, this is typically where, I won't say where the process starts, but where libraries and archives spend most of their time. And I think where we need to really think about data curation is moving back up that life cycle. And that's why I mentioned the moving upstream concept. I think we need to be addressing a lot of these problems much earlier on. My observation is from working with kind of leading edge innovators in eSciency research, there are the most forward thinkers in this space are people that realize that and are working very hard at data capture and data curation at the very early stages and worrying about how to capture that information, how to share it, the protocols, and they're very, very carefully scripting that. And those are the people that we're trying to work with in terms of best practices. And then being able to make sure that it moves throughout that entire life cycle, that it's not just operating in silos. The trick there is integration. And how do we, as we move from step to step to be able to take not only the data, the metadata, and the provenance and ship that around. And this is another point that I'll make later in the talk. It's very problematic. Sometimes there are domains that are very, very cohesive and very non-proprietary about their research. There are others that are very proprietary and being able to get data curation standards and protocols in those specific domains can be very problematic. But it's very, I'd say, inherent on the data curators in wherever stage they may be. Maybe it's publishers, maybe it's librarians, maybe it's people in the labs to be able to integrate and be able to move things from system to system based on those protocols and based on interoperability. So the interest from my company's perspective is how can we facilitate the move from what we would think of as four or 500 years of the scholarly article? How can we move from these very static summaries to much richer information vehicles? And so, as you can see here, a pace of science is picking up, status quo is being challenged, there are some innovators, there are some people that are dragging, but we maintain and I'll provide some examples of others in the ecosystem that there is an immense amount that is available. And that's why I'm kind of pointing out the bottom below the surface. There's just an immense amount. If we can imagine the container that is the scientific article, that it has the ability to serve as and facilitate reproducible science, the ability to, when you get a paper, you also get the data, you also get machine readable, we'll say methodology, that you're able to reproduce that methodology against your data set or share your data set with someone else and let them utilize the methodology. Interactive data, the ability for you to take, pull out a table, change a figure, easily insert it back in and ship that document off or if you'd be doing this on the web and not be thinking about a specific document, whether that be a PDF or Word document or something online in that way. The ability to collaborate, collaborate real time with other researchers, dynamic documents that literally changing on the fly or the ability for reputation and influence to be captured, to be passed on and shared in a way. These are all the themes that we're starting to see, we're starting to see resident in examples, again, online, certain commercial journals, open access journals, they're all starting to dabble with this. A couple of points that's specific to that that I'd like to call out. We're seeing interesting work, for instance, from on the commercial side, Elsevier has been doing this article of the future competition. They also have a grand challenge that they issue every year. These are producing some fascinating ideas. I'll be making this slide deck available and these are live links and if you're not familiar with these, these are annual, I think these are about two years old and every year these get a fair amount of attention but obviously very scientific focus because it's Elsevier but amazing ways that they're working within different domains to show how collaboration and how dissemination of information can work. Recently, I think this is probably about a month ago as you might have heard, Public Library of Science Plus started a new, I guess, channel for delivering information that they're referring to as currents and their specific focus was the H1N1 flu and what they're doing, as you can see here, is it's fundamentally, it is, how do they phrase it? It does not go in-depth peer review. It's actually, the whole point here is not to drag things out through a six, 12, 18 month peer review process but to have a handful of experts in the field review it very quickly and get it out as quickly as possible and this is a partnership between NIH, PLOS and being published on Google NOL and I think this is a fascinating thing. I think a lot of journals would blanch the concept that this is very important scientific information that isn't receiving full peer review and I think this is, and that is not different from the nature preceding work. This is basically a pre-print service that Nature Publishing Group has been offering for two and a half or three years and I think these are interesting in that and I probably represent the future and that you're starting to see a chink in the armor of the peer review system that people are starting to say it's more important to get information out even maybe it's not the 99 or 100% solution but if it is advancing the cause and advancing science let's get it out there as quickly as possible. Another interesting thing that's been announced over the last six or nine months that has garnered I think a great deal of interest is the Google Wave and I myself haven't spent enough time on it but watching the blogs and watching the people that we work with there's a lot of interest into what this means. Time will tell us if we see adoption of Google Wave but the opportunities for scholarly communication have been heightened and some of the leaders and innovators in the field have jumped on this and what that could mean for changes especially when you talk about the commercial publishing world. And then lastly just another part of the chain is Mendeley which is I think PC based and the papers which is a Mac based application and many different people have referred to at least Mendeley but papers as well as kind of the iTunes for academic papers a way for you to download and store basically your own repository on your desktop of the papers that you're looking and then it recommends oh if you're interested in these papers not unlike an Amazon recommender service the ability and this is amazing as you'll see 60,000 people have already signed up four million scientific papers have been uploaded and that's doubling about every 10 weeks so this is a small company that we actually chatted with a couple of times probably six months ago and they've seen some amazing growth so it's some differing ways that you're seeing peer review being scaled back a little bit papers being distributed in different ways not unlike maybe archive.org et cetera but new kind of new paradigms for how information is being transferred within and across academic domains. So that raises the point of yes you know how are we going to see the world as we've known it for several hundred years around scholarly publishing how are we gonna see that change and this is going to be in the world open access taking a larger role and the idea of paid and walled content starting to break down and go away how are we going to see different entities continue to I was a derived value out of that content and it's not going to be necessarily the content itself but it's going to be adding services on top of that and there's many ways that we're going to see how that happens. The benefit is we're now finally in a world where research is easily shareable we're not talking about 100 pages of tables of statistics in the back of a journal from 1800 we're now able to send a spreadsheet or download something from a web service so the data now is easily shareable and interoperability is still a bit of an issue but I think we're going to be tackling that from an interesting scenario related to data sharing many of you might be familiar with the Sloan Digital Sky Survey. This is some fantastic work by astronomers at many institutions and organizations around the world but you can see that they've got three terabytes of fully public information from 13 institutions, 500, 300 million objects, et cetera an amazing job that these institutions have done to pull data from different sources, different formats and pull it together and make it available in a consumable form and on top of that, the amazing thing is you can see 350 million web hits in six years and we have, well let's say let's round up a million distinct users versus unknown population of astronomers worldwide of 10,000 so we're starting to see when you put this data out there and you put it in a consumable form you're seeing much broader interest outside of the domain, it's fascinating to me the greatest example and I point to Galaxy Zoo is basically the mechanical Turk concept of making this data available and having people kind of drill down into it and help them have them classify different aspects of it this has received the Galaxy Zoo since it launched I think about 18 or 24 months ago it's received a lot of publicity there's been a lot of public access to this information over 100,000 people participating the one that you may have heard of is I believe was a Dutch homemaker who was for some reason looking at this found a unique object, flagged it it immediately got escalated to some of the most senior people involved in this and it was referred to as a blue object that they had not ever seen before based on that response they've actually several months ago turned to the Hubble or made sure to turn the Hubble telescope to look at that specific thing and sure enough this woman identified an object that had never been here before discovered so it's amazing concept of sharing that data and the possibilities for citizen science to help facilitate that do something that those 10,000 astronomers may not have been able to do for many, many years obviously there are concerns with data sharing I think the astronomy community is at the forefront in terms of making this information available making it available beyond their domain but it's an immense task for integrating it ensuring the interoperability among those data sets there's a challenge around annotating it making sure that people can say oh wait we need to look into this or this is an anomaly or this is great or who did this basically again kind of starting to touch on the next point there provenance and the quality of the data and it's annotating it at a high level and annotating it perhaps even at the individual atomic data point level exporting it and publishing it in agreed formats again kind of mapping back to the interoperability issue and then security how do you share this when do you share it with whom do you share it these are all issues that I think in many domains have stopped people from doing data sharing have stopped people have forced communities to become proprietary and I stress to you I would look at the same list and say these are the exactly the opportunity areas that will provide whether it be libraries, institutions whether it be commercial publishers the people that are able to address these problems in clever ways are gonna be the market differentiators moving forward and so these are exactly the areas that I would love to see institutions I would love to see scholarly societies I'd love to see libraries address these issues in an open and an interoperable way other services data analysis I'm listing a handful of names here swivel which is independent but IBM's many eyes Google's GapMinder Metabasis Freebase there are many others some of these are kind of more on a different scale but the idea in all of these is that these are open web based services that are being made available some have kind of baseline services are available for free and if you want premium services you have to subscribe at a different level but the idea that you can take multiple data sets and I mean one example is to you can I would encourage you to go up to swivel and look at the public data sets that people have loaded up again provenance and security issues but for the ones that are up there it's amazing the data mashups that you can create are the free tools if you wanted to load your own data up and run it against yourself your colleagues these free services that are now being made available to me again these are fantastic tools that the academic community could be using could be using as models maybe you're not want to use this but maybe you want to build services that are in a very similar vein but I think these are tools that should certainly be utilized I personally have only used many eyes for a textual analysis and it's a fantastic, fantastic tool I'd recommend checking it out if you haven't already but based on these tools based on these things we're really starting to see the publishing ecosystem shift and we are seeing some of the I'm sorry some of the commercial publishers investigating and again things like the Elsevier Grand Challenge starting to say hey can you guys give us some ideas and then maybe we'll incubate that maybe we'll pursue some of these ideas but we're definitely seeing publishers think about these sorts of services I'm really interested if you take a look at another world and you look at the world that I'm very familiar with of software you've seen open source and you've seen IBM and Red Hat really build not they're not selling the code but they're selling the services on top of the code and this is the I think a model that I think a lot of publishers are probably going to be looking at and adopting the idea that hey the content can be free but the services that's where the value at or that's where the differentiator or that's where the financial or enumeration might happen the other thing is this provides very rapid prototyping and so especially if you're starting to not get concentrated or not focused on the peer review as we've traditionally known it but the idea to build analytical services and visualization services et cetera allows very rapid prototyping and very unique things to happen and we can envision and it's already starting to happen days when repositories won't be simply the full text of research papers but they'll be able to include and incorporate the great literature supporting around those papers data, images, maybe software, emulations, et cetera that will support that so that we need to evolve away from again just the traditional concept of the paper and think about all of the things the content in the paper and what it could represent so I'm obviously, I'm stating the obvious on the last point but to make all of this happen some enhanced concepts of interoperability protocols are very, very necessary a couple of examples that I think is, I think to illustrate the point very familiar with data.gov but there is fantastic to see that shortly after the new administration change but there was this push within the government to say we've been collecting and we've been making this data available to ourselves for a long period of time hey, there's an onus on us to turn around and make it available and make it available to all the taxpayers and so it's been fantastic to watch data.gov really take off and not unlike the concept of swivel but we're starting to see some really baseline services delivered this way as well another idea that I like quite a bit is, I'm not sure if folks are familiar with but worldwidescience.org and this is an amalgamation of government agencies or other bodies in different worldwide countries that are responsible for maintaining scientific and technical information on the behalf of their country so the participating, the types of groups that would be participating from the United States would be Department of Energy or NSF, et cetera but a lot, let's see if they have the number here it's probably somewhere around 60 or 70 different countries that are participating but they are pulling together all of their scientific information and again, in a kind of UN of scientific data they're pulling all of this information together and providing a search, a federated search across that and I think this is fantastic because they're not going to publishers it could be the British Library it could be individual agencies it could be scientific and technical information agencies in Korea and Russia, et cetera that are getting together and doing this regardless of borders regardless of language pulling together I think this is a fantastic model and it's something where certain institutions are saying we'll worry about the data curation other ones are saying we're going to worry about the crawling and the indexing and they're dividing and conquering and tackling that again kind of in a UN fashion and it's fantastic to see a lot has happened in this space just literally over the last two to three years so this is something I think is a fantastic model to emulate now the concept of enabling semantic computing is an interesting one the tough thing is people have been talking about semantics and semantic computing for years and we still are talking about it we're still not kind of feeling the benefit of it I'm one of the people that has kind of jumped the chasm and definitely feel like there's going to be extreme benefit for us in the long run and so a lot of the efforts that happen within our group our research group at Microsoft are focused on enabling this but it still might be a while before the kind of rank and file are able to enjoy the benefits of what we're going to see in this space pointing to another blogger Cameron Nalon from the UK he has a blog called Science in the Open and I love the highlights here where he says the laboratory record is reduced to a feed which describes the relationships between samples, procedures and data stressing the point that yeah, the data is great but the next step is understanding the relationships between the data and the relationships between the data and the paper and the methodology and the co-authors that are written and the expansion of linked data out from the individual atomic data points and he points out what it also requires is good plugins, applications and services to help people generate the lab record feed it also requires a minimal and arbitrary, sensible way of describing the relationships again, here's the opportunity again, stepping away from simply the paper and starting to think about the data and how we can enable that and by the whole promise I think I'll probably skip ahead to the next slide the whole promise that this represents is the ability of tagging and identifying these relationships and letting machines generate this intelligence this artificial intelligence that is things that we, our minds will not be able to capture but only via high performance computing we'll be able to see this I wanna stress that there's a distinction between semantic technologies and the semantic web semantic web is one of the many tools at our disposal but there's semantic based solutions that have to be operationalized at much lower levels that's just kind of one manifestation or one channel that we will see so how do we start to arrive? How do we take advantage of these things? So yes, there's the idea of leveraging collective intelligence there is at a very simple way there's the concepts of these recommender systems last FM or when you're buying something on Amazon yes, other people who bought this also bought that's, you know, brain dead in a lot of ways in comparison to the benefit or the promise of semantic computing and so we're starting, you know there are applications of these recommender systems kind of TIA, BioMed Central's, Faculty of 1000 but by comparison these are very manual they're not literally manual but they're very still kind of peer review based or some kind of basic social networking but there is the idea of these automated correlation of scientific data and smart composition of services and functionality the idea that things that we would have never been able to compute or think about on our own if we're able to describe those relationships between things and let the computers do the connections where, you know, gonna be in a much better space when these things are realized the other point that I'd like to stress here is that we're not all gonna have supercomputers under our desk to do these things and in the near term I envision and I think very strongly that we're gonna be as an academic community leveraging cloud computing more and more. We see a lot of usage going around the world we see a lot of usage, yes it's great and scientists all would like to have massive computing powers, a lot of them are starting to use things like Amazon EC2 or Amazon S3 to store their information or to do their calculations and I think that that is a model that it's going to ramp up quite a bit. So kind of reiterating the last point, you know in a world where all data is linked and this is the linkedata.org kind of an ever-evolving world, sorry diagram of linkedata it's very much worth a visit to that site. There's an immense amount of potential but when we have that, when we have the ability to demonstrate and catalog these networks you can imagine when everything is stored, processed and analyzed in the cloud the kind of services that can just be automatically generated, knowledge management, knowledge discovery, you know virtualization, storage services, notification of you can easily start to capitalize on especially if you're not focused on the storage and the compute if you're able to kind of basically offshore or outsource that and you're able to focus on delivering these services I think it's again a much greater value add for our respective roles, whether it be as authors, as scientists, as librarians, as archivists some form or fashion. When we talk about cloud computing just for a point of distinction many people have kind of broken into three categories this is actually from Tim O'Reilly but the idea of kind of very baseline infrastructure utility computing, moving up a level to platform and platform as a service and there's examples of salesforce.com, Google's App Engine, many things in that area and then finally the thing that I think we're most all familiar with the idea of software in user applications things like Facebook, things like search, Amazon, et cetera but these would be the three levels and what I was mentioning earlier is I think this is for most of us in the scholarly communication or research world we would love to be able to not worry about those and spend our time in this area so this is the opportunity space and I think this isn't a very obvious concept but why would we want to do this so we don't all have to have a super computer sitting under our desk but there are many reasons that are gonna lead us to cloud computing and it just again takes a huge chunk of time resources and effort off of the table and allows us to think of much higher value add services. Certainly the cloud landscape is still developing but you to get an idea of the services that are already out there and we may not even think about them as cloud services but the idea of Flickr, of Smug Mug, et cetera for photos, for video obviously we know about YouTube how many people are familiar with SciVy? Handful four or five SciVy is basically YouTube for scientists and the idea of, and it's free it was actually NSF funded came out of UC San Diego but it's the idea of, you know what? I've got a 75 page physics paper and I really don't wanna read it I'm exaggerating but instead of reading the abstract or the entire paper why don't I go and watch the author who actually did it do a 10 or five or seven minute video that explains the concepts and maybe actually demonstrates there's another similar thing called Jove the Journal of Visualized Experiments and it's an amazing, again an amazing use that is really kind of a layer above the article could also save a lot of money traveling to conferences if you wanna just take a look at these things online I'll jump back to Smug Mug for a second how many people are familiar with Smug Mug? So only three or four the amazing thing about Smug Mug is it's Flickr they own no infrastructure they are completely utilizing Amazon SC3 and EC2 and so they have zero IT infrastructure and so they've built their entire business on top of Amazon which is a risk but I think in a lot of ways that they're a very forward thinking company and that they're saying we're just gonna take that part of the equation off the table and we're gonna worry about doing more value added services obviously other ones include SlideShare for presentations Google Docs for word processing spreadsheets we're seeing a lot of these end user tools are already available in the cloud but in a certain way these only go to a certain degree and it's maybe not the last mile for what scholars and for what researchers need and I think that there's a gulf there that is again an opportunity for people to step in and begin to take advantage of these resources and build those compelling and value added services in the middle. I mentioned Amazon's S3 and EC2 I'm gonna skip over the DuraCloud project because I'm gonna have a slide on that in a moment but the idea that we could start leveraging these resources to do the hard aspects of archiving and preservation that have here before been very difficult and we've kind of prevented us from doing it these things are starting to get to be a lot easier and then there's the idea of new business models that are developing in this space service provision could be a new way to sustainability for some journals or for some societies. I'm also interested to see the NSF data net solicitation. They're actually in round two doing a second round of solicitation. These are $20 million four year grants that are being given not to one institution but coalitions of institutions that are coming together. In the first round it was Johns Hopkins and probably about eight or 10 partners got awarded the first one as well as I'm trying to remember if it was Oak Ridge and University of New Mexico but those are the two first awardees but the idea behind these is saying okay all of the problems I've been talking about around collecting data, storing data and preserving it to make sure that the scientific data is available. NSF said we realize this is a problem and we're gonna get groups of not just individual institutions but groups of institutions together with a specific mandate, A to create sustainable digital preservation solutions and something that they have to turn around and make available. So they have to actually produce something not just a prototype, but something that can be turned around and shared with the community. And I know NSF is in the second round of awards for that at present. But to me it's a very interesting evolution in thinking that these aren't based around actually developing new data, it's taking existing data and data moving forward and how do we preserve that and ensure that it's going to be around. And the mandate is on these is to say not just it's a four year grant and your funding runs out but there's a goal, a stated goal of the thing that these have to be sustainable and you have to build a business model for how these will continue after the initial four year grant. Then preservation of provenance. The, I mentioned the DuraCloud project. This is for those of you that are familiar with dSpace, Fedora, they actually merged, there was over two separate open source software repository packages. They actually merged into a single entity about six months ago, which is they're now referred to as DuraSpace. And they are initiating a project that is called DuraCloud. And this is the idea that at the, I will say at the top level here, you might have hundreds, thousands. And in fact, we are probably talking somewhere between two and 3,000 institutional or centralized repositories around the world. Several here on the Harvard campus, several over at MIT, et cetera. But two or 3,000 of these worldwide. And right now they don't have a very good backup or plan B or plan C in terms of how they're preserving that data. Some are experimenting, but in terms of how could they scale this or how could they really address us in a much larger order or fashion. And so what the DuraSpace team has done is they went to the Mellon Foundation and said, we would like to build a business, a scalable, large-scale business around preserving information at these academic institutions for these institutional repositories. And so what DuraCloud will be doing is providing this area, providing this durable storage, sorry, durable storage service layer. And then, so they would basically, let's take the Dash repository here at Harvard. Dash could go to DuraCloud and say, I would like to sign a service level agreement with you. I will have a local store of my data, but then I will store a backup or store a mirror with you. And then what DuraCloud does is turn around and they go to HP, Microsoft, Google, Amazon, whoever, and they sign and they are able to distribute and share multiple copies of different locations, et cetera. So there's two agreements that are going on, but I think it's a fantastic, this is the best application I can think of in terms of cloud computing and preservation. And it's a large scale. This is something that, I know that DuraCloud people are talking with everyone. We hope to be one of the providers behind the scenes as well, but I just think this is a brilliant approach. And obviously, institutions will be paying for this service. We'll be funded in that way. And these services will attract this kind of ecosystem as well. Another blogger, John Will Banks, from his Science Commons blog, I think raises an interesting point that I will be closing on. We need more than computer software routers and fiber to share scientific information more efficiently. We need a legal and policy infrastructure that supports better yet well-ward sharing. And so the point that I close on here, and I stress that you read more of his blog, but this is what I hope doesn't get me fired back at Redmond, but at the end of the day, it's not exclusively software. Yes, I've been talking about the opportunity, et cetera, but it's, as we all know, I wish I could say I've got the answer, and here it is, X, Y, Z. But at the end of the day, it's making sure that people are participating in this, that we can encourage, that we can provide incentives and rewards that we can perhaps revisit what peer review looks like or what the tenure process looks like. These are things that have to change, that have to evolve. And it's gonna be different in every domain. It's gonna be different from institution to institution. But at the end of the day, as much as I'm as a software provider, as much as I can say, I'd love for there to be some code that we could write that would address the situation. But at the end, it's a sociological issue that has to be addressed as well. It's a key part of that equation. So I will point you, this is the last little tiny bit of a commercial, but this is just a website that describes some of the work that we're doing at Microsoft Research, just research.microsoft.com. And then lastly, I will just share my information, my email contact, and a website about the specific area that we work with. So I would like to open it up to questions. There might be questions coming in from the Twitter feed, but thank you very much. We can certainly make the slide deck available as well. Great, okay, thanks. Oh yeah. Well, and again, some of it's happening already. I had a couple of the examples, the Elsevier plus, very interesting things. And it's just a matter of these are, I look at plus as a very forward-thinking organization, Public Library of Science, and it just has to be an issue of are the domains, they are a leading edge and an innovator, but the domains have to kind of latch on and start taking it and running with it. Another example is Google Wave. Fascinating potential, but it's been tossed out there and it remains to be seen if communities are going to adopt it. In a lot of ways, you look at domains like astronomy or physics or chemistry, very guild-like kind of old school, and you have some innovators and some laggards, and it's just gonna be interesting when we can see a tipping point happen in certain areas. And I feel like the onus is on the people in this room, the people in these institutions to say, how do we lead them across that chasm? What are the compelling rewards or incentives or services that we can provide to say, don't do it the old way, do it the new way? So I wish I knew, I wish I could say when, but I think it's gonna vary from domain to domain, from institution to institution. Yes? I wonder if you're intelligent, people are talking about it, and what will come down, in a sense, from institutions that NSF, Cyber Infrastructure, kind of thing. Yeah, so I think, quickly characterize the questions. If we look at the consumer space, what implications or what examples could we start seeing kind of bleeding over into the academic or our public sector in that way? I think it's an amazing space to watch. I think, I tried to list a couple, I think examples like CIVI, like SlideShare, et cetera. There are certain things that CIVI where you can say, okay, yes, there's YouTube, but how do we build a special application of that for scientists? We see a little of that. And again, I think the consumer space is fantastic. It's interesting, you know, you used to kind of think of the consumers, I'm sorry, the business space, maybe influence in consumer space, and we've seen a complete reverse of that. And we're seeing a lot of things bubble up from there and applied to the academic world there, or to the private sector world there. I'm trying to think of other examples that I've seen. And again, it all depends on, from my perspective, it all depends on when that gets adopted by the leading people in the field, the innovators in the field, and if they can kind of jump that chasm and make it more broadly useful or broadly applicable. So it varies. I just, there are certainly examples of it, smug mug, et cetera, there are things like that. But yes. So I do wanna ask a question. So when proselytizing internally at Microsoft, what do you see as the issues that need to be overcome? There, if any? Yes, oh yeah. And how's it going? Yeah. So whereas, you know, 50% of my job is talking externally, or I should characterize that as by talking some and trying to listen more. To your point, the rest of our job is going back inside the company and acting as a conduit, acting as an advocate for the higher education space overall. And whether that be from a teaching and learning and education space or a research and scholarly communication space, myself, Alex, others in the external research group are literally your advocates to turn around and try to make the case, make the business case within the company. And so that's our job inside the company. And so what we've done over the course of the existence of our specific group over the last three years is exactly come out to venues like this, sit in meetings, attend academic conferences, listen, hear what the pain points are. Here, find out what the innovators are doing, whether it's on our software, whether it's on open source software, whether it's on competitive software, but turn around and go back into the product group and say, guys, look to kind of to your point, look what these leading people are doing or look what is happening in the consumer space and look at what we can do inside Word with SQL, with SharePoint, et cetera. Oh, okay, so that's a whole different presentation. I mean, we had another meeting on campus where we did a little bit of that this morning. So yes, we are building, we probably built six or eight add-ins for Microsoft Office that facilitate chemical drawings, semantic markup, writing academic articles, the consumption of ontologies in journals and I should say in Word documents. We built a tool for facilitating collaboration on top of SharePoint that facilitates collaboration among scientists and academics. We built a repository software platform that allows people to build repositories. Just, I would recommend maybe check out the website, it's microsoft.com slash scholarly.com and that will take you to the list of the products, the projects that we've produced. And that's our group, Office and Office Foundation, separate from Microsoft Public Health Foundation and that really unlocks a lot of things that was blocking Microsoft in the past. We're actually able to contribute to open-source projects now, we just contributed 25,000 lines into the next kernel for driver support. So there are things that even where philosophically people were turning around at Microsoft, there were things that we could not do and now we're able to do that. It's taking a lot more to be said than when you probably get the idea. Yes. It's more common than a question, but if you're a driver at the Institute for Public Creative Social Science, we have to battle the end user application actually, that this is all the concerns that you used in the data sharing and in addition, it adds data analysis for vocabulary, mostly social science, and for network, for social network there. I'll just be very interesting to show you. Yeah, I will. Because it has a lot of the, but it goes through all the issues that you have in the area. We try to respond by one. And I think that it has been well-discussed as about how people actually adopt the afterwards, right? It's not all in the software providing this as a service that's, I mean, software is open, open source is free to add data, but the main thing that has helped to adopt across researchers across the world has been to keep ownership of the, well, at least provide the perception of ownership of the data. So they create their own site, they upload it there, we take care of all the archiving, provide all the services to their site, so that they can complete their site and basically build a Facebook of recent data. That sounds like a fantastic best practice. So I definitely want to talk to you, and I'd like to have a slide about the work that you guys are doing. It's fantastic. Oh, please, let's follow up. Is it called Dataverse? Yes, it's where they were written. And the leader fennelpers are here. Okay, fantastic, fantastic. Other questions? Yes. Many of your examples were data sets. We're in science with data sets. Yes. Data is sort of announced, but once data is used and papers aren't written, I was wondering how, I mean, Elsevier is doing something, but how published texts were gonna, were they gonna keep up with data sets where things going to be developed for them, but now it sounds like you are developing things for written texts and actually authoring the text? Yes. So I mean, there's many ways of thinking about it. So a lot of the work that our group is doing is how can we take documents that are written in a Word file traditionally? How can we evolve that? How can we capture that in XML? How can we embed data sets and images, et cetera, in that and send that document around? How can we semantically mark up? How can we include information? That's one idea, right? And these are some of the experiments we're doing. Other, some of the things I've talked about is how can you say we're not gonna worry about PDFs, we're not gonna worry about Word docs. How can we do it on the web? Things like Google Wave, things about just the representation on HTML, there's places that are saying we don't need to think about these old containers anymore, we can do it in a virtual way on the web. And so those are some of the other areas of exploration. We want to be able to look and observe all of those. Yes. Are you addressing? Yes, in part, yes. And again, we're doing, we're again taking the tools that people are most familiar with, which is one approach of saying, for in certain areas, Word is our PowerPoint or Excel, very common tool for authoring or capturing information. In those areas, we're trying to build ways to allow scholars or allow researchers to define those relationships, to import ontologies, mark up the document, define those relationships and share that maybe with a repository or with a database or with a publisher when they submit that. So yes, we are doing that. There's other examples of the work that we're doing where we've, the research output repository platform that we built is taking advantage of that and doing the calculations in the link and the semantic calculations to pull those and document those relationships. So there's examples that we're doing and as well as other organizations. And partnering with other organizations. Yeah. Yeah, I'll stress that. We're, as much as I'd say, woo, it'd be great if you did everything on Microsoft. We know that that's not gonna happen. And so our main focus is to say, when and if you use Microsoft software, we wanna make sure that it's not a hindrance, but it facilitates your interaction. If you put data into it, you get the same data out or better data out of that in that process. So that's the kind of effort that we're doing in Microsoft research. I have a good question. What's the latest developments in terms of how data is being tagged or marked up semantically, sort of automatically on the fly as it's generated? Because a lot of the times it seems like there's this curation stuff that scientists take like an Excel spreadsheet from there, whatever it is, and then put it into the computer, and then eventually it ends up sort of tagged. But it seems like really before this will take off, the machine doing the measurement really has to know how to tag it. What's going on with that? You can give it to him. We got a nice, excellent idea. So big work, man. I would actually, one idea, so I would not feel, there are basically machines that are doing blogging, and that are able to put things out. You're able to do the tagging of that information that's coming off, feeding into a certain place, and it's being tagged automatically. But I'm not experiencing that. I don't have enough background there. How, what we are doing with in Microsoft, I can talk a little bit about, actually I can point to Alex, and I would mention the Kimfer word. I mean, how are we doing the semantic markup and the data in the, we have built an add-in that facilitates chemical authoring and semantic markup for chemistry. Yeah, and that's definitely part of it. It's within the authoring environment. The two examples I think are most interesting right now, there's multiple examples of this, but the chemistry group at Southampton University has all of their instruments, measuring environmental variables, measuring machine outputs, every single thing that somebody was having to record in a lab notebook is all being recorded for them, being output directly to the blogs, and when you go record your experiment, you say I was in this room at this time, and you're effectively creating a data mashup between your lab notebook and what the room is telling you went on. Is that where Cameron Nailings is? Yes, exactly, exactly the way. When you move into the sort of automatic step, there's a group at the Oxford Research Center that's doing some work in the cancer research space that is allowing researchers to semantically annotate within the Excel environment. So, instead of just having a column in a row and you don't know if it's, you may be able to tell us temperature but you don't know if it's higher than higher Celsius, or you know that it's an instrument that was recorded in this particular stream of data, you underline XML of that Excel spreadsheet now is annotated pointing to an ontology that says this experiment was used, this is how the data was gathered, this is the units that it's in, so that then when you upload it, there's no further processing. If you have another format that that goes into and have a system, that's all carried away. The units problem really seems like a tricky part that's not able to work, that's fine. Thanks very much. Another example to look at it in terms of, I would say the best practice in data curation is the fluxnet.org, and that is, it's something like a hundred different organizations worldwide that are all capturing water hydration data from around the world and working in a cooperative single framework to capture that information and share it. A great story about non-proprietary sharing information, curation across data sets, et cetera. Yes? So we're increasingly blurring the lines between what I traditionally consider scholarly information and a wide range of other kinds of information, or maybe expanding the definition of what we think of as scholarly information. The traditional role of libraries, because there's a few of us in the room, is to select, collect, and preserve what we determined to be or think is scholarly information for scholarly information. What's your impression of what the role is of the libraries as the world expands? So, as we're a little swamped. For full disclosure, I'm actually, my history is as an academic librarian, so I was at Columbia University for years, my area is out. Exactly, then they made me get a Microsoft. But to your point, my opinion, I think that the safe world of collecting books and collecting scholarly journals, it's been done. We figured it out. Yeah, we can kind of figure out how to get from 98% to 99 or 100%, but that's not the problem, and that's not where we're gonna make the difference moving forward. It is data curation. It is provision of services. It is working directly with publishers, or going around publishers and working directly with the scholars. And saying, let's do it. It's actually, if they have been done, it still takes an enormous amount of energy and money to do that. Absolutely, exactly. No argument, but I would argue that that is taken as red. That is not, I mean, it still has to happen. It needs to happen behind the scenes. We are gonna make the most incremental impact is all of the, I hate to, what traditionally has been either gray literature or the stuff that's been underneath a desk or on the shelves of the scientists. That's the stuff. That's unique and that needs the brain power. What librarians and archivists have done for centuries, we figured out, we need to apply those skills to a new set of data. And that's my opinion. And I'm from a librarian here, so I can be heretical, but I haven't told you. What's a librarian? You can divide it up into the curatorial preservation aspects of the work, or you can look inside of the actual scholarship and furthering scholarship. And those are two separate avenues and extremely difficult areas to tackle, but to these upstream analogy here, I think that the big opportunity and the big place that the profession needs to go is to get into the laboratory, needs to be with the researcher, needs to be, when applications are being written to solve a problem, a lot of the times it's not being written with an eye to how the information comes out of the system. So scientists create problems for themselves that then need to be solved. Are they upstream? Are they create a solution that makes sense for other scientists, but maybe won't cross a domain or won't make it easily available for other populations that might be interested in that data. And I think that's a value that librarians have been able to provide. Yes. Great talk, thank you very much. I want to probe a little bit what you alluded to at the end of the sociological shift I've worked with scientists for a very long time. I've worked with scientists for more than 10 years. My impression is it's not generally a sharing culture. It was at least in the onion now. Several months ago it says selfish scientists won't share data when there wasn't too far on the market. That's right. When reality and, you know... So I'm wondering if it's going to take a lot of time and that may be true about, you know, the older generation and the general situational thing is blurring when there was social networks that they could go into error, but... But we can't wait. But, you know, I wonder if the ship, it's going to take a generation for all this to sort itself out with younger people coming in. They're used to blogs. They're used to communicating a lot. They're used to being not very private. But, you know, for example, when the market researchers, you know, two market researchers may not want to share things when themselves because they're competing with their products. I can see some cultures where it didn't make sense and other cultures where there's going to be a ton of resistance. Yeah. Again, I wish I had a canonical answer for you. The answer is there's a broad scale and it's going to vary, you know, from domain to domain. It's going to vary even within the domain. There's going to be innovators. There's going to be laggers. Are we going to have to wait a generation? In some cases, yes. In other cases, I don't think so. I think we've seen, I've seen, witnessed, you know, in the last five or 10 years tremendous shifts external to Microsoft. Let's pick open-source software. Three, well, let's say five years ago, there's no way the work that we're doing now would have been supported within Microsoft. Microsoft has started to go, hmm, wait a minute, we need to think about markets in a different way. We need to engage with different communities in different ways to a point now that we've created an open-source foundation that are parts of the company that are generating a lot of open-source. And again, if you would have said to me 10 years ago, is Microsoft going to be thinking that way? I would have said never. And some of the most, you know, proprietary thinking people in the company have now completely shifted. So that kind of opportunity, that kind of potential reward is, I think the incentive that we need to be able to demonstrate to scientists and to say, okay, you've been very proprietary about it, but if you can point to examples of where sharing actually helped, you know, a colleague or something else, some other type of situation, vastly improved a colleague's career. Or, you know what I'm saying? Or helps you. I mean, the reason people don't share their data is because they run a process, run an experiment, and they in their mind have six papers out of it. And as soon as you share your data, you're opening it up to the world and you're earning yourself. If the reward system was such that by sharing your data, that is also recognized as part of your contributions, then there's more incentive to do that. And the other side of it, I think that you're talking about addressing, is if it's incredibly hard, no one wants to do it. If there's so many roadblocks and so many hurdles to sharing that data or making it talk to other data, then they're definitely not going to do it. But if you can create systems that facilitate that and make it ridiculously easy, or even five percent easier than it is now, then that might be the incentive. So we're also planning to the science that shares the data is by creating the data and we provided them one way for all researchers to understand what they, well to get credit for what they do is to have the number of certifications and the amount that they've been cited by the journal. So if the data can be cited, then that's an additional amount of money. Sounds like we should all be talking to her. Yes. I would go back to the, I think this is related to your question and the upstream data services and the role of libraries. Let's imagine I give you a beautiful young mind of a library and we can mold it into anyway. What skills do you put in there? Because we all hear that that's necessary. But is this a computer scientist that we're talking about? Is it a hybrid between information expert of some sort, computer scientist and a scientist and we accept it and it is quite diverse. So it's very interesting. I'm not going to be able to say here's 15 things you need to do it, I'll talk about. I mentioned the data net, the data net solicitation that NSF put out. And so that's basically pulling together scientists, it's pulling together librarians and archivists, it's pulling together in some cases, businesses. And it occurred to me that that could be an amazing, if you could teach a class on what those guys are doing. And so I actually outside of the affiliated with the Information School of the University of Washington and I've explained to them about, I don't think you need to be training librarians the way we need to be training them. We need to start thinking about the service orientation and not reference interviews, but how do we start saying you need to build services, you need to be build value added services on top of the data. And it's literally, I think to your point, I think it's getting them some classes over in the business school. It's getting some exposure to policy and legal in other areas. It's sitting down with the informatics groups, chemo informatics or bioinformatics and getting these multidisciplinary partnerships across departments. And I think that's the way that it has to go. I mean everything is bleeding into multidisciplinary work. But I get, as a librarian, I get a little frustrated when I go into a place and it's not a librarian doing the work, but it's a chemist doing the work. The chemist who got his PhD in chemistry is now doing work that I would traditionally, I think, inspect a librarian or an information professional should be doing. And I feel like that's a failing on our profession where the scientific domains have had to spill over into traditional things that librarians should be doing. And I think it's the onus is on the information schools to reclaim in partnership, not in a competitive way, but to say, no, we're expert in doing that and all these other experiences now we need to do it in your area as well. There's a lot that librarians have learned over a couple of centuries that we need to apply those in those other areas. We have time for one more question. Yes. I just want to follow up with a question you've talked about. And notice the Microsoft policy, internal policy, you put more and more information on your application code or your application design online, share with others, like performing on Microsoft, sell everything online. Have you come to see the advantage or benefits yet after you put more and more information on your application code online or more and more and more and more and more. Have you seen the benefits yet? I wish I could say something like, yeah, we've seen word usage spike or a share point of license, no, hey, wait. We're, the reason we're doing this in many ways, the reason that Microsoft, specifically Microsoft research is doing this work. Most academic institutions around the world have already licensed Microsoft software. We've already paid for it. The institution has already bought it. It may not be used. It may not be used by various departments or various reasons. And what we're trying to say is if you didn't know this functionality exists, it does. Or if it doesn't exist or it's not meeting your needs, let us help make sure that it does. And we can either build that or we can give your feedback to the product group and ensure that that is additional functionality. So fundamentally, we're not trying to say, hey, we've had a great benefit for Microsoft that's trying to say, we want you to have the benefit because you've already, literally your institutions have already purchased software. We want to make sure that you deploy it and have a positive experience doing that. And it helps you get your job done. And again, as a point that I made a little bit earlier, it's not that we want everyone to run, well, yes, that'd be ideal. We'd love to have everyone run Microsoft software all the time. We know that that's probably not a reality. Specifically, when we get into specific building applications, we want to make sure that in that ecosystem, that Microsoft's interoperable, that you can get data in and out, that we can play friendly and be part of that ecosystem in a positive fashion. Okay, thank you, Evie.