 Ach gwybod yw'r bwysig? Yn gythweb ym Mhwylwyd Ffryd I, rydyn ni'n gyllideb am yr Aelod Cernig, rhywbwn ymdangosog Cymru'u will. Gorydd yw'r ddwy-ectr ymdangosog. Fy wnaeth i mi o'r potion o'r swyddo am y busig iddynt yn gydigig ti'n gyfriferoedd. Rhywb bod hyn i'n gwybod yn oedd chyfatig i mi o'r swyddo i mi o'r potion oherwydd o'r llythiau i'n gwybod hwn i'r wyddi, Felly i'n ystafeld ar gyfer rŵr mae'n bwysig i'w ddaeth i'r ffordd, a thai'n gweifio, ac rwy'n creu llwyddiad am rydym, ond rydyn ni'n gwneud yn ystod am lleoedd yn hynny'n dweud o gwybodaeth ar 3 ffiant pan mae'r cyfraffi sydd yn fawr am y cyfraffi a lleoedd mae amddangos ei gweithio mydd yn dweud, yw cyfraffi ei wneud i'r cyfraffi i'r cyfraffi i'r cyfraffi i'r dweud o'u dweud i'r bod yn ei cwlad. I'm going to talk about the olive project at Carnegie Mellon, and then time permitting we'll have a conversation, try and answer any questions. I'm not a technical wizard in the context of the olive project. Our computer scientists are the wizards. So I've canned a presentation from one of my colleagues, so please bear that in mind when your questions become unduly technical. So let's get the show on the road and talk about the impetus behind software curation. Those of us in libraries have a long history of success to a greater or lesser extent in archiving static content. In recent years of course the question then turns to executable content. Those of my generation certainly probably think about executable content as being the games with which we've grown up and a yearning from time to time to be able to go back and play Duke Nukem as though it was 1992 all over again. And that remains an interesting case study, use study of software curation. But I think in today's institutional environment there are many application specific drivers behind the notion of software curation. I take as a starting point in this the momentum that we're seeing towards open science. And my reckoning is that when I've got three books sitting on my desk or at least on my laptop about open science there's enough momentum behind it to suggest that the open science movement deserves our attention. And certainly when the New York Times and The Guardian start writing about open science it really is time to take note. The software execution aspect of this becomes I think quite important when we look at issues around for example scientific reproducibility. We've seen a lot of interest picking up in the more popular side of the media about our ability to reproduce the results generated in the scientific laboratory to demonstrate either that good science was followed or that the conclusions arrived at by the researchers can indeed be supported by the data behind that. We had a last minute move from keynote into PowerPoint and those of you who use keynote will sympathise with me that my animation here doesn't work. What I was hoping to demonstrate was the shareable knowledge circle growing in size to reflect the open scientific movement as demonstrating an environment in which the amount of knowledge that is shareable globally grows to become a larger percentage of the useful knowledge that is out there. We in libraries have spent the past 20 years converting material to digital form with established standards and protocols for preservation and accessibility. My first digital project of the sort was the Gertrude Bell archive which was one of the first just funded projects in the UK. We started that more than 20 years ago converting manuscripts and photographs in the library in which I then worked into a very elegant mid 90s digital archive. We built it and whilst there was a lot of interest in the technical approach behind it frankly for the next couple of decades nobody really came to it and I think it was largely forgotten except by those of us who returned sentimentally to look at how we did things in those days. A couple of years ago a book about Gertrude Bell came out with a photograph from that archive on the cover. Those of you interested in the history of the age might like to know that on the camels on the far left as I look at the screen is Winston Churchill, then Gertrude Bell, then Lawrence of Arabia and a couple of functionaries sitting next to them. Indeed the back of the cover acknowledges the archive as the source of that cover image which in itself was nice and it created a bit of a buzz. This year it became even more interesting when Nicole Kidman starred as Gertrude Bell in the film of the book and all of a sudden that digital project from 20 years ago became a very hot topic. I think this is the issue in part that we face with executable content. Software generated 20 years ago may languish and then for some reason completely unpredictably becomes of interest and we need to work out how we can make that accessible. In addition to converting analog media to digital in libraries we've got a track record of curating born digital content and we see at least some libraries picking up the challenge of managing social media products such as tweets and webpages in this case in the British library. We see the policy drivers behind data management OECD has been championing this for some time and increasingly the federal funding agencies around the world are putting in place data management mandates in this case the NSF one that came out last month. Another swath of driver behind this the movement around citizen science that has picked up a lot of traction plus in this case talking about the volume of crowdsourced data being generated by citizen scientists. Inside the academy we see increasingly journals in this case plus one providing access to data sets. In this case the data is stored in Figshare when you download the data and try to look at it it opens up in a Word document. How sure can we be that the Word document will still be accessible in 20 years time even through the file hopefully will still be available in Figshare or in Figshare successor or in the internet archive. We know that data comes in many shapes and sizes and increasingly being provided through mega archives like Figshare, Research Data Australia, the UK data archive. So lots of data out there how can we be sure that we will be able to access those data sets and open them in years to come. An interesting case study to bear this out the Reinhart and Rogoff economic paper which came out I think in 2010 and inspired many governments to implement austerity measures. About a year after that paper came out a grad student and his supervisors realised that there had been an error in the Excel spreadsheet which actually rendered the whole study somewhat redundant. That was fine because the Excel spreadsheet was still accessible it was only a year after it had been created. If people were going back 20 years in history would they be able to open up that Excel spreadsheet and show that there was an error that had if you believe the Bloomberg hype changed the course of history. So that is the problem statement in a sense. Inside the academy in the context of reproducible science the drive towards data management we need certainty that we can gain access to the data sets into the future. Another element which I don't have a slide for the increasing number of journals like Software X from Elsevier plus Computational Biology I do have a slide at the end which are encouraging computer scientists in particular to deposit their software in journal archives. And again that same problem statement. In 20 years time with contemporary hardware will we be able to execute software created 20 years earlier and can we plan ahead to ensure that executable content remains accessible. So that is the problem statement. I'm going to hand over to Ewan who will talk about the Yale approach to handling that and then I'll come back to say a few words about Olive. Thanks Keith. So I'm Ewan Cochran. I'm the Digital Preservation Manager at Yale University Library. I'm going to give it a little bit more background before I start talking about what we're actually doing at Yale. I think one of the most important reasons why I see software curation is important which Keith hasn't really covered in detail yet is for software dependent content. I mean he did talk about research data and it's a very good example of that. This slide here is just a graph showing the different usage of operating systems over the last 13 years or so. And what it shows really is just that there was a lot of change. So if you had any software, if you had some content that required some software that relied on one of those operating systems you're going to be in a bit of trouble now because it's probably already obsolete from an access perspective, from a regular use perspective. And that's important because old software is often required to authentically render old content to get access to the information in the same form or the same information that was available when the original software was in regular use. This is an example from the Forestry Research Institute of New Zealand. It's a research paper that they published that covers tree growth over time. The original on the left is WordPerfect running in Windows 95 and the one on the right is the same file opened in LibreOffice in Windows Vista. That was when this was done. Here's another example. I really like this example because a little technical detail about it. What you're seeing on the left is again something from the Forestry Research Institute. It's also about tree growth. It's a bit earlier. It's done at Wordstar and it has some equations in it. And the interesting thing about this is the author of the paper. They used the line above to do the exponents in the equation. So it's squared, cubed, what have you. And that's not a feature of the format or of the software. It's not like they turned on bold on the text. So you can't go through and programmatically look for this without having some sort of really artificial intelligence type approach to it. So what it means is you can't really automate migrating these things between different pieces of software in different file formats because you can't look to know what you're doing and confirm that it has actually been preserved over time. Because what you see in the modern software is that the equation changes because the formatting, the size, even the font size can make a difference to that. So what I'm getting at really is there's a really great need to preserve or in curate software in order to maintain access to software dependent content. In order to do that, that means we need to curate and preserve operating systems, software applications, all the fonts, scripts and other dependencies, plugins that are needed to support access to that content. And in addition to all of that, there are another couple of areas that I think are really important when it comes to software curation. That's maintaining access to whole desktop environments because it will sometimes be content that requires a very specific setup of software to be able to be accessed. An example we've seen is famous users' desktops where Emory University took a snapshot of Simon Rushdie's desktop and provided access to that via emulation. I think there ought to be a lot more of that going on. We're doing a little bit of it at Yale, the Manuscripts and Archives Unit and the Bionic Area Booking Manuscripts Library are taking authors' desktops and taking a snapshot of them. They're doing a bit of redaction because the authors often don't want people seeing every single file on there, but they keep the original for the most part and only provide access to a redacted version. But there could be a lot more of this. There's an area that I'd love to see some work in capturing desktops of famous software developers. We're really missing access to basically the tools of today, the tools that are used to build this digital world and no one seems to be going out and capturing them. I'd love to see that happening. And then we also need to curate the pre-configured disk images with sets of software installed on them. So when you're doing emulation, you need to set up a virtual hard drive, install software on it and then use that to access content. Those disk images aren't really being systematically configured and preserved and curated over time. So the software curation thing, how do we go about doing it? Well the main thing that we use really is emulation or virtualisation. And emulation, to give it a bit of a definition, it's when you use a new computer, on a new computer you create in software the hardware of an old computer and then you run the old software that was compatible with that old hardware on that emulated version of the hardware. And usually the computer you're doing the emulation on has completely different hardware to the one that you're emulating, or quite different hardware at least. And many emulators were originally developed to run video games. There were a lot of homebrew groups that got together and built things like Nintendo emulators so they could play their video games. There's a lot of passion that went into these and still does. But it's also been used a lot in industrial settings and in scientific settings. If you've got a piece of hardware that you need to keep running for a long period of time and it relies on some software and then the hardware that the software is running on dies, what are you going to do? Often the easiest thing to do is emulate that hardware that the software is running on, connect the big machines that you've got connected to that new hardware that's running the emulated version of the old hardware and keep it going. Another area where emulation is used a lot is in mobile phone application development. Basically if you're building something for an Android device like this, these normally run on completely different hardware to your PC. And you want to do your development on your PC because it's much faster and it's easier, you've got big screens to develop on and so on. So the actual software development kits that are provided to build the software for your phones come with emulators. The Android development kit comes with QMU, which is one of the emulators we'll both be talking about in a while. One other point I thought it would be useful to make in the way of background is the difference between virtualisation and emulation. So virtualisation is basically the same as emulation except it's doing the creation of the hardware and software on hardware that is compatible and it's generally used to be able to run multiple instances of an operating system on a single set of hardware. So you might have a server with 48 CPU cores and a whole bunch of RAM and you separate that out into 48 different operating systems running at the same time. But they're normally compatible with that underlying hardware and often some of the code that's running on that operating system will be running directly on the underlying hardware. Whereas with emulation it's all done in software so there's this layer in between that and the hardware. It means that the hardware that's doing the emulation needs to be much more powerful than the hardware that's doing the virtualisation because it has to recreate that entire hardware in software. But what virtualisation is good for in the digital preservation world is bridging the gap between the departure of recently obsolete hardware and the arrival of hardware that's powerful enough to do that emulation completely in software. And so I see there definitely being a role for both. I think over time it would end up having to move to emulation but in that interim gap where you're no longer running Windows just to have some software that runs on it and you need to keep it going but you don't have a system that's powerful enough to emulate it yet. You run it in virtualisation and you can still build the disk image and get it all in your workflows in time for when the emulator's available and you can put it back into that. So another thing we need to do to enable the software curation is to make available documentation of all these different components so that we can do what I think is really necessary in the space which is collaborate. So to enable collaboration we basically need to be able to have standard ways of talking to each other about the things that we're dealing with and we don't really have these for the most part. So we need persistent identifiers and unique identifiers for software. We need them for the virtual hard drives so that we can exchange those hard drives and share them. It doesn't make a lot of sense for everyone to go around and create standard Windows 95 hard drive images with just Windows 95 shared on them. Or if we just want to be able to talk to each other about something being compatible with this disk image we need to be able to identify those things. We also need to be able to identify the configured sets of hardware so if you've got a Windows 95 era PC or Mac OS 7 era computer that's configured as an emulator environment we need to be able to talk about what that means and identify it over time. And basically we don't have most of that now. The Internet Archive nists the National Institute for Standards and Technology I want to say and Pronom at the UK National Archives are all doing some good stuff around software documentation. But the other two there's basically a gap at the moment. So how does this actually work when you're doing it on a day to day basis? So with emulation what you really need to do to begin with is configure your emulator or configure your emulator environment. So that's like you're building a computer from scratch. You take your memory, you take your CPU, you take your audio card and your video and your network and then your hard drive and you put them all together and connect them up. With modern emulators you can do this through a graphical interface and just click. You can say I want a Pentium era PC with this kind of hardware, this audio card and so on. And then that can be saved as a configuration file just as metadata. And then you also attach a virtual hard drive, a virtual disk image and then you can start installing your software on that. You can boot the machine and install your software. And it generally requires specific tools for each emulator and they can be a bit technically challenging. A lot of librarians and archivists would probably struggle with this step in the process. And then when it comes to accessing these things generally in libraries and archives at the moment with the exclusion of the great work of the Internet Archive they're using dedicated machines in reading rooms to provide access to these environments. And they're normally quite restricted and you need to have special permission to access a lot of it. Often because it's technically challenging. And that's really too hard and it shouldn't be this hard. So this is where the emulation as a service concept comes in. So emulation as a service is basically about providing remote access to emulated environments or virtualized environments via a web browser. It abstracts all those complicated configuration issues away from the end user and basically just provides a link. You can click on it and it will boot in your web browser. The implementation that we're using at Yale has a lot of really interesting features that are really quite useful. One of them is that, and this is pretty consistent with all emulation approaches, but you can choose to save any changes that a user has made it to the environment that they're using or not. So you could have a pre-configured environment made available to an end user. They can do whatever they want in there. They can delete files. They can add things. And then at the end of the session it just gets deliberately lost so that next time someone logs in it gets back to the same environment. You can restrict interactivity where appropriate. So that means if you don't want someone downloading anything from it or you don't want anyone being able to print from it, you can do that. And that's really quite useful and quite powerful and something that came up in the discussion this morning. Because what it means is you can get over some of the IP issues that come along with all of this. And the emulation as a service software we're using at Yale allows you to quite quickly create custom environments. You can take a base environment, add something, install some software and then provide a link to that to somebody else. And that's quite powerful. It gets at or gets towards the idea of virtual reading rooms. You can create a custom reading room for every patron quite quickly and quite easily. So the background for the things we're doing at Yale, basically this software is all produced by a team at the University of Freiburg in Germany. They've done some really amazing work. I started working with them when I was working at the National Archives in New Zealand a few years back. And I remember talking about the concept of this with one of the leads of the project over there while we were at New Zealand doing some brainstorming around it. I don't want to take too much responsibility for it, but it definitely was involved in the early days. And now when I came to Yale, of course, I brought this along and we actually had the first installation of the software outside of Germany. And we've been testing it. We've been trying to figure out how we want to incorporate it into our workflows and providing requirements and ongoing development. Collaboration. And we're planning to implement it into production in the next financial year. So why would you all be interested in this? Well, as we mentioned, a lot of content can only be properly accessed using emulation tools. It's technically specialised to do this emulation and can be quite challenging. The old software can be a little difficult for modern users to understand. One of the things we're looking to do in a future iteration of this service is provide layers on top of the interface to point to how to use this. So say you're loading up an old WordPerfect file. If you've ever tried the early versions of WordPerfect, they're mostly keyboard-only controls and modern users just wouldn't get it. They wouldn't know which controls to use for starters. I certainly don't. I have to look it up every time I try and do it. So if you can provide kind of a layer over the top with some pointers to click here to do that and so on, or you can even have little animations, that's the kind of thing that's enabled by this kind of approach. The other great benefit, of course, is that users don't need to come into your reading room to access anything. They can access this stuff from anywhere in the world. And, of course, you can maintain control over the content. The users can't copy data in or out of the environment unless you want to authorise them to do so. The one thing that's always going to be a problem with anything that's remotely accessed is the ability to take a screenshot or take a photograph, but that's probably acceptable for most people. With this particular implementation that the guys from Germany have put together, there is quite a strong separation between the environments, the installed disk image environments, and the objects that you might access via them, and the emulators or the emulator configurations themselves. And they've got a few different ways of implementing it, including one where everything is local, or you can separate any of the bits out to have some of them done locally and some of them done remotely. So you can outsource components, all of it, some of it, or none of it, as you wish. So you might have the disk images and the content locally, and then the emulation provided remotely, so that you don't have to deal with the more technical things. You can just keep the things that you want to keep in-house. You can create small derivative environments very easily, so you can click a button. You can start with a base environment, add some files, click a button, and it creates what's called a derivative environment. And when it does that, it just saves the data, the difference between that and the master environment, which is really good because the master environment comes with a disk image that usually has windows or macOS installed on it, and it's usually quite big. So instead of having to keep that for every single new environment, you just keep the little difference. Saves a lot of space. And so then you can also reuse the standard environments that you might have all these dependencies that are the same for all sorts of other types of things that you want to access over time. The other thing that this particular implementation provides is the ability to cite environments. It gives you a unique identifier for every single environment that you can then provide for citations, and it also provides a link that you can just email to people or add to your catalogs. I blogged on the Library of Congress digital preservation blog about this with some examples, which were actually from patrons that came to us and said, I want to access this, what can we do? And you can have a look at that if you want some specific examples. Here's a diagram of the architecture. I can put you in contact with the team in Germany if you want to talk about the details. I don't have a lot to say about that aside from... I'll get to a bit about the developers in a minute. I'm going to try and go a bit quicker because I've taken up a lot of time. From a practical perspective, if you're going to implement this particular approach, you still need to do that emulator configuration, but you can also leave that to the experts to do. But what it requires is you configure the emulator locally, and then you import the environment that you configured into the service. And you can do that import, importation via a web browser interface where you put the files into a file or it looks for them. It then allows you to just configure it and label it and so on. And if you're a librarian or an archivist, the thing you're probably most likely to do is take one of these environments that's already in there, add some files to it, and then provide access to that to somebody else. And you can do that all via a web interface. You can also do the same thing with software. So you could add a CD-ROM to one of the environments that's already there, install the software, you can do the testing via a web browser, add some metadata about how it went, and then save that custom environment to be either then used to add some files to it or provide access to what we've been doing is we've got all these CD-ROMs in our general collections that no one seems to be doing anything with. So we've been imaging them and I've started making a lot of them available in the service. It's really easy to do. You just put the file into the right folder, then you can boot it up like this and install that software or add the link to the desktop and then provide that to somebody to use. You can also use that same environment interface to ingest entire disk images that you might have captured of, say, an author's hard drive. And from an English perspective, it's really quite straightforward. If you have a finding aid or a catalogue, you can take either the thumbnail and the link or just the link and put it in there. Someone clicks on it and it loads. You can also just embed an environment into a web page. So it's right there in the middle of the web page. Now if you're a developer, some of the interesting things about this implementation from Froburg are that what it does is it basically provides an API for a whole bunch of different emulators. They've incorporated, for example, the same emulation software that's used by the Internet Archive, the JMS. That's included in here. So are a bunch of other emulators. I think three Apple emulators, QMU plus VirtualBox, which provides a whole lot of really interesting potential. With VirtualBox in there, you can put modern environments in there, which then provides you the ability to do weird kind of virtual reading room things that you might want to do. And then you can incorporate that into your workflows. So if you want to put an emulation as one step in a more complex workflow, you can do it just using this API and it's a generic API. So you don't need to know the specifics of any of the particular emulators. You can embed them into web pages and use them for online exhibits. You can cite them and provide URIs for everything. One of the really cool things that they've been doing recently is collaborating with the Big Curator project. And what they're trying to do, and they've had a lot of success, I believe, is make it so that once you basically one click you image the thing and it automatically figures out the configuration for the emulator, it configures it and then attaches the disk image that you've created so you've got it emulated just like that. So image, maybe even a CD-ROM, they've been doing that as well. And what it'll do is it'll look in the CD-ROM to see what kind of software and what operating system is probably needed to access that CD-ROM and picks the right environment that has that software already installed so that you can then just automatically boot it and then all that's left is for the librarian or archivist to install the software and then you've got a link to that environment for anyone to use. I'm not going to demo right now. If there's any time at the end I can do that. But I will leave you with this slide which is the link to the environment in Germany that you can go and try out yourself. The Wi-Fi here's a bit sketchy so it's probably best if I don't try and demo it now but if anyone wants to have a look at it later just come and find me. Thanks Ewan. I'll turn to all of them. That's perhaps a nice segue but one of the differences, we're all trying to achieve the same thing, one of the differences between all of and Ewan's approach is that there is a separation between the location of the virtual machine image and the site at which it is launched and executed. It's demand-paged and therefore if your internet connection isn't of data centre quality you can still get there but I still wouldn't want to try it on the Wi-Fi in this hotel. I should have mentioned at the beginning that these slides are already on slide share on my account at CM Keith W so feel free to grab the slides if you wish. The Olive Project is driven out of the School of Computer Science at Carnegie Mellon and has been funded for the last few years by IMLS and Sloan as a proof of concept and that foundational work was recognised through an award towards the end of last year by the National Academies Board on Research Data and Information We've also been working closely with Ithaca so Diana I'm sure might be able to join a conversation after this presentation Ithaca came in and provided a lot of guidance on some of the big questions that I'll pick up just at the end. Given time I'm also going to quiz through these slides because all I'll do is read what it says on the screen so grab them from the slide deck. The interesting point here is the inspiration from YouTube that what we wanted to try and do was page on demand but streaming a virtual machine is not as easy as streaming a YouTube video and that has been at the absolute centre of the Olive development work. The broad model there is very similar to what you've heard from Ewan already. The host environment is represented on the boxes at the bottom of that structure so you might for example be hosting all of this on a laptop running Linux Olive has built this caching service which is on box 3 then inside that virtualising the host hardware using QEMU as we've already heard then sitting on top of that is the guest environment so for example it could be the hardware emulator hosting an old operating system let's say Windows 3.1 or Mac OS 7 or similar and then inside that the old application which might be an early version of Word Perfect for the sake of argument and then inside that the Word Perfect file or an Excel spreadsheet or some specific account in a computer game so that is the broad conceptual structure illustrating that with some examples there some American game that I'm told was very popular in schools in the early 90s where you could plot the plague and how it would wipe us all out sitting inside an early version of Microsoft Windows so this is a YouTube illustration going to the Olive website where the virtual machines are available and there's a link towards the end of the slides to that page and you'll also see the link there to this YouTube film showing the way in which you can go in and start some software if you have the credentials to get in in this case Microsoft Office 6 opening it up and there we go this will bring back memories for some of you and some of you will be grateful that you didn't grow up in this era it's going to be a sentimental trip in a few seconds time down the early PowerPoint demonstration slide so this is what PowerPoint used to look like Do you have trouble finding the installation software? I guess given the sort of university that Carnegie Mellon is it's been quite easy just to tap anybody over the age of about 40 and ask them to open their desk drawer and you'll find some of the disks but that is a... So the hoarders are our friends in this case? Absolutely and we seem to have a culture on campus whenever a faculty member retires that their office contents are given to the library and once you sort through the black bags you can often find some of these old disks so this is I think the game that I mentioned so here it's illustrating this what else is this running through? one of the big challenges we face in this environment is the legal issue around software licensing and that's why a lot of the software that we've been able to work with as proof of concept is not available outside the olive environment because our lawyers are not as generous as Brewster's in steering us in the right direction we did commission a legal study which simply said do not then there were 300 pages it was a bit like library rules for borrowers and here's a chance to see the early mosaic browser able to execute an old website but you'll see when we try and execute a contemporary website in this case such as current Coda project it doesn't look as nice as he would wish it to look and I think that's the end of that you really don't want to watch that again and I've lost the mouse right so although we've proved that it can happen it is incredibly time consuming we've been working with some of the current programmes in plus computational biology and these are taking weeks to work with because of the complexity of today's software relatively straightforward to take an early 90s computer game and work with it 2015 bit of software looking at genomic data is much more complex and therefore the technical expertise required to make that work is very demanding but Sacha has this wonderful conception of all of us being the reference library for the nation and the world I think that he ought to talk to Brewster a bit more carefully about who's going to dominate the world in this space but there is this sense of being able to look back in time what would Isaac Newton's model say about today's data and that's why we are looking at the data that is being generated and the software that is being curated and made available through today's journals such as Software X and Plus Computational Biology you can find more data about all of the project website there's a technical report in the School of Computer Science repository which you can also access given that we now have five minutes between this moment in time and drinks I'm going to stop talking and invite any questions or comments Thank you all for being here