 We had a little practice session at the beginning experimenting with the audio that my microphone isn't ideal and I apologize for that Unfortunately, we also even with two headphones and a separate mic weren't able to fix my American accent But you know, we'll do what we can so what I'm going to be talking about a little bit today is just the The you know the in this spirit of the the shift going on in libraries archives and museums this digital curation shift If you will moving from predominantly analog to predominantly digital corrections I would say requires also significant shifts in lamb thinking and practices It involves attention to various forms of digital representation Including some levels that I'll be talking about that. I really think to find the work in many ways And also very much hopefully and luckily we have a lot of free and open-source tools that it can expose capture and transform Digital efforts is intension at those levels that I just mentioned So when we get to these levels It's something that people who work in the IT sector or if you're even just adjacent to it because you work in digital curation You're likely have encoded encountered a lot of models that are based on levels and and layers, right? It might be you have an interface layer that's built on top of a web of server that has a database underneath And there's an operating system under that, you know We're sort of naturally attuned to think about these levels that build on top of each other And a lot of those are technical and and you know fundamentally important for how the technology operates But not necessarily the way we think about digital collections And so these levels of representation. I'm going to talk briefly about are the ones that I think really matter most to how people want to encounter digital collections and Thus also how people responsible for them really need to be thinking about what the it is that they're managing, right? What are the entities? What are the objects? What are the types of expression that they're really trying to reflect over time through generations of technology? So at the highest level we have aggregations of objects And this is something that lambs libraries archives and museums deal with quite frequently So from an archival perspective, which I will apologize. I tend to fall into more predominantly Then library and museum perspectives though. They're all very much intermingled We have things like series and record groups and phone and different ways to talk about this idea that We might encounter something at a level of aggregation that we can encounter and manage Distinct from the objects within it, right? We can really think of something like a classification system And a set of objects together in a way that really constitute their own level of representation We can think of objects or packages. So from the oais reference model perspective, which I'm sure many many of you on this webinar are You know either far too intimately familiar with or have been trying to avoid for many years of your career And the idea in the oais reference model At its core is that we have these information packages. You have submission information packages You have archival and then dissemination information packages And the the most significant aspect of that for this discussion is that They are packages and not just bit streams. You have something that you're trying to manage But it has a lot of associated metadata with it And the reason I want this in with object is that There's a somewhat murky boundary between objects and packages in the sense that If you were to try to preserve a web page You want to make sure that as an object it might render and be viewable as a page But even distinct from the rendering you want to think about them as an entity together Whether it's images and spreadsheets or different components of a website or a web page, for example But then you may really be particularly focused on and interested in the in the application rendering How did this actually get perceived by somebody who used it? How might this actually been impulated and encountered in a particular platform? And when I say in application it tends to suggest something that's running on your desktop But this can very much be Application environments or software environments that are also running and hosted or cloud environments as well You have the file through the file system So this is core to a lot of the work that i've been doing over about the past decade is that The file system which is where on this whole stack of technologies A lot of what an archivist would call provenance and original order information resides So rather than just having a file that you might download from a website If you have it out of its file system, you know things like what were the timestamps associated with it, you know modified access Changed or created what were the user accounts associated with it and their permissions what directory structure was the sitting inside of But in some cases, of course, none of these are mutually exclusive, but often one takes A stronger focus than another we may care about the raw bit stream and this is what Anyone who's opened a file in a hex editor is really encountering. That's the bits Chunked out into bytes that constitute the file itself And that's quite different from what's what's visible through the file system because that Then doesn't have any of that associated metadata with it. It's just the the bit stream itself In many cases, we tend to the sub file data structures. So things that are Important information but not actually at the file level. It's not something that your file system would name as a file It's something that you have to reference within that file as something that's An important component or possibly even what constitutes the record or object that your object in the in the sort of more abstract sense that you care about a lot of the work that has Advanced through libraries archives and museums over the past decade that I've been really closely involved with Has given a lot of attention to the bit stream through the input output equipment and this is to address This fundamental issue of having something on removal media if you have something on a floppy disk or a hard drive or a flash drive Just getting the data and recovering it off of that underlying hardware Requires a lot of decisions. Do I want to just pull the files off then looking through that file system level of representation? Or do I actually want to copy sector by sector the disk itself so that I can replicate that disk as a disk image that could be mounted You know to be Accessed directly in an environment to be run in an emulator or any number of other scenarios Even below that you can have the raw signal stream, which were the actual um the analog properties whether they're Audio signals or magnetic flux transitions that actually constitute what the bit stream is before it's interpreted even by the machine as being a one or a zero so for example, there are certain voltage levels that Drivers and other technology will consider to be a one or a zero But that doesn't mean that they're objectively that and then finally you have the bit stream on the physical medium itself so just to very briefly Run through these levels with some examples A aggregation for example could be something like This output from context miner, which is a system developed by a project that um I worked on with several colleagues here at the university of north carolina a number of years ago As you can see we were looking at election 2008 in the u.s. So that was a good number of years ago um And the idea is that the software context miner will go out to different sources in this case to youtube And pull back results from queries that have been issued on particular topics So through this view what you're not seeing is the objects. You're not seeing their metadata At the level of the granular object. You're seeing Aggregate information about what was pulled down. Okay, so for the query election 2008 There were this many results so far. This is how many results there were on the last crawl Right and so you can kind of group them in that way and navigate them in the way that you would with an archival finding aid for example There can be an object or package which would then be one of these videos that might have been collected through youtube that has not only You know the say the flash file At the time that was there uh to be rendered and maybe downloaded as an ed peg or something else that can be stored locally But also the metadata associated with it comments views The name of the file the description that's associated with it Um in many cases the application rendering is fundamental to our understanding And really meaning associated with the object. So in this case there could be a video From a collection associated with again the election in the u.s. I apologize for this very us-centric example But this is what we were working on where there'd be a video That you'd want to be able to encounter in the browser as it would have at the original time Uh the file through the file system can be encountered generally in in one of two ways either through the command line Interface which you can see at the top there or through the graphic user interface that you see a little bit lower on the screen And you can see that in this case the dir command Which i'm sure many of you are familiar with at the at the c prompt or whatever other prompt you're looking at In a windows environment, you know, this would be ls in a linux environment Has this slash a which says show me all the hidden files Show me things that might not be readily visible to The human eye if you're just encountering it with the normal settings in your operating system So we can see some of these little traces that are left behind in the file system If we're looking at a file as a raw bit stream, this will be Again easily viewable in a hex editor where you can see on the left side here the hexadecimal representation So these two nibbles make a byte right so the zero zero there constitutes One byte value and then on the right you can see what that would be in azki if it turns out that that is text It's readable And so this is also what we base check sums on right so when libraries archives and museums generate Cryptographic hashes of files. They're not doing it based on this big agglomeration of information They're basing it solely on what the bits within the file are We can think about sub file data structures Within the microsoft office environment So quite a few contemporary file formats now even though they have a different extension So this is hidden from the view of the average user are actually zip files And this is the case with microsoft office files for quite a number of years now So if you were to open a word document not in word but in winzip or seven zip or some other application You can then see the internal structure of that file right? You're essentially kind of now walking a tree as if you're looking at a file system So in that case you may very well be interested in some you know embedded thumbnails or metadata or images That are embedded in the file that you can pull out and not have to treat everything at the individual file level The bit stream through the input output equipment as I said earlier tends to manifest as a disk image Which means reading through all the sectors off the disk and then writing them to some File or split set of files that then can be processed later in a variety of ways You can look at the raw signal stream through the input output equipment. This is a cryo flux, which also I presume some people Um in this webinar are probably already familiar with this device, but if you're not essentially the cryo flux is a way to Extract and recover data from older media, especially floppy disks that might be problematic in some way There might be some corruption. There might be issues with the encoding whether your device is not reading it correctly And so you can see in the top right corner that those lines are indicating the actual magnetic Waxes that constitute the data that's coming through the wire And so even though we talk about all this as ones and zeros ultimately these are physical analog properties that have to be detected as ones or zeros another example of this that may resonate with some people is using a cassette for digital storage right so I know back in the the days of computing when I Came of age, which was around the time of cometer 64 in the u.s. And you know bbc micro in the uk and There was a lot of use of tape as opposed to you know as a companion to floppy disk for For for data and essentially it would make the squeaky sounds and those squeaky sounds would be interpreted as ones or zeros And so if you're in the position of taking one of these tapes and wanting to move it to a contemporary environment There's essentially a two-stage process. You first have to generate the squeaks In the right way so that they generate those analog Audio signals and then you have to break them down into ones and zeros in some way And then finally ultimately there are physical properties on the medium itself And in many ways what so much of the work of digital curation is doing at least in the immediate recovery and Security of the content Is to pull stuff off the medium so we don't really have to rely on it right media or brittle They become obsolete. They're not going to be readable forever So kind of looking at in this case It's a very high powerful microscope looking at the surface of a hard drive But those are the physical properties that ultimately manifest the bits So if we bring this to these shifts in libraries archives and museums that are the are the focus of this series and of my talk We can look at how core lamb functions involve numerous decisions based on various patterns right commonalities differences contextual relationships And luckily when those patterns can be identified algorithmically Software can really assist in the process And I always try to use terms like computer assisted as opposed to automated because There are very few core library archives and museums functions that can be really automated in any meaningful way We can support the work very thoroughly with software We can do things that humans do badly or inefficiently with software But ultimately the decisions about what we should be tuning the systems around and the decisions to make around them are human decisions So compared to analog materials lamb functions are often more iterative and repiant rely on data sources And streams shared across the functions. So just as a quick way of Explaining this we can think about appraisal. So the you know the function that within archives refer to as appraisal Which is essentially collection development selection, you know, what should go into a collection to be retained for continuing value for a long period of time It's very difficult to make appraisal decisions if you're staring at a floppy disk and holding it up to the light Right, it's not going to tell you too much. There might be some label on it that gives you some idea of what's there But essentially as we encounter these levels of representation as someone who is Caring for the materials is able to work through these levels More and more information reveals itself and so that that can give you more and more ideas about Whether this is material that should be trained who should be retained in what form it should be retained What are the characteristics that are essential for its preservation? What are the elements that should be part of its description? And access points and things like that. So there's there's always a lot of iterative And interactive elements of all these functions that libraries archives and museums engage in But because of these some of the sort of symbolically encoded and explicit nature of digital representation I would contend that this iterative and interactional component Of library archives and museum workflows tends to really define digital curation in a way that wasn't quite the case in analog materials So I was fortunate to have worked with a number of colleagues At university in north carolina and edukopia institute on this project called Along with partners in in other institutions lyricists artifactual On this project called os s arc flow where we looked at workflows in a diverse set of institutions Again, this is a very us-centric project because it was funded by a us funding source So we didn't have an international representation, but quite a few different kinds of institutions public libraries large research Institutions larger smaller archives and state archives Looking at what their workflows were With born digital materials when they were using a relatively similar set of software So they were all using some combination of the big curator environment Which I'll talk a little bit about more archiving matica and archive space But in different combinations between those three tools and also with other software that they were using And so in the case of this project we looked at the different kind of steps in their Workflows and figured out different ways to represent them that might be useful to share to compare and to build off each other's workflows So it's just one example of a workflow. This is showing the steps along the way It's showing at the top the software that somebody might be using to carry out those steps And at the bottom who might be involved in those steps and so sort of breaking that work down I think can really reveal a lot of these places where Functional components interact and also where the software can really reveal and exploit information that emerges often only at particular points in the process So one of the projects that we worked on In the past is the big curator project. This was funded by the Andrew W. Mellon foundation It was a partnership between the school of information and library science and the maryland institute for technology and the humanities Again funded by the Mellon foundation over a three-year period with two different phases What we wanted to do is is sort of package up distribute provide support for document provide training around Digital forensic software all open source free and open source digital forensic software That could be incorporated into lamb workflows and could help to support provision of access to data Which is something in a wider public access sense that the digital forensics tools Don't normally get pointed at they come out of an industry that isn't trying to share information They're trying to use it in a very Circumstribe sort of set of environments So the big crater environment itself is a specialized Ubuntu Linux distribution So it can be run as a virtual machine It can be run As and this shouldn't actually say coming very soon because we have these installation scripts in place where you can essentially take an Existing of into Linux machine and built it into a big crater environment You can run it as the native operating system on a machine or on a partition on a machine And there are a lot of individual components that you could run in either Linux or when possible a windows environment These are the steps in a process that the big crater environment is really intended to support Which are acquisition reporting redaction admitted data export? These are all things that i'm very happy to go into in more detail during the discussion these tools and They're associated documentation are maintained and cultivated and stewarded by the big crater consortium Which has been very active. It's an international body that uh member institutions pay into And then they get to sort of set the direction of the consortium and the software that it supports But it also has a number of activities including a very active User lists that are open to anyone outside the consortium We then had a project called big crater access, which also was funded by the andrew w. Mellon foundation in the us And this was to look at now that a lot of output had been generated in institutions from these digital forensics tools How then can you provide access to them? What should the access points be? What sort of filtering and redaction and restriction of the content should there be? so one of the things that we developed out of this project was um Through a contract by developer named greg jansen was this big crater access redaction tool that allows you to take a Disc image and then scrub fill or fuzz the content that you might want to not provide public view Around so scrub meaning you just replace it with zeros fill you fill it with some particular You know heck sequence that you want And fuzzing being specifically for binary files to try to break executables basically if you don't have the rights to distribute that software Big crater nlp was then another project that built off of the other two also funded by the mellon foundation And this is where we really tried to point a lot of natural language processing software at the same set of Tasks, how do we identify things like names of people places organizations? In the way that can allow both people caring for the collections and people who are trying to make use of them To be able to better navigate content that can often be too difficult to approach at an individual item level um, so this is essentially the workflow supported by um, you know workflow at the lowest level in terms of the tools supported by um The work that we did so you essentially have a corpus of disc images. You have to have some way of Of reading into those disc images and seeing their file system then pulling the files out in text form So that the natural language processing tools can be applied to them And this could either be topic modeling for example, which is one thing that we did some work on Or named entity recognition, which I had mentioned previously um, so this is the kind of um Kind of thing you can do with with some of this sort of software is with a disk image placed into a Directory on a server running this peculiar access software You can navigate down into a disk image without ever having had a previously extracted the files It's just a disk image sitting on a server and this can be running locally in a reading room It could be on the public web It could just be on the machine of somebody who's caring for the collection And then you can also extract the text from the file say it's a pdf We all know that this can be done, you know to varying degrees of success depending on what the pdf is And then also on the fly a plain named entity recognition to know what the organization's people places are associated with that collection Another possible direction for all of this is topic modeling, which is basically looking at Just term co-occurrence looking at things within a collection and saying Maybe all of these items are somehow related to let's say papers that were transferred to an archive or a manuscript repository These are the things related to this particular scholars Uh personal life and these are the ones related to her work life, right? Sometimes terms tend to occur in one or the other that will help with that kind of classification Uh a project that we just recently wrapped up also funded by the melon foundation was ray tom, which was looking at Email specifically so we worked with Some really terrific partners at the state archives, you know So the a lot of action happens at the state level in the u.s And raleigh in north carolina, which is not too far from us is where the state Capital is and so we really try to incorporate a variety of tools to more efficiently process Collections from email and then make them easily Analyzed and extracted and manipulated using some of these other tools So um, this is essentially the the breakdown of the software. There are various things that pull the data out of an mbox format or a Uh OST or pst from a microsoft environment Running various processes against them including natural language processing And then there's an also an iterative archival interface A processing interface that we developed to allow an archivist to go in and see some of these pre-defined tags But then also apply their own tags and be able to indicate where these things are in their process I have they already identified these as being records Have they identified it's things that have sensitivities and need to be closed from public review, for example So here's where you can find more information about the ray tom project And then finally in terms of projects that our team has been working on Cam Woods and I and and torn detour say who's our Very talented software engineer have this project called crass cap that's funded by the national science foundation in the u.s And this is trying to better support review for Sensitivity of scholarly products and communication so things that come out of research activities So what I wanted to conclude with before we you know to kind of launch into our uh discussion Is that I would say the benefits of developing digital curation tools Sorry, there are benefits of developing digital curation tools that can be inserted into various Places in a specific workflow rather than being one very very tightly regimented Sort of pathway that can only be executed in a particular software environment We're often dealing with relatively long time horizons and all know the technology needs to migrate Companies go away their business models change their licensing agreements change right and so We have to be able to build More sort of modular approaches to how we do this sort of work Decisions and sources of data to make them appear in a variety of steps in these lamb functions That is making these sorts of decisions So they can be especially amenable to these sort of modular Reusable set of tools they can be picked up and used In a variety of settings And I say this not to privilege the particular projects that I just summarized for you that of course I've been involved in but just use them as an illustration Of this more general approach is sort of breaking the work down to say this is largely about detecting patterns It's largely about a set of tasks that really kind of interact and flow between the different functions that we've always traditionally done And how can we do that in a way that then enables the work to Enables the work more efficiently But also makes it more transparent. And so that's just the kind of final point that I'm going to conclude with Is that the tools and processes that introduce things like natural language processing machine learning? Which are very much all the rage under discussion and for very good reason in libraries our connives and museums Should be transparent and subject to community review and community contributions And so that's really what's driven a lot of the work that I've been involved in is, you know, not because these things should be open source But that they should be Developed in a way that allows people to replicate the tasks you've performed and understand why they're seeing the output that they're seeing And so that's essentially the the sort of remarks that I would call introductory to the discussion. Hopefully and I'm very happy to To see what we what we come up with in the the discussion in the q&a well, cowl, uh will You you've blown us away with that and I think I think that was absolutely fantastic introduction to to as you say an introduction to the To what hopefully is going to be a really rich discussion around these issues um, so I'd encourage everyone to put Put questions into uh there for now. I'm going to kick us off with something um I'm going to take people back to To the youtube video of hillary with hillary clinton and um, so just to give us a bit of a visual Reminder on that one. I guess I could bring that up while you're asking the question just so people know That would be great. So um, that's uh, so so my question I suppose is about cloud computing I mean that was a 2008 project cloud computing and the use particularly the use of third party platforms Like youtube was was was with us But we're only just getting used to it really at that point in terms of uh of it kind of becoming ubiquitous and now it is ubiquitous. Um What are the what are the implications of that for for digital curation? I mean, you know, we don't we don't get the stuff anymore Right, right. Yeah, so I would say Excuse me. Sorry I think for cloud computing. Maybe this example isn't so important to have on the screen, but you know, it's uh It's an example of where the platform obviously is a web interface, right? The platform involves Keeping flash going, you know, I mean, I'm sure many people who are part of this webinar have encountered these issues of how do you essentially, how do you maintain Something that's highly interactive like not just a website that might be database driven and interactive in that sense But actually requires A whole set of tools that are built on top of each other and also might require user input to even operate correctly Um, and so in that way, I don't think that's a new question. It's still a challenging one, right? How do we deal with things that have those software dependencies? The more fundamental one really that I think you're asking is about when we're dealing with a cloud platform And maybe I'll stop sharing this now, but when we're dealing with a cloud platform In which some of those levels that I talk about are completely moved away from control or viewed by the library cards and museums And are essentially built into the stack that's being provided to you Right, I would say because you know, it's worth revisiting those levels of representation that I talk about Frequently and say are these still really relevant, right? And of course, I'm quite biased because these are levels that I came up with but I think in a cloud computing environment, they're still just as important because we still need to be thinking about Okay, the the cloud is where I'm pulling it out. Maybe someone else is hosting the content But ultimately I need to know do I need to get to certain metadata associated with the file that would otherwise Be in the file system and is now On the hosting side, maybe it's literally is in the file system But it also might be sitting in a database on that Platform and that's where we get attributes like timestamps and file permissions and things like that, right? So essentially that same level of abstraction is still relevant Um in the same way that also if you want to be able to look at the internal content of a file and look at its bit stream If you have something like a google doc, I mean, this is probably a very good example Right a google doc in its native habitat is a database, right? It's a database that has a lot of diffs. It has a lot of transactional data It has keystroke level data about what edits were made to the file And so if we think about it from a preservation perspective, we can say all right I'm going to work with google to maintain that database for the rest of you know Till the end of the republic as we like to say in the us in terms of the national archives and records administration, right? Which hopefully is not within the next few years, but this remains to be seen Um, so, you know for a very long period of time or we can say essentially what I need to do is pull out of that environment What I think the significant properties are and maintain them in a more static form, right? So in some ways it's artificial to say a google doc is a word doc that I downloaded from google Because it really isn't stored that way by google. It's stored as a database with a bunch of little shredded Uh transactional data that can be reconstructed in front of you to be a document But the reality is that when it comes to those levels of representation if I care about it in terms of its in application rendering If I care about it is an object that I can recreate Those are essentially behavioral and sort of aesthetic characteristics That really this term significant properties that we've had for a number of years should still carry forward, right? We still need to be thinking about What is the it that we're trying to maintain, right? And I don't know if that was getting to what you were asking about I mean, there are obviously a lot of institutional issues of should we or should we not rely on various Host providers and what are the costs of cloud storage over time and those sorts of things that are also Really fundamental questions, but what I think it gets to these levels of representation We still very much have these same issues, right? If you're looking at it as the stuff that I'm caring for in a digital collection We always need to remove ourselves from the underlying technology and say what are the properties that I care about It just happens to be the case that each of these levels Might have some of those properties residing within that level And that's why sometimes archivists and librarians and museum carriers have to be focused at hex editors and all these things That are really in some ways below the level at which they would hope The end user would have to care about but they can be fundamental to the attributes We're really trying to reproduce over time, right? So no totally and I think those wider questions I think we've got a couple of questions in here and I think I think Muhammad's question will pick up on that but I just wanted to sort of say I think that's very interesting because I think when you look at the levels and you said I think at level five dropping down to level five Was where you start thinking about the concepts of archival provenance and original order So the kind of level where below that the archivist if they're dealing with an analog collection kind of In some ways says here's some stuff in a box Um, and I'm now thinking intellectually about its provenance and its original order and its intellectual content Which is pushing you up the levels up to up from from five But but below that to some extent we're kind of we're kind of it may be a conservation issue But we're not thank you Cal for sharing that but below that it's kind of it may be a conservation issue But essentially if it's a piece of paper and I can read it. I'm kind of not too. I'm not too I'm not really expecting to get involved I suppose one of the one of the questions that you're the digital raises for us is It's the challenges raises that we may have to get involved We may have to get much more involved in those in those lower levels But as you say not always if you end up with a document that you can read that's that's that's that may be a That may be a good thing And if the alternative is to start a lawsuit with google then it may it may not be practical Right, and it's also it's a very broad sort of royal we right because it just means that collectively As professionals we need to be attending to these levels It doesn't mean that everyone engaged in the work needs to be you know Opening hex editors and things like this but I really have found I mean this has been my experiences working as a professional in this field And also working as a researcher in this field that Being able to navigate between these levels and understanding their relationships is really fundamental to the work because it really is about The records themselves and their characteristics that you care about from a you know from the perspective of meaning The human things we care about if we don't tend to these levels It's very easy to lose information in an irreparable way not be able to recover it because we weren't tending to the different levels but Yeah, no, absolutely So so mammy's question is about the implications of modern database systems in health care And I think it's possibly it's possibly a similar one to to perhaps that google challenge is that kind of you know They're often quite distributed and uh, you know, where is the information? Where's it sit? How do how you know, is it is it on the particular been outsourced all those kinds of questions? So I wonder if you had any thoughts about that Um, yeah, and I'd love to hear more about the question because I don't want to you know Assume the wrong things about what the question is asking I don't know if mammy's able to join us. Um on uh, and and maybe ask the question live or or or clarify on that We'd be very welcome to to join and if not i'm happy And if not i'm happy to answer it as I interpret it and then see how well I do but let's see how we go I mean, I do think health care records and the sort of the distributed nature of them raises another thing that Often comes to mind when people are looking at these levels that I tend to obsess about is Well, aren't things so distributed now that it's really not a matter of just like let's pull things off of one disc Right and so if you're thinking of somebody's distributed electronic health record, for example It's very likely not sitting in one database on one machine that you're going to pull the drive And now I've got the hard drive with all your records on it, right? It's going to be distributed and when it gets pulled together Is really when we again might have those levels it might be that You know the the sort of the object or the package that we care about In the back end is actually pulling from eight or ten different sources And that's not so different. I think in many ways from what I was saying about with web pages, right? So if we care about a web page and you want to be able to Preserve it and render it 50 100 years into the future The reality is that that web page might have elements that were drawn from seven or eight different servers across the world Right. It has a banner head from one place and it has an image from another place And so we really have to be thinking about the separation of these levels in that I'm not in any way implying that you just work your way down and ultimately at the bottom is one disc where everything is stored Right, but I'm not sure if that really gets to what the question is Obviously there could be things specific to medical records that Um, I mean I could say with when it comes to the sensitivities. Obviously there are A lot of things related to I know in the u.s. We have this very specific set of identifiers that tend to be That that tend to be discussed as the ones that are personally identifying right if you can go through these identifiers and You know the in the u.s. For example, social security number in most countries have something comparable obviously Though most of them handle them much better than the u.s. Does with our social security numbers That might be something that you want to protect, but the reality is that because these systems are so distributed and can be re-aggregated Things like natural language processing and machine learning can really help to identify things that might be sensitive In a way that isn't just a matter of doing a simple text string search to identify things that might Look like phone numbers or something, right? Which is Which is also a powerful thing like I said, it's all about patterns So if those are the patterns you can identify those are the patterns that you identify So, you know looking for sensitivities tends to be driven by the kinds of patterns that we can most easily identify Because that's what the software can do, but it does raise the risks that we can overlook Things that you know don't properly anonymize someone whether it's human subjects research data or whatever else because things can be re-aggregated in such easy ways Absolutely, I suppose the more the more distributed Disaggregated the system that the more you're going to need powerful tools to be able to to to conduct that sensitivity review It's not it's not something that can be I mean, I think obviously scale also stops would stop things being done In a non-automated way, but the the more distributed the more like we need that Which also brings us slightly on the thing on to onto new hacks question. Um, which is about um About transparency around machine learning and how that can be subject to community review and contributions And I I mean, I think we could talk about the the you know There's possibly a long tradition of opacity within archive archive work What the mystery of what the archivist actually does? Um, and I and I and I hope that actually maybe one of the things that comes out of digital archives Is uh is is more transparency about what the archivist actually does Yeah, so I'll have a few things to say about that and I have to say as a very important caveat up front I'm not a core machine learning researcher, right? I'm not going to be offering new algorithms and machine learning that haven't already been in place, right? A lot of the work that we do is much more pragmatic in the sense that we pull together existing tools We look at use cases that are very important in libraries archives and museums and try to apply them But with that said in our own work, we've tried to really look at transparency in a couple different ways one is Documenting the work we do and making it very clear what tools we're drawing from so people can really say okay I ran those tools and got the same output or I ran them and got different output and can start make inferences about What does that mean for our work processes? How do we evaluate them? How do we compare them? Uh, but also very importantly things like the the provenance Uh and chain of custody again from an archival perspective Around what these tools produce so for example, what model was used? When was it run? How was it run? You know what commands were issued when it was run and those are things that are relatively easy to capture and reflect If you think to do so, right? And so that's one thing that that um archivists have really library librarians archivists and museum curators have really benefited from in adopting a lot of the digital forensic software Because it tends to be meticulous about documenting You know what tool was used what command was issued at what time pointing at what directory and all these sorts of things because It might be brought into court in which every single one of those things could be called into question um So that's kind of again from the research team that I work with that's kind of how we approach the transparency um There are a whole set of other Associated issues that are really under active research now in terms of explainable and understandable ai as it's often labeled So for example, we all know that one of the issues with machine learning is it's essentially a black box you get an output and because it's often based on these kind of you know, weightings of of Of uh networks and things there's no way to really entirely know Why the output came from the input that you put into the machine learning? And so there is a lot of active research around how to address Understandability and explainability of ai so that we're not just relying on the machine learning and hoping it does the right thing Uh, so tying it back again to this sort of provenance and chain of custody One of the things that we did that sounds quite trivial in this caress cap project that we're working on now is build a utility for people to better choose and Document the models that they use when they run this machine learning right because there are default models For example, spacey that does natural language processing as certain models that are the default english You can have ones that are like bigger models that run more slowly, but are slightly more High slightly higher f-scores and things for identifying strings You can have ones that are in different languages, you know spacey for example supports models in a variety of different languages But you could also customize your own model based on doing machine learning on your own environment Um To in our experience there wasn't really a terrific utility for being able to do that because it seemed like the presumption is Whoever's using these tools has one model that they're really working with they plug it in and then they're done Right and in a library archives and museum setting It could very well be that you need to use a diversity of models depending on what kind of content you're working with And if you don't both select that explicitly and document what model you were using That can be quite problematic from a transparency perspective because how do people really know what they're getting when they're looking at this If this definitively identified this object as this What model is that based on was that was that a model that used training data based on you know Catalogs for shopping as opposed to the sorts of materials that might be in your collection, right? So I hope that's not too long-winded an answer But I mean I think there are ways that those of us this is maybe getting a little bit more to vicki's question about Speaking with people in it right those of us who I'll speak for myself who aren't machine learning experts Can still really do to make sure that this work is um sort of transparent and public Publicly viewable while also really pushing on the more techie people who are actually developing the artificial intelligence technologies To make sure that they are attentive to these issues of understandability and explainability because We should never be in a position where you simply step back and say the machine learning seems to be working Okay, so I guess we'll trust whatever it gives us right because There are just examples in the news media every day practically of cases in which machine learning leads to outcomes that people wouldn't have intended right because We have we have to really question the assumptions that are built into it and constantly revisit them So, so let's just come on. Thank you. Carl. Let's just come on to vicki's question because obviously, you know Everyone's seen that I'll I'll praise you a little if I may vicki This is the question about the fact that I I T I T departments I T services departments think in terms of uh, you know A three-year block three months block understandably. They're trying to deliver on Times technology changing all the time and and I suppose it's about how the archivist communicates the fact that we we need to I speak in archivist here. We need I've had these conversations We need to think in terms of tens hundreds of years. How do we how do we build? How do we build that bridge that says actually we can we can do that? I suppose it's just whether you have any tips about that Uh, that cow is what is what vicki's asking. Yeah. Yes, absolutely. And you know, I mean I I think there are two primary Answers to this and they're both kind of hard for us. I will say, you know presumptuously if we're all people who consider ourselves in the lamb world um One is adopting the terminology of those in the it sector, right really understanding The work that they do the words that they're using the decisions. They have to be making because Acknowledging that as my my academic advisor margaret headstone many years ago conveyed to me as letting them know you feel their pain Right, essentially if you just sound preachy and demanding It's going to be a very difficult conversation If you can meet people halfway and essentially explain in their terms in terms of interoperability apis availability of source code, right? I mean, there are a lot of things that That the repertoire of concepts that are necessary to have those concepts To have those conversations. I think is often smaller that we might first recognize Um, I actually do a great deal of teaching in this area where I try to train people who don't Have a lot of background in these terms and concepts to be able to have these conversations So I think that's one side of it is really translating it into their terms and part of that is that they are working on these short time horizons So you say, okay, somebody who's working on a, you know, three five ten year time horizon Well, what's going to resonate with them? Well, they understand issues like system migration They understand issues like interoperability. They understand what an api is. They know what metadata standards are, right? I mean, you you essentially say these are the things that need to be baked into the system As we're managing it now as opposed to, you know, shaking our finger and saying you aren't thinking of a century from now, right? I'm not in a second saying that vicki is doing that right But I've I've had a tendency to sometimes do that, right, but I think the other side of that is to um Is to recognize that the time horizons that we insist on Also, ultimately really are short term time horizons that are just all concatenated together, right? It is impossible to preserve something digitally for 200 years That's just that's not I mean, there are interesting questions about media that you can use that are reliable enough that they can withstand, you know, kind of environmental Risks over a long period of time. But essentially because of all these levels of representation It just isn't possible. Anyone who comes to you and says they have a 200 year solution You need to ask them a very very careful set of questions or just walk away, right? The reality is that a 200 year solution is a string of five to 10 year solutions, right? And so once you start having that conversation with people who work in information technology I think it's a much more fruitful conversation Where you recognize they can't promise that something's going to be renderable or understandable and interpretable in 50 years because nobody can What they can do is work to you to make sure that you adopt standards practices software and workflows That make it much much easier to move over that next 10 year hurdle, right? Because ultimately it's this you know to use this metaphor kicking the can down the road That's really what the work is right at least from my perspective digital preservation is just a series of several year actions And so it requires the work of creative people who can have these conversations with it The administrators and i'm not for a second suggesting it's easy I mean these are challenging discussions, but I think that those who I've really admired and observed in the professions who do this work really well I think can really do this translation work They're able to work with people who are more the programmers the system administrators Recognize again on that first item that I mentioned that their work is super hard They can't drop everything to just do your work because they're trying to make sure that you know They don't have massive security breaches and people systems are running and all these sorts of things But also recognizing we may need to rethink a bit of what our demands and requirements are right That it's not put this somewhere where I can be assured it'll be accessible forever It's put this somewhere that I can feel competent will know how to move it along in technology when we need to And I don't know if that gets to the question, but at least that's kind of how I've Tried to approach things and I feel like those who do this successfully tend to approach them as well Yeah, and I'd certainly echo that I think it's interesting because it pulls on a really old archival concept of the responsible line of custodians that That that that that's unbroken. So the idea is that actually we're passing on to another archivist We're not we're not actually I'm not going to be looking after it in 200 years time Well, I imagine not and and so yeah, there's that there's always that that that sense of Of passing I think the other thing that I've observed certainly on this is that if I may is that Is that it people worry they worry about that long-term stuff as well They're not they're not immune to worrying about that and actually being able to offer them someone who's who says actually I've got this one I've got that long-term view. I I'm I can take some of that on it's actually something they welcome Actually, it's not necessarily a problem for them. It may be a solution for them Yeah, and there's there's a metaphor that's often used I don't know how much it is in other countries, but certainly in the us of getting to the table Right the sd. You want to be sitting at the table with them? Well, once that you're at the table You also have to kind of guy as you're suggesting offer something as well Well, here are ideas I have for you about how you can do this more easily, right? I think ultimately just to tie back to my long series of decisions kind of idea of digital preservation is that whenever I'm confronted for the question from somebody about what is the best long-term Technical solution to digital preservation. My answer is always hiring passionate creative dedicated people Right, because ultimately they're going to have to long deal with a long series of technical challenges And that's really what you need to be doing, right? Yeah, absolutely So I'm just going to thank you Vicki for a really really interesting question Move on to Duncan's Duncan says fantastic presentation Any thoughts around trying to introduce better access to open source tools within institutions that are bound by strict IT service and procurement policies Yeah We move on to another question I mean, this is so institution-specific, right? Because I mean, it's not only the possibility that your institution might be very wedded to proprietary tools You could also be in an environment. I mean like stanford university just to pick one in the us as an example They've got a culture of sort of commercializing software, right? So there's a lot of attention within an institution like that to if you're developing software What's the market for it? How is it going to be packaged and distributed and used as kind of a product? As opposed to the notion that you're trying to sustain it Internally to to run your operations, right? And so I think there can be a lot of competing pressures based on what kind of institution I will say that in the us the state government level tends to be the most challenging in this way state governments are very Risk averse largely because they're so resource strapped And so there's sort of an irony built into being resource strapped so that you insist on proprietary solutions that might be very problematic and expensive over the long term Because they seem less risky than adopting open source software that you might have to depend on some community and you might have to You know do some more of the development internally and so I'm not I think I'm probably rambling and not really answering the question. I don't know that I have There aren't really specific definitive answers at least in my experience to this. It really depends on What what stories work best with the people you're working with in your institution, right? And I think that the big curator consortium is just one example We've had this conversation many many times because if you're running a membership based organization You need to make sure that you can accumulate and retain members, right? Is is how do those who use that software convince their leadership that they should be paying into This consortium and a lot of that argument is around i'm paying into a community, right? I could be buying proprietary software And then we could pay for a service contract with them that'll last whatever it is a year or two years three years And then just really hope that at the end of that contract. We still want to use the software It's still affordable to us It's still supported by that company and they still offer the services that we need to support it or you could say I'm going to adopt this open source software that has a whole community around it And i'm going to pay into this membership based Consortium so that it will provide me the kind of support. I need it will have a dedicated help desk I'll be part of a forum of discussion of peers that are also using this software who can give me their advice When they've run into the same technical challenges, right? And so To me it's translating it into the kind of terms that if you're dealing with procurement people who always think the procurement has to be a A proprietary system that has a price tag and a support contract associated with it Thinking about how you can convey those same notions of having support You know training and those sorts of things But not through just a single vendor who's selling you software. It's very often community based support And I think you know, I have my own biases, but I think that's really where we need to be going professionally Not only philosophically because of public review and you know being able to kind of have control of our own processes Just but also because in many cases in the long term, it's more cost-effective, right? Because you're not locked into particular systems, but I'll get back to saying I have no idea also, right because These conversations are very difficult. You can hit a brick wall often right where it's people just aren't listening to that argument but And I think also it's worth saying that those kinds of outsourced proprietary solutions are not without risk and You can you know having the world's tightest contract doesn't mean you won't end up with a phone ringing in an empty warehouse somewhere at the end of it So, you know, I think there's part of it is saying well actually You know there's there is recognizing there that those aren't risk-free solutions buying into a company that could just fall over and And you can't get anything Right and I'm not meaning to suggest that like proprietary software should never be part of any of these workflows It can play an important role too But I think especially with really core functions thinking about how it can be communally shared software Is a really really powerful thing to be thinking about Yeah, absolutely. So, um, we've got only a couple of minutes. So I'm gonna I'm gonna just put a final one If I may there's no more in the in the chat So we're both archivists cal and our audience is drawn from across the library information heritage and research sectors Um, should we be doing more to draw in a wider group? Should we as archivists be doing more to draw in a wider group or is it important that archivists to take Do archivists have a particular place in this I suppose is what I'm I'm asking Yeah, I think it's a great question and one that you know, I think I'd probably be um Uh, we've been trying to answer it's an entirety. So I'll try to kind of pull it back again to Um, a local example of the vicarator environment, right? Which is a which is a platform that pulls in a lot of open source Uh digital forensics tools But in a way that is documented and you know, kind of It's it's put together in a way that supports the workflows that are most likely to be present in libraries archives and museums And we're not alone in this right there's this sort of tradition of reusing software in a lot of ways But I think it's a good example because to me going it alone and basically hiding in a room with a bunch of our colleagues And saying let's build a system from the ground up based on our specs Is often not realistic in a number of ways. We don't have the resources. That's not how the industry works You know, we might not have enough of a customer base to really sustain a market if it's just us in some cases But there are distinct needs in libraries archives and museums And we have certain conditions that we insist on partially because we have such long time horizons So I think it really is a sort of in the middle of that spectrum sort of answer We need to reuse and build from and really interact very very closely with people in other sectors But also insist on certain things that are you know to use a hokey business term our core competencies, right? That really can't be just outsourced to someone else because There are those human decisions based on those patterns that I was talking out at the beginning of my talk, right? If those patterns and decisions are distinct to our professional space We can't seed control of those because nobody else is ever going to prioritize them the way we do But if we just try to go out alone and develop all the tools ourselves I'm pretty sure it's going to fail right because that's that's we just don't have the resources the market or You know the sort of full stack of technology to ever really do that we need to rely on others and they can rely on us Absolutely cow in the same way we don't we don't maintain our store buildings I mean we don't we don't we don't maintain the brickwork and the plumbing and things like this So it has that that same feel to it if we were to give it an analog archive or museum analogy