 definitely a very tough time, not only here in the US, but also around the globe. And we just want to acknowledge that for our Black colleagues and with everything that's going on, we hope, you know, things are going to get better. And we just want to see that we stand in solidarity with Black lives and we know that Black lives matter. And we just want to let we stand with you and whatever you're going through right now, whatever is going on right now, we know that at the end, you know, we're going to see something good. So again, thank you so much for coming in. We have a session, this actually a kickstart session for us, webinar for us. So as you see on the screen, we decided, and Angel and I kind of just putting together this webinar series where we talk about emerging technologies related to archives. So one thing that we think about is that technologies, I mean, adverse technologies are affecting everything we do. And we want to see how does it affect our life, like our practices as archivists, but also researchers and everybody who's dealing with research or dealing with archives. So we have a couple of webinars coming in, but this is the first one and we're focusing on attitudes, antigens and archives. And we are so pleased and happy that you have Dr. Anteia Celes who's actually the Secretary General of International Council on Archives. So she's going to kick off our presentation today. And then one thing we want to let you know is that we want to make sure that if you have questions, there's a Q&A tab, please, tab in your question in there. If you have a comment that you want to share with the whole panelist, we have the chat room, feel free to type in the comments or anything else there. But the Q&A tab is definitely a place where you want to type in your questions. Something else I also wanted to make sure that you understand is that we're going to give time to our presenter to go through her presentation. And then after she's done, then you have time to answer your questions. So with that said, I will hand it over to Dr. Anteia. Thank you. Hi. Hello, everybody. Thank you to Rebecca, Azure, Krista, and Jody and Claire and to the Mellon Foundation for having me here today to be able to speak on archives and artificial intelligence. Also, during this difficult time, especially for our colleagues around the world, we're facing discrimination and oppression. ICA stands with you. ICA will support you. So please be sure that we are here and we are definitely listening to your concerns and we're here to support you throughout this, what's happening. So I'm going to share my screen right now. Excuse me one second. I get the right piece of information up. Pardon me. Okay. There we go. So like I said, I'm going to be talking about artificial intelligence in archives today. This research and the topics I'm going to be talking about emanate partly from my time when I worked as head of digital transfer and digital appraisal and selection at National Archives UK. We did test artificial intelligence software at that time. And then some of this research has also been born out of my own research that I've continued as Secretary General at the International Council on Archives. So just a bit of an overview of what I'll be discussing today. So I'll be touching about on the topic of what is artificial intelligence, the use of artificial intelligence in government, acknowledging artificial intelligence as evidence and as archival record of the future, the ethical challenges and the role of the archivist in this space, the impact of information management practice and the implication this has for using artificial intelligence technologies within our own practice, the automating of archival practice in terms of appraisal selection and sensitivity review. And then I'll also be touching on access and reuse of born digital records and I would also say digitized records in the use of research and the automation of that research and researcher expectations. So a few definitions just so that you in terms that I often use in my presentations just so that you're aware. So when I talk about data I often distinguish between structured and unstructured data. So for me structured data is information that is more often numerical, information put in tabular form to enable quantitative analysis. So this is things that you would find in say an Excel document. Unstructured data is information consisting of word processing documents, PowerPoint presentations, videos, sound recordings and photographs. I also talk about structured and unstructured record keeping environments. So when I talk about unstructured or pardon me structured record keeping environments, these are environments where records or documents and data are placed in an ordered fashion to allow for retrieval. So this can include information management systems or shared drives with a unified classification scheme. So I also talk about non structured record keeping environments and these are environments where documents and information are not organized and can be comprised of running sequence of documents, shared drives with no unified classification system classification scheme, or any type of information management system where the information is organized in an ad hoc fashion or in an unstructured fashion. So what is artificial intelligence? So artificial intelligence can be defined in many different ways. There's actually from my research I can find no standard definition about what AI means. Everybody's got a different mind map chart understanding of what artificial intelligence means. There are two categories that I often talk about when I talk about AI, which is supervised and unsupervised. So when I talk about supervised, it requires a human to mark up or compile a homogeneous data set to train an algorithm to recognize patterns or terms in the data. And this process requires a lot of upfront work and also requires you to have some level of understanding of the data set. I'll try and unpack this a bit when I talk about how we tried to apply artificial intelligence technologies when we were trying to automate the appraisal and the selection process at National Archives UK and what this means if you try and do this in practice in an archival institution. Second is unsupervised. So the data is loaded into the system without any upfront at least by the user, any upfront human intervention and analyze the data and provides a result. So I caveat the use of unsupervised because whilst you're putting the information in the system and you're getting an output at the end of it, the algorithm itself has been trained with a specific set of data for a specific reason or to analyze data in a particular way. So whilst unsupervised, there's still an element of training that has happened in the back end. And as an archivist or as a records manager or information manager, you need to think about this because it influences the output that you're going to get even though you're not necessarily directly telling the machine from the outset what to do. So these are considerations when we talk about supervised and unsupervised. And so to help you kind of distinguish around how much work you have to do up front, but also what are the considerations you need to think about when you're starting to engage with artificial intelligence in its different forms because artificial intelligence is really an overarching term. So like I just said, artificial intelligence is really an all-encompassing definition for any activity where a machine or system takes information structured or instructor to predict an outcome. Machine learning is a process of training a system to learn how to make a decision using a pre-tag data set. This is my definition and it may not align to other definitions that are out there. So in full openness on that. Neural networks are just like how we use our brain to identify patterns and classify information. And neural networks can be trained to accomplish certain tasks to a degree. They will never replicate the capabilities of the human brain, but they do allow us to parse through large heterogeneous sets of data to arrive at a conclusion. It's not always clear sometimes when you use neural networks and particularly deep neural networks. So neural networks that are layered one on top of the other. How a machine sometimes arrives at its decision. A classic neural network that we are all now familiar with is something like facial recognition. So that's a type of neural network that neural networks also in that neural network you'll have it identifies to begin with skin color. It identifies then positioning of the eyes or mouth or ears. It identifies hair color and then all the different networks speak to each other and compare the input to the output. So they can say this is the same person or it looks to be the same person with a degree of accuracy. So that's a type of neural network. So just so that you can see the nesting of the terms it's again it's it this is just to give you an idea it's it's not 100% accurate but artificial intelligence really is the overarching term. You have machine learning as a subset to that and then deep learning or neural networks. So I'm going to be dividing this presentation into three different sections. So I'm going to be talking about the impact of artificial intelligence and machine learning data mining in government and governments around the world already using artificial intelligence and machine learning to make policy decisions to create policy visualizations for in order to make decisions at a very high level. It brings with it certain challenges and and certain risks which as archivists who advise on the creation of records we need to be aware of because we're often brought in an advisory capacity and we need to be aware I think of some of the ethical issues that come into play when governments are trying to use artificial intelligence in in policy. Now the issue becomes you know does this blur the role of the archivist to some degree? What is the impact on archivists and on the profession? I think we also have to unpack the whole issue of are we actually invited to this table to have that discussion though with decision makers? There's a whole host of issues that I'm going to try and unpack today. I'm also going to touch on the use of artificial intelligence in archival processes. So here I'm going to be pulling a lot on my work at the National Archives UK. What we tested at National Archives UK was off-the-shelf commercial artificial intelligence software which is very limited in its capacity and what it can do but what I'm going to try and convey to you are things to think about if you're going to try and use this type of technology in your own processes in your own analytics because there's just too much information now for us as archivists to be able to look at when we look at born digital and digitized material and so I'll try and give you some ideas of things to consider questions to ask how to engage with commercial providers in this space. Then I'm going to talk about making records accessible and readable for research particularly for automated research that may be using artificial intelligent software or technology. There are some considerations as archivists that we need to think about although we want to be open and transparent and make our information available in as much as possible. There's a big question for us about what happens when we break down the silos that we have created using our descriptive practices and what this means by breaking it down and what this potential what implications and risks we may be exposing ourselves to. So I'm going to start off now by talking about the impact of artificial intelligence and machine learning and data mining and government. So as I said decisions are being made now using machine learning and artificial intelligence. These techniques are being used by data scientists or statistical analysis units in government departments as well as private corporations. So the reason I juxtapose these two government and private corporations is because increasingly in government there are more public private partnerships and AI is definitely no different in this regard. We just need to look at even if we take it outside of archives and artificial intelligence if we look at the recent SpaceX initiative which was a government NASA government department working with private corporations to reinforce the U.S. space program again. This is increasingly the approach that a lot of governments are taking. I understand governments are under the cost. I understand that they don't have a lot of resource to give and they're increasingly outsourcing a lot of things. They're outsourcing IT infrastructures. They're outsourcing this type of work with artificial intelligence. But what this does is that this really poses I think from an ethical standpoint as an archivist watching this around transparency and accountability. Government has its own structures internal and external via access to information law, protection of privacy law that enables us as citizens to hold them to account. When we get private corporations in the mix here which is increasingly happening around visual policy visualization and data analysis work because governments don't necessarily have that capacity or training or don't necessarily want to put that resource into this process. It creates a whole nother kettle of fish that needs to be impact as citizens as archivists that we need to be aware of because these corporations are not as incumbent under the law to be accountable for what is happening. We only need to look at things like Cambridge Analytica and the impact that this had both in the U.S. elections and on the Brexit referendum in the United Kingdom and what happens. The only reason we realized that this had happened was because there was a leak internally from Cambridge Analytica and they went to the press. And there isn't that level of capability as citizens for us to hold these private corporations to account. So it's something to think about when we talk about AI and government decision making. So coming back to the slide. So data science and the ability to mine data is seen as a competitive advantage. So a lot of corporations are starting to use this. Platforms of course that we commonly use so Netflix, Google and Facebook all use some type of artificial intelligence or predictive technologies whether that's to give us ads or suggest movies to us. For governments it's seen as a way to parse through large volumes of data both structured and unstructured to make a decision. So there are huge amounts of data sets, huge amounts of documents that are sitting in different information management systems in shared drives in social media that governments need to mine and private corporations need to mine in order to do their work and also to make decisions. But it brings with it I think some challenges that we as archivists need to be aware of across the piece. So there are challenges with the data science approach and the use of machine learning and artificial intelligence in government decision making. So one of the big things I saw when I was working in the UK and I've seen it elsewhere is art is the data that we're meant to be combining is it supposed to be combined together. There was a big push for a long time at least in the when I started working at the international level around open data and that was really tied with the open government partnership and the view was we just need to get data out of it out there and we need to just make sure that people have access to it and it's published. That's fine but one of the pieces of feedback over time that came to researchers that colleagues at University College London who wrote about you know open data open government and the integrity of the data set is that people who were trying to use the data or researchers that were trying to use the data had no ability to understand how was the data compiled? Who was the sample set? So in terms of you know what was the data compiled for? Was it for I don't know cancer research or was it for monitoring infrastructure? And so how what then is the sample set in order for you to start working in analyzing that data? How was that data compiled? Who did they talk to? There's a whole host of things that were not necessarily published but they're equally within government. So coming back to that it's difficult then to combine if you don't know what the basic infrastructure for creating that data is how do you then know whether you should combine it with another piece of data because you could be comparing apples and oranges and then just saying I've made a decision I've analyzed this data and there's a decision but it could be completely inaccurate and an erroneous decision and also by not having that background information you don't necessarily you weren't necessarily able to understand is the data biased? Is the data skewing the results to a specific outcome? I know when I worked in government and elsewhere as well there was this view by data scientists well if you just send us the the data we can just combine it it'll be fine and that doesn't mean to say that that's the approach of all data scientists are all statisticians. There are I what I saw when I worked in government was there were different levels of rigor around tracking well tracking how data was combined what pieces of information may be taken out of a data set because pieces of information that are taken out can have a huge impact on the outcome or if you change something from centimeters to meters again it has an impact on the output that you're going to get. So but there was an approach in some of the departments where they were saying well we can combine a data set from the Department of Education with the Department of Transport to arrive at this conclusion I was like well wait a second what were the original data sets compiled for? What is the question that you are asking that you are seeking to answer? And that's often I think key when you're going to talk to any data scientists or any decision makers or anybody who's doing this work in government is what is the question you are trying to answer and does this data help you towards that simply because it kind of sort of looks like it's it should be the right data does not necessarily mean that it should be combined and when you're combining it and let's say it is data that you can put together in order to make a decision or make a policy visualization then sometimes you take information out or you have to change different variables well that has to be documented because again it has an impact on the output and as archivists we need to be able to track all of this we need to capture all of this and this needs to be because from accountability perspective if it isn't tracked if we don't have that information then how does a decision maker or how does somebody who is involved in data interrogation and then has to present before you know a public select committee or has to present before a parliament how do they how do they explain that how do they explain the impact of the decisions if those decisions no longer exist and so as archivists we have to capture that trail and that's a huge trail and I'm not saying at all that that's a small amount of information we need to capture but that has to be tracked and we also need to make sure that these individuals that are involved in this type of work and that are giving or feeding this information whether it's in a policy visualization or else something else that we are able to to to make sure that that is all being captured that is all being documented so that when the if the decision ever comes to the fore if that becomes questioned then they have the process there to be able to be held to account I think a big thing we talk about artificial intelligence is is the data biased and how does that impact the output of the algorithm and how does that impact the effect of what we see and how we interpret it sometimes you know there isn't when I was speaking or working in government sometimes that wasn't always the consideration and that wasn't necessarily always front of mind again it depended on which department you spoke to and so sometimes it was and and sometimes it wasn't but it is something definitely that you have to think of because you have to think about it in terms of if you're going to put information that's going to skew in one direction well how do you compensate for that skew how do you make sure that what the output you're getting has precision and recall and in terms of the question that you're asking as well so it's all about you know what's the starting point what is it you're trying to to address what is the question that they're trying to address and then but then looking at then if the data is skewing in a certain direction but you have to have all the contextual information about the data then how do you then compensate for that I think archivists have often played a role in advising organizations on the creation and preservation of records and data to ensure their evidentiary value but my question to the community would be what advice would we give in the creation preservation of algorithmic and computational records now I use the term algorithmic and computational for a reason these are not types of records we I feel as a community have ever had to deal with before these are records that we can never read it's not like a piece of paper like I have in front of me now or a notebook where I can as a human I can read the the the words that are on the page or I can see the numbers on the page this is something completely different and the only way we can test an algorithm and test its output is by giving it data and so there's an intangibility around these records and I'm not sure that we we would be able to properly advise on the creation and preservation I think there's the whole host of work as a community that we need to do here so does the archivist play a role in advising and how algorithms and code are created for decision making and how do we know what to preserve I absolutely think we have a role here I think we have a role as the advisor and I think that we need to look at though what it is that we need to create and preserve so to give you an idea of how a policy visualization is created in government and this is just an example so oftentimes there's information in word documents or an instructed information which is then sometimes put in tabular form or combined with other data so this whole process here this needs to be documented so if we're extracting information or we're using an algorithm to pull out information to be able to combine with this data set here or to put it in tabular form that all needs to be documented any of the tabular data that we have chosen to put into or to use as part of the interrogation process that all has to be documented there has to be justification why this information is present whether information has been taken out or put into these tabular sets of data in order to arrive at the endpoint and again the the question that you are asking the machine to answer is absolutely important and understanding how that algorithm then outputs the decision understanding precision and recall is absolutely essential or else you are going to have a very difficult time assessing whether the output is accurate and potentially could be creating policy visualizations or making decisions on inaccurate information or information that is skewed or biased so you have to be really careful throughout this process and understand what is your output algorithms cannot do multi-variable assessment what i mean by that is it can only look at the variables that you put into the machine and and and how it interprets it and outputs it it cannot take contextual information like um what a person let's say you're looking at cancer it's not what it's going to look at or whatever the variables are that you are analyzing it's not going to be able to assess somebody's mood somebody's socioeconomic background all of these types of things that are contextual pieces of information that are important as part of the research project process but the algorithm is not going to be able to do that this is not a silver bullet to answer all of our questions and it's not going to guess uh or come up with decisions on the fly we need to be cognizant of the limitations of these tools and but we also need to ensure that this process is documented other things that often end up in machine learning and policy visualizations are things like twitter uh or social media uh social media outputs and it's often used uh on the commercial us in the private sector in the commercial side to build uh to build advertising campaigns and i've been to conferences where they've talked about successful advertising campaigns that have been solely built uh just using uh twitter or social media data but that said in a policy visualization it is what like how was the how was the data scraped from twitter um sometimes if depending on the type of question you ask and there was research a few years ago if you ask the question slightly differently you can sometimes get a completely different output in your twitter scrapes so this becomes important when you're doing this type of work to be really precise and really document what is the question that you ask and what was your return then there's all the code that then goes into this so you put this in and then you have to develop the code to then come up with this type of a policy visualization now when you this is just an example uh and this is actually a screenshot from the appellate court in france uh on uh compensations in divorce proceedings but what i'm trying to say here is that these if this was a policy visualization potentially we could be saying where am i going to allocate money in the french judiciary system to because we have a certain number of divorce cases in this area and this is where it needs support but if i've missed out a whole segment of data where the data is biased i could potentially be leaving out an entire segment of the population from having some type of support or from getting funding so this is where the importance of knowing the question and having accurate data and knowing the bias in your data becomes really important because then there are decisions that impact people's lives so another use of artificial intelligence in um that i saw at least in the news uh in government decision-making is handwriting to help governments in this case it was from three years ago it was in sky news which is a british or uk news organization and they were talking about handwriting analysis to help government catch gangs behind mass scale fraud benefits fraud so there's some ethical considerations here from a from different points of view so there's potentially the archival record how do we capture this how do we capture the training data how do we capture the decisions that are made but if this becomes there's ethical issues of archivists that we need to think about so here are some of the issues i'm going to i'm going to raise here so if this becomes standard practice in government and it passes it passes into policy how do we begin to advise firstly on what documentation needs to exist to document the training data and then subsequent information that is needed that is input or not into the system so what does integrity and accountability look like in this context and also what do we preserve and what is the role ethical role of the archivist here so here here are some some issues i'm going to flag so handwriting analysis is not a precise science there are there are massive projects that are out there transcribes is one of them but it is it is a very difficult type of machine learning because handwriting varies handwriting varies over your lifetime handwriting varies according to the different individual that is writing the most successful projects i have seen around handwriting analysis have had our our laboratory or experimental i would say mostly based in research institutions and mostly with very finite set of data what the issue here i have with this particular if in this particular initiative is that so what government potentially is doing here is testing would i consider a nascent software nascent algorithms that are really untested and inaccurate so what we are doing here is we're testing potentially inaccurate or sort of still in experimentation algorithms to catch essentially on some of our most vulnerable in society you know when i look at people who claim benefits you know these are these are often people that have legitimate claims and when you look at and i'm not saying that benefits fraud doesn't happen it does happen but when i look at the cost benefit analysis here of how much am i actually saving by targeting an experimental algorithm at the most vulnerable people in society when i know that from a tax perspective there are probably you're probably going to make more money from tax evasion than you will from this i have a serious issue from an ethical perspective but also as an archivist if i was advising on this i would be saying i recommend you do not do this and would be laying out the reasons why because you know there's a there's a there's an issue here i have as an archivist and as an individual around essentially experimenting on the most vulnerable in society in order to in order to get decisions from an algorithm to purportedly save money so this has some serious considerations i think they're you know i say this but i think we need to be careful as archivists you know how do we balance that ethical role that we that as individuals as as human beings we we play but also within our profession ostensibly you know i have had situations where government departments have gone and done their own thing my my fallback position at that point is an archivist is that's fine i've told you i've advised you but what i'm going to now do is i'm going to document and i'm going to make sure that every exit decision that you have and you've made whether that's training the algorithm because training an algorithm is an iterative process i'm going to capture all of that and i'm going to make sure it's preserved and there will come a time when you will be held to account for that decision so it's something that is archivist we need to think about but it is you know in terms of usage of ai there is an in government decision making there is definitely this push and i saw it again and when i was working in government that somehow artificial intelligence will solve all of our problems will make our lives easier and is a panacea to to uh in decision making which it absolutely is not it is only a tool and the tool is only as good as the data that you can put in it and it's only as good as you train it but like i said it cannot do multi-variable assessment it can't assess a whole set of different variables that may also sit outside of the machine itself that need to be taken into account another one that i'm sure many of you are aware of is the machine learning artificial intelligence in recidivism rates so this is documented not only in a cathio neal's weapons of math destruction but it's documented in research papers so some of the u.s states they were using algorithms to determine recidivism rates so it was called compass the correctional offender management profiling for alternative sanctions so some of the contacts for the data that was used to train compass the algorithm created by north point so this was a private organization so the sentence is given to african-american prisoners in the federal prison is 20 percent longer than those given to white convicts for similar crimes african-americans represent 13 percent of the population of the u.s of the united states but only account for 40 percent of the but account sorry for the 40 percent of the prison population so what the issue here was the base training data set was biased and then the company used the algorithm and marketed it then to courts and judiciaries in various states in the united state as a way of calculating recidivism rates so the the propensity or the percentage that an individual is more likely to reoffend the problem was one so not only was the data biased because they essentially were using questionnaires that had been compiled one in the 1990s then they were asking questions and then what impacted as well was that policing one of the questions was do you know an offender or have you ever been convicted or stopped by the police and the problem was a lot of these people that were being targeted by the recidivism algorithm by saying they will reoffend it was that these were individuals who were in low socioeconomic neighborhoods the police and the way that the police were monitoring those neighborhoods was much bigger and for much smaller offenses though so for you know misdemeanor misdemeanor drug offenses versus targeting people for murder so the way that they were policing meant that these individuals that ended up getting marked as high recidivism rate in the training data set were already disproportionately disfavored by the policing system which then reflected itself in the questionnaire which then made it into the data set that trained the machine so this is this was a huge issue then secondary point is that north point because it was a private corporation they has no accountability there so you know they came after the fact and they were actually had a closed what I call the closed algorithm which is they weren't retraining the algorithm so they weren't compensating for the bias so the bias just kept presenting itself presenting itself presenting itself which is a huge issue and so and it's something we need to be mindful of when we're using algorithms in government decision making and for policy decisions so why should this matter to you why should this matter to archivists algorithms are the historical documents of tomorrow and now governments need to be held to account if they use these technologies to make decisions that have an impact on people's lives and of that of their citizens and we are responsible for identifying and preserving that information it comes back though to some questions what should we but what should we be preserving because this is a huge amount of information this isn't just a piece of code it's like are we preserving all the components that contributed to the training of the algorithm so that includes like all the documents the data the social media information the tracking of decisions around what was kept what was left out what was put in what was you know in the in the tabular data what potentially was taken out how they're the iterations of the code because you're in when you're doing when you're doing this type of work it's also and especially if you're the one working on the code it's how do you keep how what do you need to how are you retraining that algorithm with different pieces of data in order to get the outcome and so when we are archivists do we only get the final algorithm do should we get all the different iterations of the algorithm uh should we get the iterations of the result so or do we just get the algorithm and the results so there's huge questions about this but these are not small pieces of data like this isn't like a little bit of this isn't like two mags or something this is like terabytes and terabytes and terabytes of data and if you're looking at the propensity of use in government of artificial intelligence to make policy visualizations or to make decisions using these machines the issue is you know which ones do I target what's the appraisal and selection process here is it only the ones that impact citizens lives which ones do we choose and how do we preserve it because our most archives don't have data centers and so what does this mean for us what partnerships perhaps do we need to consider um or do we just keep certain components of the of the process and then have pointers that will say well this information is kept here if you want the source training data or if you want this other piece of information you have to go here so there are huge questions that we still need to ask ourselves about how we're going to preserve this but we need to ask it now like these are records and these are things that are being used in government in private corporations now to make decisions and we need to develop a strategy in order to be able to do this so like and so all of this requires us to have the capacity and the skills to advise decision makers in departments in ministries that are seeking to implement these technologies the problem is are we invited to the table I would say no we are not invited to the table and so there's always the question well how did we then get invited to the table I think that for me uh I just invade myself in when I was working in UK government I just presented myself and knocked on the door and knocked on the door and knocked on the door until they let me into the meeting not everybody is like that um but I think we it's about starting that conversation and it's about sort of slowly incrementally building the case to make sure that we show up at that table it's not easy it takes years of work um but once you're there then then at least then you're sitting there you're listening and you can contribute and ensure that the information that is being captured is accurate it's also trying to find your allies in here so finding people who are sympathetic to ensuring that you know information is properly documented that the training data is is uh as as representative and non-biased as possible you know there's a whole host of issue a whole host of uh allies that we can find out there to help us either get the message across or also get invited to the table so I'm going to talk now a bit about the use of artificial intelligence processes in archival work oh sorry I'm so coming back to the previous point my apology I forgot to flip the slide a bit further down so what are the challenges when we talk about um artificial intelligence and government um and the challenges and issues for us well the challenges and issues are us are that we will be responsible for preserving these algorithms in the intermediate and historical archives um and so it means it comes back to the question I said before how do we identify them and then how do we capture them I wouldn't even know where to begin in terms of um how do we export like do we leave it in the government department system do we export it where to be exported to uh if we don't have the data centers or the capacity infrastructure at the archives do we have to create some type of a cloud environment what are the security risks with this like again there's a whole host of issues that we need to unpack here um as I said earlier we are not currently considered stakeholders when it comes to discussions connected to the development and implementation of AI technologies so that's definitely something that as a as a profession like I said it's about finding who your allies are you know is that statisticians is that data scientists is that decision makers is that people working in IT or you know there's more and more artificial intelligence sort of departments or bureaus in different governments or in different private organizations or institutions so it's how do we sort how do we build those relationships with them to ensure that you know our concerns our messages and our needs because these like like I said these are records are captured and properly captured and documented um I would say however that we don't currently have the skills or capacities to play our role as a trusted advisor on information management questions related to AI records to ensure their preservation and durability I don't think this is insurmountable um you know I say it but I think it's because you know there are skills out there that that we can easily acquire you know I took an introductory course to two statistics on course era and that helped me immensely understand issues around you know uh plot uh box uh box plotting how that impacts in terms of understanding variability in the data understanding skewing of the data um so these are in these are these are generally accessible courses they're not huge amounts of money and it's sort of incrementally building your knowledge about how these technologies work you don't have to be an expert so I want to be clear you don't have to be an expert but it's understanding how they work so that you can begin to advise on ensuring that the proper documentation is there in order to ensure that what the archives then preserves is the the totality or the the as complete as possible a record we need uh not only to advise decision makers on the preservation of algorithms but we need to understand how to manage significant ethical challenges that would be posed by AI technologies so AI technologies are are used across governments across the piece and oftentimes they are used to impact citizens and have an impact on citizens lives and there are going to be times where we are going to be walking into situations where we may have serious ethical issues as individuals as as archivists about what is happening and I think that there is a lot of discussion that we still need to have about how we're going to handle this I don't necessarily feel that all of our code of ethics are up to date I hold my hand up I see a code of ethics is not up to date and definitely needs to be updated but you know I think we need to unpack what what these technologies mean for us in terms of our practice as well as how to be advised and government and decision makers and you know even the institutions and corporations that we represent how to use these technologies and and also how to properly document the decisions that they're arriving at it is sometimes difficult to understand how an algorithm arrives at a result or a decision even if we preserve everything related to that decision so I think we need to accept the uncertainty that the uncertainty principle is what I call it in algorithms we could document it as well as we possibly could and we could still not always understand how it arrives at a decision so I'll give you an example so I was reading in Scientific American I was like it was from about a year ago year and a half ago an article about some scientists who were training robots the base data was the same they exposed the robots to the same conditions and and it was a very early childhood language acquisition and and early childhood learning and one robot was adapting well had was able to output you know it's basic set of words and the other one kind of almost shut down and they couldn't figure out why and these are very advanced you know machine learning algorithms that are in these types that are used in robotics but they couldn't understand why in one situation the both training data was the same both exposed to the same situation and why the two robots reacted differently so there is an uncertainty principle that we will need to accept how lawmakers and others accept that I'm not sure what what that will look like but as archivists it's about preserving and maintaining as complete a record as possible so impact of information management practices so now I'm getting into the discussion about archival processes and how we use algorithms in our own archival processes so why am I talking about information practices well actually the information practices have a huge impact in our ability and in hat and also how we use AI in our own work so information management systems at least from my experience are not always easy to use and they can be quite rigid which means that users will often try and find other and easier ways to file their information and believe me I saw this a lot when I worked in government so what I often saw was that government departments would implement an information management system and then leave the shared drives open at the same time but sometimes the rule around rules around the information management system were so rigid so strict that people just would not use the information management system or would use it and would revert back to shared drive which and sometimes what ended up happening is that we would have two types of record keeping systems running at the same time resulting in incomplete folders and duplication in the UK we carried out a study to assess the state of record keeping government departments and understand the amount of legacy data they had so when I talk about legacy data I'm talking about data that is no longer necessarily actively used in government departments but it's just sitting there sometimes they've offline it sometimes it's just sitting in a shared drive somewhere and there was what we found was actually quite interesting so for every terabyte of data in an information management system there was about 25 terabytes in shared drives and this did not include data or information held in email servers when we accounted for email servers and data because we also had a huge argument with our departments about well they didn't feel that data felt it was a record they thought data was separate that it was it was because what they thought was a record was you know unstructured data so power points presentations word documents audio visual but data now that's not a record that doesn't need to go to you which we had to have a discussion yes yes it does if you're using it to make a decision then yes it is part of the record so once we accounted for the totality of information holdings which included email servers and data sets it added up to about 1.5 petabytes of data that needed to be appraised and selected now for the it people out there that you're like at 1.5 petabytes it's not a big deal when you an archivist looks at the looks at that it's about 1.5 billion word documents there is no way that any archivist can look at 1.5 billion word documents so information management teams often didn't know what was contained in legacy data holdings and did not know what documents or data needed to be preserved so we're coming into a situation potentially where we've got 1.5 billion word documents the information manager and this is legacy old data that's been archived I don't know for sometimes only five years still don't know what's in it so they're asking us to help them because in the United Kingdom the appraisal and selection so what what to keep and what to throw away was done before it could be transferred to national archives so the the departments were like I have no idea what's in here and I don't know how to go through this and and figure out what to keep and what to throw away because if you if I go back we've got two filing systems that exist at the same time so you have the information management system and the shared drive sometimes the shared drive was in such wonderful shape that you just had a running sequence of records and you had such helpful titles as my doc misk bob's files so how do you do appraisal and selection in this type of environment um and especially if the teams didn't actually know what was in the the holdings themselves the information could also have varying levels of contextual information and limited metadata and the metadata could also be comprised compromise pardon me because of previous migrations so this was a massive issue for us when we were trying to apply the commercial off-the-shelf software technologies that I was talking about earlier because oftentimes our our um I'm thinking intentional our point of repair um our our points of reference within the records are the date but the problem is is that sometimes we had three different dates so we had the date of modification that a date that was actually written in the record itself so when I open the word document the date in the the document and then the date in the in the actual file title and I'm like which date is the accurate date sometimes they would be a year out from each other sometimes you know maybe a few days sometimes years apart um and what ended up we ended up finding out from some of our departments so for example in the uk we had a 20 year transfer rule so after 20 years departments had to technically transfer their data to us we had department come to us and we knew that they had information from like 1995 and we're like so where is this information you have from 1995 and like I don't know we don't have anything it's 2000 don't worry about it you know we've gone to the five years before we've transferred like but wait you told us in a survey you filled out that you had stuff from 95 so where is it and then they start going through their their sort of records and they start looking at when they implemented their first information management system and we had actually planned to run a piece of artificial intelligence uh off-the-shelf software that we were testing and we we had them run it and then they came back and said well everything there's a huge spike in 19 and 2000 because we were analyzing the dates and they said oh yeah that was the year we implemented our new information management system and I guess all the dates from 1995 migrated over into the date of migration and I said well then you technically have information from 1995 and so where is it because there's also everything to do with sensitivity review so if information we had to determine whether information was open or closed and that all had to be done before transfer to National Archives so it created a huge issue for us in terms of determining you know when information was was to be transferred when information needed to be opened or closed also we had issues because if we were trying to identify material from for a particular historical event so because they didn't know what was in the data what we often asked them to do was go you know have a conversation either with people that used to work here or go do some research about what the department major sort of events in the department were at a given point in time to help you start to see if you could use the machine train the machine to pull up certain pieces of information but if you they're doing a timeline of what the events are we can't use that timeline now as a way of pulling out information sometimes it worked depending on the types of information that were contained at least in the header of the the documents sometimes it didn't so it created these information management practices created a huge host of issues for us and trying also to navigate between different information silos so you had the information management system you had the shared drives you had the emails you had the data sets so it made it really difficult to parse through all of that information to figure out what did we need to capture and transfer so volume can greatly complicate the appraisal and selection process along with the ability of archivist to to carry out large-scale evaluations of unstructured data so when we did our tests at National Archives UK we ran the algorithms or the the off-the-shelf machine learning software on-prem so what that means is on-premise and it created a whole host of issues for us because we did not have a compute capacity to be able to go through large volumes of data our government departments were also in the same position that we were because they had outsourced their IT services to a third party who were asking ridiculous amounts of money even just to implement the most basic of software so we were running up against a catch 22 there are issues if you're working at this from a government department point of view of trying to run your machine learning technology in the cloud for security reasons and also depending on the types of information you're trying to look at and the try types of information you're trying to evaluate so when I say security it's that we tried to test this our our machines that we tested three pieces of software out we could not run it in the cloud because of the security clearance of some of the records that we were trying to deal with so it created a huge host of issues because we were asking departments in the UK to figure out what to keep and what to throw away they didn't know the contents of the material and there was too much for them to look at then we had they were we're asking them to figure out what was sensitive and what was not in the records and again because of volume and because of the sensitivity we couldn't run it in an environment where we could make it make it the machine process fast enough to be able to go through all of the information without it falling over and failing so there were huge issues and these are things you need to think about if you're going to use this technology in archival practice it's do you know the contents of your data what is it you're trying to find out about your data the amount of contextual work you have to do before you even start putting that data in the machine is absolutely vital it's because if you don't understand what it is the question you're asking you're going to you're not going to get the right outcome and this creates a huge issue in terms of what becomes the archival record of the future we could potentially skew the record of the future if we're not careful about how we use and how we interact with these types of technologies so due to the amount of information that was required we started to do this is where we did sort of a second study to start examining off-the-shelf technologies that had machine learning capabilities for the purposes of assessing their viability in carrying out appraisal and selection and that's the link there and I am happy to give you this presentation after the fact so you can go have a look at the report so what we discovered so we tested three types of technology to do appraisal and selection and like I said sensitivity review so figuring out what should be open and what should be closed in the lead-up to using these machines we analyzed so in the UK anybody any government department that wants to close their records has to provide a justification for the closure and so we looked at the closure applications to the Lord Chancellor's Advisory Committee is what it's called what we found was that about 75 percent of the applications were coming through as around personal data and then 25 percent were things around national security international relations much more meaty subjects and so what we decided to look at was essentially to try and use these machines to analyze for personal data the reason we chose for them on the sensitivity review point of view to analyze for personal data is that one by proportion of the amount of closure applications that we were getting and because the machines had a greater capability of identifying that material that information so this when we talk about personal data we talk about names so entity extraction natural language processing then we talked about you know as social insurance numbers or any type of identification number identification that was regularly expressed that's easy for a machine to do generally as long but again it required training we had to mark up the data set before we put it in so that required all the upfront work by us in the government departments and departments are normally not happy to do this type of upfront work because it requires a huge amount of investment because you have to mark up the data set first and then you got to train the machine and then you got to make sure the machine is actually properly identifying the information in the data so you train the machine and then you bring in another set of data and then you put it in and then you assess the precision and recall so you assess how accurate the results are so this is not for the faint of heart and you really do need to really properly plan the process so what we found the machines do well like I said was regular expressions boolean keyword search and they can process at scale provided you've got it in the right environment what these machines do not do well at all they do not understand and cannot infer context I want to be clear about this when we were looking at things around national security and international relations the machine really couldn't distinguish with a level of accuracy and a level of a level of precision and recall that we felt comfortable with it was actually almost a null percent a null return um and when I say null return it's it was at 50 40 50 percent and so that's not good enough if you're going to use these machines uh for that type of work it means that it's it's not working and it's because context of information when you're doing these types of assessments is so incredibly important because a word or a sentence said in one document can be not sensitive and a word in a sentence is another document can be sensitive same word same documents but the context around it becomes important there's a lot of companies that were saying if we if you do you know sentiment analysis you'll be able to find this information no because if I'm looking at a dispatch from an embassy they could be saying uh information because they were trying to argue that if it's negative sentiment that it's sensitive it's not the case at all it could be just regular sentiment like there's nothing negative about it but it's still sensitive so there were a lot of sort of not quite snake whale salesmen but there was a lot of I think over promising about what these the off the shelf technologies could do in terms of identifying sensitivity and in terms also of identifying I would say information that is but for appraisal and selection it is a balance the tools are great when you you have sort of regular sets of information uh or regularized information that they can process at scale but there's a point where the human needs to come in and these are things around context these are things around handwriting analysis like I said uh or under or understanding the context of a record understanding the context of information and then making that decision machines as I said earlier do not do multi-variable assessment very well you know and especially off the shelf software so to be to be cognizant of when you engage with these software and again you know Boolean and keyword searches it has its limitations too so you know if you're trying to and you have to understand the context of the records because if you don't understand the context of the records you don't understand the content of what it is you're praising and selecting you could either get too many results not enough results uh so you have to be really really up to up to snuff about what it is the questions you're asking the types of records you're dealing with and when you're going to have to get the human to intervene so you just need to be au fait with that before you actually start using the machine so this is an example of a visualization it's a date plot and it's looking at dot types of documents containers presentations etc um and so I use this as an example to say uh so they're they're looking at the size of the item well I can tell you that there could be material in there from 1995 but you have to remember that material from 1995 1996 they're they're they're incredibly incredibly small so if you're using this type of a visualization and these are visualizations that the machines generate um you have to be careful with this and you have to know the right questions to ask to the service provider because you know and especially if you also have to understand the date range of the data that you're interrogating because if there's material from 1995 there and you're trying to make an appraisal and selection decision you're saying well anything from 2005 onwards then that's what we'll analyze but you're potentially missing information from 1995 to 2003 because it's too small it's too small for the machine to analyze so sometimes you got to cut up your data depending on what it is that you're doing and depending it is how you're appraising and selecting so it's something to be really mindful of because a lot of organizations we were working with in in UK government there's like oh we'll just do appraisal and selection using using file size I said whoa stop I said it what's your date range here and what is it that you're analyzing and then let's have a discussion about whether or not the file size is an appropriate measure for say for an appraisal and selection decision because formats have changed the volume of formats have changed have gone up with the decrease in storage like there's a whole host of issues that we need to take into account when we look at this also for example if I'm saying well there should be material in 1995 in here and I'm not seeing it but I'm seeing a spike in 2005 then there are questions sarcophists we need to ask about well is there's something it manifests that happened in 2005 that you can tell me about because of impact our appraisal and selection decision also when you're working with these types of of platforms and when you're working with these types of tools so you see at the bottom it says document container presentation image unrecognized well what's a container what does that mean because that's going to impact your appraisal and selection decision what's unrecognized because there are record there could be records in there that are actually really important but that the the system is only trained to recognize certain file formats there are very few systems that can recognize the plethora of file formats and that say a interrogation system like droid digital object record identifier which is used in national archives can identify so there are limits as to the types of file formats some of these systems can can interrogate or return and identify so again you see over at the far end there's the brown the brown symbol that says other document well what is other document again all of these things will impact your appraisal and selection decision and so you need to be cognizant of how these machines are are interrogating the data another thing to be mindful of that some of the systems will often will often market to archivists and others oftentimes we looked primarily at the e-discovery market because that's where a lot of the machine learning was sitting when we were doing our tests they offer clustering so they'll offer like a categorization or clustering of concepts and often it's done in an unsupervised fashion and I was always really wary of using that as an appraisal and selection technique I think what it can do is it can give you a good understanding potentially of the contents of the records to then do a more sustained search or more sustained training of the system so that you get a more accurate result but I was always a bit weary wary sorry about what it is they were returning because I wasn't entirely clear because the algorithms are proprietary so you have to remember with commercial service providers these algorithms they own it that's their copyright that's their IP they are not willing to let you look under the hood so it creates a number of issues when you want to interrogate or we want to understand the precision and recall that you are getting based on the base information that you are feeding into the algorithm so it's something that you need to be mindful of when you are engaging or working with a third party commercial software provider you also need to be mindful that when you are working with these third parties software providers especially in e-discovery these algorithms were trained for legal uh legal discovery and in legal discovery it's like the minimum amount that you need according sometimes to a preset search string that is given or approved by a court and so it's the minimum amount that is necessary for legal proceeding in archives we have a much more a much broader and a much more all encompassing need to capture as much as possible and so we need a certain level and we also need to take an account of the accountability on our end for using these types of technologies in the work that we do so uh we when we use clustering I think it's a tool it's a tool to help us understand what's in the records but I would not use this to make an appraisal on selection decision unless you have a level of comfort with the precision and recall that's coming back from the system based on the training that you've done so the problems and limits that we encountered during the testing so really a lack of understanding regarding the content and the context of creation of the of the data that we were analyzing uh that was held in government departments corruption or alteration of the metadata so because of migrations the metadata changed we had a lot of really unmeaningful file titles which made it really difficult to make an appraisal decision just on the file title difficulty in understanding the visualizations generated by the machine so this for me was really about different uh different off-the-shelf commercial software defined different uh defined formats differently or defined containers or packages differently so you had to be really really careful about what is unrecognized for x product or what is unrecognized for y product and what does that mean then and do I actually do I need to look at this uh how do I unpack some of the some of the terminology and also understanding how did the machine get to that point based on the training data that I've put into this um understanding regarding the reliability of the results and the acceptable level of risk so I because these are black boxes because these are commercial off-the-shelf technologies we cannot necessarily go in crack the code pull the code out and then reach and then modify the code so that we can control the precision and recall there was a limited amount that we could do in terms of understanding how the algorithm was processing the information to get to the endpoint there was also a question on our end of what is acceptable risk so we as a community need to understand that we will never get it 100 percent there was never a 100 precision recall there was always a level of risk that we will potentially lose or not lose that we will potentially omit material that somehow uh will not fall into the parameters of how we train the algorithms or how we design the search parameters and we need to accept that that's going to happen we need to accept that algorithms are not going to uh are not going to find every single instance of what we are looking for and so what does that mean for us and what does that mean when we apply these machines and what does that mean for archival accountability in this context so there there's there's a whole host of questions particularly as a profession we have not been very good at exposing I would argue our practice around appraisal and selection and these machines really do require us a level of rigor a level of accountability and a level of transparency around how we are arriving at the what I considered the archival record of the future we also need to be mindful that we have to retrain these algorithms every single time we use them archival records change and so the parameters of the search the parameters of retrieval will change as well and that means that we essentially is we have to train the system every single time we want to use it for appraisal and selection and a lot of times that we want to use it for a sensitivity review as well so um it is and so when I see people trying to identify what they call classifiers for things like sensitivity review I have questions about how long that's actually going to be useful for because you know some of these sensitivities are time sensitive some of them are context sensitive and also for appraisal and selection events change you know especially if we can't rely any further on the date you know every you know what I have known five years ago that I would be looking for COVID you know when I eventually come to the point where material related to the pandemic is transferred you know I this is what I will be looking for but would I have thought to to train the machine to look for it five years before no so it's an iterative process and it's not static in terms of how we do our appraisal and selection process using these machines we had to deal with a lot of distrust in technology and the results that were generated by the system um so when I talk about this it was really that the paper process was seen in government as the gold standard paper was never perfect the issue with paper is that it's just not discoverable the issues are not discoverable and so we have to balance the strengths of the technology with the human and we need to know at what point the human needs to intervene when the human's capacity for multivariable assessment becomes important it does take a significant amount of time to train the system and our departments especially wanted something much more automated almost unsupervised but like I said you know unsupervised it has a set of training data behind it it's just that we weren't involved in training the algorithm and I think it presents a huge problems because people think algorithm people think that well you can it's just you can just do whatever you want with it and you just get a result and it'll be accurate and that's not the case at all so in terms of the impact on our profession so automation is really no longer a choice it's a necessity however that does not mean that the human or archivist is irrelevant in the process not at all like these machines are not silver bullets as I said from the beginning you know they are they cannot do multivariable assessment they can only assess what you give it to assess it can't make it somehow pull pull data from somewhere else by magic and make it and make some decision that you weren't expecting it can only make a decision based on the information you put into it um and so that is where still the archivist needs to intervene to say well this isn't I need to actually look at some of parts of the data more in greater detail so when I was at National Archives we developed what I call the funnel technique because what we found doing our appraisal and selection in uh using these machines was that about 50 of the data was duplicate so that's a huge amount I think you need to distinguish between meaningful and unmeaningful duplication and when I say that I mean you know if you've got five of the same uh five five of the same presentations in a file you don't need those five presentations you might need one but if you've got uh you know one presentation in this file and one presentation in in another file elsewhere in the in the system they may be informing the records around it and so at that point you would want to capture it the issue though with off-the-shelf commercial software is they don't often make that distinction about meaningful or unmeaningful duplication what they're simply going to say is you've got it you've got an exact bit for bit duplicate so it's just something that again that requires a human to go in and look at that is a lot of work though do not think that simply because you use artificial intelligence that this in at least for appraisal and selection that this is somehow going to limit the amount of manual intervention you may still need to do it there may still be quite a bit so whilst we reduced the the funnel by about half uh and then we then we went into what we called sensitivity review where we we did what we called the bucket approach so we had we took that we took the um 75 percent related to personal data we put that in there and we let the machine take care of that and we there was you know there was still risk and the machine's not going to get it all right uh but we felt that that was an acceptable risk to take and then we actually had humans do the the next 25 percent and look at things like national security international relations to then come up with at the end of the process the records that would be transferred ostensibly to national archives we were monitoring this process very very quickly most of not very quickly very precisely most of our department still because they had such small sets of data because we were only really assessing materials from I'd say 1995 to 2000 2001 so the data sets themselves were really really small um and so a lot of our departments did double check so although the machine went through it they double checked everything um and our research at least for when we were looking at how well the machines did on personal data interrogation in the in literature was saying we were they were getting results of about 90 to 95 percent which is actually pretty good but it took again it takes time to train the machine to get it to that level of accuracy a lot of our departments didn't have the resource or the time to be able to invest in that uh and that's a huge investment uh that any archives that wants to use this will have to to to put in uh and then so it's just something to be mindful of um I think the challenge with automating appraisal and selection along with sensitivity review is how do you measure accuracy what does good enough look like here um and what are the risks and what is an acceptable risk appetite like I said it's not going to be perfect but then what are you willing to take on as an institution as an organization um and and what is acceptable and how do you mitigate that you know it's like I said it's not going to be a hundred percent these machines cannot do everything uh another question is how do we determine what might be missing again that comes into your to your sort of risk appetite are you happy with the precision and recall um I think you need to consider though you know when we talk about bias uh in record and the use of artificial intelligence you know there's a there's a level here that we have to accept that the record is biased the record will represent a specific point of view at a specific time um and as archivists we just need to be mindful of that and that it may it may potentially skew the results um and there's questions of is that okay if it skews the results um and if not okay then how do you compensate and like I said earlier how can we can't be accountable for the decisions we make based on machine outputs how do we equally hold the machine to account there's a whole issue around transparency and accountability of algorithmic accountability how do we compensate for changes in the digital record over time and how do we retune the algorithm like I said from my perspective and based on my experience I think we need to retune it and for every single time if we're going to do appraisal and selection for it sensitivity review you know it depends it would depend on the legislation and how the legislation changed with regard to personal data and what constitutes personal data um for and the same for you know international relations or national security but if we've got a human doing that then you know that that's a human judgment it's not to say human judgment is is incredibly accurate but our accurate all of the time but the context specific sensitivities need to be assessed by human and you know I think we need to be we need to be cognizant that if we are dealing with commercial off-the-shelf software that we're dealing with black boxes and so we risk and we need to acknowledge this we risk biasing the historical record and by proxy history and our collective memory so we need to be careful about how we use these technologies moving forward and how we engage intelligently with these technologies and what are the right questions to ask and how do we approach it and how do we prepare for using these technologies in our processes because I don't think it's a question if for us it's a question of when like I said automation is no longer a choice so picking up a question around ethics and algorithmic accountability so I feel strongly that archival codes of ethics need to be studied and revised not only in terms of our practice as archivists using these technologies but I would say are how do we act in an ethical manner or how do we what is our ethical stance in a government setting or an institutional setting where potentially we may run up against what we feel is perhaps unethical uses of these technologies I still think we're lacking the proper competencies and skills to work with these technologies I don't think again I don't think it's insurmountable I think it's about taking the time to educate yourself it's about taking the time to do small tests so the tests we did at National Archives UK were really really small data sets but we learned so much just from doing using small data sets in terms of you know issues around keyword searching issues around the dates issues around file format and the way that machines interpreted the information I think you know there is already a lot of work happening around algorithmic accountability and transparency so I feel strongly that corporations and businesses as well as government need to be accountable for how their machines arrive at a result or they must disclose the workings of their algorithms so there is the declaration of algorithmic transparency from the Association of Computer Computing Machinery there's the partnership with AI partnership between Google, Microsoft, IBM, Facebook to promote AI for social good I'm slightly resident of reticent on that particular partnership I find it difficult when large corporations like Google, Microsoft, IBM, and Facebook are there to promote AI for social good when they have a commercial gain potentially by using these technologies there is also the Montreal Declaration which is a more recent declaration on the on the use of responsible AI and then also there the EU regulations and principles around AI usage around personal privacy so there's a lot emerging I think there's a lot of work that still needs to be done both in terms of legal both in terms of law and I think in terms of jurisprudence I don't think it's caught up it's kind of left us in a bit of a of a void I would say and I think some of the codes of ethics as well archival codes of ethics in terms of giving us guidance on how to use these technologies and how to properly engage with them um and the uh I um I just wanted to let you have 10 minutes until um get to the 30 minute marker great so almost done so I'm moving now into making records accessible and read and readable for research so two issues for the archival community to consider is the impact of researchers trying to mine archival data digitize and digitization of historical data and information so researchers are starting to use data mining techniques to parse through large volumes of data so researchers I know google ngram is not something necessarily want to hold up but it is a parsing technology that some researchers have used to mine literature to trace things like stereotypes and literature and there are also many other tools sometimes bespoke tools that researchers are or will begin to use uh in order to carry out their research so this question's for us as archivists about how much access we wish to allow researchers around access to public records and data so data mining I mentioned earlier at the beginning data mining and machine learning tools breakdown silos that are created by archival description they can also reveal unknown connections that can become sensitive or problematic by virtue of the connections that they're making so the issue here comes back to the question of sensitivity review a little bit so it can surface the since the information that wasn't properly reviewed by and in surface therefore sensitive information but also sitting in the in the descriptive silos information may be may be public sitting in its descriptive silos but the moment a machine breaks that down that starts to make connections there may actually be sensitivities that surface and so there's questions I think for us when we start talking about uh internet of things uh networked archival description uh or even just breaking down the silos and letting a researcher parse through large amounts of our collections uh down to sort of the the content of a record what does this mean for us and and and what are we willing to allow and are there different levels of security that we may want to apply depending on the records I don't know these are questions um again it can surface things that we're misturing sensitivity review also once the data is mined and put in is put into a system outside the archives what else can it can be combined to so let's not get tunnel vision with AI there is danger of focusing too much on the impact on our individual collections but like I said what about linked data what about semantic data what will this mean for archives and opening up our collections and it's not that I don't think we should like I'm not saying don't open up our collections I think we absolutely need to but I think we need to be cognizant of the risks of interlinking our different connection collections and like I said potential issues that might surface whether that's sensitivity whether that's you know any any type of it's these types of risks that might come come to mind but it's something we don't know that we've really thought about and again it's a question of are we willing to accept that risk as well we also need to consider the impact of future digitization so the repurposing and reuse of archival records and data has had enormous value and I think we sacrifice much uh by digitization and allowing companies to digitize archival records and data in order that we can get a free copy I think we need to be savvier about this we hold vast and I mean vast amounts of important data and as much as we want to make it available I think that we need to be careful there are a lot of companies that are beginning to realize the value of data held in historical records digitizing them and applying OCR as a method of gaining access to large volumes of data to train algorithms we need to be cognizant about what is free is not always free we need to start asking ourselves why is digitization free will this data be used to train an algorithm and I think we're especially for engaging with certain companies that is a question we need to ask ourselves what is the company's ethical stance around reuse of the data once they've got it what happens to the data once the digitization is done and will there be an impact on people's lives so I'm going to give you a quick example I know I have only have a few minutes left but a quick example around paper digitization of paper death registrations so a lot of companies have digitized this information we also know that there are algorithms out there that are you know allowing and disallowing around insurance healthcare well if we combine the the information and death registrations so that contain oftentimes the cause of death and we are start we over time are able to link an individual living individuals such as myself with the different causes of death within a family and maybe those causes of death are actually sort of predominantly cardiac for example then can healthcare or health insurance companies start to say well based on the preponderance of this particular medical condition that we are seeing based on the death records we have mined and you have now applied for health insurance we are now going to deny you coverage if you if you try and claim against this particular health issue this poses issues but this is data that oftentimes archives have made freely available have made available for digitization so that we can get a free copy and I think we need to start really thinking about this I would argue in a wider in a wider sort of lens of the impact of AI so in summary so government use of artificial intelligence so these are a lot of questions because there's a lot of thinking I think we still need to do as a profession so what role does the archives and information community have to play in the space and do we have a role I think we do what skills do we have what's or do we need if we are to play a role what is a record how do we capture and preserve the record who are our partners and how do we begin to work with them when we look at machine learning and artificial intelligence and archival practice what is the accuracy what are risks that we are willing to accept what's the risk appetite we have around accuracy how can we ensure the accountability of the decisions we make based on machine learning and artificial intelligence processes in terms of artificial intelligence machine learning and research how much access is too much access when machines are involved or is it okay just to let the machine parse through everything what is the what are the right questions to ask when private companies offer us free digitization and how do researchers want to use our records to carry out digital research and what implications does have for the archives so just a quick parting thought whether using an algorithm artificial intelligence or machine learning one thing is certain if the data being used is flawed then the insights and the information will be flawed so at the end i've got a reference for further readings just for information about the further readings i'm trying to present in these readings a global picture of different points of view about the use of artificial intelligence so i don't subscribe to any one of them but i what i've tried to do is present different sides of the of the spectrum regarding the use of AI and so that's me done thank you doctor um cellist for your wonderful presentation um we have quite a few questions so i'm going to try and get through as many of them as possible so the first one uh one of the folks who listed said how do the artificial intelligence help us in the appraisal and the disposal work and what are the appropriate skills and knowledge of archivists will be necessary to work together in artificial intelligence so i think uh the use of AI for artificial intelligence is really useful particularly when you're dealing with born digital simply because of the the magnitude of the data that you're we're going to be dealing with and also because of the of the variety of file formats and also AI enables you especially if you have silos to break down it allows you to parse off or parse through all of those silos so and as we get into greater and greater volumes of data that is where it also is going to become incredibly important i think um it's also how we understand how the machine like i said earlier how is the machine processing the information how is the machine then rendering that information for us to look at it a visualization and so what do we need to be mindful of and what questions do we need to to think about asking i think in terms of skills it's uh some of it i would argue is understanding having a basic understanding of statistics it's really really helpful um and like i said there are courses on Coursera there are free courses excellent courses and they're also very inexpensive paid for courses in um in these types of platform skill share Coursera that you can you can go to you can learn and you got then at least a basic set of information that you can then use to think about how the machine is doing the work we need to remember though that because we're not the designers in the process we're taking machine we're taking off-the-shelf commercial software so our ability to truly understand precision and recall is i would say is is kind of impacted um and so uh it's but having that ability to understand a little bit how the machine is processing really comes from i would say the statistical analysis data science courses there's a lot of them out there that i think are also really helpful but i would say start with the basic stats course and then work your way up and like i said there's lots of free courses out there i know the british library actually is offering training around programming for information professionals so having a basic understanding of pro programming is really really helpful um but yeah i we don't have to be experts we just need to get a basic understanding what that machine is doing yes thank you for answering that question so our next one um uh one of our guests is wondering are there any differences between data in general term and data in our archival term and you said thank you uh i don't think so i think when i and the reason i use the term data is because of the communities and the partners i try and work with they don't understand what an archive would have not an archive well they don't understand what an archive is most of the time but they don't understand what a record is we have a very specific definition of what a record is whereas when i'm working with data scientists when i'm working with it people when i'm even when i'm working with decision makers they get what data is and they often define unstructured records which are what we call records and and data which we also call records as as data so for me there's i i make i call it data simply because that's how the community that controls the discussion around this uh that sort of references what we call records so it's more because i need to i need the in with that community and so i've also used it in my presentations because i use this presentation a lot in different contexts oh you're muted thank you so much all right um our next question how does neural network function in artificial intelligence so neural networks um i'm not i'm not an expert on neural networks but it's really the as i understand it it's the different networks that have different pieces of information um and then through the training of the neural network begin to identify what is what is essentially the precision and recall for what the output is there's lots of information i may not have explained very well but there's lots of information online on what neural networks do so i would suggest go online have a look um because i don't think my definition is very good hey and complex okay so the next question is uh two parts um what skills will be required once a ai is fully implemented specifically for archivists do we need to educate them um as this might affect uh might have effects on their behavior and then the second portion of the question is how long do you think it will take us to keep the learning process for the machines till we reach to the minimal number of errors well i think uh so on the first one i think the skills for me to say when to be able to look into the future what i think the skills would be when ai is fully implemented in in the majority of archival institutions i think it will change so what i'm advocating now are the skills that we need to start engaging so it's around you know i think like i said the basic statistics uh doing you know some basic data um data science courses that will help us okay um i think uh what was the second half of the question sorry that's okay let me go back it was multiple parts oh let's see i had resolved that one let's see if i can re-find it again i'm gonna move to the next question and then i'll see if we can go back and chat okay no worries i have it azure if you oh you do would like it this is christa um the second part is how long do you think it will take us to keep the learning processes for the machines um till we reach a minimal number of errors so yeah i think you know it depends on the on the so i think there's two parts to that so one is it depends on the the size of the dataset that you're working with and it depends on the questions that you're trying that you're asking the machine to to answer for you and the process is iterative like i said and also it will be dependent upon each of the new datasets that you put in like i said appraisal and selection it's it's not linear like you would necessarily expect when you're engaging with certain ai because ai often is given one question and then that's the output and then you just keep putting the data in to get to improve this the output to the question our questions in our in appraisal and selection vary it based on the the different records that we're dealing with and so there's it's it's difficult to say what is the optimal precision of recall because it will depend on the each of those different sets of records that we're working with so the next question that um one of the participants asks if we have machine learning and edrms do we need to do we need retention periods scheduled depends on what the machine is being asked to do you know i think it's you know what if are we saying to the machine i need you to identify x these types of records to destroy but the problem is if we're saying it has to be in this form like if the records have to be in a certain format the format changes or the the forum that the records have changed so i think it depends on it depends on how the the parameters are defined and it is not static like that's the thing is like none of this process is static whether that's from an edrm perspective or whether that's from an ai perspective they're made because laws change classes of records vary functions move so it depends on the context i would say and it depends on the types of information you got in your system that you can train the ai to that you want the ar to parse through because if you've already scheduled the information then why would you know my question would be why do you feel you need artificial intelligence if you've already set the retention roles within the system like what is the what's what's the underlying issue that you're trying to solve by applying artificial intelligence in an edrm okay so the next question is how many activists have sufficient understanding of ai to serve as advisor for ethical issues or for what to keep i would say not not huge numbers i think it's we're just starting to use it now i will be interested to see in the congress in Abu Dhabi next year how many papers come up around the use of artificial intelligence and archives i think it's early days but i would rather we get ahead of this curve now while it's early days and start asking ourselves the questions and start having these conversations than waiting until something serious happens all right so the next one is does archive does archive infrastructure prepare to document um and preserve all processes and what kind of archive should we do okay i think there's a type of here so what kind of archive should do that national institute for example i think you know it comes back to the infrastructure question um that and that's not an easy it's not a it's not an easy you know question it's not an easy situation to resolve because these are like i said these are huge volumes of data and you would need data centers to to preserve this and so at this point i'm not aware of any archives that has acquired ai and so i wouldn't be able i think really to answer that question with a level of accuracy because i would just be guessing at this point about what i think that might look like and i think you know do there's questions around do we do we look at decentralized archiving just every institution need to have a digital repository uh and are there better ways that we can centralize our services and our storage so that we can do this type of preservation and maybe we have other allies that we have to work with in this space too okay so the next one is um so what okay so what you're saying is that we as archivists have to advise governmental bodies or other organizations in how data structure combined in process and and then sorry and so that puts us on the chair of policymakers and asks for detailed knowledge about and experience with the policy subject you're advising isn't this conflicting with our goal to establish an objective documentation of the following processes and can we not better focus on advising how people document the choices made in structuring combining and processing data i think they are kind of free equations so i'll try and take it at a high level um when uh when i worked in UK government it was always something that was in the back of my mind in that we discussed when i was with my government departments like i said i am simply there to say this is potentially an issue you're going to want to think about and these are potentially some of the implications you're going to want to think about what the government department chose to do at that point that was on them um and so i think i agree that what our main role is is to ensure the integrity and the documentation of the process but i think we're i don't think it's outside of our realm of responsibility to say you know we have a knowledge about the technology that the policy the policy this person will not have and the implications of using experimental technology with the most vulnerable in society coming back to the handwriting analysis where i think this really flagged up this this particular question so my issue is like there's nothing lost in us just saying you may want to be mindful of this and there are serious issues of using experimental technology in in doing policy decision they may tell us to go for a walk but at that point then my role as an archivist is to document and is to ensure that the they document their process they document their procedures that we have the data set we have the algorithm because potentially it'll come back to haunt them and at that point all i do is just go well there's the data and that's what we documented and that's what you used and that's how you got to your decision how how you are held accountable or how the government chooses to interrogate that after the fact that's on you all right thank you next question is how we may document preserved algorithm when government use them from private vendors and the intellectual property issues et cetera and i think that's the problem this is the issue i am seeing right now with a lot of the public private partnerships that are emerging in government like i said in the beginning i understand that government does not have the resources to do everything i understand the value in having competition in having different working with different private sector providers but what this then does is that it creates potentially black boxes where we as citizens cannot access i think the way around that if for governments is to build into their contracts with these companies that they must give a level of of accountability for the algorithm it still doesn't subvert the issue around the intellectual property and i think it's for me it's it's a real it's a real concern around using these types of black boxes in policy because as archivists we can never capture that because it's covered under intellectual property because it belongs to a third party service provider and so at that point it's how what how do we document that process without a key piece of evidence which is the algorithm and that's where i think the in the when we were talking about ethics and algorithmic accountability and illegal the sort of dearth of legal cases or precedent that we can reference it's it's a danger it's a danger for citizens and i feel it's a danger for government there is no easy answer there but it's something we need to be mindful of all right so the next one will be um how archivists could advise their scientists when they do not way understands keeps an ugly even how so i think you don't necessarily need to have a detailed understanding of scripts and algorithm what you do you need to have an understanding of is how they're documenting that process and it's you know and maybe maybe i'm coming at this from um you know a a position where i've been able to access a lot of education online and and you know maybe that's not always possible um but you know i i don't think it's outside the realms of possibility the internet is a wonderful thing and you can learn a lot on the internet about codes and algorithms i think it's making sure that you go into the right sites that are trustworthy sites that are reliable and and have authority and integrity but um i would say that really the main key is understanding the process understanding how they arrive at what they consider the final final decision the final output and then documenting that you don't necessarily need to understand how they're changing the actual script or code what you need to make sure is that you have the system that is documenting all that and there are systems out there that do that for researchers so it's not unheard of and that you just make sure that you've got that piece of information and that you can take that into the archives and you can ingest it there's a whole issue about rendering how do you render artificial intelligence for you know you know regression modeling or for reuse by researchers that's a whole nother kettle of fish i even touch on but it's not i don't think you need to be uh an advanced programmer for this what you need to you need to grasp is what is the process and what needs to exist for the integrity of it so next question how can we guarantee ai records can be read in the next 200 years we need to delete that one you just answered that one um are governments allowed to take decisions um concerning citizens using ai technologies i thought it was forbidden e.g by gdpr i'm not sure what that means uh that's the general data protection regulation so that's the privacy legislation in the EU uh i don't think it's illegal at all i think they can do it and they have done it and i think the issue is around the correction of the if i'm not mistaken around gdpr it's around the correction of the data that may be used to influence a decision that has an impact on you you have a right to correct that but it can be used and is used in government decision making next question with my experience in intergovernmental organizations archivists need to upgrade their skills and get the appropriate it skills but the digital information in ai there's a risk um archivists of today won't be able to function in a couple of years the big question is what is ica doing to ensure the upgrade in archival education in order to incorporate these current challenges well i think uh so we're offering we're starting we're offering a course coming in the fall around managing digital archives so in that we're offering at least some basic tools and tips and well not basic it's a whole it's a whole structured course that was developed by two professionals around how to set up your digital archives how to maintain your digital archives and with different modules so we're starting i think if we're talking about the skills uh needed to do the work that i'm talking about um i think you know there's a conversation with the community at large first about what we think those skills look like i know there's course there's archival programs around the world around computational archival studies that are coming up um so i think we would have to look there i think that we can i i don't want to be duplicating as ica courses or resources that are available elsewhere um i think there is a question for us about how we look at what the implications are from an educational point of view building on what already exists uh and then you know is it up to us to create new tech uh new training or is it up to us to sort of start formulating what the competencies we feel are because the competencies will change we're still early in this process um and all the competencies i've acquired i've acquired on my own time and on my own resource so it doesn't necessarily mean that what i think right now is is the whole picture so i think there's a conversation about what we think this looks like from an educational point of view before ica starts saying we need to do x y and z so there's a conversation first and then i think we can start looking at our partners so that we can start pointing and saying well if you're looking for this skill go here if you're looking for that skill go there because i don't think we need to duplicate what already exists so like i said conversation still needs to happen but i take the point yeah so nex uh participant said um the code is fixed for one agency or could use with others i'm not sure what the context of that one is not clear either you can go back and add in the chat or revise it maybe we can get back to that one okay um i'm gonna move on to the next question um what do you mean by saying that unstructured data is not yet an archive or maybe even a record no i didn't say that no unstructured data is just word documents um and uh when i say when i use the word unstructured data i just mean as i said earlier it's just a way to to speak to different communities about what it is that we hold but i never said that it was in the record okay thank you for clarifying that's fine has the national archive study about AI and machine learning and archives been applied also to web archiving processes specifically in creating indexes and tools for access that i don't know so i don't know what's happened with the research since i've left so i've been out of the national archives uk now for two and a half years i know they're working they're still working on the topic but i don't know what they what they have applied uh the research of what the focus of their research has been so i would suggest going on national archives uk websites looking at their blog that's normally where they talk about some of the emerging research that they're doing and seeing how they've reexamined the use of AI and archival practice and whether they've looked at it from a web archiving perspective next question has anyone heard of or know of any examples of using AI to automate question okay um AI to automate or simplify the filing process i.e. to encourage users to file their records in the official information management system um i have seen examples that i wouldn't say they were AI but they were basic routing uh scripts to so that when people filed information using certain metadata tags it actually just linked the push the material out into the file into a specific file so what the user saw was a very simplified sort of almost google search feature and so they could type in the information and provided they had the right uh keywords and metadata they could pull the record back up um and when they were creating the record it was often the keyword depending on i can't remember all the all the parameters of it but it depended on um how they they what file they sometimes associated the record to and it would just automatically put the file in there so they didn't have to save it directly into the file so there are sister there are situations where i've seen that and absolutely it could help with uh routing user information but again i think you need to be careful because you'll have to retrain the algorithm at some point because the way you know functions change uh stuff changes and the way information is created is changes so you have to make sure that you know when your your algorithm needs to be retuned so that it's routing the information in the appropriate areas i think we'll do i think a couple more questions okay the next one has multiple parts um this participant wants to know would you be able to share some information about practices in the uk national archives and government on managing structured data as records how does um uk identify capture manage and apply retention and disposition to data both transactional applications and analytical ones um again i've been away for two and a half years so my first recommendation would be to check the uk national archives website to see if they have updated information on the identification of data sets as records we were acquiring a few data sets when right before i left at national archives uk and i think we were trying to unpack the preservation process so we were looking at ciard which was developed by the swiss national archives which looked at the packaging up and preservation of data sets in terms of identification of data sets i think we were primarily looking at it at least when i left from the point of view of function and what was the function and the use of the data set and what decisions were what decisions was it influencing and were these substantive decisions and so therefore how do we capture it um and i think too it's sort of try it's sort of looking at what previous information maybe we might have captured say it was case files that have since been transformed into data sets and then identifying those and making sure that we were able to bring those in but i i don't know if they've developed a more um a sort of firm policy on that and how they're approaching data set capture and and preservation so check the website is that's advice i could give them a freight thank you so next question um they said thank you so much for informative presentation in your opinion can you um can you appraise it says um they're wondering about the appraisal and selection process without any human participant participation um can it be adequate and how can ai determine for example historical and cultural value of a record during this process is it really possible concerning ai's inability to process multivariable assessment no i can't that's where the human comes in the plane and sorry is there is a more question to that yeah and then can you recommend any articles books concerning use of ai cloud service for records specialists who are unfamiliar with ai and its possibilities so i'm working with researchers right now um that are trying to find a publisher for a book on archives and ai and access so that's one that i'm aware of um the other books i would point you to would be more general reading sort of more available documentation that i have in terms of specific articles looking at ai i have reviewed a few for a couple journals but i haven't seen any recently so i'm afraid i'm not very helpful in that point but i would suggest um if you're looking for reports the national archives uk is the report we did it's been it's been at a date now we did in 2015 it's five years old so you know it needs some updating um it could be they have other reports that they've published since then but that's the one that i'm aware of that i was involved in um and you know i would say if you're looking for general reading around ai one i found interesting it's a it's a bit sort of dire but interesting as kathy o'neill's book on weapons of math destruction because it kind of gives you a sense of some of the issues you need to think about and really help me sort of break down what i need to think about from an archives perspective uh around what are the considerations and what are the things i i need to be aware of when i'm trying to train data and what are the things i need to avoid or train i'm sorry not train data um thank you so next question what are your recommendations for a better preparation to handle ai ml and the logarithms in the field of archives sorry did you repeat that sorry what are your recommendations for a better preparation to handle ai ml and the logarithms in the field of archives so i think one is training so which i talked about earlier i think one is we have to just get our hands dirty we just have to test out some technologies and you know get some small amounts of data that manageable and that we can get a sense around the precision and recall to get a sense of how these machines work to get a sense of you know how would what are the things what are the constraints we need to be aware of when does the human need to get involved what are the things that the machine won't do for us uh or that we just need to be extra vigilant about that the machine might say this is garbage but in reality is actually a record so i think it's just teaching ourselves and going out there and and training and working with it as my one of my colleagues said learning by doing so uh we come out almost coming to the end of our you know the assigned time that we have but i feel like we just need this last question and then uh like the question was what will a code of ethics for an accuracy in this age new age of ai we look like and is it necessarily our responsibilities that accuracy to integrate the the algorithm in government decision making all is a responsibility for all of us as citizens of public servants i think it's it's it's a mix it's a mix of um as citizens we need to hold government to account for the decisions they're making and sort for these types of partnerships that they're getting involved in especially if they have if they have an impact on our lives but i think too as archivists um we need to update our our codes of ethics for the first to adapt to these new types of technologies and it's it's how do you make a code of ethics um how do you i'm not finding the right words right now but how do you stabilize it so you're not revising it every two years because there's a new piece of technology that's coming out um while still being mindful that they will have an impact on the work that we do on the advice that we give and so sorry and how and how we manage that okay that's you i mean we just this was just kind of a wonderful conversation until the end and i would say that they're going to be definitely a follower because there's definitely kind of a great interest in this kind of you know conversation but also because people are asking right i'm going through the question like how much penetration starts AI having you know archives in the developing world people are asking a lot of questions what scales and thinking about just the scale and one of the questions that you had we had and then you try to answer so was the council library library council library resources information actually has a clear constant libraries information right definitely so provides also some kind of training so resources are out there but as you see it's just a type time for us to be able to you know train ourselves and you know be start practicing definitely learning by doing so with that said i would say thank you to Dr Andrea um and here and then we have all the sessions we have all the webinars that are coming and i want to say thank you to care for sponsoring cook here the modern foundation for sponsoring this uh emerging technologies and big data and archives uh webinar series and um my name rebecca reigh i'm rebecca by dr rebecca by egg i'm actually a postdoc at the shambles center so i want to say thank you as well to the shambles center for just providing me with this opportunity to be able to do this kind of work as well and then my other colleague azuli is two words from uh page as well and then from new york university library so uh with that said we say thank you to everybody and then we hope that uh we're going to definitely have a full up again what we think about the follow-up and no worries all the questions are captured we'll have a video the video recording are going to be uploaded and uh the clear um youtube and with that said we say thank you thank you a lot and thank you very much everybody for coming thank you for having me thank you you are thank you everybody