 Welcome back and let's get started on our final segment of the second day of the virtual event at the CNI 2021 full member meeting. And we have two presentations to round out our day. I am really delighted to welcome Riz Ali from the National Security Research Center at Los Alamos. Los Alamos is an amazing place that has given a lot produced a lot of innovation that's turned out to be very important for our community over the years. You may recall that Paul Ginsberg's preprint archives started at Los Alamos. Many of you will know the great body of work from Herbert von Disampel who spent time at Los Alamos and today I am really pleased that we'll be able to learn about another very interesting innovation. That's being done at Los Alamos. It's basically as I understand it and we'll understand it a lot better shortly a machine learning driven application of AI technology. And with that I'm just going to say welcome Riz. Thank you for joining us and I'll turn it over to you. Cliff, thank you so much for inviting me to speak. It's truly it's an honor. I mean I get your emails on a very regular basis and I'm just thrilled to be able to present this to the CNI community. So just to give you a little background about myself before I get started on the briefing I've been the director of the National Security Security Research Center for the last couple of years. And I'm going to a little bit history about the NSRC. My background really is in IT and cybersecurity so the AI work that we did over here is right within my, within my expertise. So on a daily basis about the archiving and library work that my staff is doing over here but it's just been a truly wonderful experience trying to introduce these innovative technologies over here at Los Alamos. But a little bit about us. The NSRC is the Los Alamos is classified library. So we have actually two different libraries at Los Alamos one is our unclassified library which is called the research library. The classified library is the National Security Research Center, the NSRC is by far the larger of the two entities but you know only a select number of people are able to access the material because it's all the materials classified. The lineage of the organization actually dates back to J. Robert Oppenheimer in 1943, when he formed the technical library as part of the Manhattan Project. So the, the, over the course of many decades, the, a lot of many archives and many libraries popped up at Los Alamos and somewhere around the 90s, a lot of them started getting consolidated together. So, about three years ago, when I started at the, at the laboratory, there were two major classified libraries still in existence, so one held the physical collections, one held the digital collections, and there was a project started to consolidate and to form the National Security Research Center. So this center over here really houses about 75 plus years of scientific engineering work on a classified side areas of the work that the library does. And our goal is all has always been to offer services similar to major university libraries. So, to that effect you know I've got a fully trained staff of librarians archivist digitizers communication specialist historians is fairly large staff that support the operations of the NSRC. So that just give you a little bit of a sampling of the type of material that we have within our collections everything from physics to nuclear testing and material sciences, environmental science lab history, even a topic that most people don't even know about you know the US experimented with the nuclear propulsion so hold the archives for that. So obviously with the nation's rush to go to Mars, you know that's become a pretty popular topic with a lot of information requested we've been getting on that particular topic. And the volume of material that we have and this is completely germane to the artificial intelligence work that we've been doing is just trying to quantify exactly how much physical material that we have. Everything from our after cards to videotapes the radio grass which are x rays of nuclear weapons. Physical reports microfilm microfiche I mean you name it if you got a large large collection, someone the order about 14 million pieces of information, so it does make us one of the largest libraries within the federal government it's fairly extensive collection of of items that we have vast majority of our collections has not been digitized so we estimate it's less than 10% has been digitized. And out of that less than 10% has actually been cataloging index made available to our researchers. So we're talking about less than 1% and that's very unacceptable results you know after 10 years of working on trying to get this index made available to our researchers. So as I mentioned, you know less than 10% of stuff is digitized everything from all these different materials that we have within our collection including the very large collection of card catalog which is about 750,000 items in and of itself. We've been trying to digitize the stuff for about 10 years and it's been a slow going process as you know many of you have probably realized when you try to digitize large volumes of materials it's not particularly easy but we did. We did about about nine months going doing a deep dive analysis to find out exactly where the bottlenecks are and try to fix our digitizing issues because it's really the amount of material that we have, you know we can't be waiting for the stuff to be digitized. We needed some high speed digitizing labs to try to solve the problem. So I ended up visiting a number of different organizations, such as NERA, the George HW Bush Presidential Library, Purdue University Library System, some commercial organizations, the DoD organizations like the National Reconnaissance Office. All of them have high speed digitization operations set up and we wanted to see if we can apply any of their lessons learned and what they're doing over here at Los Alamos because we're doing things the current, the old way of digitizing and doing quality checks was just not working enough for the volume of material that we had. So the result of that really deep dive analysis was the standing up of seven brand new digitization labs, everything from video, audio, microphone, microfiche, you name it. We started off slow. The first lab we stood up was the video lab and then we did the audio lab and then after that we did the motion picture films digitization lab and then followed quickly thereafter with the microphone and microfiche labs. Basically, you know, standing up new labs was very expensive. I had to build a business case to get the stuff up and operational. And the, once these labs are up and operational, it really made the problem of cataloging the material that we have so much more difficult because the volume of material that we have in our digital lake. Like I call it a digital ocean, it's not even a lake is so vast that there's almost no way that you can catalog all this stuff manually. And we have two additional labs that were planned for 2022 one is going to be a small scale paper digitization live I just don't have physical space to set up, set up a large scale lab right now, but we are planning a small scale paper digitization and then I also want to set up an indexing lab to help with some of the indexing projects that we have going on. So, you know, the, the reason for our going for us going down the path of trying to introduce artificial intelligence machine learning into the library process over here is that are the rate of the information that the digitized information that we have is very slow, and anybody who's actually done cataloging work themselves know that you, there's almost no way that you can speed it up in a manual process. So we have a variety of different digital collections, you know, they have all sorts of different metadata and indices and some of them are just non existent. We inherited these collections from, from old archives and all libraries and some of the archivist and librarians are very meticulous about putting indices and metadata data together and some of them weren't. So it's just a large, large and difficult to use a collection of indices and metadata. The cataloging rate again I sat down for several weeks to try to figure out exactly what was going on with the cataloging rate and why we're the process was so slow, the manual process for our material takes about 10 to 30 minutes per document. It may depend on how complex the document is if it's just a one sheet or if it's 150 sheets, because the cataloger has to extract the appropriate indexing information and metadata in order to put it into our digital repository. But the analysis I did and this is just, you know, the 2.4 million is just one of our digitized collections that doesn't come to solve them. We had, we had about one and a half full full time equivalent staff working on it at the rate that they were going it was going to take us over 400 years to digitize. So then it kind of got me back into the same realm that we were in when we analyze our digitizing rates. Our microfilm collection would have taken us about 100 years. Sorry, microfiche collection or taking us about 100 years to digitize microfilm would have taken us about 2000 years. The cataloging is, you know, would have taken us 400 years and with all the stuff going on with the high speed digitization labs. The quantity of digitized collection is dramatically increasing on a daily basis. And there's no way that we can ever find enough people to catalog all this information and bring the material to our researchers who need to use our daily basis. So we can set out with trying to find out exactly what is available in the market space. There aren't that many companies are doing this type of work but since with my military background. I was able to find some folks in the intelligence community who knew of companies that were doing almost this exact same type of work for the US intelligence community. So if you think about. For example, the sound bin Laden house you know when it was rated the, the material had to be rapidly digitized and the digitized material had to be index catalog and natural language search had to be performed on it to make it available to intelligence Obviously Los Alamos is not in the in the business of rating terrorist houses but we do have similar requirements in terms of being able to rapidly digitize catalog index perform OCR and make stuff available to researchers. So, we kind of use that philosophy and we did find several companies that that did this type of work. The reason we did go down the intelligence community route is because we do have some specialized security classification rules that we need to implement over here because of the classified nature of our of our holdings. The commercial companies would have taken them too long to build those things out the folks that work with the intelligence community already had this stuff. As part of their toolkit so that's one of the main reasons why we kind of went down that that path. I was able to narrow interview about a dozen different companies. I narrowed it down to one company that that did the work that we needed, you know, they brought in some partners to build out their suite is able to find some money, about half million dollars to do a pilot and now the pilot went really well, and we're in the process of getting that implemented on our classified networks. So the goals really is that we wanted to provide an integrated search across document repositories, and we want our researchers to be able to research this material themselves. So that kind of necessitated that the material had to be the OCR had to be performed by the sync by this tool. The data had to be extracted indexing information had to be extracted. All this stuff had to be done in an automated fashion, and then a natural language search engine had to be had to be developed in order to for for researchers to be able to find the material and so the search results doesn't deliver just pure boolean results which would give them maybe hundreds if not thousands of items that are completely irrelevant to them it should have been it should be a natural language search tool that they can actually get relevant information from It's basically uses AI enabled engines and it's multiple engines it's not just one engine that pulls all this information out. There's the extraction mechanism is AI based there's an AI based optical character recognition system, which actually works on pre 1984 funds which some of you may know if you've done OCR work is that Adobe. They basically focus on a computer generated fonts the typewriter fonts it has really difficult time with the, the system that we needed had to be able to recognize older fonts that were generated by typewriters. And then, you know, being able to extract all the stuff automatically without a human intervention was was key. So we also want the natural language processing so the natural language processing if if I'm sure there are things there's any person on the planet was not use Google, Google uses a natural language search system. They rely on millions of search results in order and people clicking on relevant items to be able to build those relationships. So, for this particular tool, we would only have thousands of people not millions of people and maybe only hundreds of search results process daily. In that low volume of material that wouldn't be enough material, there wouldn't be enough data for an AI system to automatically build some of this. Some of the interrelationships between the data. So this, the search system actually uses a combination of AI based natural language processing, as well as a human built ontology system to make the connections between all the data elements so that the relevant search results are turned up for for our customers. So all this stuff together is what we will we've turned this project Titan on the red it delivers. We're hoping this company deal deliver relevant results for our customers in relatively short period of time so we're thinking about three to five years is the time that we've kind of given the system to to mature itself out. Right now we're just on the first year of making the implementation happen so once in the ingestion of the data starts, we're fairly confident it's going to be able to drive that to that 400 plus year timeline of being able to catalog all this information down to about a year, maybe two years. A lot of it a lot of it really depends on how much, how many servers and engines that we can run simultaneously on our on our systems. And a roadmap that we, you know we started the technology evaluations at Los Alamos back in 2016. We did evaluate commercial products the commercial products that most of them didn't really pass our evaluation and one of them. So this year did actually do some of the stuff that we did that we wanted their classified functionality was kind of questionable because the types of classification restrictions that we have, they would have to build those out they weren't built into the system. This pilot with Titan technologies, Titan on the right in 2020. Again, this year we're building it on the classified environment. I think it's probably going to take another year for us to get a test system out of a classified mainly because there's a lot of security issues that we have to deal with on the other side it's not really a technology issue. And then a year later we're hoping to expand that out to additional data sources if the initial data source. The test source we think is going to work properly. And then and then we're going to evaluate that in a couple of years to three years to see where we're going to expand this out to. And this is one of those projects you know as the director of the center I'm kind of managing personally because it's it's very large project multimillion dollars per year. So I'm personally manages this project but I do have a project manager, Julie maize who's who does a lot of the day to day, working with the vendors and our internal staff to implement this project. I think that's all I've got so I'm Cliff I'll turn it over to you and if folks have any questions I'm happy to try to answer them. So I see one question online. Any chance Titan on the red would be made available to other institutions. So, Titan on the red is a semi commercial product. It's a semi commercial because it was actually developed for the intelligence community. But they have a version of it that that that company is selling to the commercial space so if you send myself or Julie and email, we can put you in touch with with the folks that are in that company and then they can then you can have further discussions with them. Okay, then we got another question from Alex. I'd like to know what human based ontologies you're using sounds quite like what we're doing with the hierarchical cost, costuring of scientific research problems that we've extracted although our timeline is by Christmas okay I think the ontologies that we're doing is. Okay, so we're taking some commercial ontologies and we're taking some open source ontologies and these ontologies are basically related to, to science, engineering, and then the human generated ontologies that we're using are being built in So we have a team of two people who are ontologists, you know, university trained ontologists. They're building out those ontologies those ontologies are specific for the types of work that we do over here and, you know, for those of you. So in history Los Alamos is the play birthplace of the atomic weapons so the bulk of our work that we're doing over here is related to nuclear weapons. So the ontologies that we're building over here are classified ontologies specific for the data set that we have at Los Alamos. There's another one saying can you say a bit more about ontology that powers the search also search by concept or right full text search combinations. So the, the ontology I think we've already talked a little bit about that is. It's a combination of scientific and engineering ontologies that are commercially available and through open source sources, and then our internal man made ontology so we're internal staff is working on. The search is actually a fairly robust search it's actually more in depth than what you get with Google so you can do the full text search by, you know, typing in stuff into the search engine but then it also presents you with additional capabilities that will graphically show you relationships between the search terms so it'll say okay, if you're searching for, let's just say Oppenheimer. Here's all the stuff about Oppenheimer and graphically they'll say well, Oppenheimer is also connected to this guy whose name is also up and I'm a what happens to his brother, or it also happens to be connected to do you see Berkeley because he's teach stuff at UC Berkeley so they're here some additional stuff that you may want to check out with UC Berkeley so it's actually a fairly robust system that that this team is going to be rolling out. Okay, so the next question is I think you mentioned Palantir have you tried their interface with library metadata production folks anything you can share. Palantir, we honestly, we didn't pursue it further than than the pilot program that we did back in 2017. There were been too much build out on on the Palantir side for the classified work that was specific for the type of things that we're doing, plus their cost structure was a little bit more than the way we could work with over here in Los Alamos so that's when we started looking at other alternatives and the other alternatives that you know it says the pilot with Titan otherwise started in 2020 but the actual search for an alternative system started in 2018 and that's when I started leveraging some of my connections within the intelligence community to see if there was companies out there that could do similar type of work. So, another question this is, are any data visualization features built in the search, other than the graph related disambiguation that you already mentioned. That's the only data visualization tool that we've contracted with this company for. I'm sure there's others that we could maybe at a future date and we're not. So, we're not actually doing data visualization we're doing information visualization so data visualization at least in the way that we use it over here is taking the raw data that's contained within our documents. I would say tables and graphs and charts and doing a visualization of that we're only doing the visualization of the PDF documents, it's and then the visualization of how they interrelate with other documents within our collections in terms of the authors or or the subject area. Another question how many staffers are working on these projects and labs. Okay, so the total number of people that are within the NSRC is roughly about 70 people. And for some of the major university libraries you know that's a tiny amount but for our for the amount of customers that we service that's actually a very large number of staff. As far as the digitizing folks themselves it's about 20 people, and we're in the process of hiring quite a few more, because we've stood up the lives but they're not fully staffed yet so we're, we're still going through the process to as far as the projects go to this. Titan technologies, Titan on the right project. We have a fairly large team with our vendor that's supporting us, I would say it's probably about. About 15 and 20 people, and then we've got another much smaller team within Los Alamos that are working on implementing it on our system so I think it's about five or six people that are working on our side over here. Got a lot of questions there. Let me jump in with one if you'll allow me. Since we seem to have filled it everything in the chat. So I'm just thinking about the, the scanning and OCR side of this. You might you have you have a lot of material going back into the 40s 50s 60s of the last century, which is going to be type written for sure, often copied copies of type written manuscripts of not necessarily wonderful quality. And as a bonus problem it's got mathematics in it. And, at least back in that era, the way I often saw that done is you would actually hand write in some of the things like integral signs and Greek letters rather than particularly in the era before when you had the IBM Selectric balls that you could swap out for math symbols. How, how, how well is the, the optical character recognition working on that kind of material that seems really challenging to me. I have tested this out with type right typewriter fonts, even the older stuff and it does recognize the material. There are limitations in terms of, you know, when you digitize something in the old days with with typewriter stuff there was a lot of hyphenation, so the system doesn't correct the hyphenation and that's intentional. There's always an ambiguity of whether or not that hyphen was intended to be there or if it's just at the break of a sentence or break of a word. So that stuff is not correct in terms of formulas and everything. There's actually some massive research going on in different universities to do to try to detect the computer generated formulas as well as handwritten formulas that work is still ongoing. Our system is not going to be able to solve that problem. At least in the first revision, as far as the handwriting stuff goes, we have quite a bit of stuff within our collection that has handwriting stuff. But at least my look at the vast majority of stuff, the vast majority of stuff, I would say like 7080 90% of the stuff we have is typewriter stuff or computer generated font stuff. There are documents with a handwritten stuff on the margins. There are notebooks that have handwritten stuff, but that materials really in the minorities, the majority of stuff is the typewriter stuff. This system will be able to detect handwriting fonts, handwritten stuff at probably version two or version three because that is part of our game plan to implement but it's just not. We're going to run the bulk of the material first before we start going on some of the necessary but you know that there are these are would end up becoming side projects but they're not in the, in the game plan at least for the next couple of years. Okay, well thank you. I think we are at time. I fear we are at time, because there's certainly, there's a lot of fascinating aspects to what you're doing and I'm very grateful to you for coming and sharing it. The scale of this is really pretty, pretty breathtaking, and I wish you a lot of success with this and I hope when you, you know, get that get this rolled out. So think about giving us an update on on how it's all playing out. But thank you so much for coming and sharing that and I suspect there are folks who will want to follow up various things with you in the chat. Sure. And again, Cliff, thank you for inviting me. Thanks for coming. And we're now going to just take a brief pause while we set up for the final project briefing for today.