 Hi, I'm Laila Sturman and with me is Jason Clark we've been working on a project that we're going to discuss today in machine learning text summarization and optimizing scholarship for citizen audiences and discovery. We worked with our colleagues Justin shanks and Daniel Aiden who we are very grateful for their time and expertise on this project. Next slide so today we're going to discuss the motivation for the research. Then we did a survey to to ground the work that we're doing and we'll talk about that a bit and then the meat of the presentation. Jason will talk about summarizing and restructuring scholarship and how we did that and how we thought about it and what data we used and then we'll. Briefly discuss the implications and next steps of this research. Next slide so clear communication outside of displaying niches is rarely taught or emphasized in higher education. In fact, we train students and complex start in as part of their indoctrination into a field. At the same time, interdisciplinary research is a sought after interaction that research offices and institutions believe will increase innovation productivity and access to funding. Journal articles and book chapters are the standard efficient means of communicating within a once field and yet this static format disciplinary jargon and the presumed audiences of these scholarly outputs may inhibit some of the common measures of success for those outputs measures like high citation rates, successful grant applications, broad readership and acknowledgement of research are contingent on articles that attract and encourage readership from broad audiences. Next slide. We talk a great deal about the physical access to materials, yet understanding can be a secondary barrier once the document is in hand. If researchers want to have a large scale impact doing the work is not enough translation needs to be a further step. There's a reason folks get quoted in national news outlets more than others. They've done the work to be accessible. Next slide. One barrier that limits researchers potential to be collaborative is the understanding between disciplinary disciplinary niches. So you may be working on the same type of thing, but using totally different vocabulary to to explain it. And that really limits cross disciplinary interaction. The barrier is enforced by a lack of generally understandable language in things that we assume are understandable like abstracts and summaries. And so we're arguing for a simplified version of those research outputs. Not that they should be dumb down or simplified to the detriment at the detriment of innovation or to be misconstrued as, you know, the news media. That got a little confused around some of the coven 19 preprints in early days and what that all meant and what a preprint was. But if there are secondary outputs that are simplified for a broad audience that those can actually help the broad reach and the impact of knowledge and work against that kind of journalistic confusion around research. This is not a new concept, but we're taking a new angle at solving it. And we think that libraries are well situated to engage in this type of service. Next slide. To get baseline data of the current practices of our faculty at Montana State. We developed a survey and we made sure it included gift card incentive, which we were glad we'd done when one of our respondents said that this was a very challenging survey. So we asked some hard questions. Next slide. Oh, and I will say that the survey questions and responses are available on the sides that C&I is putting out and on GitHub with the rest of our documentation. So, who responded to our survey? We had 59 respondents out of 105 invitations, which is about 60% response, which we're very happy about. And if you can read the slide. We had about a third associate professors and then a quarter ish each assistant and full professors and then a postdoc and two research appointment professors. And we invited researchers very specifically and not a random sample across all the tenure home departments on our campus. And then we tried to invite scholars at different career stages. So we were really intentional about trying to get a broad understanding of our scholars. We wanted to know what was important to them and where they currently spent their time so that as we developed an automated process, we could decrease the time and effort involved in that step of translation and really be useful to our scholars. Next slide. We found, here's an example of a journal article that'll come back in a little bit, that scholars value a broad range of outputs. The journal article is the top. It, you know, surpasses everything. But in a sort of a second band of important scholarly outputs, we had books, presentations, invited lectures, book chapters, posters and proceedings. So none of that's surprising. Those are what we expected, but it's good to confirm that. When asked about measures of success, we also didn't, we weren't surprised, but we, you know, our beliefs were confirmed that. Assistant professors universally said publications, citations and grant dollars were the measure of success for them so that, you know, promotion of tenure is really driving their output associate professors. I had those same concerns, but then added the number of their graduate students who had successful careers and who successfully completed their graduate studies. And then some recognition and impact and then full professors had all those same concerns and added the success for their undergraduate students and their teaching, and then a few more mentions of broad impacts. So they also mentioned intrinsic motivations. So things like a sense of personal satisfaction, positive impact on people's lives and being sought out as a mentor. One respondent cut right through it and said Montana state values the number of journal articles and book chapters and grant dollars that I bring in where I value making a difference. Well, individuals have varied motivation and differ from our university. The specter of tenure and promotion continues to be a central driving factor in the dissemination of research. This also drives the venues and objectives of that research dissemination. So these outputs are aimed at disciplinary peers. Next slide. And when targeting their intended audiences, they expect to do little more than publish. As you can see from these quotes, I just expect to find researchers to find my papers. We just assume that they're doing the right thing. We argue that that's not enough. Next slide. Instead of just trusting that publishing in a reputable journal will do the work for you. We advocate for translation. We should treat scholarship as a conversation, especially if you think about it in terms of the information literacy framework. It's important to be speaking the same language to have a fluent conversation. This is again, an opportunity for libraries to help with this work and to provide wider access to the ideas that will benefit further research in the public. Next slide. Lack of translation inhibits both outreach and discovery. As you can see in this quote from one of our respondents, they valued the articles and podcasts as increasing the visibility of their work just to keep up in their disciplinary niche. Next slide. Many of our survey respondents believe that their research has impact or should have impact. There's a specific audience that they'd like to reach. You can look in our survey data and see, but they'd like to reach more than just 17 people who read the same journal that they publish in, but they don't do the work to get there. That work is not valued on that promotion and tenure timeline. It's rarely counted or it's intangible, the connection between that extra translation work and invited speaking roles. So we can help make that work easier. So it's less of a time suck, but people still get the benefits. Next slide. When asked what impact they think their research has on the left, you can see this is a historian who clearly is very thoughtful and does intentional work to reveal the historical roots of current issues to bring hidden stories to public light and really works to move their work into many spheres. You don't see on the right. This researcher says it moves ideas around within whatever academic field it lands in, which I honestly don't quite know what it means and can assume it's not that effective. So, next slide. So, when we're looking at that next step on the left again, we have an economist who has a suite of people working with them. It's a very well resourced department and they, you know, they say that they have their research is crafted into a policy brief. And disseminated and then a communication administrator crafts new articles that highlight the topic and then a third layer of other economists present their work to the public. So that's layers of resources that most of us don't have the reality for many is more a published and reputable journals, and then move on to the next thing try to get the next grant right the next paper teach the next class. So, we're hoping to help researchers game the ideal without needing to hire 3 extra people to do the work. Most people can't. Next. So, we're aiming to create a tool that allows Andre into the scholarly conversation. It doesn't play scholarship, but increases visibility and accessibility across disciplines. Next slide. We're not working in a vacuum. These tools listed here and linked. Help scholars translate text mind parse metadata from PDFs and summarize research. Not all are still maintained, but there's information about all of them. And they're all working to enhance the possibilities of the static PDF for human and machine readability. And we're happy to be joining this scholarly conversation. Next slide. One of the last questions on the survey, researchers responded that they were most interested in an abstract that had been translated to provide an accessible and readable summary of your work to the general public. This is no easy task, but important as we've seen. And Jason will now explain how we've been working on that task. So, yeah, in addition to lots of information, which is great for us, a general ideas about scholarly communication practices, how disciplines are prioritizing what they say or where they're publishing. But we're also able to gain knowledge into where if they were going to do this work. How might they think about the articles they produce and where we would, where we would look for candidates for summarization or for for the work of building a snapshot article. And our entry into this was about interdisciplinary translation kind of moving because between our domains on campus, allowing communication to happen. But there's even further we connected to the idea of snapshots that allow citizens to understand parts of our research and the output of the university income in keeping close to our land grant mission. So the end of the survey had questions like this. What would you prefer? What kind of output, if you were to create a snapshot article, what would you, what would be most beneficial to you if we were to look at your sources, the sources of your research, where would you have us look. So applying the research from that survey to these new forms of scholarship. And this is this was really exciting for the team. So really what we were looking to do is again moving moving from that as we prioritize the journal article, moving from the common expression of the journal, sorry, journal article to something, something else. Our team was really coming at this as, you know, if you were to think through a source a seed, a seed snapshot article that would eventually lead somebody to or could lead somebody through a DOI, or through other other means to the actual research itself. What would be the minimum viable unit for expressing that form of scholarship. And again, we didn't work in a vacuum with this. Leila mentioned a number of tools that are in various stages that I think about this. But one of our one of our motivations was to think through that minimum viable expression of scholarship. And for us that that idea of a nano publication, which I have on the screen right now, which was really just about this was a link data expression for, for, for an article so you could, you could encode an article in, in, in a certain way so that it had its, the primary assertion, the provenance of the article and the publication info. And so coming from that mindset is really where we started, and that led us into the summarization activity. We eventually settled on a form of a web scale vocabularies called schema org that was created that's created and consumed by commercial search engine. They have a form of a scholarly article, which is part of the markup you're seeing on the screen here. So, in addition to the snapshot we used markup behind the scenes to identify the different pieces of the article of the journal article itself. So all of these things like the idea of a snapshot, what's minimum viable expression of scholarship and then how do you encode that we're coming, we're converging into this project. And then you get into the, how do you script and program summarization so the machine processes themselves and I'll give a little time to this part of the conversation. Again, we saw from the, I'll stay here for a moment, we saw from, from survey that people saw value, our researchers saw value in summarization, and creating different kinds of readability levels for their research. We also didn't have full resources and always the time. And as Layla pointed to them, they're not always incentivized to do this. The team kind of thought about this as the last last mile problem with last mile problem whether you hear about with the internet. This was the kind of last mile problem that we were thinking about in terms of research scholarship outreach like communication of the of those concepts in terms of page to person. So there was value here but and we knew that there were routines that we could introduce and so that's where these machine processes really helped us move forward with recognizing that value and then the expression of that research in a new form. Again, we started with, there's always the data mining itself, giving just want to give the group a sense of where we started. We had a STEM, a STEM corpus, and we also wanted some social science and humanities corpus. So we, we pulled a number of different sources together, primarily PDFs that we would mine and start this summarization process with and there was, you know, depending on sources. The OCR was better than others if we did sort of cap current research, I think we went through the first we stayed within the like the last five years. And, and these were just samples that we weren't trying to be exhaustive of every piece of research that MSU Montana State University has produced in the last five years, but just how do we get to this proof of concept. So data mining was our first was our first part of the machine process. As far as cleanup, we use the Surmine Java library and all of these scripts. Layla mentioned that we have a GitHub repository linked at the end of the slides. The scripts are available there so you can see them. This was a matter of taking that PDF document and moving it into a textual format that could be used to do some of the natural language processing and use some of those models to come up with readability and different kinds of summarization. So the second piece of this was what I just what I just started to mention, which was like, okay, you have a set of texts. You have the ability to start to look at your language models that can talk about readability levels or evaluate accessibility of language. You have a set of libraries that can start to parse that language and break it apart and understand rank it in terms of like what are important sentences where what are where the parts of speech how and then combining that into an abstraction that can be expressed in the snapshot article itself. So this work was really about part of speech tagging. We used a library that ranks sentences so that we could understand where those important sentences were or what what the machine model understood us as important sentences. We primarily use the Python natural language toolkit again the script. The script is there. If people are interested in seeing some of the code behind the scenes. So, as we, we kind of we had this idea of a minimal viable expression. What what is what could a snapshot be we had the machine processes and then we were going to unite it with that schema.org markup for an HTML expression of the article itself so that it could be indexed and found and read. So, in this case we looked at two particular formats. I mentioned the scholarly article schema.org markup earlier but there's also news article markup and either one of these as you move to summarization, the abstractive summarization that we pulled out. Probably could be marked up in both or, or either or. Again, that's an expression of this is just kind of behind the scenes of what how we would connect the expression of that schema vocabulary with the actual text sources and the snapshot. And this was our overall goal, two to three sentence summary, a readability score of around 60 which is about high school level, and then something that snapshot article that was re envisioned in HTML and are highly optimized HTML that could be found by commercial search engines. And so it was a matter of moving and kind of see this I'm going to let it let you just call your attention to the sort of opening sentences of this intro, which is not is using the language of the domain as it would. The idea was to move from this expression of the article to that idea of a snapshot and the summary here is where you're getting the movement in readability and an abstraction of pages, you know, nine to 10 or nine to 12 pages of an article a science article moved into a summary readable for a wide audience. We also were able to maintain some of the domain language and the important sentences, based on what the model would give us. And then we could prioritize keywords, keywords that it was finding that were part of the narrative and identified as essential So what this led to was obviously forms of as Leila was talking at the top but accessible research, really hitting at translation of those concepts that kind of last mile of research understanding and then also connecting it to ways that we we think places where users are findable research. There are users and opportunities there always are the data sources for the articles. That was something I think I said, some point. It was, you know, I had this kind of Bill Murray ground all day moment if people understand that metaphor man caught in it in a loop of days. I feel like when we come to these text mining projects. The first problem is is always the, the normalization of the data and finding good text sources. So that there was that there was no different here. But we were able to get a small set of data for social science humanities and stem articles. So the challenge I think that we still need to pull on and understand is what how good is this model really working. The summarizations came out pretty well. I would like that to to my non specialist. I, and knowledge of the domain, but I would, I would like to confirm. We rework the review of the summarizations to the scholars themselves. We still need again these are the questions like another conversation with the scholars about what, what are these, you know, is this model performing. Also, maybe find ways to adjust the reading levels like if we wanted to dial it back up to places where maybe it's at a high school level, but we so we don't lose if we do too much in moving it to a level of a certain. Those are all. And I think there's exciting. I think also opportunities. There's new partnerships. We, as we piloted this with scholars research journal articles. I think there are, there are more and more ways we could help. Reach out to you research our news. Research our news. And this model could be applied at helping them abstract some of their research news. I also think that the news writers are people are particularly suited to help understand and create better summarizations. So that's a, that's a that's a partnership that will move forward with here. I think there's also some work to be done on how do you surface the snapshot article. Make sure it has the referral to the original research and understand the impact and reach of the snapshot article itself. So I do this a lot. I'm not meant for these zoom screens. I, I'm a kinetic speaker. Those, those were kind of the. The details of the model and the goals of the snapshot. Layla is going to take us through the overall picture of research implications. Thanks, Jason. Yeah, so we're. I'm just going to reiterate that we're, we're hoping to position the library as a partner in the research process. Next slide. You know, I think it's a, it's a recurring theme that we're looking to make sure that the library continues to be integrated into the research enterprise. Even as that changes at a really fast speed, it changes very quickly and and we want to make sure that we're keeping up and we, we develop new services that are innovative and useful. We will do some more work understanding how and why scholars summarize their work and figuring out how we can make this more efficient and useful. And then the overall goal of all of that is to create interdisciplinary communication. That's more effective that, that allows research to, to innovate to allow people to do what they're hoping to do and to collaborate across disciplines and, you know, even within disciplines in, in new ways that produce scholarship that will change the world. You know, it's a lofty goal and also a very attainable goal, I believe. Next slide. So, we, we hope to be a part of the conversation of scholarship and we hope to act as translators within that conversation to facilitate conversations. That may be happening already, but we can help accelerate and improve to add visibility and discoverability to research so that we can have the, the most full impact. It's possible and benefit our own research and in the world. You know, we're a land-grade university and that is, that's our mission and so helping, helping that through the production of, of snapshot articles and optimized summaries is a, is a goal that, that we feel pretty strongly in. Next slide. We would be happy to talk about this further. Please contact us or, you know, check out the survey information and, and everything on our, on our GitHub to understand what we're doing more. And we look forward to continuing this conversation. Thank you.