 Go ahead and get started. There's quite a lot to cover. Welcome. I'm Cliff Lynch. I'm the director of CNI and you are at the last of the project briefings for the first day of week two of our virtual CNI fall 2020 member meeting. This week focuses on the theme of transformational work within organizations and professions. We are recording this and we'll make the recording available after the session. Normally we have closed captioning but on this particular section. Welcome to have run into some kind of technical problem with it. We will make sure however that the recording gets closed captioned, and I apologize for it not being available for the live session. There is a chat box and you're welcome to use the chat box on an ongoing basis to introduce yourself share comments etc. There's also a Q&A box tool at the bottom of your screen. We will do all of the Q&A at the end and Diane Goldenberg Hart from CNI will moderate the Q&A part of the session. I do just want to remind you that as of week two. As the real-time project briefings that we're offering, there is also a group of pre-recorded videos that we're making available and I invite you to avail yourself of those. Week two will go on until next Monday where we'll have a summing up session at the four o'clock slot next Monday. Now, we have three speakers today, all from the University of Cincinnati. Chuma Wang who I think is well known to much of the CNI community as a leader and very active member of the coalition. James Lee and Kristen Burgess. James and Kristen are affiliated with the Digital Scholarship Center that has been set up. The Digital Scholarship Center which I had the privilege of serving as a part of an advisory group to is a really interesting activity that is trying to make all kinds of links, and serve in a catalytic role among a whole variety of activities across the University of Cincinnati community. And what you're going to hear today is mostly a report on machine learning driven activities that have been funded out of an Andrew W. Mellon Foundation grant to the Digital Scholarship Center. And that's more than enough for me. Let me turn it over to our speakers. Chuma will lead off the presentation. And let me just thank all of you for joining us and thank our speakers for being willing to spend some time today filling us in on this interesting and important project over to you. Thank you, Cliff. Thanks everyone for participating this session. What we like to do is, I'm going to start to set some of the context, give you some of the background about our center and the initiatives. Now I'm going to transition to Kristen Birch. She's an operational manager of our Digital Scholarship Center. And she's going to talk about some of the organizational issues. After that, after that, Kristen going to transition to the James, who is the center director. She's also the vice provost for the Digital Scholarship and the associate dean of the library, and he's going to walk you through this time, very detailed. We have a total of the 30 line slides prepared about our center's work and do them by the end we hope we can leave about 10 minutes, or if it's not at the 15 minutes to engage you for some of the questions. So today our topic, we titled to focus on the next generation of the machine learning, which you will learn a lot of details from the James presentation. So what I like to start into talking about is this evolution for the library as the research partner and the project catalyst and the digital integrator with a little bit of folks are about the background about this digital integrator role on the campus. Additionally, libraries are viewed as the service provider to the university. With our journey of the transformation, many of our organizations has made the progress through our efforts in the new initiative such as research data management, digital scholarship endeavors. The organization's libraries now has been achieved somewhat insane as the teaching, learning and research partner. However, the libraries as the resource as the source of innovative research design as the catalyst for inter and transdisciplinary research, and even more rare. So as leaders for the digital integrator use some of the cutting age digital initiatives such as artificial intelligence, machine learning your nerve from our center's work today with human centric approach that kind of the role is not quite there yet. That's where we focus on in the universe of Cincinnati digital scholarship center through the support of the Andrew Menon foundation. We are going to today report you mostly is our first grant from the Menon Foundation support, and we just been renewed for the second grant, we will talk a little bit about second grant approach. But through this wonderful support from the Menon Foundation, our DC has served as the productive research catalyst in leading transdisciplinary research teams. Through the development and implementation of accessible machine learning platform, that has been successfully leveraged by multiple disciplines across the university. The library based again emphasize our library based digital scholarship center has began to change the perception of what we could be, what we could be the library emergent roles and the mission on the campus. So, we're happy today to share with CNI community our journey, how we make ourself closely aligned with where the university priorities are, how we work collectively in collaboration with faculty from multiple colleges across university to address the research questions through joint creation of project design and use of the cutting age machine learning techniques. Those activities and endeavors have redefined the library's relationship with our university academic mission. For example, the library now has been charged to leading our universities enterprise wide, we call the digital integration effort. My, my, my, about a year ago, I was named in addition to the Dean of the university library, I was named as the vice provost for the digital scholarship. And the university senior leadership recently about year and a half ago has select our digital scholarship center as one of the five anchor team going to our to be built digital future innovation center which I will touch base a little bit more when we talk about alignment. So, by now you may ask him what is she my what do you refer to as the digital integration what is the concept behind. I want to share with you my definition, what I'm trying to practice here at the University of Cincinnati, the concept itself, the digital integration integrated terminology was not invented by us. I'm from the former senior program officer of the Andrew W. Menon Foundation, Dr. Don Waters. And in the universe of Cincinnati, we, this is my definition we refer the digital integration as a collaborative efforts that drive to leverage the broad enterprise. Digital capacities, capabilities and methodologies to create vast efficiency to enable, stimulate and accelerate the transformation of teaching, learning, research and community engagement. So that's how we thinking about digital integration. I mentioned everything starting with the alignment. So now I'm going to share with you a little bit more high level information how way at the library and the digital scholarship center aligned with university where universities going strategically and how we make alignment over there. Little bit about the universe of Cincinnati. We are the R one research one university in Ohio, the second largest university in the state of Ohio, we are public, by the way, many of people are meeting me ask me whether universities and that is private or public. We are the public university second largest. This enrollment is a little bit over the 46,000 enrollment. Now, we founded the one of the oldest. Long history public university founded in 1918. Last year, we just celebrate our bison 10 years. Yes, I go very quickly is the one of the air libraries, we have the 10 locations plus all jurisdiction in the library system of the 13 locations to serve the entire university. And when I arrived here, I created this strategic plan and focus on our vision to become globally engaged intellectual comments of the university. The four pillars of the pursuing and which today you are here mostly pursuing of the digital scholarship is focused on the digital technology and innovation part. In 2017, you reversed the hard that our current president never pinto and to be our president is starting the 2017, and he come in to create the new strategic initiative called next leaves here. This is the vision is set up for the university. Urban public university into a new era of the innovation and impact. He keep emphasize about urban public and innovation and impact. We take up those concept and we follow his need about the strategic platform and pathway, and we try to figure out where we align with those platforms from the academic excellence from the innovation agenda, and from the urban impact. After this next leaves here, strategic direction, each of the senior vice president starting to create their own approach for the alignment, right. This strategic sizing plan is under the provost, and the strategic sizing plan is for entire academic house, you can see, we have talking about the growth for the 10 years. We are talking about the about the 46,000 and 10 years later we're aiming up 62,000 of the enrollment. We're not only just talking about enrollment growth. We're talking about growth about the education. Some of you may know universe of a Cincinnati invented co-op about 110 years ago. So we're talking about co-op 2.0. So talking about growth in a quality and access digital is at the center here. So we look at this university provost growth strategy very carefully to thinking about where we can align with the university. Recently for the FY 21, I will say this year, and after we settled on a little bit of the pandemic university provost is asking, you know, each of the team to submit the new request about the new money could be available for the new investment, but have a very very clear message to the teams. All the requests have to align with these four thematic areas. And we are aligned our digital Scottish center squarely with the data science and somewhat with the justice in the digital world, which you will hear from James about our some of the project example. We are part of the academic house we are lined with the provost we are lined with the VP our office as well on the screen now you see is our university 10 years gross plan for the research. So we look at this plan carefully. We identified where the error we can lie. And the example I gave to you recently will be selected by the innovation center is one of the example. Now on the screen now is talking. This is the strategy for the university growth innovation. We are created a whole new district called the innovation discrete in the east of our campus and the university recently launched a new initiative to build the new building here. The digital futures building is going to be 180 square foot the building with some of the new concept design about the very outward facing industry facing partnership engagement facing over the research. As I mentioned, our digital Scottish Scottish Center has been selected as one of the five anchor team going to this building. So, in closing of my part here, the alignment is so important, not only just get the new money and the new support. It is how you can really make the difference about the library's new role and the new mission on the campus. With that, I'm going to transition to Kristen, and she's going to take from here talking about structure of the center, Kristen. Thanks. So while she know whenever the big picture of UC and UCL and our focus on being a digital integrator. I'm going to introduce the digital scholarship center more broadly with our mission structure and our staffing. And then James will go into more of the technical aspects of our research and integration. So, the digital scholarship center is an academic center, it was created as a joint venture between the university libraries and UC's College of Arts and Sciences. Our core mission is to break silos and across wires between disciplines on campus and in the community. And as a sort of digital innovator she may use this term as well on campus, moving between cultural and disciplinary silos at the silos across the university to create new modes of research and teaching using a variety of different digital methods. So, with that we work kind of at the intersection of data science the arts and humanities and the libraries. So, next slide she mo. Thanks. So we are kind of the what we do we are UC's catalyst for digital methods to transform research and teaching it's like the broad the broad level description of that. So James is going to go into a lot more detail about exactly what we do. And, but we do this basically using machine learning and data visualization on large unstructured data sets. And, and a lot of that is also a lot of our work also involves bringing teams together from various disciplines that don't always work together. Next slide. So the DSC has three main components, they're the gray kind of nodes or circles that you'll see in here. So research is that bottom left corner one. So that network research is a big piece. We also have our core services so digital tools development faculty development and teaching workshops things like that creating digital skills, and then teaching is the other one so doing introductory coursework workshops, student research and faculty team labs, and then job placement and digital training. So in one of these three ways we interact with any of these kind of different groups that you'll see on this slide and all the different units or departments that are represented there. So you'll see they represent lots of different disciplines across the campus and our goals to really be a distinctive global leader and trans transdisciplinary digital research. Next slide. So to achieve those three components we rely on our awesome team. So we have recently expanded the core digital scholarships in our research team to include the UCL research and data services. So we're now all one unit that reports to James and his role as the associate by provost for digital scholarship, the associate dean of libraries and the director of the digital scholarship center. This partnership is meaningful for several reasons, especially including the unique opportunity to strengthen the library's role in the university research ecosystem that the digital scholarship center team is already firmly embedded into through our digital research work and other research collaborations. Next slide. So this is really the who the who of our team. At least it doesn't include our affiliated faculty or the project members that we partner with but it's everybody on our team so that top row there will see everybody who is part of the digital scholarship center and that's really our core research team. They bring a wide range of expertise in areas such as data analytics visualization linguistics library science humanities research to name a few. And they work very closely with our affiliated faculty members from across the university. They've built many of the tools and created a lot of the visualizations that James is going to discuss and when he takes over. That second row is our research and data services team from UCL that recently joined the DSC. So this group comprises a lot of those core services that are part of our work with activities such as consulting on data management plans, consult conducting workshops on data management, geographic information systems in CVI tools, the open science framework and orchid, just to name a few, and they also do our events such as things like UC's annual day to day. Last but definitely not least are our graduate students so we have a phenomenal group of undergraduate and graduate students from a wide range of backgrounds who assist in our research and we consider the student that student training and partnership to be a really critical element of our mission. Next slide. So the DSC has two main phases so far. The first one we can really consider from 2016 until 2020 and that was focused on becoming this full center, focusing on the research life cycle that that's shown on this slide and those different phases. And this idea was really heavily supported by our initial Mellon Grant in 2017. The next slide to go ahead to the next slide. Details a little bit about our second phase, and that is just begun as was supported by our second Mellon Grant that Chimo mentioned. So, and we're incredibly grateful for that. Our current focus on this one is to expand and to deepen our existing research, teaching and service relationships, and then to expand DSC's model. So included in the second phase is the integration of RDS with the DSC to further grow that digital integration vision, and to support our scholars through the entire research life cycle. So from ideation and research question development to the dissemination publication and expansion of the project to further iterations. So with that I will pass it off to James who can can get into more of the details of the projects. Thanks, thanks Kristin and thanks Chimo for setting it up. I'm going to stop my video because as you can see my videos a little pixelated, but I just wanted to say hello with my face briefly. So building off what Kristin was just discussing there, there are two major ways we've been trying to create an impact on our campus with DSC. And this is sort of the beginning to talk about the how of how we implemented this vision. One part is technical, certainly, we are definitely using innovations and machine learning and AI of to answer research questions and what the are two phases of Mellon support have really enabled us to do is to develop a machine learning platform, instead of tools that adapts machine learning approaches and applies them to any text and image data set for research projects. And any often gets questioned. And that's our goal I mean it's it's a bumpy road but we're trying to make our machine learning logic elastic and broad enough that we can really be able to accommodate and analyze a data sets from multiple disciplines. So why these machine learning methods in a discipline specific way, meaning that it's not a bunch, it's not computer scientists telling this subject matter experts how they should be approached, or how these methods should be applied. We really work with faculty researchers students and subject matter experts within a field to hear what they need in terms of knowledge and information and insight to be extracted from data sets. There are also questions that are not able to be addressed using their congressional methods and how we can adapt machine learning methods to directly help them, as opposed to simply applying machine learning methods for the sake of it. The second dimension is not technical. And I suppose, you know, when the title of our presentation after the next generation machine learning. I think what gets the most surprise is the human aspect, and the team science aspect of what we're doing. I think that a lot of what's next generation is finding a way to get human teams together to work together around the machine learning approaches so we, as I state here we assemble teams to really nurture unconventional transdisciplinary research questions and partnerships, those that are not sometimes by department tenure protocols or funding bodies or, you know, university administration, but we try to take partnerships and research questions that are a little unconventional at the boundaries of disciplines and try to nurture them. Next slide please. This. I think there's some detail about the sort of organizational mission of the DSC to cross wires and to break silos at an at a nuts and bolts level how do we do that. I think that the with the merging of research and data services that team and the digital scholarship center in our students what we're aiming to do is provide a sort of full spectrum of capabilities that we can provide to our research and teaching partners across campus in the colleges. So, you know it's always reductive to have a linear workflow like this but I think it helps to visualize the sort of different technical steps that we use to enable to work with collaborators. On the far left side you see data management, and this should be familiar for too many of us. And this is where RDS really shines a research and data services so I think that our research and data service librarians are very adept and we're passionate about open science and open data culture. Using the open science framework orchid things like of that nature and trying to expand them across the research community at our university. There's a lot of different plans and research data management certainly and you're putting data from projects into our institutional repository as well as other repositories. That's a huge part of. That's a large part of the sort of infrastructure or research projects. And one point that the RDS team members when I showed them our presentation that they mentioned to me which I think was a good reminder is that they're often RDS often serves as a human tool to some of the digital methods that we use AI machine learning data visualization statistics can be kind of forbidding to some faculty members so the way they put it is they're sort of a gentle introduction to data culture and data data driven methods. So, you know, they can explain existing tools out there that faculty or students can use and if they need to dive a little deeper they can run to the next step and work with us at the DC. The digital phases really with a DSC has carved its identity on campus and it's using machine learning methods and data visual data visualization methods to drive new hybrid types of data analysis and I mentioned here digital humanities to buy informatics. We've had a wide range and we've you know we've interact we've worked with currently nine colleges on our campus from design to law to engineering. It's a critical college and arts and sciences. So it's really, we've tried to be broad enough, as I mentioned earlier, and elastic enough in our logic to accommodate all of these different fields. At the core of it we really do I really do feel that machine learning and recent developments and multimodal deep learning are really really powerful but they're under leveraged and I think they're underused. I think that the advances in machine learning and AI have been largely exploited by several technical fields and computer science data science engineering etc. And I think that if these methods could be expanded to other fields it could be really transformational but there needs to be a translation step. What we think is the translation step oftentimes for researchers is data visualization. It's a graphical user interface or some kind of point and click interface or some kind of interpretive framework where users can explore and analyze data trends, which I'll show you in a little bit. So that's really what we focus on in terms of data analysis in order to broaden our scope and broaden our network of collaborators is a combination of machine learning and applying recent innovations from the past 10 years in machine to different fields and their data sets and their research questions and displaying and engaging users from a wide range of disciplines using data visualization. I won't go so much into exploratory data analysis versus confirmatory data analysis but that's that's an interesting cultural trend or conflict almost that we found is that some fields. Don't really work in qualitative data or work with qualitative data types and they don't really engage in exploratory data analysis which is central to humanities and social sciences and also in machine learning. So how do we bridge hypothesis driven projects and faculty that are trained in that tradition versus exploratory data analysis. So really digital outcomes and products. We really try to keep our eyes in the finish line our faculty and students all have that we collaborate with all have certain things that they hope to gain and hope to accomplish with us so whether it's publications presentations, a website or an app of or grant proposal so that last component is really where our subject matter experts and our collaborators. That's where they shine that's where they're really focusing us on the output and what the finish line is so in these three phases I think our teams encompass the full sort of research and data lifecycle at our university. As I mentioned just before, why data visualization and our first melon grant really emphasize the need emphasize the need to for data visualization development. I think there's been a lot of work in digital humanities and digital scholarship out there. And, but it's not very interpretable and on the left here I show an example of, you know, in the command line, you know you can see the guts of the computer working and while it's processing what's iterating through a data set, a machine learning process. You know, this is, you can't read this this is not human interpretable humans do not think in terms of you know one dimensional strings of numbers and words. So in order, especially for humanities and social science faculty members. We try to create visualizations for them to take the results of the AI methods on the left, and put them into some kind of spatial or visual format where they can extract meaningful trends. AI is a big, big trend right now in the computer science and data science worlds. But I think, along with explainability, certainly in replicability visualization is a way to sort of open machine learning to new audiences and to get game their critiques and insights. Next slide please. So the, as I said, how the how element to how dimension of how we accomplish our work is part of this technical part of this human. And I'll address those each in turn in the remainder of my presentation, the technical side is our machine learning platform. So this is what is funded by has been generously funded by the Andrew W. Mellon foundation. So what we call the model of models platform. So we found that we are using various digital humanities and digital scholarship tools for our different projects. And we were sort of reinventing the wheel in 2016 2017 we realized that you're doing the same processes over and over. So what we hope to do was to aggregate and consolidate all of our different methods in the machine learning realm and text mining and natural processing into a single platform that where we could take any text data set, run it, you know, index it database it and have it retrievable and searchable, and then also able to be input into machine learning algorithms. So this is, I'll give an overview around model models platform. Next step please. So model of models, as I said, really consolidates several existing technologies in prevalent in digital humanities and digital scholarship and data science with an added twist, which I'll get to in a second. So underlying the actual machine learning logic are two major families of machine learning strategies to observe, as I mentioned here, latent patterns and large text data sets. So modeling topic modeling is very familiar to in the library world and the digital humanities and information retrieval worlds. And we use LDA, most popular implementation of it, created by David fly. And more recently, you know, the deep learning revolution has led to the popularity of tools like word embeddings were to vex very famously introduced word embeddings to the NLP world and more recently by directional performance like Bert are taking hold. So what we're trying to do with the reason I call model models is that we don't really hang our hat on any one specific machine learning algorithm, or its derivative family. But rather what we can do is we can input new modules with very classic machine learning methods like topic modeling and LDA or more recent developments like language models and Bert. And we'll be able to take our data sets and analyze it with new, new machine learning algorithms and produce new types of models as they come along. So that's why we call it a model models is that sort of you can sort of swap out different types of machine learning algorithms and as they progress for different research questions. What I mentioned here is that we really have taken advantage of distributed computing, and also cloud computing to enable us to run parallel models so that's another reason why we call it a model models. It aggregates multiple models in parallel to compare word usage within a corpus in those in the distributed models so basically what we do is we can use Amazon Web Services or high performance computing cluster, and we're taking a bunch of Hadoop and Apache Spark and distributed clustering architectures to spread a task into a number of different, and number of different instances. And what we can do is we can take six models for example so our main way that we do this is we want six models in parallel, and then we aggregate them and compare the resulting models and the word usages from the six parallel independent models, combine them on top of each other. And what that allows us to do is two things one we can confirm consistent topics across all models. So it really boost the signal if a cluster of words or a topic or a certain trend in the language is showing up at a very high rate of probability in six models in a row. And that is that sort of boosted score, and it's weighted differently. However, if they're one off topics or one off clusters or trends. Those are sort of suppressed there. And likewise underrepresented topics that might not have appeared in a single model representation can be boosted as well so it's using machine learning and distributed computing to really run many replicates at once to increase user confidence on one hand I think that a lot of our collaborators felt we're a little wary at first of the brittleness or the variability between models. So this model models technique has really increased user confidence in our in the results, but also interpretability, because it really emphasizes the highest probability most stable topics at the top of your model results and sort of push you down the trail and tail of sort of outliers. Next slide please. One thing we've been working on a lot at the end of our first melon grant and you're really expanding in our second grant is validation. How do you AI, there's a lot of writing and thinking going on about the replicability crisis and AI, and how deep learning algorithms and AI algorithms are this sort of black box that are very mysterious and so brittle and when you put them into the real world. They don't work very well. So we've come up in our various projects and publications with various ways to build into our platform various validation methods to boost the replicability and the interpretability and the trustability of our work. One is internal. We do internal validation measures. We don't go into all of these but we use various topic coherence measures which basically assesses the percentage of difference or variability between clusters of words and trends within each of the six parallel runs. And, you know, obviously between the six individual runs if there's a huge gap, and there's wild swings and probability, some things going on here so that's a flag that you know these. It might not be that coherent external validation and this is, this is where I feel what's so next generation about our work. There, the algorithms we're using are tried and true and they're trusted and they've been well tested, but they have not really been integrated with human validation. And so what I feel like external validations and trusting the human and finding the way for humans to enter into the black box and interrogate the black box is really what's where we're trying to make a change here in machine learning and analytical culture so external validation. What we do is we take a blind subject matter, subject matter experts or collaborators. So whether it might be the faculty member working with and their students or class, and they tag a randomized sample of model result. So I say a randomized 20% that's the, that's the ratio that we normally like to use, but for it needs to be a sufficient sample size, as you can imagine so it has to be over 1000 document model. So we have a random 20% model tagged by subject matter expert that look at the model results and they literally annotate and tag different clusters saying, Okay, this seems to be about literature this seems to be about politics etc. And then we take a blind human coder panel of, we have several collaborators that work with us and their labs and the graduate students that have no subject matter expertise, and then they assess and tag and annotate the same randomized model. And then you assess the percentage agreement between the subject matter expertise human tagging and the blind human coder subject tagging. There needs to be at least a 70% agreement. Otherwise, this external validation method shows that the topics are not coherent. So, regarding the replicating ability crisis we're really trying to tackle this head on. So our users and our collaboration trust our methods. So we're using the pattern recognition capabilities of machine learning and natural language process in methods as an information retrieval technique. It's not like a genius in a bottle that we rob and then we get a model results that be said, Oh, surely it must be true. We're not interested in classification. As an information retrieval technique the clusters of words and the documents that we retrieve really offered a human user to assess the validity and the robustness of the model results. And then as I say there, you know the goals to provide models capable of evaluation of our panel of multiple independent coders. As I say here, part of the reproducibility and replicability stakes that we've set for ourselves is that we've run parallel replicates at every model so it's not just a one off and equals one model that we can, you know, form our research on for depending on the research question and you know of the discipline that we're working in, we run between six and 20 runs of in parallel. So for some disciplines of like some of the biomedical projects we're working on the burden of proof is very high. So every time you know we need to run about 20 models in parallel to really trust our approach. So that's really our hybrid, the team based approach to side of what we do. It's a hybrid machine approach with a human judgment of subject matter experts, verify and tag the model results and there's now a burgeoning literature in the past three or four years showing that this blend of human judgment and machine learning model outputs outperform, no matter how optimal or how well honed it that outperforms in purely machine based analysis. I'm not going to get into classification here too much today but it's suffice it to say that when we do do classification tasks. The hybrid ML approach manifests itself as a semi semi supervised learning approach it's not purely unsupervised as some deep learning approach just do. Next slide please. So, in building the model of models platform, I described the underlying logic distributed computing and running several jobs and model in parallel. And with the goal of giving our human coders and human subject matter experts, something to assess something that they can like assess the viability of for their research question. So, in order to do that we, the input are different data sets. So we have certain data sets that are researchers come to us with they have a data set a small local archive or a small local data set something that they digitized. But more often than not they're interested in analyzing some of the very large digital archives out there that have recently been open source and digitized. The path decade or two. So, here's just the first screen of our model the models page where the user can select the database that they're querying and they're analyzing. It's very important for us to make our platform interoperable with the very the large data sets out there. So I just think this here that our platform is compatible with the coffee trust research center extracted features data set. J store data for research data sets, you know, based on requests that you can put into the FR chronicling America, the text creation partnership and the Harvard open case law project that is trying to open source US case law. Against Lexus nexus and Westlaw basically on the, on the far other end of the spectrum on the medical side. There's a lot of text out there so there's the over 30 million abstracts and PubMed and the over 2 million full text to open articles and public central that we are compatible with us patent claims back to 2004. And electronic health records in the fire data structure and that gets a little tricky because then those need to be IRB approved and they need to be HIPAA compliant, but some of our medical and clinical projects are interested in analyzing electronic health records. Also, large batches of social media data Twitter, Instagram images and captions and Reddit Reddit threads. So, it's a wide range of data sets of different scales. The largest probably being the PubMed data set with over 30 million articles. But what we want to do is the underlying data structure metadata needs to be interoperable with our, with our platform so users can, you know, pick and choose what data set they want to use for the underlying analysis. So, small core power I get I got a question about that sometimes will help you read them, you know, someone came to me about a year ago saying that they had 100 documents that they wanted to, you know, use machine learning on and I said that 100 documents sounds like a like a human reading problem, like so we can sit down and read it with you, but it's not, it's not really something suitable for our methods. So this is the output. I'll show you some different outputs that we construct using the model models method, but you select a data set based on your discipline and based on your research question we can begin to ask questions and begin to ask the research questions. The ultimate model is, is this and this is really the interface that users, the information retrieval and assessment page. So on the right hand side here you see the clusters and the clusters are the aggregated topics and the aggregated clusters of words that have the highest probability of occurring together within a corpus. So if you look at cluster number 12 for example, the words children patients asthma RSVN age, these are words that have a disproportionate high probability of co occurring within the corpus. And subsequently each cluster shows words that co occur, and within the same word range within the same sentences and paragraphs a lot. And in the center here. So that's a tabular representation of word clusters in the center here. This is one of our innovations, something that we've created for data visualization is a vector space model showing representing these topics in a two dimensional space so basically you can assess which where the hotspots and where the clusters of words and language are and then what the distance or their similarity or between the clusters maybe. So the position in the vector space is dictated by word similarity. So the assumption being nodes with that are very close together have more similar words in their topics and clusters, and those are more distant have less similar words. And the edges between the clusters here is determined by document similarity. So it's the amount of documents. It's shared between clusters so you can assess word and document level. Relationships in this visualization on the left for the documents. So that's really the information retrieval pain so it shows all the documents that contribute to each cluster. And then it allows users to establish ground truth by reading the articles contributing to each cluster. Next slide please. So we have multiple visualization outputs based on a single underlying method. I won't go into each one of these in detail but you can see that on the left here for a certain data set users can search filter and explore the underlying texts that they want to input into the model by time range and by word frequency and word use. One of the visualizations that we use on the right hand side here are work trees. We have a specific request of some of our users, some of our collaborators have wanted to look at the sort of hierarchical structure rather than a network structure. So based on the same underlying model we can produce different visualization outputs, based on the sort of cognitive and analytical needs of our collaborator. Next slide please. So these are examples of, it's based on the same model different outputs or a stand key diagram, the cluster table and the document retrieval pain for information retrieval of the same on the left. But the stand key diagram allows you to see sort of ubiquity or rareness of word use across the different clusters, and on the right is a classifier using a BERT language, a deep learning language model method. So that's a different kind of output altogether that we can use based on the same method. Next slide please. So I just, to wrap up here, I just want to show some of the range of what we're doing. So many of our projects are individual humanities. And that's one of our real strengths. So example here is using the model of models for publication, recently showcased in PMLA that I co authored on sort of unlocking the Anthropocene archives using text mining to identify sort of a history of a human's impact on the earth using the model of models method. Next slide please. James. Yes. Could I get the next slide. Oh, I think it's up. Okay, here it is. So the in the digital humanities realm, this is just one example of a recent research project that is forthcoming in post 45 where we were analyzing audio archives and digitize audio archive and identifying language use in different archives to show hotspots here. So this is just an example of a different visualization output for audio archives. Next slide please. In the social sciences, we have projects and sociology and journalism and communications. This is what example of analyzing Twitter so using our platform to analyze linguistic and semantic trends within public political discourse on the right wing and the left in the 2016 election that was recently highlighted in new media and society so different visualization output certainly different data set based on the needs of our social science collaborators. Next slide please. So one area that we're really growing into in the digital humanities realm for our second grant is a minimal computing and using mobile devices and creating apps, applying some of our machine learning data visualization technology on apps, which I won't go into a lot but it's assuming that you do not have the resources of high forms computing how can a phone or an iPad going to use some of our methods. And in closing, we, I just want to highlight some of the DSCs very unconventional and surprising projects with our academic health center and College of Medicine. I, what I would say is that these projects really are two way experiments in disciplinary contact. So we are have taken on the role of teaching clinicians and STEM faculty and practitioners about how to work with qualitative data and how to work in sort of in a humanistic or social science mode. Next slide please. It's simple that with our children's hospital hospital medicine division we're looking at a clinical uncertainty and looking at language within electronic health records where doctors and nurses waffle or hedge or demonstrate uncertainty so they can flag and identify. Are there signs are there linguistic tremors demonstrating this kind of uncertainty within the electronic health records. Next slide please. Similar project for the sake of time I'll pass over this but and you should go to the next slide. And this is an example of using our model models method with the foster care center at our children's hospital to identify some of the social health and wellness in foster and foster children that they that pass through their center and identifying what how these children and the clinicians and social workers identifying the plight of these students of these patients, and how the current data structures and the current clinical methods not adequately reflect how they're expressing their condition. Next slide please. And I'll skip over this one. For the sake of time. Second phase we're really expanding it's a multimodal machine learning methods, text and image, though, identifying an art historical data sets like to do on caves on the left and on Instagram, using our machine learning logic to sort of cluster text and image within data sets. For our research questions there's been a lot of demand for this among our art historians, artists and design faculty. And what we're also doing is expanding our Jupiter notebook pipeline so we're taking each phase of our machine learning platform and, rather than taking it as one package, extracting different sections for researchers to use in a sort of DIY DIY method for certain slices of our methodology. And the last slide please. So, wrapping it up. Here's our the location of our current model of model platform. It's a model models.io, as well as digital scholarship to the website. In some of the second grant objectives we are going to expand to 15 sub grants. We're looking for more partners so if your institution is interested in using the model models and our Jupiter libraries for for these kinds of digital scholarship projects and expanding on your campus we'd love to work with you and, of course, as we mentioned, we're great very grateful to the D.S.C. for their support of all of these activities. Thank you very much. Wow. Thank you. That was an amazing presentation and the breadth of work that you all are involved in. And the possibilities are quite astonishing. Thank you so much for your presentation. I'm going to go right into the questions because we already have several queued up. The first one comes from Nicholas Taylor who asks, how does the D.S.C. evaluate and populate the pipeline of candidate projects to collaborate on or facilitate. How do you benchmark capacity. James, you want to take it that way on. Sure, what I say is that's partially the goal of digital integration is to separate out tasks and projects and queries that can be that are more service related that can be addressed by our research and data services team, and those that need more in depth and research collaboration with the D.S.C. That said, I think the model model platform and the tools you've created are a means to create more capacity. So initially we were manually creating these tools over and over again. So we wanted to consolidate that effort and the platform really allows us to get a head start and have a sort of pipeline of analytical capability that we use for each project. Fabulous. Okay, thank you for the question, Nicholas, and thanks for fielding that James. Next, we have an anonymous attendee asks, do you support long term dissemination of the visualizations. Do you have any sun setting policies. Good question. That's a question issue that we've really been working on with our research and data service team in terms of the long term. How do we store and you know what's the sort of preservation and repository strategy for a lot of our results and data sets. I think right now. The best answer we have is we are preserving and putting into repositories of the underlying matrices or the underlying data tables for the visualizations so people can read sort of quote unquote rehydrate them down the line. So we're not necessarily preserving for long term the visualizations themselves, but the underlying data that someone with enough, you know, someone could easily pop them into Excel it won't look exactly the same but they can see the same trends basically. I think that's the simplest way to do it so far. If you have suggestions I would love to hear more because that's a that is a problem. If we've got any suggestions please chat those through. Our next question and comment comes from Don waters. Don says hello to everyone terrific presentation good to see you all. Can you comment more specifically on demand for your services. Do you have to beat the bushes for projects like the ones you've mentioned or faculty queuing up to work with you. If they're queuing up. How do you sort out which projects to support. That's a great question and that relates to the capacity question earlier. You don't have to drum up projects I can word of mouth has helped a lot I think we have early several early adopters in different departments and colleges on campus and they've talked to their colleagues, and then their colleagues have come to us with research requests. What I would say is that we try to phase our project so let me. I think we're ready to go. Some of our collaborators come. They have a data set ready. They have students that want to work. They have a sabbatical and they're ready to roll this year. Others are more aspirational. They have an idea or hunch. They have no idea how to address it and they've never worked with digital tools at all. So those aspirational projects we coach them for two to three years down the line, they might be ready to have a full blown research project, whereas those that are ready to go, you know we can start next semester. We do an initial triage phase to see the sort of the maturity and the readiness of the research question and the faculty member, and we sort of stagger them semester by semester. I also want to thank Tom for asking that question to edit one more comments, the visibility on the campus for the digital scholarship center is also helped for faculty to learn up the engagement with digital scholarship center. In addition to the example I gave, we were selected as the one of the five anchor team for the digital future beauty. We were actually invited by the board of trustee James and I to give the presentation to the board of trustee. And so all of this are helped Provost promote VPR promote so that will make the visibility pretty high on the campus. That's terrific. Thank you so much. Thanks Don for the question. Next Emily Gore asks, can you talk about the computing power setup you have to process this data you mentioned HPC. Is this work being supported by your campus data center. Could you share information about this. A very good question. And that's a big component of the work we do. What I would say is that some of our computing we use on the cloud, whether it's Microsoft Azure or Amazon Web Services. I would say a lot of our prototyping goes on there. A lot of our jobs that are small enough scale we can just do it in our, you know, in our lab, you know, computer desktop computers are high powered enough if you have an eight core machine you could do, or you could run our pipeline locally. What I would say for long term use and you know where our platform is actually housed and served. It is the campus high performance computing cluster. And interesting story, sort of related to the previous question about, you know, be, you know, do people come to us or do we have to find them out our aerospace engineers and our high performance computing team and it came to me last year and they said that they were trying to get a NSF MRI major research and transportation grant to boost the high performance computing capabilities that NSF reviewers has said, you're not diverse enough. You know, you're not really university wide. All you do is like five people in aerospace that are using this. So you need more use cases so they came to me to say, can you, can you be as help us craft a case for university wide application and use cases for high performance computing so after I helped to break the logjam with NSF ironically so. So it's a weird instance where I think that digital humanities and digital scholarship can actually assist other fields in surprising ways. You know, they might have limitations that there might be siloed in ways and they don't know that they could actually appeal for a broader audience, and I think we've really tried to do that. That's really interesting. Great. Thank you, Emily. And now we have a question. Have you worked with cadre, which is the collaborative archive and data research environment science gateway. I don't believe I'm not familiar with that but I'm sure one of my team members is Alaska about it. Sounds fascinating. Okay. All right. I have lots of folks thanking you and Oh, and John done has just chatted in the cadre. URL. Let me just chat that out to everyone. Thank you. There we go. And I see we have another question now from Terry Wheeler. You discussed epic FHIR data. Clicking into your site. I do not see this data set, but it quite likely is private. If this is HIPAA protected data. Do you have a data enclave for this or do you just store the model. Very, very good question. This has been a tricky area and a really area of learning for us. Because you are absolutely the opposite of open data sets. So they are really tightly regulated. So what we do is we have an honest broker on our academic health center camps and College of Medicine. It's the biomedical informatics department. We have secure private servers that they use for HIPAA protected patient data. And they also de-identify all the data. So any, you know, anything that could be used to sort of extract or infer the identity of the patients, they have protocols for this. We have to go through IRB approval for every project. So it's quite arduous, but it's worthwhile. So, so yeah, we do not, yeah, it's not public at all. These results are not public and we do not house them on our Amazon web services or high performance computing cluster. They're on the biomedical informatics secure server. And sometimes we're very, very secure projects. We have to go on site. They don't want to get the data off their, out of the building. So like we have one of our developers actually going to one of their offices to work on it. So it's been an eye opening for me, but it's all very worthwhile. So, yeah, far into the other end of the spectrum open data. That is really interesting. Thanks for the question, Terry. And thank you again to all of our, to our presenters for this fascinating talk. It's really interesting. We're really looking forward to hearing what's going to develop out of this project and to our attendees for making time out of your day. I see we're well past time. So, but we do still have some attendees I'm going to go ahead and close down the public portion shut off the recording. And if there are any attendees who would like to continue to hang out with us and maybe have a chat with our presenters ask some more questions or share your own experiences please feel free to do so I'll be happy to unmute you. And with that, I will wish everyone a good rest of your day and we hope to see you back at the NIH tomorrow. Take care everyone.