 So it's my pleasure to introduce you all to Dr. Nathan Kelber, who's the community engagement lead at JSTOR Labs and the director of the Text Analysis Pedagogy Institute. Having served as a public historian, research librarian, and professor, his projects promote technical literacy in data science and the digital humanities. Nathan's going to speak to us today on the topic of why learn text analysis, text and data mining with Consulate. Please join me in welcoming Dr. Nathan Kelber. Wow, an actual applause. You don't hear those often with people all in the same rooms. This is great. Thank you. I appreciate you keeping it very brief so we can focus on the text analysis. So I've got two versions of the slides. There's one on business and data science and one on digital humanities and libraries. My background is in digital humanities and libraries, I would say. But I do occasionally teach as a professor at Wayne State University in data science and the business school. I was going to ask the people on chat to say where they're from. Maybe they'll give us an idea. Sure. It seems like in this room we have a librarian, an artist, computer science, and engineer. So that's pretty varied. Oh, boy. OK, you guys are challenging me. We're going to do digital humanities and libraries, but then I'm going to supplement. And actually what I'll do is I will share here is the link to the slides for business and data science. And then I am going to actually show. This is for DH and libraries. And we can share those slides, too. I am going to share these on my screen over here. I'll bring the chat over. So we're in good shape there. Yeah, let's present these. All right, so why learn text analysis? I was supposed to talk a lot about the consulate, but I'm a really bad salesperson. I prefer to actually teach people because I think that's more interesting. And then show you why consulate is useful and talk about my experiences teaching and some other people that I've worked with that are now using this consulate platform, which is now in its beta form to teach text analysis. But I want to talk a little bit about for people who are curious about text analysis or natural language processing, why would you even want to learn this? And what's the value in doing this? And how long does it take? These are the kind of the questions that I think are really important as you start on your journey. So this is a very high level view down on text analysis. But I'll leave it open if people want to ask more specific questions as we get deeper into the slides. So I've already shared the links to the slides. I'll talk very briefly about Ithaca, which is a not-for-profit organization. Sometimes when people talk about consulate, they call us a vendor because we're associated with JSTOR. And I'm like, we're not a vendor. We're not for profit. I've never worked for a for-profit agency, and that's one reason why I'm terrible at selling things. But Ithaca has four parts, essentially, ArtSTOR, which essentially provides sort of like JSTOR, except for images. We supply 2 million plus high-quality images and digital assets. Ithaca SNR, which is like a research arm which libraries often use to make decisions. JSTOR, which you're probably familiar with, that is a not-for-profit digital library of academic journals, books, and primary sources. Importico, which is a preservation service. It's a dark archive where we preserve mostly journal materials for the long term. And so those are the kind of four different parts. I work in a special wing of JSTOR called JSTOR Labs. And it's kind of like an applied digital humanities think tank. It's the way that I figured out that I can basically be a professor and do digital humanities projects all the time that are new and interesting and not have to be in committees all the time. It's my secret professor world. And so we partner with publishers, libraries, and labs to create tools for researchers, teachers, and students that are immediately useful and a little bit magical. Here are some of the things that we've been working on. And when you look at some of these projects, you understand that we're not-for-profit, especially working on things like higher education in prisons with the Mellon Foundation and finding ways to basically help people teach and learn in all sorts of different institutions. The other thing that I'll mention here is the Text Analysis Pedagogy Institute, which I am the director of. It's an NEH-funded institute to help people learn text analysis and also to teach text analysis. Last year was the first year that we did it in partnership with the University of Virginia, both the library and also the School of Data Science there. In 2022, we're going to run it again in the summer and it will be in partnership with the University of Arizona. Last year, we brought in over 200 people virtually to basically take text analysis courses. Maybe I should supply a link to that because there's open educational resources that we created as a part of the TAP Institute and they are copy-left licensed. So if you want to reuse them or adapt them for your own learning and teaching, it's a great place to start. Stuff like Python basics, introduction to our optical character recognition, topic models, pandas, machine learning, visualizing humanities data, text analysis in ancient and medieval languages, and named entity recognition, those were the courses from last year. And right now we're just getting the instructors together and we'll probably put out a call for participants maybe at the very beginning of 2022. It may be late 2021. So if that's something you're interested in, keep an eye out for that. And it's completely free. It's funded by the NEH. Normally I do a level set, but I think I've gotten a little bit of sense of who I'm talking to. It seems to be a fair number of students actually as well. Am I reading that in the room? The box is a little small. You have everyone from undergrads to grad students to faculty members here. Great. I always watch what the grad students are doing. People think like, look at what the faculty are doing, but like the grad students, I'm like, what are grad students talking about right now? That's how I know like whether I'm on the pulse of what's cool right now. If grad students are showing up, then I know I'm like, okay, I think I'm doing the right things right now. I'm doing the things that are interesting. Well, I know I didn't. Why first of all, why would you learn text analysis? First of all, data literacy is really at the heart of research. And this has been true for a long time in the STEM world, but I think it's only recently become true in the digital humanities kind of over the last 15 years. So as humanities computing transition into digital humanities, data literacy is really changing. It's really changing the way that we do research and the way that we kind of search, find and discover information. Data skills are really in demand right now. If you were to look at Glassdoor and figure out like what were the kind of top positions for the last five years or so, what do you think that position is? It's not English professor, unfortunately. I wish it was. Data scientists? Data scientists, you're absolutely right. I think it fell to number two this year. So I may be a little bit out of date there, but data scientists is the top career on Glassdoor. But everywhere, data skills are in demand, especially in libraries, as they're trying to work with bigger and bigger data, they're trying to work with faculty that have all sorts of different data needs. I worked on a project recently as part of a grant on collections as data basically looking at how can libraries begin to think of collections as data and to collect data in such a way that it can be used for research. And I think that learning text analysis is also really important because it helps us understand things like algorithmic bias. And here I'm gonna take a more humanities bent. That is even if you don't expect to become an expert in text analysis, even if you don't wanna deploy machine learning models or whatever, I think it's really important to understand kind of how some of this stuff works, if only because it's already making decisions in your everyday life, whether you're aware of it or not. That is, models are being used to decide things like which neighborhoods will get patrolled more often by police officers or who gets a particular loan from a bank or questions about who gets paroled or who gets surveilled or whose information is available and whose information is not. So I think that it's really important to understand text analysis and the ways it's being employed so that we can be aware of issues like algorithmic bias. And I always keep a copy of data feminism because I reference it so much and I recommend it so much. So I have a quote here from Data Feminism. You haven't read this, humanities people and also data science, frankly. It's really interesting work, looking at kind of why why ethics in data really matter and why we should think about data justice. And I always stress this when I teach data science strategy and leadership is really thinking about issues of ethics and justice when it comes to data. So here's the quote I have, decisions of civic, economic and individual importance are already and increasingly being made by automated systems sifting through large amounts of data. For example, Fred Poll, a so-called predictive policing company founded in 2012 by an anthropology professor at the University of California, Los Angeles has been employed by the city of Los Angeles for nearly a decade to determine which neighborhoods to patrol more heavily and which neighborhoods to mostly ignore. But because Fred Poll is based on historical crime data and US policing practices have always disproportionately surveilled and controlled neighborhoods of color, the predictions of where crime will happen in the future look a lot like the racist practices of the past. And this is also discussed quite a bit in the book, Weapons of Math Destruction. That's M-A-T-H Math Destruction. So I recommend checking that book out as well. I've got a second quote by Ruhab Benjamin and this is from Race After Technology. She says, the same sort of algorithmic filtering that uses more ethnically tailored representations into my feet can also redirect real estate ads away from people like me. This filtering has been used to show higher paying job ads to men more often than to women to charge more for standardized test prep courses to people in areas with a high density of Asian residents and many other forms of coded inequity. The difference is that coded inequity makes discrimination easier, faster and even harder to challenge because there's not just a racist boss, banker or shopkeeper to report. Instead, the public must hold accountable the very platforms and programmers that legally and often invisibly facilitate what she calls the new gym code. And this is the way that kind of racism is written into coded language. And often this takes the form of a bias in training data sets. So those are some reasons not just why you might wanna learn from like a research perspective, but also just like a general like good citizenry perspective, like it's good to have an understanding of the way that these technologies can be used for good and also for evil and to think about critically how technologies are being deployed in your community however that is. And so but if we look at it from a more kind of academic perspective, why would you wanna learn text analysis if you were a humanist, for example? I think it's clear that there are many humanities faculty that will never use text analysis. But I think that's slowly changing over time. And part of it is because the way we do research is fundamentally changing. And here's Dan Cohen at Northeastern. Says many humanities scholars have been satisfied perhaps unconsciously with the use of a limited number of cases or examples to prove a thesis. Shouldn't we ask like the Victorians, what can we do to be most certain about a theory or interpretation? If we use intuition based on close reading for instance, is that enough? Should we be worrying that our scholarship might be anecdotally correct but comprehensively wrong? Is one or 10 or a hundred or a thousand books an adequate sample to know the Victorians? What might we do with all the Victorian literature? Not a sample or a few canonical texts, but all of it. And in that realm, I think that humanists are increasingly confronted by the fact that they have data. We're not generally creating data through experimentation or observation. More often than not, we're mining data from historical documents. You name it, we've tried to mine it from whaling logs to menus to telephone directories. This means that we tend to want different tools than scientists. And also that we have some interesting data wrangling problems. More often than not, the categories that our historical sources use to divide up our data are not the same ones we're interested in analyzing. So we often have to do some very creative transformations and interpretations. This is a really great piece by Miriam Posner that I recommend when people are starting to think about humanity's data and why analyzing it and going through it may be very important. But put very, very simply, if you cannot read, manipulate and interpret data, you're overlooking an incredibly rich source of understanding the human condition. And you need to know kind of like what can be discovered in data and especially data at scale and how data at scale can be used. And as I've been alluding to as well, how it can be misused. I think that's equally important from a humanities and kind of social justice perspective. Here's Harriet Green speaking particularly about the issue of collections as data. And this kind of mimics my experience. This is for the librarians. This mimics my experience at UNC where we had like these huge collections and they just like they were given to us by a vendor in a bunch of hard drives. And then we kind of put them in the closet because we didn't know what to do with them. They weren't ready, they weren't show ready. And so she says, not all of the data we create or purchase for library collections comes in neat multi-gigabyte packages of quarter files. We recently discovered that datasets we had purchased as part of a database licensing negotiation for more shelf ready than machine ready. They currently exist as stacks of hard drives, disks and other bewildering formats sitting on a book cart. How do we provide access to these data collections? And so we're all awash in data. I mean, that's really what this is about. It's that there's tons of data. We don't know how much information is in it. We don't know how to get the good stuff out of it. And when it comes to text analysis in particular, that is the vast majority of the data. That is when you go through a data science program, a lot of it will tend to be focused on numerical data, things like linear regression, logistic regression and kind of working with numbers which are famously tractable ways of working with computers. But less is understood about working with words or strings or working with unstructured data. That is data that's not in tables or Excel files and kind of like how do you take five million emails or five million tweets or five million novels and get something interesting out of it or useful out of it? I think that's a more difficult question, but when we actually look at the makeup of the data that is out there in the wild, the stuff that's being created all the time, it's mostly unstructured text data or things like, there's also things like obviously audio and video which people are often trying to get to text materials. They're using interpreters and other types of things to try and get to the text data to analyze things further down the line. And so the question you should ask yourself if you're thinking about learning this stuff is first of all, how long, how long is this gonna take me to learn text analysis? It's kind of like when I was an early modern scholar, I was like, oh, crap, do I need to learn Latin? Am I gonna be in trouble if I don't know Latin and spoiler alert, I didn't learn Latin but many of my colleagues did and they did not learn all of Latin. That is they only had to learn enough Latin to kind of get by with the types of research they were doing. Nobody, it's not possible to learn all of text analysis. It's kind of like a silly thing to be like, I don't know, learning all of English or all of history or something like that. You can't do that. And so like, what you need to do is you need to know like, here's the kinds of data that I may have access to and here's the kind of questions I'm interested in. And so like, you just kind of need to know some of the affordances like, what are the some of the methods? The problem is most people aren't equipped to answer that question. They don't know like the difference between say topic modeling and Latin deerslay allocation. Like those are just weird words. And so like, my goal here today is to introduce some of these opportunities really about what you could learn and what might be the opportunities for your research. In text analysis, obviously. And so I wanna kind of give you that guiding path, that kind of initial step. But before you can get started, you're gonna need some data of course. And really you need machine readable data, machine readable text. That is something that can be copied and paste. You can start with something like PDFs or pictures or something like that, but then you'll have to do some type of optical character recognition to get you to that step where you can copy, paste text. So you need machine readable text and since this uses statistical methods, you're gonna need lots of it. And this is the part where a lot of humanists just check out. They're like, well, I don't have that. So that's the end of this talk. And so part of the goal with this Consolidate program is, hey, we're a nonprofit that has access to tons of machine readable text. Maybe we can help people create interesting data sets that they can learn either doing research or just using it for exploratory purposes in their discipline to kind of learn some of these techniques. And who knows, maybe publish a paper or something out of it. But the idea is to really give people all of the tools they need to be in learning these things, right? And so now I'm gonna talk about directions that you could go. And this is really kind of the bulk of what I want you to take from this talk other than you should check out Consolidate and it's really cool, but mostly this is the thing that I think is interesting from an educational perspective. I've got five questions here and each question has like several techniques and this is not exhaustive. This is just like a very quick introduction to some of the things that are out there, some of the possibilities that you could explore. And so I'm gonna list the five questions underneath you're gonna see the methods. And I'm gonna talk about each method a little bit more in depth, but first I'm gonna introduce these five questions. So the five questions that text analysis can really help you answer are, what are these texts about? And that's probably the most popular question is the question that gets asked the most. Like I have a bunch of emails, what's in these? What are these about? I have a bunch of novels, what's in here? I have these whaling logs, what are they talking about? So that's what are these texts about? The second question is how are these texts connected? The third question is what emotions are expressed? The fourth question is what key names can I find? And then the fifth question is which of these texts are similar? So I'm gonna talk about each one of these methods in just a moment here, but I just wanna introduce the five questions first. So the first question, what are these texts about? There are a number of methods that I would recommend right off the bat that are fairly simple to employ that will begin getting you thinking about kind of what's inside of these texts. And the first thing at the beginner level is just going through the texts and seeing how many times certain words occur, basically word counting, like putting all the words in a big bag and then pulling them out one by one, like a scrabble tile and just saying like this word occurs this many times. Sometimes this is called bag of words approach because you're not really concerned what order the words come in or what part of speech they are or anything like that. Just like how many times does this word occur? And that's a very good beginner technique to just get a sense like how many times a word occurs. So you could ask a question like which of these texts focus on women? And so you could look at the word woman and women and you could see like how many times do those occur in these texts? Of course, that may not be certainly it's a beginning exploratory technique is not gonna get you all the way there but it's like a first step in kind of beginning to figure out what the texts are about. Like what are the top 100 words? And here you may need to take out certain words that are function words that are likely to occur say in English that would be things like the and of or we call these stop words. And so you may need to pull those things out but the idea here is just to figure out what are the words that are the most common? And like maybe one level beyond that we can do something like co-location. Co-location is examining where two significant words occur close to another. They could occur like right next to each other or they could be like three words apart or you could say like give me or they could be like in the same sentence or the same paragraph but you're really looking to see like where those two things occur in close proximity. And so the example here is where are women mentioned in relation to home ownership? And so you could be looking at those two different things at home ownership being like could be like a two word construction. Sometimes we call that a by-gram that has two parts. And so you would look at, where are those two things co-located inside of a text? And you may actually pull those out and begin to like close read them or you may take a different approach depending on what your research needs. Another thing that you can do at a more intermediate level is topic analysis or topic modeling. And this is an unsupervised machine learning technique and that may sound intimidating but I can tell you that in the courses that I teach the workshops, we usually get to something like topic modeling within the first month. And that's for people who have never programmed in Python before. And so topic analysis is really asking the question what are the most frequent topics discussed in this newspaper? And it's just kind of going through and trying to group basically common topics or groupings of words together that are in a certain set of documents. Another thing that you can do is significant terms analysis. One form of this is TF IDF which is basically what powers search engines probably powers your library search. And it's basically finding the significant words within a text. And so you might ask a question like what language is most significant within these 1970s political speeches that I have? And so it's basically figuring out like what are the terms that are frequent in a given political speech but infrequent in other political speeches? So when it comes to something like TF IDF you're really trying to figure out what are the words that are common in this particular document that are not really spoken about in other places? And so it's not just looking up here word frequency it's also looking at well inverse document frequency. It's really trying to figure out like what are the words that are very common here but you rarely see other places and those are considered significant terms. The second question, how are these texts connected? And humanists have actually been doing these concordances for a very long time. I remember looking these things up in the library in like big thick books. But like concordance as a Shakespeare scholar this was very useful because you could look up like where a certain quote was repeated many times but it's really where's this word perhaps phrase used in a variety of documents. And so the example I have here is which journal articles mentioned Maya Angelou's phrase if you're for the right thing then you do it without thinking. So if you're like an Angelou scholar you might be curious what have people said about this particular quote. And if you could bring together all of the places where that quote has been written about that would be very useful for your research you would have the entire scholarly record on that particular topic, right? Another thing you can do is a network analysis. And often this is talking more about kind of information and knowledge flows and how people are connected. The famous digital humanities example of this is called the Republic of Letters but it's really kind of like who's talking with whom and how are those communities connected? And so if you had a bunch of different documents that were connecting different people whether they're letters in the case of the Republic of Letters or you could have emails or tweets or like it could be any number of things that are kind of like directionally oriented you could begin to ask questions like what local communities formed around civil rights in 1963? Or you could actually ask a similar thing with like Twitter data or Facebook data around the George Floyd protests or something like that. So the idea is to generate a network visualization of kind of who's talking the most and where they're talking with and kind of where the different pockets of the community are forming. The third question, what is this kind of author feeling? And this is actually perhaps more popular in business and data science. Sentiment analysis is really huge in business and data science world. Anytime you make a phone call and it says, this call may be reported for quality assurance purposes. What's usually happening in those cases is they're taking the phone call and then they take the phone call through a transcriber and actually it's amazing the transcription quality of the best models now is actually better than human transcribers. And so like they take that and they transcribe it into text and then they use a sentiment analysis to get a sense of the sentiment. And it could be something basic like somebody happy or angry or confused or it could be more sophisticated. There are different ways of doing this. Sometimes they use like weighted models where they say like this word has a positive sentiment and it's weighted this way and it has this negative sentiment and there's all sorts of different kind of procedural ways to do it where you can make sure that you're accounting for things like negation, like this is nice versus this is not nice. And even models that will there are ways to analyze emojis and amodicons and I wouldn't be surprised if somebody is doing sentiment analysis on memes at this point, that would not surprise me at all. But then there's also more sophisticated things where you can basically give you can train your own model where you're saying these are all the people who had negative things to say and these are all the people who had positive things to say and you train the computer based on those identifications that have been labeled by others. Question four, what key names can I find? And this is really named entity recognition. It's basically listing every example of a kind of entity it's gonna go through your text and try and figure out like who are all the people that are being named in this text? What are all the locations? What are all the dollar amounts? What are all the company names? Like it'll go through your text and try and identify things of interest to you and pull those things out. And so it's an extraction technique really is pulling out the entities that are of interest to you. And this could be useful if you're just trying to figure out maybe you're looking at your field and you're trying to figure out you pull out all of the names in all of the proceeds from a particular conference the latest conference in your field and you pull out all the names. That would be interesting, right? Cause you would get a sense like who are people really talking about in the field? You could get a sense of like the trends or what are the, you know, discord? What's the discourse? What's the jargon? What are the terms that people are using? So this named entity recognition could help me do that kind of analysis. And the fifth question, the last question is which of these texts are similar? And this is used in a variety of different ways. You can do things like clustering where you're really asking kind of really which of these are most similar in a stylistic way. So there's a field called corpus stylistics. And so in like Shakespeare's studies people ask questions like is this play closer to a comedy or a tragedy? And they'll figure out how things cluster. And there are different ways to do this clustering where you can do hard clustering where you draw like a hard line between the different clusters and do fuzzy clustering, which I always find funny cause I think it's like portoroid clustering or something where you're like, there's some overlap. So when you do like a topic modeling with LDA then some of the words that are in different topics can overlap when you think of like a Venn diagram they can overlap between two different topics. And then there's also hierarchical clustering which is kind of like a like dendrograms basically. They're these kind of like Rudy images that kind of show let's say you, I did one on Chaucer's tails and the Canterbury tails probably 10 years ago, wow. And it basically starts from the bottom up with all of the tails and then they kind of form into these groups and you see like which tails are closer to each other in terms of their stylistics. So that's a little bit about clustering. Another way you can do clustering is with things like supervised machine learning. And so we identified texts that are similar to a certain thing and this is where when you get into the area of supervised machine learning you're actually using pre-labeled data. And so in this example here as part of a project I helped start at UNC which is now looking for a project manager. We basically took the work of Paulie Murray who wrote this book in 1950 called States Laws on Raising Color. Don't get me talking about Paulie Murray because I'll get way too excited. But she wrote this book and like Thurgood Marshall called this book the Bible of the Civil Rights Movement. And basically she was attempting to find racist laws in the book. And so we took all of the laws that she discovered and we used them to help train our model along with other experts, Dr. William Sturkey and Kimber Thomas who was a clear fellow at UNC and a few other folks. Basically had them label these laws. They looked at these laws from the end of the Civil War through the Civil Rights Movement, 1860s and 1960s and determined whether they thought they were Jim Crow laws or racist laws and basically put them in a pile yes, no or maybe. And then we used that to train the machine learning algorithm to go through basically all of the laws. I mean, we had, we scanned all of the volumes and it took us over a year to put together this corpus of all the laws in North Carolina. Now, Paulie Murray did it. She, I mean, she didn't read all the laws but she did it for every state in the country not just North Carolina, every state in the country in 1950. She didn't have Google and so like we were basically trying to discover more racist laws and we found over a thousand of them on the books. And so that's an example of supervised machine learning. Lastly, the authorship attribution which is occasionally common. Common example in digital humanities is the Federalist Papers. But I always mention J.K. Rowling's book, The Cuckoo's Calling, which she wrote under a pen name. Nobody knew who wrote it but there was a fellow Patrick Duola and he did this authorship attribution. He basically said, J.K. Rowling wrote this book and she was forced to come out and basically admit that she had written the book under a pen name. So authorship attribution is basically using statistical analysis to determine who wrote something. So those are the five questions. What are these texts about? How are these texts connected? What emotions are expressed? What key names can I find? And which of these texts are similar? Now I know that's kind of a whirlwind of information but I think what it does is at least, I hope what it does is demystify a little bit some of the very common methods in text analysis and natural language processing that might be used by people in libraries or digital humanities. One area where I'd like to develop some more material is, I have a lot of librarians that are interested in training machine learning models for classification or things like subject headings and basically trying to like read through a document and figure out the metadata that should be associated with it because it's incredibly valuable work and it's incredibly laborious. And so I often get librarians asking me about that and maybe that's something I'd like to develop in the future I think some type of course on that. But there's the link again. There's my email address, there's the Twitter and then a link to consulate.org which I'll talk about in a moment here because I wanna give consulates due because I'm supposed to be telling you about this platform that helps teach and learn these things but I wanna pause because I wanna see if folks have questions. Nathan, I have a very practical question. So I'm super excited about the possibilities of text analysis especially because we have so much OCR and text right here in the library. To go through and systematically learn those is that what your courses would do like in early 2022? Um, so I wanna clarify a few things. First of all, I've written a series of notebooks and the things that I teach for consulate kind of train the trainer type things very similar to the carpentries or programming historian or things like that. At the text analysis pedagogy institute which is an NEH Institute that I will run next summer with University of Arizona. I will teach one course in Python basics probably but there are a bunch of different teachers that teach all of these different areas and we're always kind of looking for feedback about what people are interested in when we recruit the teachers to try and kind of build out the open educational resources because frankly every university in the world either and I don't think this is hyperbole every university in the world either has somebody that is teaching this stuff or is thinking about teaching it and developing materials and trying to figure out how to teach it. And so like that seems silly to me. We should all be sharing open educational resources rather than redeveloping the wheel at every university in the world. And so like the way I see it, we should be kind of while we should be forming communities obviously but I think that's even challenging at a local scale like at the University of Utah, do you know the people who are interested in doing text analysis? If you were to, you know, like that was very hard for me to figure out at UNC when I was the digital scholarship specialist. It took months and months of work and when I was done at UNC then I started working at the larger triangle level with the digital, well, it was called the Triangle Digital Humanities Network but then it then expanded again beyond the Triangle Digital Humanities to be North Carolina Digital Humanities Network but it's an incredible amount of work just the networking work to figure this stuff out. And so my like ideal vision in an ideal world we would be like building a community of people who are like sharing open educational resources and like it's not just me teaching people and kind of teach the teacher's kind of way like, but it's like a community of people like sharing educational resources and making it possible for people to meet the demand. And like the demand is huge. When I was at UNC, which was over two years ago now, we would have waiting lists of over 80 people for things like Intro to Python and Intro to R. We had Barbara Rachenbach recently speak at Ithaca who at the time was the AUL at Columbia. Now she's the university librarian at Yale but she was saying essentially like they would offer these courses and they would fill up within minutes and they couldn't meet the demand. The only thing that saved them weirdly enough is the pandemic and they had to go virtual. And so with the virtual teaching suddenly they could teach like huge amounts of students. And so like to give you an example at probably our most successful example in our beta with Consulate at Northeastern University, they've been teaching these kind of classes for Intro to Python based on my notebooks. And I was like, you know, you guys have been doing really well with this. Like how is it going? And they told me last week that they have a waitlist of 324 students for this course. And I was just like blown away. I was like, wait, what? I thought it was crazy that, you know, at the TAP Institute, we had 200 people. We had people literally from all over the world, all over the world. And like the Ithaca people were coming to me like, should we keep promoting this or we wanna promote it with this and this? And I'm like, no, no, no, stop. Like stop promoting this Institute. We can't review, you know, this many applications. We don't have the like resources to review so many applications. And I was like, shut it down. So I think there's just huge, there's huge need out there, especially at R1 and R2 institutions. Although at the same time, the smaller institutions are the ones that could kind of use these things the most because they don't have the IT infrastructure to run their own Jupyter Hub and kind of facilitate all this teaching. So, yeah, I think there's huge need, but the bigger problem often on campuses is people just don't know where that need is at. And this like little grouping here is an example of that where I just go in and I'd be like, who's here? Cause like I thought it was gonna be DH people, but it was computer science people. It was business people. Like I just don't know like, I never know who's gonna be in the room. And so there's part of it, a big part of it for any university is just figuring out like, how much demand is there and like where are these people coming from? Because they exceed the capacity of any single Liason library and to like meet with all of these people, right? Like they're in all different disciplines and all different parts of the university. And so it's a huge problem to try and meet demand to build the community out. I know that was a long winded answer. That was great. Thank you. Other folks have questions. I can give a brief showing of Consulate. I know I'm supposed to be like, maybe you can have me back and I'll do like a real demo of the platform, but I'll show it a little bit, but I wanna see if anybody else has questions. I have a quick question. How much math do you use with this text analysis? I guess what are some concepts in that that there are statistics? Statistics are very helpful for understanding the models. I often work with digital humanists, many people who have never coded and they're often afraid they're like, math is not my subject, but when you look at study after study, like what makes us somebody good at learning to program, it's actually language. It's not numeracy, it's literacy, like language comprehension and logic. And so like, I feel like a lot of humanities people get tripped up and they're like, I can't learn Python, like this is, there's gonna be a lot of math and this is gonna be really difficult, but we start with the very basics, like the very basics. And that's why I start with Python basics and just introducing Jupyter notebooks and things like that. And part of what makes this approachable is basically the way that Consolate allows people to teach. It allows people to teach without spending a ton of time setting things up. So when I was teaching these things at UNC and when I taught them in other places, it can take a couple of hours or more to get everybody set up, especially if you have like a workshop with 20 or 30 participants and half of them have Macs and half of them have PCs and like they end up having different issues or somebody has old installations or there's path issues or there's a million different things that can trip you up when you're just setting up the environment to begin learning. And so the nice thing about Consolate essentially is you give your students a link, everybody clicks the link, it doesn't matter if they're on a Mac, on a PC or on Linux, like it all goes to the same place. Everything is run in the cloud and they have an identical environment where they can begin to learn quickly. And so that's ideal for kind of running these workshops. Now, how much math is really involved and with the kind of like Python basics stuff, I don't think it's a ton of math. A lot of it is kind of more like logic and flow control. That's the hardest part I think. Like I often explain this with flow charts and things like that. If you wanna understand some of the mathematics behind some of the algorithms, you can get incredibly complicated, obviously. But at the more basic level where you're kind of learning to deploy these things, I don't think that you need to know a ton of math to begin programming in Python and begin doing natural language processing. A lot of the libraries have already been written. And so really you just need to learn how to write Python code and to deploy these libraries in appropriate times to do the analysis. Now, as you become more advanced, you may become more curious. And so like for things like TFIDF, the math is like really not that complex. And so I go, I explain exactly how the math is done in TFIDF, we do it on like calculators outside of the notebook. Like we just like punching the numbers. And so I'm like, you just need these three things. That's all you need, it's not complicated. But to do it at scale with like millions of books is when it seems more complicated. And so if you're intimidated by the math, I don't think that should be the case. The thing that should intimidate you more is it takes time to learn these things. And so you should be strategic about what you wanna learn and how much you wanna learn and how much time you want to spend on certain things because there can be very deep rabbit holes where you're like, well, I wanna learn how to do this and this and this. And there's an infinite amount to learn. I mean, this is why we all work in the academic world because we're addicted to learning. And so that's something I think that requires more critical thinking is like, not can I learn the math, but really like, is this the career trajectory that I want? Is this the direction I wanna go and why and how? I think those are more important questions than yeah. How much math do I have to do? Other folks, we're only slotted for an hour, right? Yeah, we may have to do a follow-up consulate training at some point. Sure. I can show you very briefly. I can show you very briefly. Whoops, did I stop sharing? Oh, no, I didn't look at it. So there's two sides to this consulate platform. One is that we have an ability for people to build data sets. And so we have this amazing data set builder that allows people, I'm gonna clear all the filters, allows people to build data sets based on a variety of content, including JSTOR, Portico, which is that dark archive that I was talking about, Pranay Kling America, which is a newspaper collection created by the Library of Congress. Doc South, which is documents of the American South collection created by UNC Chapel Hill. The South Asian Open Archive, which is a bunch of really amazing materials on Asian culture. And it's really kind of neat because it works in a variety of different languages so you can build things out that it's cool when you see all the different word clouds in Hindi or Bengali or all these different kind of languages that are part of this collection. The reveal digital content is also newspapers. As of right now, it's like dissident newspapers from the 1960s to the present. Alternative press, radical newspapers. There's also some prison newspapers, which I believe have actually been added in there now. But it works somewhat like doing a regular library search, but you're building a data set. You're not just looking for one thing. You're like trying to build a whole data set. You can also filter by publication titles. So there's like a particular journal or set of journals. You can build that out. You can also browse the newspapers or journals like literally one by one to build your data set out. And obviously the bigger your data set, the better for kind of doing text analysis work. It will automatically generate some outputs, some visualizations that you can kind of tailor. And you can save these and share these and download them and download the data behind them. There's a term frequency viewer. There's document categories of overtime, tree maps. This is a preview of the documents that are in here. And then when you're done, you just build the data set. And any data set that you've built, you can also download up to 25,000 items. And we have ways that we can get you more than 25,000. There are ways to do that. But like self-service, you can build a data set up to 25,000 items. And you can download the metadata. You can download unigrams, bigrams and trigrams. That's one word, two word and three word constructions. The word counts for those basically. For any of these, except for JSOAR and Portico data, we're also able to share full text. Now, because of copyright law, we have to follow the law. We are not able to supply full text for the JSOAR and Portico data. But there's a tremendous amount of stuff you can do with just the unigrams, bigrams and trigrams. And then speaking of tutorials, these are the lessons that, well, some of the lessons that I often teach. So getting started with Jupyter notebooks, these Python basics, there's a set of three of them here that go through the basics of Python, metadata, work frequencies, TF-IDF, pandas, working with stop words lists, data set files. Here's sentiment analysis. And sometimes we do topic modeling or tokenizing and things like that. But these are, I'll show you actually, I'll show you in the learning environment just real quick, this'll be the last thing. You can open this immediately in the Consulate Lab and you can run code in the browser. So this is what makes it amazing for teaching and learning is that you can essentially build a data set and then load it into the Jupyter notebook environment and then run the code live in the environment. So you give your students like a link and then they all click that link and then that leads them into the environment where they can begin doing analysis. And so if you're familiar with something like Binder or Jupyter notebooks, that's essentially what this is, using Jupyter notebooks to help people learn in the cloud. Well, in case we need to lose people close to two o'clock, I do wanna say thank you so much for coming and talking to us. We really appreciate it. I know I've played with Consulate a little bit and it's a really great resource. Anyone have any questions for Nathan while we're all still here? So I had a question, Nathan. I'm not familiar with Consulate. So I'd be very curious to see a follow-up but just playing around with it initially, do you have plans and developing it further so that all the different modes of analysis that you covered in your presentation could possibly handle through one portal through Consulate? Certainly you could do all of those things and more if you were to write the code. I haven't written the code, but like the idea, I think is obviously I will continue to write notebooks and I will continue to teach. We're also hiring another teacher to come on board but it's a kind of train the trainer's model, right? Just like the carpentries. And so I'll continue to try and listen to what people are saying they want, what are the most desired things they wanna teach and learn and to develop those materials. That's true both for Consulate and also for the text analysis pedagogy Institute whose goal is really to develop those open educational resources to make it easier. And so like we do not have notebooks for all of the things that I mentioned in my talk but I would love for us to get there. We're still in beta so but yeah, this is the Jupiter notebook environment and this is basically there's some of these have video materials, some of them have explanations and links, there's a huge glossary of materials. This is a lot of people say this is their favorite part. I've created this huge glossary in our help documentation of all of the terms in text analysis and our help documentation is actually, I write most of this and it's quite expansive for all sorts of different things. So this is not like an unresponsive or like something that comes without documentation helpful. We spend a lot of time helping people and we figure out what the pain points are so that we can iterate. But the idea here is that I can write Python code in here and it will execute in the environment. And so people can learn to do any of these any of these different things by basically going through one of these notebooks and learning them. I'm working right now on a notebook that's on Optical Character Recognition. And so this is based on Hannah Jacobs' work at Duke University or she's also at UNC Chapel Hill for the TAP Institute. And so this is a whole notebook that intends to help people learn how to do Optical Character Recognition. But you can see here all the different notebooks that focus on a variety of topics. And of course, like I said, we're trying to expand on that. We're getting ready to hire a second teacher. So I hope that answers the question. Like we're not there where we've done all of the things that I mentioned in there. And the thing is there are different ways of doing those things. So like I have a sentiment analysis with data but there should also be a sentiment analysis using machine learning. And we have like topic model and using latent Dirichlet allocation but there should also be latent semantic analysis. And so like there's no way to like do everything. And I don't like, and part of this too is like a lot of these are built off of our dataset builder but people wanna analyze other things than what are in our dataset builder. So you can bring your own data into the environment and write new notebooks and new techniques. And so the idea that I think we're trying to facilitate really is really like how do we create a space where people can write all of these things and share all of these things both with their students or patrons but also with their, with other instructors and researchers. Just a quick follow up. Are you, is there a compilation anywhere of different syllabi that have used consulates and their environment and lesson plans? I don't think there's any such list at this point. We're too early. People are kind of developing on their own but we haven't, what we would ideally like to be able to do is to like gather those all into the platform itself and make it easy for people to discover it rather than having them in different places. So there are people who are writing like all of these different things. So for example, with the TAP Institute those are all in different GitHub repositories, right? And so like we could I suppose start creating just like a basic list that would attempt to gather all of the different repository links into one webpage. But I think the better way forward is probably for us to continue developing to the point where it's easy for people to share. So what we have right now in the environment at this point in time is people are able to sign into the environment if you log in and then you're able to modify notebooks and save them in the environment. But we're kind of working on the mechanisms for like how people navigate in that space and save their files and share their files. And so that kind of like sharing component is still very much being developed. What we have mostly at this point is the ability to create the data sets to run the notebooks and easily bring in data sets and to share the notebooks with links. And quite a lot has been built out but we don't quite have the part where it's like the sharing community of notebooks and open educational resources. So most all the things you'll see like on our website are things that I've either authored or co-authored. But if you go to the TAP Institute which use Consulate essentially for teaching the Institute, then you'll see additional things that people have developed. But again, this is an open standard we're using Jupyter Notebook. So they're not just things that will run on Consulate, they're things that will run on other platforms like Google Colab or MyBinder or other types of things Kaggle. And so our goal has always been to like develop things and as open a way as possible as a nonprofit. Okay, well, I think we are officially out of time but for everyone here in the room or on Zoom, we will be adding these links to the event page if we don't have them there already. And thank you again, Nathan for coming and speaking to our group. And it made me very excited to go play with Consulate. Awesome, and if you want me to come back and give you a real demo, I told you I'm not a very good salesman. Want me to do a real demo? I can show all the kind of ins and outs of the platform but begin to play with it and send me an email if you have questions or yeah, you want more information. And I'd be happy to come back if you want me to like do an intro to Jupyter Notebooks and intro to Python course. It may be worthwhile. That may help you figure out like how many people would be interested in it? Is that something you offer at the library? Not exactly. And I think there's a lot of people at the university interested in that because when you did your last, we had advertised through Digital Matters, your last round of classes for like intro to Python. And I think it was like a four week, eight session course. And we had seven people sign up from the University of Utah. So there's definitely interest on campus on learning these things. That's good because my network is very weak on the West Coast. So if we're reaching Utah then that's good. Well, you know, we're not quite a coast, but... I mean, like I said, I mostly, most of the people I've worked with in my 10 plus year in Digital Humanities is like very East Coast and Midwest, not mountain region. Mountain range. Well, yeah, I mean, I think that we'll probably take you up on that and we really appreciate it. I mean, I have so many more questions but I don't wanna keep people longer if they need to head off to class or other things. So... Great, let me know if you wanna run another session and yeah, let's spread it wide because I think what you'll discover is there's more interest than you thought. Yeah, exactly. Thanks, Nathan. Thank you so much. Thank you, Bob. Bye. Bye.