 Hi everyone, thanks so much for tuning in and watching this video. My name is Daniel Jacobs, I work at the University of Texas, Austin, and I am the Captioning Services Manager. And today we're going to talk a little bit about our automated transcription speech to text tool. But before we get to that, we are going to talk a little bit about some of the background context of what we do here at UT. Actually, five years ago I spoke at the CNIFall meeting, here's a blurry picture to prove it. Then we talked about what was a newly established service and how it was a somewhat unique partnership between central campus, disability services, and the libraries. As a shared service, it provided some unique opportunities and challenges. Our main priority was, and still is, to ensure that video content is accessible to our deaf and hard of hearing student communities. On paper, this sounds relatively simple, but prior to the establishment of the service in 2014, many departments and colleges had just been pursuing their own individual solutions with varying degrees of success. It was a challenge to work with all the relevant parties and wrangle the many systems and tools used to create and manage video and audio content at a university of UT size, which is upwards of 50,000 student body. Housing the service under the libraries was actually a strategic and philosophical decision. It provides the opportunity to offer our services to all colleges and departments regardless of their size or budget, and it aligns philosophically with the library's model of providing access to information, be it for research, instructional needs, or simply the exploration of ideas and intellectual innovation. And not only do video captions and transcription provide critical access to deaf and hard of hearing individuals, but they also allow for the ability to search and discover content that without the text representation wouldn't be discoverable. And just as libraries approaches, library approaches are changing as technology and campus needs change, we too are looking at new ways to address and keep up with the evolving challenges and opportunities. And once a challenge comes as a result of the pandemic, this is what was a typical scenes from campus prior to March of this year, large gatherings, close quarters. And as over the years, this is what our captioning and transcription work volume growth look like, sort of from left to right, starting in 2014. You can see steady but slow growth over the years. And starting in 2012 or March of this year, as the COVID-19 pandemic ramped up, many of the classes, actually all the classes, ceased in-person meetings and everything shifted online. And this is more what campus looked like after that. Everything shifted online. And some instructors chose to use synchronous streaming live video like Zoom meetings and others relied more on asynchronous pre-recorded materials or combination of both. And as a result, this is that same chart but now expanded out to the end of November of this year. And you can see the sharp peak over on the right, which represents basically starting the summer into the fall semester as a dramatic increase in video content that needed to be captioned or transcribed as a result of the online class offerings increasing. From the beginning, we knew that machine generated transcription and captions would be hopefully part of the equation of our workflows, but up until recently, probably the last few years, the quality just hasn't been good enough to be useful, at least for our purposes. We started seriously considering and exploring the different offerings from, you know, from like Microsoft, IBM, Amazon, Google a few years ago, comparing the features and accuracy levels sort of side by side. And the automated speech recognition is far from being accurate enough to rely on exclusively, of course, what humans do easily, such as understand the context, the subtext, the meeting, and even being able to discern speech from within difficult audio, say with background noise and music or people talking over each other, it's very difficult for even the best AI to do that. And for us, the handcrafted captioning is still essential and really provides the most accurate output. Auto transcription and combination of auto transcription and human reviewing or editing intervention is widely used and we found that in many situations can be faster and more efficient than just typing out by hand created from scratch. And the fact that we are embedded in the library's IT group is fortunate that we were able to rely on some excellent support from the software development team. They helped us get a completely custom tool off the ground and run it in just a few weeks from when we decided to really make a push. We decided on Amazon, as you can see, AWS Transcribe is the speech-to-text engine. And Django is a web framework that is widely used at UT, and we use that as sort of scaffolding that sort of holds up the web application as users to interface with the upload files and view the results and download the results. And Box is used, which is like a file sharing service that's used to sort of store and upload video and audio to and from. Actually, at first the focus was, like I said, it was more towards external users who wanted to empower faculty, staff, content creators to create their own captions. We helped to give them the option of automated draft to work from would make that process easier. And so, you know, and that's kind of where it's at now. It's both a service that's offered to the university, the entire campus. Anyone who has a university ID can log in and upload content and have it transcribed and download the results. And we also are using it internally for our internal student staff to do the same and integrate it into our official sort of captioning and transcription workflows. Alright, so without further ado, I'm going to go ahead and switch over to screen sharing my browser and I'll give a quick demo of what it actually looks like at this point in time. Alrighty, here we are. This is the job creation screen once you log in. And we'll do first to just give this a short title. We can just stick to anything like the file name or any other identifier. Just do a CNI. And here we actually have a few different options. I didn't mention this, but we do provide the ability to transcribe into Spanish or English, just meaning the same language. So if it's a Spanish dialogue or Spanish language speech, then it would just be transcribed into Spanish text. And there's some other options here, which includes speaker identification. If you want that, that just will sort of, it will do its best to identify when different speakers are talking and it sort of labels them in sequence. So like the first person, you know, speaker one, two, three and tries to keep track of it. If you know how many people are talking, how many people are represented in the file, it's more accurate if you tell it, you know, if there's just two people or more. So that's what that option is for. We'll go ahead and turn this off. This test video is going to be just one person speaking. So here I have a demo file. It's one minute. It's a video professor. And it's just a pretty high quality audio. It's actually produced by a department on campus that has a professional studio. They produce online classes. It's kind of like a best case scenario in terms of audio quality and speaker. And so when you first upload the file or first submit the file, you can see that the status is uploading to the transcription server. After, you know, a few seconds or a minute, it'll update and go through the various statuses of transcription progress and all that. So actually I've got another, got the finished file here. I did this earlier. So you can just kind of bypass the weight and we can see what it looks like here. The status is completed, have the submitted dates and some other information. And actually I'll go ahead and play this back so you can get an idea of what it looks like. You can see the transcript down here. It's just the plain text and you have options for downloading it in various formats. If you download a Word document or a text file, it's just text. If you go for a SRT or VTT file, that is actually a full caption file, you know, timed out and segmented and formatted in such a way that's optimized for video captions. And then we can play back the video here and see, in theory, should see the results. So with all of this and talking about sociological imagination and applying it, basically what we're trying to get you to do is to step away from your own mind. And everything that's familiar to you can look at something in a new way. Pretty good. You might have a reflexive way of thinking about these issues. The main goal of this is to give someone a good start. You know, if it's that in the high 90s of accuracy that, you know, and you have to change a word or formatting every, you know, once, you know, once one out of every 10 words, not bad. So that's the idea. And so yeah, you can go ahead and download this, for example, as an SRT file and just, you know, save it. And you can upload this to, to YouTube or, you know, other video platforms and coming soon, we're actually going to have the ability to directly from the app go into a separate editing platform where you can actually make edits and play the back the video or audio in real time. So that is the main demonstration of this app. Go ahead and stop sharing for now. Thanks so much for making this far. I just wanted to sort of just go over a couple of final thoughts related to automated transcription and how we plan to use it here at UT. One of the things that sort of future enhancements wise, we are really wanting to take advantage of features that are part of speech-to-text engines, especially like cloud-based ones like AWS's or others where you can train models, meaning that you give it a sort of an accurate set of training materials. So if you've had a video and created, you know, 100% accurate text, you could upload that to the system and the system would learn from that. At least that's the idea and sort of for at least a particular content type or maybe it's a, you know, a particular professor, it could increase the accuracy going forward for that particular speaker. Another thing we're wanting to take advantage of is sort of like custom word lists so you can provide, you know, up to like, I think like a thousand words for a particular model and sort of make sure that those words, you know, if there's synonyms or homonyms that sound similar that it'll favor the particular words that you give it. Spellings and like sort of pronouns or particular terminology, which can be very helpful for STEM material like science and, you know, technical content where there's, you know, less common terminology used frequently. You could give it that term list and it would, in theory, be more accurate. And then like I said, we want to really enhance the editing tools. I think that's probably, you know, one of the biggest pain points is going through and making the corrections. You know, there's ways to make it more efficient like bulk editing and I'm just making that process of getting around within the text file or within the text as easy as possible. So, yeah, again, thanks so much for watching and, you know, stay tuned, hopefully be able to give another update in the future. Maybe not, won't take as long as five years, but yeah, I really appreciate the opportunity to talk about this, you know, what I think is pretty exciting work we're doing and feel free to reach out. And if you have any questions or comments or just want to chat about this topic, I'd love to hear about what other institutions are doing, you know, and how you're handling what I'm sure is a common problem with this increased workload for creating accessible video and audio. So yeah, please feel free to reach out and the contact information will be on screen soon. And yeah, thanks so much.