 Hello everybody and welcome to this session from Michael Shaw from the University of Lincoln. Today Michael is going to be talking to us about what does it really take to create accurately captioned video content. So it's my pleasure to hand over to you now Michael. Thank you very much and a very big thank you to everybody that's joining us. I really do appreciate you taking the time to spend with me today. On screen is a QR code and a short link. If you would prefer to have captions on your own device in your preferred language, we'll be using PowerPoint live and so you can access captions via the link or the QR code if you'd like to follow along and choose your language. So I will leave that on screen for another second or so. The link will be in the chat section so that you can find it later. And so I'll begin because we are tight on time and I have a lot to talk about. So I would like to start this afternoon by talking briefly about our journey at the University of Lincoln so far towards an inclusive and accessible education. It's a subject that I have presented on many a time before and I've told the story of how over the last two years we've made progress towards supporting staff to deliver an inclusive learning experience through the use of workshop seminars, support resources, all staff mandatory training and templates on Blackboard and the use of Blackboard Ally as well. And this mandatory training component was really critical to our progress because it allowed us to present scenarios of genuine challenges, accessibility challenges and provide staff with simple and easy to implement solutions to these challenges. And we could see that this was starting to have an impact because we could see culture change when using Ally. We could see the accessibility of files uploaded to Blackboard significantly increase from academic year 1920 to 2021. This central column here that shows a dramatic increase in the accessibility of files. We were starting to make the progress we wanted to see of winning over hearts and minds and providing the tools that staff needed to create an inclusive education. And part of that toolkit was our accessibility toolkit. It's freely available online and I'm going to share the link to this at the end of this presentation. And this toolkit features some elements that I'd like to delve into today, specifically on creating accessible video content and using captions for inclusive teaching. Now, when we did training around this, we could see the impact of our accessible video training by reviewing the percentage of Panotto videos where staff had added automatic captions. And we saw a significant increase in throughout 2020 where staff were adding automatic captions to their videos. And at this time, this was optional for staff. This data suggests that staff were actively choosing to add captions to their videos. And we know that captions benefit many types of learner. It's not just those with auditory impairments. It's really important that we move past this concept of captioning being a contingent feature for deaf students only. It's not. And in response to student feedback and in partnership with the Lincoln Students Union, we took the decision to enable automatic captioning on all of our teaching and learning videos on Panotto this academic year. So now all of our videos on Panotto are prefaced with a disclaimer that reads, automatic captioning has been enabled. These cannot be relied on to be 100% accurate. Please contact your tutor should any automatic captions cause confusion. And this message appears for five seconds at the start of every video on Panotto. And to enable this, we actually used the copyright notice feature within Panotto. So why is this disclaimer and the 100% accuracy value important? Well, the accessibility guidelines from W3C state that automatically generated captions do not meet user needs or accessibility requirements unless they're confirmed to be fully accurate. And they usually need significant editing. There's a failure of the success criteria if captions do not include all of the dialogue, either verbatim or in essence. And this is an incredibly high standard. So how do we go about reaching this level of accuracy? So automatic captioning is one of the tools in our accessibility toolkit, but they do come with challenges. And I want to take a look at some common challenges that I've been told about with auto captions and discuss each of them today. And I'm sure if I was in a room with you right now, I would see some heads nodding that some of these look familiar. So to do this, I've taken eight videos from our Panotto library, all published teaching videos edited five minutes long from a range of speakers and I manually scored the accuracy of various captioning solutions. Now choosing to take a manual scoring approach was a conscious decision because I wanted to allow for differences in the way captions handled punctuation and numbers, digits or words. Because whilst punctuation is important, at this stage, I chose not to use the text comparison tool. However, this did slow down the workflow, obviously. So let's begin with this. Automatic captions are just unusable. I've been told this. And we know that according to web accessibility criteria, the auto captions will need correcting. But are they unusable? Well, we're going to look at this data in more detail throughout today's talk. But as an overview, we're seeing that caption accuracy across the videos we tested, you can see are on average above 80%. There are a couple of outliers that are significantly lower. And we're going to look at those in more detail. So please remember these when we look at averaged data. Most of the data is above the 80% accuracy. And for those of us that were creating video-based learning resources before auto captioning was available, you'll know that 80% is still a lot better than transcribing everything from scratch. So let's take a look then at how different softwares handle automatic captions. First by looking at three video hosting platforms, Microsoft Stream, YouTube and Panotto. And here we have the average caption accuracy of around 85% for Panotto automatic captions, 96% for stream and 98% for YouTube. With YouTube being consistently providing the highest caption accuracy across all the videos and Panotto providing the lowest accuracy. We can see our two outliers here showing significantly lower in the Panotto accuracy. And that's what drags down the average. So YouTube is providing a very high caption accuracy. And with over 500 hours of video content uploaded to YouTube every minute, which is 720,000 hours uploaded per day, it's not all that surprising that the data set YouTube has access to surpasses any other video platform. So let's expand this data out then to include more voice to text services. We're now seeing data from those three video platforms but also CaptionEd, a paid captioning service, PowerPoint live captioning, which I'm using right now, live captioning in Teams and two voice to text services that are within Microsoft Word. Dictate which transcribes your voice in real time and transcribe which works on MP3 files. Now YouTube is still consistently showing the highest level of accuracy across all the tools. We can see that the averages are generally a small variation across the tools. We visualized this data a little bit better so that we can see the average accuracies across all the technologies tested between 84 and 99 percent accuracy. So we know that some technologies might provide on average a few extra percentage accuracy in the automatic captions but none of them reach 100 percent and so still need correcting. So we need to consider the workflows involved if we wanted to use these different technologies. At University of Lincoln we use Pinopto for all of our teaching and learning video content and it provides the sort of management and learning tools that we think are important. If we were to use Stream or YouTube to gain that extra few percentage accuracy we'd also have to factor in the time for multiple video uploads and transferring the VTT or SRT files back and forth and we're still having to correct the captions anyway. So let's move on to take a look at accents. For this work we looked at some videos featuring East Asian accents, Nigerian, Spanish and Polish accents and as well as what is referred to as received pronunciation which is generally a home county's London English accent much like my formal presentation voice and looking at this data we can see that on average across all the technologies there is an 8.7 percentage difference in auto caption accuracy between native English accents and non-native speakers. This is where Pinopto provided very low caption accuracy on one Polish accented speaker that impacts this average scoring. So overall we do see a decrease in auto caption accuracy with accented speakers but does this push the accuracy percentages into an unusable category or are they still a reasonable starting off point for corrections? I wonder if you've heard this from users of captioning technology that automatic captions don't work with my technical terminology and we teach academic subjects at an academic level and so it makes sense we'll be using technical advanced language to teach our subjects but to what extent extent does this impact automatic captioning? I analyse the scripts of our test videos using the common European framework of word complexity which compares the words in a text to the 10,000 most commonly used words in the English language and I was looking at teaching videos in the areas of education medicine microbiology and business and so here's the data we can see what looks like a weak trend here but with a p value of 0.56 this is not a significant correlation between word complexity and caption accuracy and that's because although we might think that we'll be using academic terminology we'll structure our sentences using common language and in fact we're teaching so it's in our best interest to use simple language where we can. Auto captions might not know every single advanced term in your subject area but I would suggest that captioning AI actually has a larger vocabulary than than I do I'm not a walking dictionary you know so do we need to employ manual or paid solutions? Well the question we're asking here then is do these paid services provide a significantly higher accuracy? Let's take a look at our data again and point out to you where the paid services are from captioned and the panopto manual transcription service both are showing higher accuracy than panopto auto captions but neither are reaching 100% and they're not significantly better than any of the other free services let's take a look at the data in a bit more detail we can see that in some instances the paid services actually gave lower accuracy captions the manual and paid services came out worse than the automatic services on some videos and here we can really see the limitation of vocabulary where AI systems do indeed have a larger vocabulary than manual captioners and we could argue then that the only person that reliably knows all of the words and can reliably produce 100% accurate captions is the speaker themselves so is there something else that we're paying for them? Do these paid services provide any additional benefits? Well captioned provides a confidence score against their transcripts to help direct and prioritize corrections however when we compare the confidence scores to the captioned accuracy there's no significant correlation between the two now at Lincoln over the last academic year we've uploaded over 9,500 hours of content to panopto and this is excluding student assignments and any videos under a minute so let's look at the cost of these paid services the cheapest manual captioning from panopto is at a rate of 45 pound per hour which would amount to 427,000 pounds per year to caption everything and captioned comes with a license fee and then a five pound per hour caption fee but then fees on how many times each transcript is reused which is a highly unpredictable value and it makes it very hard to budget for it could be anywhere between 50,000 pounds per year up to hundreds of thousands per year and with each of these large sums of money we still don't reach 100% accurate captions so we looked at software we looked at comparing software we considered some comparisons of accents technical terminology and to an extent we've seen that auto captions are pretty good yes they need correcting but they get us most of the way there and this didn't quite line up with the perceptions of captioning that we have that we've anecdotally seen that that we see the really bad gibberish captions the complete nonsensical inaccuracies so is there something that we've missed I wanted to look at audio quality and consider the impact of noise on captioning and so to do this I added incremental amounts of white noise to a recording and use panopto water captions to generate the transcripts and so here we can see the effect of greater than 50 noise completely preventing auto captions from working now I wouldn't expect constant noise across a teaching video although air conditioning and you know background hums would fly apart but consider audience noise coughing background chatter they may well interfere with the audio at this sort of rate and cause auto captions to fail at certain points one of the things that struck me most about listening to a range of recordings was the amount of natural reverb from a room and from different rooms and started to wonder if this what effect this would have on captions and reverb is influenced by the size the shape the temperature and the materials that the walls floor and ceiling are made out of and reverb in a room is measured by a value called rt 60 which is the amount of time usually in milliseconds it takes for sound to decay by 60 decibels and when we speak to an audience they hear the direct sound of our voice but they also hear the reflected sound of the room which can accumulate and the more reflected sound the higher the reverb the higher the rt 60 now if we replace our audience here with a microphone we can start to see how reverb might be introduced in problems microphones have a critical distance which is the distance at which the amount of reflected sound equals the amount of direct sound so that we need to be speeding into a microphone at a distance lower than the critical distance so that the mic hears more of the direct sound than the reflected and this all affects what's called the intelligibility of an audio quality so I went into some teaching spaces and I measured the rt 60 of some of our teaching rooms and in this relatively small teaching room measured an rt 60 of just below one second which is actually quite high for a small room and it's because this wall on the left here is actually made of glass and it adds to the reverb at the room quite significantly I measured the rt 60 of a larger teaching space and found actually a very similar value just below one second and although this room is larger it has an angled ceiling which is good for audio quality and much softer wall and floor spacing so I did some test recordings using two different microphones a samson ub1 and a blue yeti microphone and it took a look at the caption accuracy here in the small teaching space I produce repeat recordings further and further away from the microphone moving further from the critical distance we can see that the mic is starting to detect more reflected sound than direct and so the intelligibility of the recording is reduced and the caption accuracy is significantly reduced at two three four meters away we're at sort of 70 percent accuracy and below in the larger room with a similar rt 60 we see the same effect again looking at the average data you can see that the distance from the microphone as a factor of reverb causes a significant decrease in the auto catching accuracy more so than we've seen from different software or accents or terminology and if reverb and audio quality is affecting the intelligibility of our recording so much that it's having this impact on captioning then it's also impacting everyone that listens to the recording and these microphones in our teaching spaces are fixed to wooden surfaces that naturally reverberate at low frequencies and they're often boxed into podiums that create little echo chambers for them which means that we are really struggling to create accurately auto captioned teaching content in case you're wondering some systems have already figured this out and smart speakers that you might well have in your home can have up to six microphones so that they can distinguish between the direct sound and the reflected sound to accurately interpret your voice commands and at Lincoln we're starting to and at Lincoln we're starting to invest in these array microphones which have up to eight microphones built into the unit I'm hoping that this will have a significant impact on audio quality now since some of this session discussing challenges with auto captions I want to look at some mitigation options in light of everything that we've seen so scripting will always produce better recordings teaching will be more concise you'll have already produced a transcript so we can kind of sidestep some of these captioning challenges we should also try to record a well-lit view of the speaker's face to allow for lip reading we can start to reduce the risk of captioning errors if we are really concerned with the technical terminology we should be making sure that these words are clearly visible on screen or in a glossary for everyone and audio quality plays a really significant role in caption accuracy we should be trying to record the highest quality audio that we can and finally prioritizing where caption correction needs to take place is a viable approach it's intentionally at the bottom of my list because I would rather see accessible video content for everyone rather than it being a contingent feature for some but it may be the only way forwards so what does it really take to create accurately captioned video content well for educators producing video content audio quality is the most significant factor in relation to auto caption accuracy given a high quality audio input most voice to text services will return reasonably accurate captions thus reducing the amount of time needed for correction and although some services show higher accuracy rates than others the workload of multiple uploads and management of caption files may not warrant the few extra percentage in capture and caption accuracy and so with those final thoughts I will end my presentation and I'll thank you very much for your time and attention today I appreciate you taking the time to spend with me today thank you thank you so much Michael that was really really interesting with so much research that's gone into that the chat has been incredibly active many people saying how much captioning has improved since 2020 the issue of technical terminology did come up which you did address in your talk however there was a discussion around small errors if I could just share with you here's an example of a small error that might occur do you have any comment on that it's a fantastic example and it's what it's almost exactly the same example that I saw the medical video that I was using as a test sample was about liver functions and diabetes and hyper and hypotension was sorry hyper and hypoglycemia was specifically referenced and you're right that's a serious risk of of getting that error incorrect it's why making it incredibly obvious and visible on screen and pointing to it and you know gesturing with your hands gesturing with your mouse is really crucial to making that point really obvious because that's the type of error that is high risk and highly probable with auto captions so yes it's definitely one to be aware of yeah absolutely and following on from that there's a question from Tara any advice for mathematical terms or characters the same thing again really it's that that sort of latin terminology I didn't test any mathematical videos but microbiology species names came up in my testing any latin words anything that that's has discrepancy over the pronunciation of which let's face it most latin does anyway because no one really knows how to pronounce any of it again it's a case of making that very obvious in your teaching style to really try and make it as crystal clear as possible and from what I've seen you know the the data that youtube has it's such a vast data set I would love to have that sort of information to play with but they can't hit 100 accurate through AI through automatic captioning the one thing that does interest me and inspire me is that powerpoint although it wasn't the highest accuracy powerpoint as a captioning service does have the ability to look at the words in your slide content and use those to try to infer what the words you might be saying and so it's the only AI that has the ability to the best opportunity to learn the kind of terminology that you're saying more so than any other technology I've looked at thank you there's a interesting question here from Matt does the data suggest women's voices can be more accurately captioned so I had the same question Matt and I obviously only looked at a limited data set but I did deliberately choose an equal amount of man and female speakers if you look at microphone quality there's a general suggestion that microphones are better at picking up slightly higher frequencies and that they are slightly more intelligible on recordings which would might suggest and yes I'm talking generally that higher pitched voices would caption more accurately in my test I did a comparison there was no significant difference between man and female speakers okay and we have a question from Caroline who asks how important do you think non-spoken captions are such as sound effects or musical descriptions this is really interesting because the w3c criteria states that all sounds need to be included in in the captioning and I don't know how much of that does play a part and how well how well would go about that I coughed throughout throughout that presentation do I need to caption cough is it significant to the to the message I'm trying to communicate or not I would say no one interesting thing I did notice is that YouTube was the only captioning service I looked at that transcribed the word um if you um throughout your speech most captioning services ignore it YouTube does put it in and it's the only one that did um but where I'd like to explore further with captions and audio descriptions is a similar conversation I have with people around the use of alt text and the context of alt text because how much you describe is going to affect the message you're trying to convey if I'm if I'm teaching music I need to very accurately describe that music but even within that if I'm teaching the technicalities of music I need to describe the notation the um the intonation the variation the tempo the pitch if I'm describing the art of music I might describe the emotions the feelings how you would go about that I don't know um it's subjective and it depends on the message you're trying to communicate yeah fantastic thank you so much Michael honestly there are so many more comments that and questions that I could ask you lots of people saying that they hope that you publish these results and also people asking if they can use your research to show other people etc so I would suggest that that these discussions can be had further detail over and discord where you'll share your resources yes I'd like to say a huge thank you to you Claire for chairing the session I was very much in my my presentation head and so I haven't been through all the comments but I will read through them all now and I'll be on discord as well um and please do get in touch with me there or on Twitter as well brilliant thank you so much for your talk Michael and for all your time thank you