 All right. So again, let's introduce the next talk, accessible inputs for readers, coders and hackers. The talk by David Williams King about custom off, well not off the shelf, but custom accessibility solutions. He will give you some demonstrations and that includes his own custom made voice input and eyelid blink system. Here is David Williams King. Thank you for the introduction. Let's go ahead and get started. So yeah, I'm talking about accessibility, particularly accessible input for readers, coders and hackers. So what do I mean by accessibility? I mean, people that have physical or motor impairments. This could be due to repetitive strain injury, carpal tunnel, all kinds of medical conditions. If you have this type of thing, you probably can't use a normal computer keyboard computer mouse or even a phone touchscreen. However, technology does allow users to interact with these devices just using different forms of input. And it's really valuable to these people because, you know, being able to interact with a device provides some agency, they can they can do things on their own. And it provides a means of communication with the outside world. So it's an important problem to look at. And it's one I care about a lot. Let's talk a bit about me for a moment. I'm a system security person. I did a PhD in cybersecurity at Columbia. If you're interested in low level software defenses, you can look that up. And I'm currently the CTO at a startup called Alpha Secure. I started developing medical issues in around 2014. And as a result of that, in an ongoing fashion, I can only type a few thousand keystrokes per day. Roughly 15,000 is my maximum. That sounds like a lot. But imagine you're typing at 100 words per minute. That's 500 characters per minute, which means it takes you 30 minutes to hit 15,000 characters. So essentially, I can work like the equivalent of a fast programmer for for half an hour. And then after that, I would be unable to use my hands for anything, including like, you know, preparing food for myself or opening and closing doors and so on. So I have to be very careful about my hand use and actually have a little program that you can see on the slide there that measures the keystrokes for me so I can tell when I'm going over. So what do I do? Well, I do a lot of pair programming, for sure. I log into the same machine as other people and we work together. I'm also a very heavy user of speech recognition. And I gave a talk at that about voice coding with speech recognition at the Hope 11 conference. So you can go check that out if you're interested. So when I talk about accessible input, I mean different ways that a human can provide input to a computer. So ergonomic keyboards are a simple one, speech recognition, eye tracking or gaze tracking. So you can see where where you're looking or where you're pointing your head and maybe use that to replace a mouse. That's head gestures, I suppose. And there's always this distinction between bespoke, like custom input mechanisms and somewhat mainstream ones. So I'll give you some examples. You've probably heard of Stephen Hawking. He's a very famous professor and he was actually a bit of an extreme case. He was diagnosed with ALS when he was 21. So his physical abilities degraded over the years because he lived for many decades after that. And he went through many communication mechanisms. Initially, his speech changed so that it was only intelligible to his family and close friends, but he was still able to speak. And then after that, he would work with the human interpreter and raise his eyebrows to pick various letters. And then keep in mind, this is like the 60s or 70s, right? So computers were not really where they are today. Later, he would operate a switch with one hand, just like on off, on off, kind of Morse code and select from a bank of words. And that was around 15 words per minute. Eventually he was unable to move his hand. So a team of engineers from Intel worked with him and they figured out they were trying to do like brain scans and all kinds of stuff. But again, this was like in the 80s. So there was not too much they could do. So they basically just created some custom software to detect muscle movements in his cheek. And he used that with predictive words the same way that a phone smartphone keyword will predict like which word you want to say next. Stephen Hawking used something similar to that, except instead of swiping on a phone, he was moving his cheek muscles. So that's obviously a sequence of like highly customized input mechanisms for for someone and very, very specialized for that person. I also want to talk about someone else named Professor Sang Muk Lee, whom I've met. That was me when I had more of a beard than I do now. He he's a professor at Seoul National University in South Korea. And he he's sometimes called like the Korean Stephen Hawking because he's a big advocate for people with disabilities and whatnot. Anyway, what he uses is you can see a little orange device near his mouth there. He it's called a sip and puff mouse. So he can blow into it and suck air through it and also move it around. And that acts as a mouse cursor on the Android device in front of him. It will move the cursor around and click when he when he blows air and so on. So that combined with speech recognition, let's him use mainstream Android hardware, right? He still has access to, you know, email apps and like web browsers and like maps and everything that comes on a normal Android device. So he's way more capable than Stephen Hawking was because Stephen Hawking could communicate, but just to a person at a very slow rate, right? Part of his due to the nature of his injury, but it's also a testament to how far the technology has improved. So let's talk a little bit about what makes good accessibility. I think performance is very important, right? You want high accuracy, you don't want typos, low latency. I don't want to speak and then five seconds later have words appear. It's too, it's too long, especially if I have to make corrections, right? And you want high throughput, which we already talked about. Oh, I forgot to mention Stephen Hawking had like, you know, 15 words per minute. A normal person speaking is 150. So that's a big difference. The higher throughput, you can get the better. And for input accessibility, I think, and this is not scientific. This is just what I've learned from using myself and observing many of these systems. I think it's important to get completeness, consistency and customization. For completeness, I mean, can I do any action? So Stephen or Professor Song-Luk Lee, his orange mouth input device, the Sip and Puff is quite powerful, but it doesn't let him do every action. For example, for some reason, when he gets an incoming call, the input doesn't work. So he has to call over a person physically to like tap the accept call button or the reject call button, which is really annoying, right? If you don't have completeness, you can't be fully independent. Consistency is very important as well. The same way we develop motor memory for a muscle memory for a keyboard, you develop memory for any types of patterns that you do. But if the thing you say or the thing you do keeps changing in order to do the same action, that's not good. And finally, customization. So the learning curve for beginners is important for any accessibility device, but designing for expert use is almost more important because anyone who uses an accessibility interface becomes an expert at it. The example I like to give is screen readers, like a blind person using a screen reader on a phone. They will crank up the speed at which the speech is being produced. And I actually met someone who made his speech 16 times faster than normal human speech. I could not understand it at all. It sounded like, but he could understand it perfectly. And that's just because he used it so much that he's become an expert at its use. Let's analyze ergonomic keyboards just for a moment because it's fun. You know, they are kind of like a normal keyboard. They'll have a you'll have a slow pace when you're starting to learn them. But once you're good at it, you have very good accuracy, like instantaneous low latency, right? You press the key, the computer receives it immediately and very high throughput as high as you are on a regular keyboard. So they're actually fantastic accessibility devices, right? They're completely compatible with original keyboards. And if all you need is an ergonomic keyboard, then you're in luck because it's a very good accessibility device. I'm going to talk about two things, computers, but also Android devices. So let's start with Android devices. Yes, the built-in voice recognition in Android is really incredible. So even though the microphones on the devices aren't great, Google has just collected so much data from so many different sources that they've built like better than human accuracy for for their voice recognition. The voice accessibility interface is kind of so-so. We'll talk about that in a bit. That's the interface where you can control the Android device entirely by voice. For other input mechanisms, you could use like a sip and puff device or you could use physical styluses. That's something that I do a lot, actually, because for me, my fingers get sore. And if I can hold a stylus in my hand and kind of not use my fingers, then that's, you know, very effective. So and the Elacom styluses from a Japanese company are the lightest I've found and they don't require a lot of force. So the ones at the top there are they're like 12 grams and the one at the bottom is 4.7 grams. And you've required almost no force to use them. So very nice. On the left there, you can see the Android speech recognition is built into the keyboard now, right? You can just press that and start speaking and it supports different languages and it's very accurate. It's very nice. And actually when I was working at Google for a bit, I talked to the speech recognition team and I was like, why are you doing on server speech recognition? You should do it on the devices. But of course, Android devices are they're all very different and many of them are not very powerful. So they were having trouble getting satisfactory speech recognition on the device. So for a long time, there's some server latency, server lag, right? You do speech recognition and you wait a bit. And then sometime this year, I just was using speech recognition and it became so much faster. I was extremely excited and I looked into it and yeah, they just switched on my device at least. They switched on the on device speech recognition model. And so now it's incredibly fast and also incredibly accurate. I'm a huge fan of it. On the right hand side, we can actually see the voice access interface. So this is meant to allow you to use a phone entirely by voice. Again, while I was at Google, I tried the beta version before it was publicly released and I was like, this is pretty bad. Mostly because it did it lacked completeness. There would be things on the screen that would not be selected. So here we see show labels and then I can I can say like four, five, six, whatever to tap on that thing. But as you can see at the bottom there, there's like a Twitter web app link and there's no number on it. So if I want to click on that, I'm out of luck. And this is actually a problem in the design of the accessibility interface. It only it doesn't expose the full DOM. It exposes only a subset of it. And so an accessibility mechanism can't ever see those other things. And furthermore, the way the Google speech recognition works, they have to reestablish a new connection every 30 seconds. And if you're in the middle of speaking, it will just throw away whatever you were saying because it just decided it had to reconnect, which is really unfortunate. They later released that publicly. And then sometime this year they did update, which is pretty nice. It now has like a mouse grid, which let's which solves a lot of the completeness problems. Like you can you can use a grid to narrow down somewhere on the screen and then tap there. But the server issues and the like expert use is still not good. Like, OK, if I want to turn if I want to do something with a mouse grid, I have to say mouse grid on six, five mouse grid off. And I can't combine those together. So there's a lot of latency and it's not really that fun to use, but better than nothing, absolutely. I just want to really briefly show you as well that the same feature of like being able to select links on a screen is available on desktops. This is a plugin for Chrome called Vimium and it's very powerful because you can then combine this with keyboards or other input mechanisms. And this one is complete. They use the entire DOM and anything you can click on will be highlighted. So very nice. I just want to give a quick example of me using some of these systems. So I've been trying to learn Japanese and there's a couple of highly regarded websites for this, but they're not consistent when I use the browser show labels. Like, you know, the thing to press next page or something like that or like, you know, I give up or whatever is it keeps changing. So the letters that are being used keep changing. And that's because of the dynamic way that they're generating the HTML. So not really very useful. What I do instead is I use a program called Anki and that has very simple shortcuts in his desktop app. One, two, three, four. So it's nice to use and consistent and it syncs with an Android app. And then I can use my stylist on the Android device. So it works pretty well. But even so, you know, as you can see from the chart in the bottom there, there are many days when I can't use this, even though I would like to because I've overused my hands or overused my voice. When I'm using voice recognition all day, every day, I do tend to lose my voice. And as you can see from the graph, sometimes I lose it for like a week or two at a time. So same thing with any accessibility interface. You got to use many different techniques and it's never perfect, it's just the best you can do at that moment. Something else I like to do is read books. I read a lot of books and I love eBook readers. The dedicated eInk displays, you can read them in sunlight, they last forever battery-wise. Unfortunately, it's hard to add other input mechanisms to them. They don't have microphones or other sensors and you can't really install custom software on them. But for Android-based devices, and they're also like eBook reading apps for Android devices, they have everything. You can install custom software and they have microphones and many other sensors. So I made two apps that allow you to read eBooks with an eBook reader. The first one is Voice Next Page. It's based on one of my speech recognition engine called Silvius and it does do server-based recognition. So you have to capture all the audio, use 300 kilobits a second to send it to the server and recognize things like Next Page, Previous Page. However, it doesn't cut out every 30 seconds. It keeps going, so that's one win for it, I guess. And it is published in the Play Store. Huge thanks to Sarah Leventhal who did a lot of the implementation, very complicated to make an accessibility app on Android, but we persevered and works quite nicely. So I'm going to actually show you an example of Voice Next Page. This is a, over here, this is my phone on the left-hand side, just captured so that you guys can see it. So here's the Voice Next Page. And basically, there's a connections green. I can do the servers up and running and so on. I just press Start and then I'll switch to an Android reading app and say Next Page, Previous Page. I won't speak otherwise, because it will chapel everything I'm saying. Next Page, Next Page, Previous Page. Center, Center, Foreground, Stop Listening. So that's a demo of the Voice Next Page and it's extremely helpful. I built it a couple of years ago along with Sarah and I use it a lot. So yeah, you can go ahead and download it if you guys want to try it out. And the other one is called Blink Next Page. So the idea for this, I got this idea from a research paper this year that was studying eyelid gestures. I didn't use any of their code, but it's a great idea. So the way this works is you detect blinks by using the Android camera and then you can trigger an action like turning pages in an eBook reader. This actually doesn't need any networking. It's able to use the on-device face recognition models from Google. And it is still under development, so it's not on the Play Store yet, but it is working and please contact me if you want to try it. So just give me one moment to set that demo up here. And so I'm gonna use the main problem with this, the main problem with this current implementation is that it uses two devices. So that was easier to implement and I use two devices anyway, but obviously I want a one-device version if I'm actually gonna use it for anything. So here's how this works. This device I point at me at my eyes, the other device I put wherever it's convenient to read. Oops, sorry. And if I blink my eyes, the phone will buzz once it detects that I've blinked my eyes and it will turn the page automatically on the other Android device. Now I have to blink both my eyes for half a second. If I wanna go backwards, I can blink just my left eye. And if I wanna go forwards, like quickly I can blink my right eye and hold it. Anyway, it does have some false positives. That's why I like you can go backwards in case it detects that you've accidentally flipped the page. And lighting is also very important. Like if I have a light behind me, then this is not gonna be able to identify whether my eyes are open or closed properly. So it has some limitations, but very simple to use. So I'm a big fan. Okay, so that's enough about Android devices. Let's talk very briefly about desktop computers. So if you're gonna use a desktop computer, of course try using that show labels plugin in a browser for native apps. You can try Dragon Naturally Speaking, which is fine if you're just like using basic things. But if you're trying to do complicated things, you should definitely use a voice coding system. You could also consider using eye tracking to replace a mouse. Personally, I don't use that. I find it hurts my eyes, but I do use a track ball with very little force in a Wacom tablet. Some people will even scroll up and down by humming, for example, but I don't have that set up. There's a bunch of nice talks out there on voice coding. The top left is Tavis Reds talk from many years ago that got many of us interested. Emily Shia gave a talk there about like best practices for voice coding. And then I gave a talk a couple years ago at the Hope 11 conference, which you can also check out. It's mostly out of date by now, but it's still interesting. So there are a lot of voice coding systems. The sort of grandfather of the mall is Dragonfly. It's become a grammar standard. Caster is, if you're willing to memorize lots of unusual words, you can become much better, much faster than I currently am at voice coding. A Nia is how you originally used Dragon to work on a Linux machine, for example, because Dragon only runs on Windows. Talon is a closed source program, which is, but it's very, very powerful, has a big user base, especially for macOS. There are ports now, and Talon used to use Dragon, but it's now using a speech system from Facebook. Sylveus is the system that I created. The models are not very accurate, but it's a nice architecture where there's client server, so it makes it easy to build things like the voice next page. So the voice next page was using Sylveus. And then the most recent one I think on this list is Calde Active Grammar, which is extremely powerful and extremely customizable, and it's also open source. It works on all platforms, so I really highly recommend that. So let's talk a bit more about Calde Active Grammar. But first, for voice coding, I've already mentioned, you have to be careful how you use your voice, right? Breathe through your belly, don't tighten your muscles and breathe from your chest, try to speak normally. And I'm not particularly good at this, like you'll hear me when I'm speaking commands, my inflection changes. So I do tend to overuse my voice, but yeah, I just have to be conscious of that. The microphone hardware does matter. I do recommend like a blue Yeti on a microphone arm that you can pull and put close to your face like this. I'll use this one for my speaking demo. And yeah, and the other thing is your grammar is fully customizable. So if you keep saying a word and the system doesn't recognize it, just change it to another word. And it's complete in the sense that you can type any key on the keyboard. And the most important thing for expert use or customizability is that you can do chaining. So with a voice coding system, you can say multiple commands at once. If there's, and it's a huge time savings. You'll see what I mean when I give a quick demo. When I do voice coding, I'm a very heavy Vim and T-Mux user. You know, there've been, I've worked with many people before so I have some cheat sheet information there. So if you're interested, you can go check that out. But yeah, let's just do a quick demo of voice coding here. Turn this mic on. Desk left to. Control Delta, open new terminal. Charlie Delta space slash Tango mic Papa enter. Command Vim, hotel, hotel point, Charlie Papa, Papa. Enter. India hash word include space lango. India Oscar word stream wrangle, enter, enter. India noitango space word main. Nope, mic arch India noi. Lenren space lace enter, enter, race up tab. Word print Fox scratch nope. Code standard, Charlie Oscar, uniform tango space, lango, lango space. Quote sentence, hello voice coding bang. Scratch six Delta India, noi golf bang backslash noi quote. Semicolon act, Sky Fox mic, Romeo noi, Oscar. Word return space, number zero, semicolon act. Vim save and quit. Golf plus plus space, hotel, hotel tab, minus Oscar space, hotel, hotel, enter. Point slash hotel, hotel, enter. Desk right to. So that's just a quick example of voice coding. You can use it to write any programming language. You can use it to control anything on your desktop. It's very powerful. It has a bit of a learning curve, but it's very powerful. So the creator of Caldeoctrogrammar is also named David. I'm named David, but just a coincidence. And he says of Caldeoctrogrammar that I haven't typed with a keyboard in many years and Caldeoctrogrammar is bootstrapped in that I have been developing it entirely using the previous versions of it. So David has a medical condition that means he has very low dexterity. So it's hard for him to use a keyboard. And yeah, he basically got Caldeoctrogrammar working through the skin of his teeth or something and then continues to develop it using it. And yeah, I'm a huge fan of the project. I haven't contributed much, but I did give some of the hardware resources like GPU and CPU compute resources to allow training to happen. But I would also like to show you a video of David using Caldeoctrogrammar just so you can see it as well. So the other thing about David is that he has, he has a speech impediment or a speech, I don't know, an accent or whatever. So it's difficult to, for a normal speech recognition system to understand him. And you might have trouble understanding him here, but you can see in the lower right what the speech system understands that he's saying. Oh, I realized that I do need to switch something in OBS so that you guys can hear it. Sorry, there we go. Tim, number one, similar comment space. Number one, number one, content line. Control-sharp enter. Dictate text, teamwork assignment. Apostrophe, control enter two times. And the password space. And deal. Anyway, you get the idea and hopefully you guys are able to hear that. If not, you can also find this on the website that I'm gonna show you at the end. Oh, one other thing I wanna show you about this is, David has actually set up this humming to scroll, which I think is pretty cool. Of course, I got and turned off the OBS there, but he's just like, hmm, and it's understanding that and scrolling down. So something that I'm able to do with my trackball but that he's using his voice for, so pretty cool. So I'm almost done here. In summary, good input accessibility means you need completeness, consistency, and customization. You need to be able to do any action that you could do with the other input mechanisms. And doing the same input should have the same action. And remember, your users will become experts, so the system needs to be designed for that. For e-book reading, yes, I'm trying to allow anyone to read even if they're experiencing some severe physical or motor impairment because I think that gives you a lot of power to be able to turn the pages and read your favorite books. And for speech recognition, yeah. Android speech recognition is very good. Silvius accuracy is not so good, but it's easy to use quickly for experimentation and to make other types of things like voice next page. And please do check out Calde Act of Grammar if you have some serious need for voice recognition. Lastly, I put all of this onto a website, voxhub.io, so you can see voice next page, link next page, Calde Act of Grammar, and so on, just instructions for how to use it and how to set it up. So please do check that out. And tons of acknowledgments, lots of people that have helped me along the way, but I want to especially call out Professor Sang-Wook Lee, who actually invited me to Korea a couple of times to give talks a big inspiration. And of course, David Zura was actually been able to bootstrap into a fully voice coding environment. So that's all I have for today. Thank you very much. Right. I suppose I'm back on the air. So let me see. I want to remind everyone before we go into the Q&A that you can ask your questions for this talk on IRC. The link is under the video or you can use Twitter or the fattyverse with the hashtag RC32. Again, I'll hold it up here, RC number three, TWO. And wow, thanks for talking to David. That was really interesting. Thanks for talking, David. I think we have a couple of questions from the Signal Angels. Before that, I just wanted to say I recently spent some time playing with the voiceover system in iOS and that can now actually tell you what is on a photo, which is kind of amazing. Oh, by the way, I can't hear you here on the mobile. Yeah, sorry, I wasn't saying anything. Yeah, no, it's... So I focus mostly on input accessibility, right? Which is like, how do you get data to the computer? But there's been huge improvements in the other way around as well, right? The computer doing voiceover. We have about, let's see, five, six minutes left at least for Q&A. We have a question by Toby Plus Plus. He asks, your next page application looks cool. Do you have statistics of how many people use it or found it on the App Store? Not very many. The voice next page was advertised only so far as a little academic poster. So I've gotten a few people to use it, but I run eight concurrent workers and we've never hit more than that. So not super popular, but I do hope that some people will see it because of this talk and go and check it out. That's cool. Next question, how error-prone are the speech recognition systems at all? E.g., can you do coding while doing workouts? So one thing about speech recognition is very sensitive to the microphone. So when you're doing it, you don't see any mistakes, right? That's the thing about having low latency. You just say something and you watch it and you make sure that it was what you wanted to say. I don't know exactly how many words per second, words per minute I can say with voice coding, but I can say it much faster than regular speech. So I would say at least like 200, maybe 300 words per minute. So it's actually a very high bandwidth. That's pretty awesome. Next question from Pepi, JN, Divos. Any advice for software authors to make their stuff more accessible? There are good web accessibility guidelines. So if you're just making a website or something, I would definitely follow those. They tend to be focused more on people that are blind because that is, you know, it's more of an obvious fail. Like they just can't interact at all with your website. But things like, you know, if Duolingo for example had used the same, like the same accessibility access tag on their like next button, then they would always be the same letter for me and I wouldn't have to be like Fox, Charlie, Fox, Delta, Fox, something changes all the time. So I think consistency is very important and integrating with any existing accessibility APIs is also very important. Web APIs, Android APIs, and so on. Because, you know, we can't make every program out there like voice compatible. We just have to meet in the middle where they interact at the keyboard layer or the accessibility. Awesome, and Merrick N has a question. Wonders if these systems use similar approaches like stenography with mnemonics or if there's any projects working having that in mind? A very good question. So the first thing everyone uses is the nato-frenetic alphabet to spell letters, for example. Alpha, Bravo, Charlie. Some people then will substitute letters for things that are too long, like November. I use Noe. Sometimes the speech system doesn't understand you. Whenever I said alpha, Dragon was like, oh, you're saying offer. So I had changed it. It's arch for me, arch, brav, char. So, and also most of these grammars are in a common grammar format. They are written in Python and they're compatible with Dragonfly. So you can grab a grammar for, I don't know, for Aenea and get it to work with Caldeactive Grammar with very little effort. I actually have a grammar that works on both Aenea and Caldeactive Grammar and that's what I use. So there's a bit of lingua franca. I guess you can kind of guess what other people are using, but at the same time, there's a lot of customization, you know? Because people change words. They add their own commands. They change words based on what the speech system understands. LEB asks, is there an online community you can propose for accessibility technologies? There's an amazing forum for anything related to voice coding. All the developers of new voice coding software are there. Sorry, I just need a drink. So it's a really fantastic resource. I do link to it from voxhub.io. I believe it's at the bottom of the Caldeactive Grammar page. So you can definitely check that out. For general accessibility, I don't know, I could recommend the accessibility mailing list at Google, but that's only if you work at Google. Other than that, yeah, I think it depends on your community, right? I think if you're looking for web accessibility, you could go for some Mozilla mailing lists and so on. If you're looking for desktop accessibility, then maybe you could go find some stuff about the Windows speech API and the Windows accessibility API. All right, one last question from Joe Nielsen. Could there be legal issues if you make an e-book into audio? I'm not sure what that refers to. Yeah, so if you're using a screen reader and you try to get it to read out the contents of an e-book, right? So most of the time there are fair use exceptions for copyright law, even in the US, and making a copy of yourself for personal purposes so that you can access it is usually considered fair use. If you were trying to commercialize it or make money off of that or like, I don't know, you're a famous streamer and all you do is highlight text and have it read it out, then maybe, but I would say that that definitely falls under fair use. All right, so I guess that's it for the talk. I think we're hitting the timing mark really well. Thank you so much, David, for that. That was really, really interesting. I learned a lot and thanks everyone for watching and stay on. I think there might be some news coming up. Thanks and everyone.