 Hi, everyone, and welcome to the afternoon session at AI Village. I am pleased to introduce Kenya Yoshimura and Takahiro Yoshimura on Clairvoyant's concurrent lip reading for the smart masses. Thank you, everybody. Please give them a huge round of applause. Thank you. Hi, everyone. Nice to meet you. Today, we're talking about our research on automated lip reading. If you have any questions, please mention us in Twitter or something at any time. Thank you. My name is Takahiro. I'm a security researcher. Hi, my name is Kenya. I'm a researcher, too. We are not native English speakers. Let us upload this for our English is no good. OK. Firstly, let us introduce ourselves. We are a small security firm located in Saitama, Japan. Do you know Saitama? Saitama is rather peaceful, distinct, lying north of Tokyo. As you see, it has great nature. We live in some urban area near Tokyo. So up there, we do various security-related works from vulnerability assessments, fantasies, code audits, forensic analysis, and some good research. We play CTFs, too. We also give talks in conferences. So why is lip reading? Lip reading is a technique. Reading speakers' lips to extract what he or she is saying. Commonly used among hearing impaired people. Let's see the action. Oh, it's not moving. Just a moment. Oh, OK. Should I move it a little bit? Yes. You hear no audio, but we mark the area with lips. Lip reading works like this. In lip reading, people say lips and extract what he or she is saying. So let us show another example. It is not an easy task. Not only it takes quite a long time to master it, but also even professional human lip leaders often misread. Automatic lip reading is somewhat a hot area of research for these days. As we know, we have LipNet, which has been written by Oxford University et al. probably three years ago. So why is it LipNet? LipNet is awesome lip-reading neural network. Researchers have shown it can read words of the lips and English speakers in more than 90% of the case of a person. Like human lip leaders, LipNet works by looking for the movements of the lips. More specifically, it extracts the images and mouses out of the input frames. For extraction, it uses a delay-based face detector and predictor. Here are examples of extracted mass images. Did you know something? Why are they being colored? Because LipNet tends to overfit on grayscale images since it has rather simple sheerness in the formal stage. Well, let's see its inaction. The grid. Grid is a large, multi-talker, audio-visual sentencing compass to support joint computational behavior studies in speech perception. It consists of high-quality audio and facial recordings of 1,000 sentences spoken by each of 34 speakers. You can check it as a landmark compass for speech recognition in control environments. Now, let's see how well LipNet works. We will be using pre-trained LipNet in hours to prevalence. As you see, our frame is displayed and recognition is running. I will see Ray Blue by I9, please. Not so bad. In contrast, pressure on human repleters mark about 50 or solve score in average. So far, so good. Then, let it spin on some real-world recordings. This is a recording. So I'm sorry, but it's a bit hard to see why it's red. But no, he shouldn't be talking such nonsense. It is so grid-like sentence. What happened? It might be overfit, maybe. Vocabulary might be limited, maybe. Anyways, in real-world, people talk in very more variety than control environments. Feeding lips, LipNet gives us fast sentences because it requires extensive retraining on updating vocabulary. For easy update of vocabulary, we think it is better to decouple sentence generation by giving forms in standard sentences. Actually, V2P, more modern lip-reading model, produces forms. So let's say the curse of grid remains upon us until we undergo the full retraining of it, using something more fun. Now, we almost hear you say, why are you using such nonsense, then? Yeah, here's why. We could not have compasses handy. As you see, NRW 1 and 2, the lip-reading is a wide compasses, non-commercial only policy, enforced by BBC. And interestingly, a long time around time because we are not students, nor in research institution. We could not use them. NRW 3, the lip-reading in the sentences compass, has very different format for alignment. Alignments are precise transcriptions. For example, NRW 3 used seconds for the time point, and it does not use the C or SP notations. For this reason, we are forced to convert the format to one expected by the lip-reading implementation we use. On top of that, it contains far, far longer clips than grid. And we must arrange input format for that. During something like this, our time has run out. So please forgive us for using play-to-in-room lip-net for the rest of our session. Of course, we don't stop there. We are to decouple lip-reading models out of our tool. Well, enough for excuse. Let's get back to the topic. So we want our tool, clever answers. To deal with security cameras, how well does it do? As you know, security camera streams present full challenges. People tend to look down, lower quality of image, lower frame rate, and tremendous murmuring. Lower quality. This is mostly because they are small devices, so optics are limited. And they are designed to record for a long time. Next, people seem to look down. This is because they are often mounted somewhere hard to reach out, too often somewhere high, or looking down. Lastly, people look murmuring all the time. This is because of angle of camera, often they are mounted looking down people. Two years ago, Los Clans applied lip-reading to security cameras in Dutch language, and he succeeded to detect aggressions. In his research, an URAS-based model is used for detecting certain words, such as curses and treats. R2. R2. Clients use this lip-read anyways. How well does it do? OK, let's see. Security camera recordings. You see the rectangle? The rectangle shows, well, Clients is reading that area. So, see, it had no problem to detect faces and read the lips in spite of just the challenges we have described. So far, so good. But wait, then what is the point of our research? It is concursionship. Honestly, we have not seen a single research of applying a model. To the concurrent lip-reading of a multiple person. This is what we did today. We stand here to show you how we read multiple person in concurrent manner. On top of that, we are going to identify speakers so that we can track threads of conversations in a set. So, in order to read the lips, we must detect where they stand. That is, the face. We used D-Lip-based face detection to detect them. As you know, D-Lip is quite versatile image process library powered by machine learning techniques. We were thought to utilize something more orthodox, say, using vanilla-based open-cv, but it doesn't go well. Detectors based on open-cv tend to be picky over object alignment. Dropping person not to study abroad was something like that. On the other hand, once based on D-Lip, handle cases like this just fine. Or more robust at least. As you know, D-Lip does hook-based detection by default. But it can utilize CNN for more precise detection. Clearance can use it, too. But it comes with significantly higher cost, about 15 times of that of hook-based detection. After detecting faces, we keep their features for their identities. For facial recognition processes, we use the face recognition library. With it, we can hash facial features into 128-dimensional vectors called encodings. We recognize faces by keeping their encodings and finding distances among them. But fortunately, our recognition process is not to create stable. That is a problem we call rank-slaps more specifically. The facial tag can jump around. But why do they do? We have identified three root causes. The first one is because facial encodings are brittle. By saying brittle, we mean we cannot get the same encoding for the same face in the same pose. The next one is because facial detection order is not stable. By saying not stable means we cannot predict the order of the faces reported. And the last one is because a resembling threshold. The distance with the face recognition library considers resembling enough. It's said that 0.6 is quite a high number. But is this number really high? Let us show some pictures. Trump and Hillary both showing 0.3 inches. They both have distances of 0.3 inches. Can you believe it? OK. Flexibility ranks are not easy. But we believe it is safe to assume faces are not support too. Makes seven and large wraps. We need a few experiments. So after recognizing faces, we detect their lips. To detect lips, we use 6T8 points face detector comes with delay, taking points after 48. After detecting lips, then we research them as we knew. Since our model expects them in 100 by 50, so we target that size. For better recognition, we also attempt to upsample them if they are small. For upsampling them, we could utilize some super-resolution imaging techniques. SRs are techniques that intend to recreate details. But is SR really helpful? SRs are quite expensive. So we need some experiments to weigh their costs. Let us show some pictures. The center one and the right one both resize the instances over the left one. The center is done with SR. And the right is done with the nearest neighbor method. Can you spot the difference? We could not. So we concluded that SRs are not worth the cost for our use cases. So we dropped the idea. And to compensate lower frame rate, we considered interpolating frames before creating to our model. But as far as we know, modern security cameras tend to generate enough frame rate just fine. So we ensure interpolating frames would be superficial, so we dropped the idea either. After our lips got resized, we sent them to reading process by batch by batch, face by face for concurrency readings. Reading process. Reading process is tasked to feed our model continuously and output the reading with the name and some timestamp. Reading process is derived by collagens. So reading occurs in fifth order. Well, enough talking. Let's see how it goes. We are reading four persons in parallel. So we have succeeded. To read four people in four ways of talking, as a bonus, we really serve two governments as free software, as free in freedom. Please feel free to experiment with it. Hopefully, send us to request. Please mention us in Twitter or something for questions. We'll reply later. OK? Well, that is all for today. Thank you for listening to us.