 Intellectual University. Today I'm going to show you my kind of project about eye-tracking application that I'm using deep learning to do so. So the problems I'm trying to solve is I want to address the difficulties of patients or any people that are suffering from complete paralysis or some people that they were born with no limbs, no hands, no legs and basically they can do anything. So they can just like lying on the bed and you know staring at the ceiling fan. So I'm trying to help them to actively control their daily activities by using their eyes. The eyes movement I think is the only thing left that can move within them. So what I'm trying to do is to build a VR headset. It's not actually VR, it's just like a DIY VR that it has a camera over here. The camera is pointing directly to the eye, one of the eye. This one is the left eye and then the camera will monitor a video based data and we will send it to the Raspberry Byboard over here to predict. It will predict on six labels where to monitor like whether the person is looking to the left, the right or whether he is like looking up, down, center or blinking. That's why we need a video based data such kind of like a sequence of images. So we're going to use the labels and we're trying to control a wheelchair. This one is just a demo. So this is a Raspberry By Robot car that I built to control through the label wirelessly. And also this is another application. We can use the sequence of labels as a passcode just like a digit to unlock something. Let's say unlock the phone, unlock the garage or unlock the door for them. So how do I collect the data? Unfortunately, I couldn't find any public data that satisfied my requirement. So I create my own. I use the IVR headset, the device to collect the video based data. So each data boy is actually a sequence of 15 image sample over one second. So it's actually 15 FPS. And it's all contain six labels as I mentioned above. And they are all balanced because I intentionally do it. And I also samples from five different persons. Actually, I sample on my eyes the most like 60% is from my eyes and the rest is from my friends. And each session of data collection will last like 15 seconds. So within 15 seconds, you have to stare or you have to look at one direction or doing the same action within that 15 seconds. So after that, 15 data points will be created. This is a very expensive, very like a health damaging process that is I'm not going to encourage, but well, and after a very long time of data collection, I just only get like 3000 data points. Yeah, this is like my is 60% of the data. And the rest four of them is counted for 40%. So this is what the data look like a sequence. This is a blinking. So you can see an image like some of them are closed. Some of them are open. So blinking. So this is the models I'm use. I have to make sure that models is more enough and lightweight and fast enough to be deployed on the Raspberry by board, which is very slow. So I have a convolutional recurrent network. So each data frame, each image frame could be fitted into a CNN to extract some sort of features. And the sequence of features will be fit into a bidirectional LSTM to predicts the label and the soft mag layers. You can see the CS, the CNN blocks contains like a six convolutional layers. And each convolutional layers, I do convolution, batch normalization and activation. I don't use any pooling a way. So the inputs image is like 64 times 64 times three at a bottom. Okay. So I'm showing you the result validation and the testing, which is quite surprisingly good. So when I train on my own data itself from my own eyes, the accuracy is very, very high and losses also very slow. It's very, very low. It's counted like nearly 99%, which is quite ridiculous. Note that this is the validation and the testing set. It's not the training, training accuracy. And also my data and my friend's data counted into like 3000 data points also get quite similar result. I also do data augmentation too. So is that something wrong? You may say like you must be something wrong with this. You have a very limited data. How can you get this high accuracy? I also very doubt. So let's try it out. I have a demo over here. Where's the mouse? Okay. So this is a demo that I try to do real-time testing controlling the car with my own eyes. I don't use any others. It's quite impressive. Like in real-time testing, I can estimate that it does make some mistakes, but real-time testing is quite like 90% loss. This one is trying to predict a sequence of labels coded as a passcode for you to unlock something. So over here, I try to replicate this example. I look to the right, up, right, and left to open the lock like that. So in real times, it's worked quite impressive, but there's a problem. It do has a problem. So let's pass through it. The problem is it only worked for me. Well, it makes sense if I train it on my own data, if I only use my own eyes. So that demo just now is that I train the models with just my eyes. But after that, I decided to collect my friend's data too. But first of all, it completely make a guess from other people. Other people's strangest eyes and sort of. And also it also fell on one of my friend's eyes. Even their eyes is inside the data, the training data set. Well, it also performed very bad in dark conditions or very strange lighting conditions like different lights or in the outside or indoors, outdoors, etc. Well, some of my friends like she has a very big eyes and it does improve a little bit. But the others, they have very small eyes, then it have no clues at all. And it fell on me too. After for a while, I didn't have a haircut. So the hair just covered my eyebrows and it fell too. And it is also very sensitive to orientation change because the eyes, the direction that you see may not align with the direction of your head. And the direction of your head align with the direction of the camera. So let's say I'm in this direction of my head, but I'm looking some over here. Then in the from the camera point of view, the direction of your eyes change. So the label just make a random guess, which is very bad too. And one interesting properties is that the softmax core is only goes to extreme, which is very strange. But I think this accounts for the problems. So whatever the class it picks, whether it is right or wrong, it always choose a high, very high score. Let's say if it correctly predict a label, then the score for that label is usually goes to 99.999 or maybe 1, 1.0. The rest of the labels called maybe like 10 to the power of negative 8, 7. Even though it predicts it wrongly, let's say it predicted as blinking, but it's actually center. The score for blinking is also still very high. It's like 0.999, which is not going to make sense. If let's say it's like if it predicts it's at blinking as wrong, it should have a very low confidence like 0.5 or something. So this is a very weird phenomena, but I'm trying to figure it out. So the problems that I'm trying to pinpoint is that it is a lack of data, only 3,000 data points, super old feeding. But more interesting is that it works well on the validation and the testing data, which is more weird. The problems is also because of the data set are so similar. There's no much diversity on the data set because I collect the data from just on my eyes and also that the lighting and the lighting condition, lighting condition are just like sometimes I sit in here, sometimes I sit in there, not much everybody, every place else. I didn't do outdoors. So it doesn't have much diversity or in the data set. Yeah, the same, the same point. So data points, each data points are so similar, despite that it is all unique, I check. I really ensure that it is unique, but it is so similar to each other. Maybe the data set didn't cover the true distribution because I only collect my eyes and my friends what the 7 billion others to. So it may not cover the true distribution of the problems at the same point, sorry. So I do pinpoint, I group up all the problems that I think it might be, it comes to the fact that it has a really lack of data. So I try to increase the data by using different techniques. You can see first I do data augmentation, I randomly rotate, I translate some of the image, I change the brightness, I change the contrast, etc. I also generate new images using an animated tools, very cool tools called Unity Eyes from Cambridge University. It has the ability to generate like an animated, as you can see, animated version of the eye. It can look around like that. So I can increase the total numbers of data points to 6,000. And I also use the generative adversarial networks to generate fake image for the training, but actually I didn't include it into the data set because as you can see over here, it was, yeah, I see on the right is the generated fake image. It doesn't really realistic so much that I put it into the training set. So what are the solutions? Actually I don't know. I'm looking for a solution. Yeah, that's all. I hope that I can find a solution from all of you talents. Thank you. Do you have any questions? Yeah, do you have any questions or answers? Okay. Have you tried to look at the ender use of heavy layers, for example, without training? It would be the end. Okay, there is existing models for the admission of position of the eye. The addition of the eye, right? And you can use, for example, four layers of this model, and the advantage is your model training, not from scratch. Even transfer learning, kind of that. I would try it because I have a constraint is that there's a many great models that I can use for transfer learning, but they are usually very big, very large, and I want to fit it into a Raspberry Pi or which is quite an impractical. It consumes a lot of memories and a lot of computational time, which is unsolvable by the Raspberry Pi. So, I didn't do that. Yeah, correct. I'm also thinking about that, but at the development time of this, I'm thinking about the productions, and I'm also thinking about deploying the models on the computers or some powerful computers. But let's say you still have to transfer the information, a sequence of images, the data to the server or something else, and that may be a latency, maybe slowing down the process of the things. So, maybe it's predicting the models on the Raspberry Pi, maybe faster than transfer the whole bunch of data to some servers and get the result back. So, I didn't. I would think about it and I would try to see maybe it better if we can use it. But another point is that I don't want to make the models big or large because it can easily overfit the data because I don't have much data and data collection is very expensive. So, here's a post that you can go home and check out. This is from Apple and it's basically using GANs. So, just use a better quality GAN. Apple had the exact same issue. So, yeah, if anyone's interested, this is on the Apple has a blog that basically shows some of the research papers that they're doing inside Apple and stuff like that. And one of their first, in fact, I think it was their first one, was basically using GANs to come up with IXs and Qs. Anyway, that's a great talk. Thank you.