 Sorry to tell you the questions. So I try to read quickly because I want to leave some time for demo, although I'm not sure whether it's working or not. So today's topic is about clap detection, the clapping sound detection using an Arduino and a new network. Usually, we are talking about a new network. It's usually for big machines and a big high processing power, but I'm trying to do this in Arduino and show that it's actually doable. So the motivation is I need a remote control light simply because I forget to allocate a light switch at my room. So I want to do some remedy of this mistake, so I'm trying to do some remote control of the light. And there are several of the options for me. One is the PIS sensor. This is very obvious. Some kind of passive infrared detection when you work nearby, then it starts to activate. But for this time, it's not really applicable to my application because the sensor has to be mounted on a wall against the person who is still working. And the sensor actually has an angle of detection, so that if I work behind it, it doesn't activate at all. So PIS sensor ruled out. And there's a new type of sensor called micro radar. It's a very fancy name. I bought several of these sensors. And it's supposed to have a 360 degree detection. But what I found is actually, because inside the wall, there are some metal structures. It's actually not the signal, so it's not working as well. So the only option for me is sound. And actually, there is something like a production sound detector with switch built-in. And it's a nice panel that you can just replace your honorary switch. But I feel like it could be quite simple. What I cannot do is my own. So I have so many Arduino lines around, so I just make this one particular simple schematic. Just one microphone with some amplifier free into a analog input and do some detection. I think that's quite simple. Then maybe I go to Google and search for some of the existing projects that people are publishing. So there's a ton of those things. So what we actually see, what they did is they just, in the loop function, what they just read the analog thing. And if the analog value is above certain threshold, they just split the switch. That's majority of the projects which are doing. I think, does it work? Yeah, it can work. But how does it know it's a clap sound or somebody speaking nearby, or it's some music, or it's a shut-band door, all kinds of things. How does it differentiate? I think from the code, it doesn't, right? So I think further, there's a very nice YouTube video by this guy called Great Scott. He did a better version of this one. He demanded two clap sounds to active something. So that kind of minimized false positives a lot because you have to clap twice. And I still don't feel it's a very nice solution because I have to clap twice. It sounds a bit weird. Then even more, there's really some academic paper, although it's published by some students, and they introduced several algorithms based on the same principle that they do, they capture all the sound samples like this and then they do a long-term and short-term averaging for the sound sample and detect this energy level. And then they kind of calculate the ratio between the short-term energy versus the long-term energy and they give this graph and they determine some of the threshold and say above certain level, this is a clap sound because it is kind of a bursting sound over a quieter background. So I think, yeah, this might work, this sounds reasonable. So I try to implement this one, but the first question is how do I decide this threshold? Do I need to collect a lot of samples and do my own testing? And whether this threshold actually applicable to other environment, I have no guarantee. So, and meanwhile, actually I'm taking a deep learning course. So this may be the second month I'm into the course. So what they teach a lot in the deep learning course is that what you can do for a classifier, you just pump all the data in and let the neural network to decide by itself whether it is a clap or not. So I think, yeah, this is a good choice. So, and obviously I can capture the sound samples and feeding neural network, let it train. Maybe I can train our PC instead of our microcontroller. But soon I found it is not really doable on microcontroller like Arduino because for collecting sound, I need to capture data for 200 microseconds. Within these 200 microseconds, I can actually have a 7.6 kilo of samples. And Arduino definitely doesn't have memory to store these samples, as well as send these samples to a PC. Within 200 microseconds, it's almost not possible. And even if it is possible, taking the 7.6 samples and do calculation, do a forward calculation, or the floating point, mess, Arduino also not possible. So I give up this idea, maybe I can use, actually I can use Cortex-M4F, maybe it is possible, but I still want to stick to Arduino. So the second idea is I need to extract some features to represent the clapping sound. Then I can take the features I train on PC, then after train, I get all the network details into Arduino code and do the prediction and the forward calculation. That should be doable. So now the question is, where are the features? Do I look some more? So these are the features I chose. The first one is so-called a cross zero rate. On the top screen is actually a clapping sound and the button and something drop onto the table. So what we can see is for the clapping sound, given the same period of time, then there's a lot of vibrations. Where by other sounds like voices, music, you have very little. Usually this is in the, most of the people giving this term called cross zero instead of actually it looks like frequency, but actually because this is not a fixed frequency, so most of people using the name called cross zero rate. So what I choose is the feature is I calculate the time used for the first 50 cross zero detected. So the less time, that means the more cross zero you have encountered. So this is one of the features. However, giving this feature only is not enough because if somebody play a high pitched sound, then it will be detected as a clap as well. So the second I do some more captures. So on the top row I actually the clapping sound, different kind of clap and on the bottom row as a speech something drop or the same singing sound. The blue color line actually is the capture window. This is 200 milliseconds. What we can actually see is there's always a decay. For the clapping sound, whereas for other things they may not have a decay. Although for this object job there's a decay observed, but just by the cross zero rate as previously, actually we are able to differentiate it. So how do I take this decay shape and write this into a feature? So what I did is actually I get the 200 milliseconds of time I split them into 10 different segments and I calculate the energy for each segment. And I hope all these features together I feel into a neural network. It can actually give me a try to figure out it is decay or it is flat or something like that. So giving all these features actually I have 12 features. One is cross zero and the 10 being the energy level for the 20 milliseconds each. And another one is the pre-trigger, which means what is the background sound before I encounter the first high samples. So I feed into this in neural network and just do the other trainings. So some of details is the feature collection on Arduino and actually ADC setting is a 38.5 kilo sample per second, which is not the Arduino default sample rate. So I need to do some of the write some of the registers to overwrite all these things. And I tried different configurations of network and found just a small two layer actually doing quite well. The accuracy is actually quite high, 96%. Although I have very limited samples. And the training code actually is from this course. I'm not sure whether I can publish this, but because it is a training code, it is everything written by hand. So if I'm doing by now, I would be using like TensorFlow to do this. Because I just take this course last week. This was done one week ago, two weeks ago. So okay, and some of the details about this, to do this calculation on Arduino, it takes 20 milliseconds. So 200 milliseconds of sampling plus 20 milliseconds to do decision. The actual response is like 220 milliseconds later, but actually the response is quite okay for a human to feel. So some conclusion, future work. First one, Nurine was definitely the best friend for laser engineers. Otherwise, maybe I need to do, I will fall into the peak of do FFT on the signal, getting out the frequency components and fit the exponential decay, all this kind of thing. But with the neural network, everything is just done by mathematics. I don't need to do any real hard work on that. And I'm just an absolute beginner, it's two months only. So yeah, this is by no means a complete project. I need a lot more samples. I need a lot of features. Maybe this feature is not enough. Maybe I need more. And yeah, there's a lot of false positives, especially when I play the amp power from the speaker to the microphone, it gives a lot of false positives because these people like to, they will keep silent for a while and then suddenly speak to ha, whatever, then it triggers. Yeah, so a little bit demo. I'm not sure, because the environment is a little bit different. Hey, what? Uh, I, how do I, how do I show this? Okay. Yeah, it's three on my hand, so everybody's collecting is a little bit different. Okay, so these are the data. It's keep collecting. The first one being the cross zero. So doing here is okay. But definitely I talk to it, it's very hard to activate. So, okay, that's the, that's the finish. So you have a, what's it turns on? Does it refuse anything for a certain amount of time afterwards? Yeah, yeah, yeah, it kind of, it be cared for like 20 milliseconds or more, several seconds, it's just a silent time. Otherwise it kind of, you keep clapping it, you keep turning. With the ESP32, it should be able to actually get much more. Yeah. It definitely is, it's a much, much faster processor. For a 8 bit to do. Which is true. Yeah, yeah, with FPGA because for the, I can show a bit of the, you know, code is damn ugly and it is full of parameters for the neural network and everything is calculated by, okay. It is actually from the training. So I have these samples, these are the samples. This is, this is test sample. Are these are testing samples which are collected? So I can do a training here. It's a passion code. It start to do some training. What? Today is so slow. Okay, yeah. Yes. So this is the last function after several iterators, iterations, then these are the parameters again. Okay. So I just taking this output and the copy, pasting to that, you know, code, it just works. I've got two remarks. There's someone in Finland for Vita, but I want to watch the immobile hand clap. Because they synthesize the crowds, the frozen things. So that's definitely something you can do. And Google has a lot of more database with some, maybe probably that you have. Oh, God. Generally, he is the one number one. He's also the, I don't know. He's actually the one number one. TV, we will be, we'll be here. Yeah, yeah. Most of them and that, actually, yeah. Yeah, yeah, yeah, yeah. Sorry, which one? So you can, you can be a hand clap, right? If you look for E, P, O, P, N, you can probably find a, it's probably from the steps of some others. You can push through some, some, you might want to, you know, some actual, you know, they can help you treat, if you take that. So you don't even bother to clap or you just push a button. The, the idea is I don't want our water in the pocket. And I erroneously set my switch just below the light where I need to turn on the switch five meters away. That is all the motivation of doing this.