 Hello everyone. We have Kelvin Tan with us and he'll be talking about Electromechanical Platform with Removable Overlay for Exploring, Tuning and Evaluating Reinforcement Learning Algorithms. Kelvin, the stage is all yours. Okay, thank you Girish. Good afternoon. As Girish mentioned, this is my title of my talk. I will speak a bit quickly because this original number of slides require about 40 minutes to complete, so I will move quickly. So if you have any questions, we can bring it offline or discuss after the talk. Thank you. So the outline of my talk, the motivation and purpose is what I'm going to share with you. And after that, the journey for the platform development. I'll give you a demonstration and we will proceed with overview of reinforcement learning. I'll share some results of what was then proceed to discussion and what I plan to do as a follower. The inspiration or motivation for the talk is basically the one is curiosity. It's based on after learning and reading about reinforcement learning, especially after watching a documentary from Dr. Claude Shannon. This really intrigues me. So I wanted to try out not just the algorithm on the virtual platform, but on a physical platform. So a bit more about Dr. Shannon. So Dr. Shannon is very famous for coming out with the Shannon's channel capacity theorem. So with this channel capacity theorem, he's able to calculate what is the capacity of any given channel with wired or wireless. So with that, scientists and engineers are able to calculate what is the bandwidth, the channel capacity based on bandwidth. And it's also found that the bandwidth is also related to the signal to noise. Signal to noise is an important factor when determining what is the capacity of a channel. So with that, we are able to move to the information age. Otherwise, it will be very difficult without a clear formula. How thick, how many cables, how powerful the transmission channel or the size of the cables would be, especially for this internet age with the fiber optic cables being made on C. So he's also a very interesting inventor. He's not just a theorist. He invented this electromechanical mouse platform based on the 1950s technology. It's very amazing because at the time, we do not have advanced computers and powerful computers. So this mouse was able to explore the maze and navigate to a given position in the mechanical platform. Amazing. The maze solving algorithm is now being pursued right for many undergraduate studies or even secondary school robotics clubs whereby students explore the use of a miniature robot to find its way out of the maze. There are many algorithms being deployed. It can be quite simple, just some logical program instead of even reinforcement learning. So we can find out more from the Wikipedia page link here and also put the link in the reference. So journal of my journey, the work started in September 2019. I got a 3D printer and I was experimenting with the printing of items using the 3D printer. So after doing the basic stuff, I wanted to challenge myself and started building this platform right in April 2020. So the design build test iterate process started there. So in July 2020, I did the coding and testing and in August 2020, I added the additional improvement to it, the power supply to stabilize the power and the Bluetooth connectivity. In March 2021, I started to work on the level prototype. Just wonder where the stage is. Sorry about that. My screen froze. So the features of the electromechanical reinforcement learning doesn't require camera, no imaging required. The platform tilts the direction. So I will explain a bit more about this later. So basically, in order to get the ball to move within the platform, it needs to be tilted right. The ball itself doesn't self-propelled like the robots, but it's being driven by gravity to a certain direction based on the tilt of the platform. And the algorithm also added features to automatically detect walls. And the platform has self-balancing feature. The platform and computing module communicates via Bluetooth. And the platform or the maze itself, whatever that sits on the platform, the maze can enter the stage easily. This allows the maze to be of different variation, different design, or even immerse in liquid or enclosed to provide a different challenge and also different environment, a completely different environment for the algorithm to learn and explore. So the original maze or this program is adapted from Eric Delange source code. He built this very good maze exploration program with many options for algorithms, many different algorithms to be deployed. It runs on the virtual environment. So the maze is eight by eight and the agent or the ball can move in four directions. So we'll just name it as degree of freedoms. So namely the agent or the ball, in this case, is marked the rate. This is the current location of the agent or the ball. It can move in four directions, not east, southwest. So program to make it into 16 by 16, because we need to account for the obstacle or the walls. In this case, the walls are not pre-programmed to it. It's depending on the physical maze itself. So the agent has to discover the barriers or the walls by itself during movement. And I also got the agent to move in a diagonal position as well. So because of that, now he has eight degrees of freedom. Then he will take up a lot of computational power. The platform is printed using a simple open source software, a CAD designer. In this case, I used a Tinger CAD for the initial prototype. It's quite tedious, right? It's simple, but it's quite tedious and any mistakes would result not fitting and have to be re-printed, re-designed and re-printed. So this is the photo of the platform. So the mechanical platform here is being done up in the middle, right? And then another photo of this, right? So using chip sensor boss and detectors, another photo here. So this is the completed platform, the top part of the platform. So you can see that it's driven by Dooservo to control the XY tilt. So I give you a demo of the run here. So at this demo run, the control is through communication, it's via the PC0 cable. And at the same time, it's also connected to my mobile phone using Bluetooth, right? So you can see that I am setting up the notification to feed the stage information to my Bluetooth, right? It will send information like the location of the ball or the agent and also the tilt direction and other information. So I put this magnetic ball onto the stage. So this is the ball or the agent as being rolled around, right? So the embedded hall sensors will detect the position of the ball approximately. With that, the algorithm can process the information and make decisions. Okay. So this is the run on visualized on the screen. So the red dot is the agent or the ball. So it's being rolled around within the environment. So the goal is to drive the agent to the exit which is marked green. So because we are running the queue table and SASA algorithm, it doesn't really know where the exit is initially is discovering the environment. So after a couple of runs, right, it started to realize that certain position cannot be driven to, right? For example, when the stage tries to drive north, right, when the ball is at a certain position and the ball doesn't move anywhere north, it realized that maybe there is a obstacle or wall blocking the part of this ball, right? So repeated attempts with failure would result in marking the position on top of next to this agent with gray, right? So it's a possibility of an obstacle there, right? So more attempts resulted in darker gray and eventually black. So when the cell is marked black, then the algorithm would remove the possibility of moving right for this agent from that position. So for example, this dot, this agent or the ball is now here. So when the stage is being driven north and the ball doesn't go north, then probably there is an obstacle here, then it should be marked gray. And repeated attempts with failure would remove the possibility of even moving, trying to move the ball north by tilting. Okay, so a bit more attempts here, right? So it's still discovering its way around. So the efficiency or the learning effectiveness depends on the algorithm. I'll share a bit more about this later. So as you can see, it is quite a long process. Okay, so after many runs, most of the walls have been confirmed or the obstacles have been confirmed. So it's all black now, right? The markers are all black, the boxes are black. So you can see the agent is like moving back and forth. It's because of the exploration rate. So these are all tunable parameters. So after many attempts, finally, complete the name. And this shows the replay. So just the replay, right? Not the actual movement because it's very fast. So the algorithm has learned the maze, this maze, right? And is able to consistently find its way to the exit. So this is the visualization of the learning process. I will share this in the next place. It's a bit clearer here. So the first top chart is the win rate versus episode. So win is when the agent manages to find the exit, right? Go to the exit. So how many percent? What is the likelihood of it getting to the exit, right? The chances, right? Initially, it's very low. It's just moving around randomly. And when it reaches an episode of movement limit, it will just stop and fail. So initially, because it's moving randomly, it's not able to get to the exit on time, right? But over time, it's able to do so. And eventually, at around 700 episodes, it's able to get to the exit consistently. The cumulative reward is a reward that we place on the agent, right? So it's like when the agent or the ball reaches the exit, it gets a big positive reward. But any movement itself would have a small negative reward. It's like energy drain, right? So we can see that even at 700, initially, it drops very much. But then after that, the rate of descent stabilizes. So eventually, if we allow this to run further, we allow it to run more episodes, it will eventually recover the reward will be in the positive territory. So a bit more on the concept of reinforcement learning. So this chart here shows the ecosystem for the reinforcement agent in relation to its environment. So the agent in this case is the ball that is rolling around. So the environment is the space that is allowed to roll in. In this case, it's the maze. So what can the agent do? The agent can move in the eight directions I mentioned, the eight degrees of freedom mentioned. So in this case, the agent don't propel itself but is being driven by the tilt, being forced by the tilt of the stage. But the result is the same, the agent get to move. So the agent move within the environment and within the environment, so states are like the individual position within the maze, individual cells within the maze. And with each state that is an associated reward. So for our case, the reward is only at the exit. So the green box exit will get a reward. Any other state, it will not have any reward. The movement itself will have more of a penalty. So this is the Bellman equation, explains this pretty well. This is the basic groundwork fundamental equation for the reinforcement learning. So over here on the left hand side is the value term, right? Value is what is the value and bracket we have this s, the value of a position within the environment or the state, right? This is the value of a certain state. On the right hand side, we have this max a and then bracket, this r, reward, bracket, state and action plus this gamma discount factor value of the next state. So what this term says is okay, what is the value of this state, right? So this particular state within the environment, the value can kind of like be calculated by finding the maximum of this one. Okay, these two terms here. So this is the reward of the agent in a particular state taking action. So in our case, it can move up, it can move in any of the eight directions. So you'll find the maximum reward for the eight directions it can move, right? So if moving not gives the best reward, then that is the max. And over here we have this discount factor plus the value of the subsequent state as well. So this not only looking ahead one step, but looking ahead many steps, right, beyond the immediate right. So with this, the agent can drive itself to the reward, right, the ultimate reward, which is the exit. So this chart, I hope be able to explain this better. So this is the starting point, this is the starting cell, and this is the barrier or wall, right? And over on the top left hand corner, we have the exit cell with the value of 100. And next to it, there's another wall. So agent, we have moved two cells here, right? We assume we are talking about here. So the agent has three parts or actually four parts to choose right can move left up north east and also south east. So where, which direction provides the best value, right? So we can use the Bellman's equation to calculate the value of any state or cells within this environment, right? So assume, as mentioned, the value at the exit is 100. So we can calculate backwards, right? If we put the discount factor at 10%, that means the cell, every cell that is away from the ultimate goal would have a value 10% less. Okay? So if this is 100, then this agent or the story, the cell here would have a value of 90, right? One step away. And then the step, this cell here, 10% less than here, we get 81. And this cell is 73. So we can calculate backwards from here. So how about the other part, the top part? So from here to the next cell is 81, 73, 66. Okay? And how about the part north east, right? So because this is closer to the exit, the cell the agent currently is is closer to the exit. So there's no other part to the exit. So this is 10% of this 73 is 66. And this is 659. Here is also 66. So with this, the algorithms running, we have a Q-search, Q-table, right? This Q-table algorithm, SASA algorithm, Q-table with trace eligibility, SASA with trace eligibility, and Q-neural network, Q-table neural network. Okay? So a bit more about, if you want to find out a bit more about the algorithm, you can look up in the references. Okay? So my time is getting short. So you can see that all the algorithms are able to train. And the interesting thing is that the trace eligibility and SASA eligibility, the learning is faster. So it's able to get to being rid of one quicker. Neural network seems to be the slowest. Okay? So this shows the training time and number of episodes required to train the algorithm. So it seems like all this first four algorithm is able to train quickly, but the neural network may be a bit difficult. Right? One thing about neural network is much more processor intensive, right? So it is also able to work, right? Neural network. So we can see that it's able to train and function well, right? As I mentioned just now about the reward, right? After initially getting to the negative territory, repeat, continue attempt will view positive results once it found the exit. Okay? So this is the result for the 16 by 16 times 8 degree of freedom platform. So I just want to virtual platform, an emulated platform and a physical platform. Right? So this is done on the emulated platform. So we can see that is because the neural network is a very processor and time intensive, I did not include the results here. Okay? So we can see that the all the algorithms is able to work on the platform, right? With 16 by 16 and 8 degree of freedom. So over here, we have the timing and actual details as well, right? We can see that over here as limited 1000, I can increase the limit so that we can see a bit more granularity details here. So discussion. So the considerations between the virtual emulated and physical platform, right? For the virtual platform, the speed is basically determined by the power of the CPU and graphics GPU. So if we disable the graphics rendering, then it can be even faster. But for emulated platform and certainly physical platform, there's physics to be considered, right? So the ball can't roll instantly. So it also depends on the tilt. So achievable time is about one second per action. And the second is the size of the maze, right? Because of the multiplication of cells, right? We have 8 by 8, 64. But when we increase to 16 by 16, it will be 256. And we haven't even considered the degree of freedom, right? We still have to consider the eight possible freedom of movement for the agent. And we also have to think about the tilt angle. So one more thing about this is that the agent within the physical environment, right, can roll multiple cells in the emulated, sorry, in the original platform, the agent can only move the next cell. But in the physical platform, it can actually jump several cells or roll several cells, right? Or into the exit. So with this, a powerful algorithm like the neural network might be able to figure out shortcuts, which is very interesting, something that should be investigated later. So quickly, what is why hardware? So hardware is not 100% predictable, right? So if we want to actually test the robustness of them, then we should actually test it on the physical hardware, right? If you try to model a lot of intricacies, right, then it will take a lot of work, right? And even then, it may not be accurate. So we can quickly swap the hardware platforms on the same algorithm to test the robustness of the algorithm. Then why not use camera, right? Camera is very high resolution and no, it's very commonly used nowadays. So camera also requires more processing power, right? Okay, that's just one thing. And then the other thing is that second is that it is sensitive to lighting and requires contrast, high contrast, right? And alignment. So if your platform is tilting, right? If the platform is tilting, then the camera also has to kind of like be tangent to the platform, right? So it has to, the whole assembly has to move. And because of that, we have to think about the fixture to mount the camera above the platform. It has to be high enough to capture a good view, right? Not any shadows from the walls, right? So that is, also blocks the easy observation of the stage, right, from the viewers. Hello, Kelvin. Yes? Yeah, Kelvin, we are over time actually. Oh, okay, okay. I'm so sorry about that. My talk, I will provide the link or I think Picon already.