 First on the list, we have Aaron Friedman, who just finished his Master's of Computer Science in the field of Computational Geometry, and he is a Python developer working at Fabric, formerly known as Common Sense Robotics, involved in system architecture and development, and he's going to be talking about boosting simulation performance with Python. So, Aaron, thank you for joining us. Thank you for having me. Thank you for the introduction. So where are you streaming from? Sorry? Where are you streaming from? Oh, okay, I'm from Tel Aviv, Israel. Nice. Yeah. Excellent. How's the weather there? Really hot. Really hot. It's cool, yeah. Excellent. Well, I will turn this over to you, then. Okay. Thank you very much, and show the screen. Okay, so hi again, everyone, and thank you for being here. You're probably here because you run any kind of simulation or integration test at your work. Now, how many of you would like to spend less time on waiting for them to finish and to have more time for coding or for solving bugs if you write some? So I'm glad you're here. Today you will see how you can use the discrete event simulation approach to simulate your system. And now it will allow you to simulate hours of your system in minutes or even in seconds. So before I talk about how we run our simulation, let me tell you what we do in the fabric and what we simulate. Before I show you the video, I need to switch the sharing mode. Okay. We build a fulfillment warehouse for online orders. Most of the work is done by robots. We have two types of robots. The first type is called leaf robot. You can see it now in the video. And the second type is called ground robot, which moves on the ground on the floor. Together they cooperate and help us to fulfill the orders. And it works like that. The leaf robot takes thoughts from the shelving units would put the thought on the ground robot. The ground robot brings the thought into picking stations where the items are picked and later delivered to the customers. So my name is Iran. I work at fabric for about four years. Now I mainly focus on the development of this cute robot, but before that I was involved in different areas in the system. This is the simulation infrastructure, which I will present to you today. We start by seeing why simulations are so important. Then we'll see how to use the discrete event simulation approach and how to do it in Python. Then I'll talk about some challenges we encountered and how we deal with them. And finally how to distribute the multi-traded simulation into a multi-process simulation. So first, what exactly we simulate. Usually the term simulation means a tool that imitates the behavior of a system. Now a case it is not exactly the case. Let's take a look in this very simplified role of our system. We have the backend which is pure software. It manages the activity of the system. It manages the orders from clients, the stock, the motion of the robot. It sends commands to the robot and receives telemeters back from the robot. So in this simulation tool that I will talk about, we simulate only the robots. We run the system, the backend just as it runs in production, but instead of communicating with the real robot, it communicates with virtual robots. This, the copying of software and hardware is extremely important today when we all work from home due to the coronavirus and the access to the hardware is very limited. This tool, as several more you suggest, first it is used as a testing tool when developer write new code as long as the code doesn't run on the robot. Then it is one of the options to test the code. It is also used as part of our regression testing, the continuous integration system. Also in a complex system, it's difficult to know how a new algorithm or optimization will affect the system, the KPIs of the system. So this is the place to evaluate it before running it in production. Again, robots and the hardware is very limited and very extensive. This decoupling of software and hardware allows us to run as many simulations as we need on the cloud. We use this tool to evaluate a new warehouse before investing money in construction. We can run the system on a new layout and see what KPIs we can reach and how many robots are needed to reach this KPIs, for example. The simulation is very easy to inject fellows in the robots and by that to improve the reliability and robustness of the system. We also have an integration center in our offices and integration lab in our offices where we can test the code with the robots, but it's not as big as the production warehouse. So simulation is the only place you can run it on big setups before running it in the production. We saw what we simulate and what usage we do with this simulation tool. Now let's talk about how we run the simulation. The approach we are using is called discrete event simulation. This approach, continuous operations are modeled by instant events. For example, if we want to simulate an elevator, then the events can be door is open, elevator right, button pressed and so on. The simulation also maintains its own clock and it immediately moves from one event to the next event and that's how the time can run faster than the real time. In our simulation, we do it a little bit different. We treat the time as the events. In each, we divide the time into time peaks and each time we calculate the new state of the robots. We simulate the operations of the robot, which are move, turning, passing thought from one robot to another robot. Let's take an example. Let's see the move operation of the robot. Let's say that the robot can move in two meters per second and we choose to have 10 time peaks in a second. So at first the robot is located at X zero and assuming it got a move operation, then the next time it will be 0.1 and the robot will calculate the new state, which is 20 centimeters. Then again, the next time it could be in 0.2 and the robot will calculate the new location, which is 40 centimeters. I noticed that in this approach, the robot was never between 20 to 40 centimeters. It immediately moves from one state to the next. In reality, the robot anyway sends telemetrys to our backend few times in a second. So the behavior looks the same for the backend, which is discrete anyway, and we don't lose any information by doing so. So this is the idea of the discrete event simulation. Now to implement it in Python, we use the SimPy library. SimPy is an open source library. It is a framework for discrete event simulation. It's very simple and well documented and there's a lot of samples on the web. It is also lightweight. I mean that it doesn't try to help you to simulate your components. It just gives you the framework to implement the discrete event simulation. Before we see how to do it in the code, let's see, let's understand the idea of SimPy. So to understand SimPy, you need to be familiar with the three objects. The first one is environment, the second one is process, and the third one is the event. The environment is the main object that manage the world simulation. It has the simulation clock and it has an event queue. Process represents the component we want to simulate. So in this example we have two processes, one for robot zero and one for robot one. Now at first the processes adds the initial event in the queue. So we have two events, one for robot zero and one for robot one. So we start the environment, it takes the first event from the queue to transit. So it calculates the new state of the robot. And before it is done it adds the next event of that robot in the queue. And then again it takes the first event from the queue, which now belongs to robot one. Again calculates the new state, adds the next state of robot one in the queue. And then again it will take the next event from the queue. This is the time which again belongs to robot zero, but this time it is in time 0.1. So it will update the clock to 0.1. So this is the very basic idea of how SimPy works. Now let's see how to do it in the code. So in the code you'll see a very simple example of using the SimPy. In this example we'll conduct a race of robots. So let's say that a robot can move somewhere between two to four meters. Okay so let's go over the code and then we also will run it. So here we defined that we have three robots in the race. The race is going to last for 30 seconds and we choose to add two time ticks in a second. So we'll have a time tick after every half a second. Here we implement a very simple robot, we only implement the move operation. You can see it is a Python generator. So at first all the locations of the robot is zero. Each iteration of the while is a time tick. So it will calculate the new location to make it interesting to use the running function. Notice that I provide the function one and two because we said that the robot moves between two to four meters and we have two time ticks in a second. So it is one to two half a second. So the robot will print the simulation time, the robot ID and the new location. And it will tell the environment that the next time it wants to run is in half a second. So here we initialize our environment, we register the simple process into the environment and run the environment. Let's now run the race. So I have to run the code. Remember that the race is about 30 seconds and of course it is not going to take 30 seconds. I use the time command which will print the time it took the program to run. So as you can see it lasted less than one second. So I saved each of you 20 about 29 seconds. Okay, so an important point to be aware of is that all the simple process of the components are simulating running the same thread. As you could see, it is using Python generator and the environment runs each event at the time, and I will talk about it again in few slides. So all the parameters that affect the runtime of the simulation are obviously the number of components. The more components we are simulating than the more calculations we have to do, therefore slower simulation. And also the time to granularity, the bigger granularity than again more calculation in a second and the slower simulation. The mode I described so far is called as fast as possible. It means that the simulation price to run the fastest it can immediately move from one event to the next event. Simpy also allows to run in a real time mode where it tries to follow the real time it will run an event and before moving to the next event to wait until the time of the next event will come. Now, why would we want to do this? I mean, in the first thought you would think that we always want to run the fastest we can. But you may want to do some manual testing in your system like a REST score or whatever, or you also may want to combine real hardware with your simulation. So these are good reasons to run it in a real time mode. Declining the discrete event simulation approach has several benefits. The most obvious one is that it makes the development more efficient. When the developer finished write code and test it, then you will get a faster feedback. And also you get a shorter CI. But as you can imagine from the previous slide when I talked about parameters that affect the runtime, if we will run the simulation with many, many robots, then it may be that the simulation will even run slower than the real time. But this is still an advantage because that way the result of the simulation API will still be realistic. I mean that in every time tick all the robot will get the chance to do the calculations to calculate the new state. And that's how the time will not run too fast. And from the same reason is also deterministic. It doesn't matter if you run it on a strong machine on the cloud or on your private laptop. The runtime of the simulation may be affected but the result will be deterministic. And also it is agnostic to profiling and debugging. Still we will get the same results. Last, using this approach will also allow you very easy to simulate any date or any time of the day. Like you can run the system like it is the weekend or any special time that is interesting for you. In SimPy for example, you just need to provide the initial time to the environment and it is that simple. And then this bug wouldn't happen if we wouldn't be panicked before the millennium bug if we had this approach. Okay, so far I showed you how to simulate robots using the disk event simulation. But we called it at the beginning. I mentioned that we want to run our system, our backend together with the simulated robots. So our backend is a multi-treaded system. It has several threads that get messages and react to them. The messages can be either telemeters from robots, inputs from the user, and orders from customers and such. Can you think what is the problem of running the backend together with the robots? So the problem is that the robots may run the time too fast and the backend wouldn't have the time to do the work like it would do in the real time. SimPy has the support for event-driven processes. As I mentioned before, all the SimPy processes run in the single thread. So it will change the behavior of our backend and we already had a similar experience when we used the event monkey patch which makes your system your thread cooperative and runs the system like it is one thread. It did improve the performance of the system, but later we found out that we have some bugs that we couldn't see in simulation. Therefore, the solution of SimPy for event-driven processes is not good enough for us. So we came up with our own solution. In simulation, we create another SimPy process which in every time tick, it holds the time until the event-driven threads will do their work. It is doing it by calling the join on the threads queue and the join function waits until the queue is getting empty. And that's how you make sure that the event-driven thread will have the time to handle the events. Let's see an example in the code. Okay, so in this example, we'll see we'll have one event-driven thread which listened to the queue, got a message and printed it to the screen. Another robot which in every time tick will send a message to the event-driven thread. So let's go over the code. This time we'll have a time-tick, one time-tick in every one second. For now, I'll just show you the problem so ignore this class. We'll see it later. So this is a simple implementation of an event-driven thread. What it is doing is listening to the queue, printing the message to the screen, and that's it. Here we implement another simple robot. In each iteration, it adds a message in the event-driven queue. It gives the counter and tests the environment that will run again in one second. So again, we initialize the environment. Now it's time to see the problem. We use the regular Python queue. We start the event-driven thread. We register the environment and run it. So let's run it. So just remember, we are going to run the simulation for 50 seconds. And in each second, the robot will send a message to the event-driven thread which should print it to the screen. So run it. We don't see any message in the screen, and this is exactly the problem that I described. The robot did send 50 messages, but the event-driven thread didn't have the time to render them. So let's see how we solve it. We narrowed it from the Python queue, and in simulation, we had another simple process. In each time tick, it will call the join on the queue. And then we will again tell the environment that will run again in one second in the next time tick. So let's use this queue this time. So let's run the example again. This time with our queue. And as you can see, it solved the problem. The usage of the join helps the event-driven thread to handle the messages. So as you could see in the example, the event-driven thread is not really aware of SIMPAI. And that's what I meant that we run our backend in simulation just as it runs in production. It doesn't aware of whether it is a production or a simulation. With the extension that the backend cannot call the default time functionality because in simulation, they are not relevant, right? So we have to run the log. So you have all that functionality in our own module and the backend just calls this module. And this module knows whether it is simulation or production and calls the right functions. Last, we in simulation, we print the simulation time in the log because when you are debugging the simulation, you care more about the simulation time. Now, eventually, we also move to microservices, just like everyone else. And again, we wanted the simulation to run the simulation just like it runs in production. So it means that this time we don't use a multi-treaded simulation, but we want to distribute the simulation. We said that SIMPAI doesn't support a multi-treaded simulation, so for sure it doesn't support multiprocess simulation. So we came up with that solution. In a simulation, we run another service called the barrier server. And the responsibility of this service is to sync the time of the other services to prevent one service to run faster than the other services. So all the other services look the same as I described so far, the same as the multi-treaded simulation. Each one of them has its own local SIMPAI. And all of them pick a shared time tick. And it works like that. At the beginning, they initialize the SIMPAI, they do each service is doing his work. And once the shared time tick arrived, they send the ready message to the barrier service. The barrier service holds that message until it gets the message from all the other services. And once you got them, you send them the approval and then they can move to the next time tick. That's how we prevent one service to run faster from the other services. Notice that SIMSA service sends the ready message to the barrier server until it gets the approval. The time holds for me. Wait for the other services to reach the next time too. So I finished the slides. I think we have a few minutes for questions and then just sum up. Awesome. Thank you very much. So I do have a couple of why I have one question so far. If anyone has any questions, please post it in the Q&A here on zoom or you can also post it in the parrot track chat room over on discord, which I'm keeping an eye on as well. Anyway, so Ruth Vanderham asks, are you familiar with the other Python DES called Salabim? Yeah, I heard about it. I think it is quite new, maybe 2017. So before we started, it wasn't exist. But anyway, I checked it. It looks pretty similar to SIMPA. I think it also uses generators. And it has also the notion of an environment, I think. But anyway, I didn't really try to really use the SIMPA. And I also didn't see much comparisons. So if someone is familiar with it, I would like to hear in the discord system. I also have another question actually also from Ruth Vanderham. He says, how does the messaging between the barrier service work? So this is a depends on your implementation. So how your services communicate? In our case, we use a message queue for this. But you can do it with the rest or sockets or any other messaging. All right, excellent. So it looks like that's all I have at the moment. So once again, thank you very much. And yeah, if anyone wants to chat with Aaron afterwards, he is looked up his room, which I believe is boosting SIMP performance in discord. And Aaron, do you want to do a quick recap? Yeah, of course. Thank you. So yeah, so what we saw in the talk, we saw just how important the simulation is, especially for an hardware company like ours. And the disk event simulation has some more benefits with it. And again, if you want to do it in SimPy, in Python, you can do it in SimPy. Also, there is the Albin module library. If you want to run the simulation with your system, then you may suffer a time leak. You just need to make sure that all the components are tied to the time somehow to the clock. And then finally, the extension of the simulation into a distributed simulation was really straightforward for us. It really took us a couple of days to do it. So that's it. Thank you very much for listening. Okay, enjoy.