 There we go. Alright. Awesome. So, welcome everyone. I'm David Fogt. I'm a researcher and PhD student at the Technical University of Berg Academy, Freiburg, which is quite a small university next to Dresden, a small mining university, but that doesn't have to say anything, right? Okay, so the title of the talk is Behavioral Generation for Interactive Virtual Humans, which is some way of fancy title, but what you try actually to do is we have a cave environment. In this cave environment, I want to animate the virtual characters to react to human motions. You want them to be responsive, reactive. You want people to go in there and interact with a virtual agent. And to boost your animation for a virtual character, what people usually do is imitation learning. And in doing so, you create a system that allows the virtual character to observe a human being and then adopt this motion onto its own physiology, right? But over the last couple of years, this methodology slowly changed towards new direction because if you have classic imitation learning, you only focus on a single agent. So in order to create responsive virtual characters, they need to somehow adopt the behavior or at least have a look at the behavior of two persons simultaneously and how they interact with each other. So the question is only what to imitate and when to imitate, but also what are the intrinsic details of the interaction? What makes the interaction so special? So what are spatial constraints? What are spatial relationships? What are temporal coherences? And our approach on this is what we call more or less interaction learning. So we're trying to create methods that allow virtual characters as well as other synthetic artifacts such as human robots or robotic arms to engage in two person interactions. And what we do for that is we basically have a system composed of three steps. So the first step is somewhat a task demonstration. We need to somehow show the system, okay, this is the interaction you are able to do with the character. We need to record those interactions. And based on that, we then compute something what we call an interaction model that captures those spatial relationships, that captures those intrinsic fine details. And then in the end, you can have like a robot or a virtual character be able to interact with the human. And what I'm going to focus on is how Blender can be used to realize all those steps. Obviously, we need to do some sort of extension. We have to implement something more. That's not going to work out of the box. And throughout the talk, I wanted to discuss each of those boxes in more detail. So let's start with the first one. The first one is a task demonstration, meaning that we record two person interactions. But as you know, there are so many motion capture systems out there. So our requirement basically was to create something that is, that can be used for multiple systems. And I guess most of us are using like a Kinect one or a Kinect 360 or something. But in our lab, we also have an optical ART tracking system, which is a marker based motion capture system. And what we do is we, well, the system has the drawback is only can record one person at a time. But I wanted to do interaction. So somehow I need to spit the motion captures you. And what we ended up with is we had each person where eight markers in total, but you can't really record the whole body motion with all the intrinsic details without using some sort of inverse kinematics to compute the missing joint angles. And there is a small plug-in called Bipro IK, which is such a neat add-on. And it allows you to reconstruct a character's posture based on only eight markers or seven markers, depending on how many you actually have. But you can record eight markers and recompute the whole character posture in the end. And the system works as follows. You basically have some sort of middleware, which encapsulates a unified motion capture data stream. And you can use, like, two connect one cameras. It's basically based on the SDK and merging joint positions in the end. Or you can use one connect 360, or you can use an optical ART tracking system. But you also have, in this, the biovision motion capture system, which is a marker less motion capture system, that can be also used as well. Okay. But what we do is basically you record all those two-person interactions one at a time. So one demonstration per interaction. That's it. And we store everything in an SQL database. And in the end, you can retrieve those motion either live or by using a small shared library. It's a shared object loading blender dynamically during runtime, so you can access your data. Okay. So now we are getting more into the scientific part, the math part of the system, which is our interaction model. So the requirements for the interaction model are it needs to somehow capture spatial constraints, spatial relationships. If you imagine yourself high-fiving a virtual agent, you're for sure will never going to do the same interaction over and over again. Your hand will be at a different place, different position, different speed, different velocities. You have to compensate for that in order to have the virtual character respond to those interactions. So we have a lot of constraints on the interaction model. And what we realized in the past is human motion is intrinsically based on low-dimension manifolds, meaning you have quite a lot of redundant joints. If I move my hand for example, my elbow will move as well. So there is some sort of redundancy. And you can strip off the redundant information by doing dimensionality reduction. So what we do is we have eight markers per person in the end. And those eight markers are a 24-dimensional space, three Cartesian coordinates in the end. We recompute the joint angle rotation and the marker rotations with the mercilimatic solver, so we have only 24 dimensions in the end. But 24 dimensions, that's such a big amount of data. What you can do is you can apply dimensionality reduction to compute a more compact, more robust low-dimensional space. In the end, you have a small space in which you have 95% of the information with only four-dimensional or three-dimensional. This is lovely in the end. So in this low-dimensional space, each point corresponds to a posture. And the interesting thing is that the same principles of distance apply. So if a posture is similar in high- dimensional space, the points in low-dimensional space will be quite near. Meaning if you have two similar postures obtained by the user, you will see them as two close points in low-dimensional space. And this is what we see here. So this is the low-dimensional space of the dude wearing the orange t-shirt. And all the interactions start at the up-straight position. So all the interactions start at one point in low-dimensional space. But from there, they differ quite drastically. Okay. So a trajectory in low-dimensional space corresponds to a motion and the point to a posture. But we can harness this further. In the end, we want to somehow learn a compact representation of the interaction. We want to use the model to classify during one time in which interaction is the user. What is he actually doing right now? What are the semantics of the interaction? And this is where it becomes mathy. What we can do is we can compute a kernel density estimate. But the key idea here is that we compute keyframes. Those keyframes are based on the Gaussian mixture model. But in the end, you have several keyframes. And those trajectories are the interactions. So and the numbers are the keyframes. So if the user redos an interaction, it will start moving somehow and you will see a new point in low-dimensional space appear, the current user posture. So this new posture will be somehow, when the users start over here, between keyframe six and seven. So what is the system supposed to do now? Since the posture is similar to different interactions, like this one, for example. This one can be part of a handshake. It could be part of a high-five. We don't really know. But what you can do is you can look at the history. What has the user done in the past? So is he quite fast moving up to like a high-five or is he like more weighting here? And this can be seen if you compare those keyframes in the past. So if you're now at keyframe, let's say, what's that, 15 or something? If you're like this keyframe, for example, and you see, okay, the past keyframe has been made. So it clearly has to be this interaction. Same goes for the five. If you have the five as a current keyframe in an interaction, and you see the previous one is three and nine, then you know, okay, it has to be this particular interaction. Okay. So this knowledge, this sequence of keyframes has to be somehow computed. And for that, you can use a hidden Markov model. So the hidden Markov model now captures the sequences of keyframes in low-dimensional space. That's pretty much it. So it allows you, during runtime, to somehow extract the context of the interaction. It allows you to say, okay, the user is now doing this interaction. You can be based on the posterior state, probably, is quite sure about that. Okay. But still, you want to animate the character in an ongoing interaction. So if you know now if the user is in this particular interaction, that's good, but we still don't know in which sequence, in which part of the interaction. So we need somehow to figure out not only the interaction, but at which frame in the interaction is the user currently in. And still, we need to somehow optimize the character response. So if I have recorded this high-five interaction, and now during runtime I'm doing this, I am required for the virtual agent to be responsive for that motion as well. So he needs to adopt his motion to that. And adopting those motions can be realized with something called an interaction mesh. An interaction mesh is basically a spatial relationship observing topology. It's a net. So you can compute this by using telonide triangulation, or you can use our context-dependent interaction measures. It doesn't really matter. In the end, you have a net. And this net is deformed. So if we read with the interaction with the virtual character, the right-hand side, which is the virtual character, and the left-hand side is controlled by the user, the mesh will be different. It will be deformed. So this deformation can be optimized. Okay. So this should be it from a mathematical point of view without any equations whatsoever. What you need to know is basically this is a plug-in route in MATLAB, purely MATLAB. MATLAB allows us for really, really fast prototyping. There are a lot of algorithms out there coming from Sikrov and Gankasa and IFA. And we wanted to explore those algorithms. We wanted to implement them. And the best way to do so is to go for MATLAB, because it allows you for really, really fast prototyping. But in the end, you start prototyping and prototyping and prototyping. You have a lot of class, a lot of dependencies. And in the end, you're still required to somehow write it in C++, which is quite a tedious task. And what we did is basically recreated MATLAB bindings for MATLAB, MATLAB bindings for Blender. So we can use the power, or the computation power of MATLAB in Blender via a unified interface. And this is also, once again, a shared object that has been loaded during runtime. Okay. So, so far, so good. So we talked about demonstrating those two person interactions. We talked about how to compute the model. So, meaning the interaction model composed of the context-dependent thing, the hidden mark of model, and the interaction meshes. We have one interaction mesh per frame. So there are thousands of them. We need to figure out during runtime which one is it actually. And this brings us to the question of what actually happens during runtime. So, during runtime, so we capture the user's motion once again. And as I said before, we project the user's motion into a low-dimensional space, and we realize, okay, what are the active keyframes? Compute the posterior state probability of the hidden mark of model. Hidden mark of model tells us, okay, this is obviously this interaction. So, and this interaction needs to be analyzed further. And in the end, we have a pointer to a data set, a motion capture frame from the initial recording that resembles the current situation, temporal-wise, context-wise, and spatial-wise. So, this is, in the end, the posture of the virtual character that will be optimized to the new situation. And the problem here is this has to be done at 30 frames per second. So, there's a lot of computation going on in the background. But now we know what the posture of the user will be and the posture of the virtual character. Since, to each motion capture frame, there are character postures for the user, as well as the character. Since we recorded two-person interactions simultaneously. So, how can we actually interact with the agent? We now have a model. We have a process of how to compute ongoing interactions. We have a model of classifying those interactions. So, we wanted to create a system that allows you to interact with the virtual character in different scenarios. Meaning that you have, for example, normal laptop computer, a big wall, or you have a head mounted display, or you have a cave. Regardless of what you're doing, you can still use the same interface. And this is possible due to something what we call a character post-streaming library. It's a fancy word for broadcasting the final character postures in the end. So, as you see, we haven't used Blender for visualization at this point at all. And we will not use it. Blender is purely there for managing data, managing time, which is very important, and managing the broadcast and the connections to all our different servers. So, now we are in Blender doing runtime. We require the user to do an interaction, do something, compute the interaction model, and get the final character posture in Blender. Now the inverse kinematic server comes into play, solve, since we are only optimizing those eight markers. Again, we reconstruct the final posture. And this posture is then streamed to whatever output device you want. And I thought it would be interesting to show you our cave environment rather than a normal display or a head mounted display. But in the end, it's the same. So, how can we stream this data to a cave environment? Well, our cave environment is quite big because we have 24 back projection televisions and one ground projection for the inner floor. So, if you add up all those full HD TVs, including the ground projection, you add up about 50 megapixels. And this is a huge amount of information you have to calculate. And we do this by using quite decent hardware, I guess. There are more computers running in the back. There's at least one more for motion capture, our ART server, or my PC, if you're using like a normal Kinect. And Blender is once again the central part. So, Blender runs on my computer. The interaction model can be a supercomputer. It can be my computer as well depending on how long you want to wait for the model to be computed. But once it is computed, you can just copy the binary and that's it. Okay. And we've got the motion capture server and Blender is basically managing in between. So, right now, we are streaming into our cave environment, which basically means broadcasting at 30 frames per second and 96 degree of freedom motion capture data stream, all the joint angle for the virtual character, times two since we have two characters. We have an agent for the user, the user agent and the virtual character you are interacting with. So, what does it actually look like in the end? Okay. What you see next are two scenarios. The first scenario is more fighting scenario, I guess. And the second one is more casual high five shaking hands and stuff. So, we have the cave environment up and running. The interaction model is once again computed and I wear the fancy motion capture suit or the small markers. And then you can start interacting. And depending on how I move, depending how fast I move, depending on where I move, the virtual character will respond. And in this example, we have, I think, six or seven interactions in total. And you can basically decide in which interaction you want to do. So, there's no real semantics going on. You can do whatever you want to. The biggest drawback we have at this point is the latency. We have a latency about 350 milliseconds, which is not fast enough for real-time applications. We are running at at least 25 frames per second, depending on the inverse kinematics server, since the inverse kinematics server accounts for at least 45 percent of the information, 45 percent of the whole computation time. Okay. But what is going to be interesting next is the user is high-fiving the virtual agent again. And in depending on where he is moving, the virtual character will respond. So, if he's high-fiving the virtual agent a bit higher or a bit lower, the virtual character can still respond to those interactions. And this is due to the fact that we use the interaction mesh at frame level. So, the interaction itself, so meaning it is a high-five or it is a handshake, this is depending on the hit mark-off model. But the optimization of the final posture is purely dependent on the interaction mesh. Okay. To be honest, I'm not a computer graphics guy. I'm a roboticist. So, we have also robotics lab. And our robotics lab is quite new actually. What we bought recently is one of those UO5 robotic arms. And interestingly, you can use the same motion capture framework. You can use the same blender instance. You can use the same interaction model in the end. The only thing you have to actually change is the inverse kinematic server and the streaming capabilities. So, meaning that in the end, you're not streaming to a virtual environment, you're streaming it to a robot. Sure, there's more involved in that. You have to compute joint velocities. You have to compute joint torques. You have some impedance control. You have torque control. There's way more going on. But in the end, the final posture is computed completely the same way. Okay. Interestingly, you can interact with the robot the same way as with the user. So, once again, the user's motion is captured. Those are the interactions you can do with the robot. So, now we have some sort of semantics. First, you place the object on a stand, assemble a tube, part of the tube frame, and then assemble the final frame. And those interactions can be redone with the robot as well. We can operate the robot faster, but sometimes you have, I don't know, glitches in the inverse kinematic solver. And imagine yourself having a big robot and throwing around the big box. So, let's play it safe. Okay. So, depending on where I am moving, the robot will move as well. But interestingly, despite not having explicit knowledge of the scene, the robot can still recover from false interaction attempts. So, if I'm going to grab this object and the hidden mark of model says, oh, no, this is not this interaction. This is that interaction. And the robot will move somehow weird. I can just redo the interaction. That's it. Trigger the hidden mark of model once again. And so, in doing so, the robot can recover from those false interaction attempts. And I can continue assembling my things. Which is quite a boring example here, but anyhow. Okay. What you're going to see next is the interaction mesh once again. So, this is the key frame. And depending on where I move, the robot will follow and continue its motion. And this is, I think, quite cool. Because usually you have a robot in a term taking scenario. I do something. The robot does something. I do something. The robot does something. What you have here now is a continuous optimization.