 On stage now is David Kim, he will talk about Connect Fusion. We will have a Q&A session at the end of the talk. We have reserved about 10 to 15 minutes for that. So please prepare questions. We also have a signal angel in the room, so if you want to ask questions, the people that are watching the streams, if you want to ask questions via IRC, you're welcome to do so. So please give it up for David Kim. Hi, so I'm David Kim. I'm actually a PhD student at Newcastle University in the UK, and I'm sponsored by Microsoft Research and I'm basically doing my PhD research there. And today I'm going to present and demonstrate Connect Fusion. Maybe some of you have seen it on YouTube or maybe have seen it in a conference. And Connect Fusion is a real-time 3D scanning system that just uses a connect and allows us to capture static and dynamic scenes with a $100 connect camera without modifying the hardware or requiring any external infrastructure. And I'm also going to demonstrate how this system can be used for dynamic users, for example, to do user interaction and detect multi-touch on any surface and to also enable augmented reality applications. So before I jump to the first demonstration of the system, I would like to briefly describe the motivation for this work. So in November last year, Xbox released a depth camera for $100, which would have previously cost around $10,000. And our group based in Cambridge was really excited about this technology because it was the first time that the depth camera was available to anyone. And so we ended up building prototypes for 3D input and augmented reality displays in the last 18 months, and I'll hopefully be able to present some of them at the end of the talk if we have some time left. So to the real motivation of Connect Fusion. So a few members of our groups have been working on augmented reality projects where handheld Pico projector should be used to enable user detection in any environment. But many of the previous work which used Pico projectors, they always relied on external trackers. They relied on vicon trackers and also required the user to specify the room geometry to truly interact with the surfaces. So we looked into tracking solutions and robotics which didn't rely on infrastructure and which could simultaneously locate and map the environment just by using a laser range scanner and a video camera. And then these systems are called SLAM. But the solutions we found, they were either too slow or not precise enough. The geometry was quite rough, but at some point one of us suggested to use a connect camera instead of a laser rangefinder or a video camera. And we found the algorithms we needed to build up a 3D model out of a connect camera. But we also had problems with the computation speed. We had to handle so much data within a couple of milliseconds that the CPU couldn't keep up. And to cut the long story short, we ended up using very fast gamer graphics cards to offload all the computation work on 500 cores of a graphics card. And then we achieved real-time speeds and yeah, these are the pictures I forgot to show. So these were the initial prototype. This is the initial prototype of the Pico projector. It relied on external trackers. And we wanted to get the result which is shown at the bottom right. So let's just dive into the actual demo. I'll show you or present you a demo first. And then we can talk about the more specific parts of the implementation. So let me quickly switch to the demo. Okay. So the window you see on the top left is the live depth image. So bright pixels indicate that these points are closer to the camera. And darker pixels indicate that they are further away, farther away from the camera. But as you can see here, the depth image is quite noisy. You have lots of fluctuations. And you see lots of holes in the depth image. And also it doesn't remember what it has seen. It only sees what it sees, but it doesn't build up a 3D model. And what you see at the bottom is the output of our system. It's the 3D reconstruction. And let me switch to another view. So in our system, we compare the previous depth image with the current one. We see how they differ. And we calculate the offset transformation between the previous frame and the current frame. And we are now able to track the camera's position without having an external infrastructure. So when I rotate the camera, you see in the yellow frustum that it rotates with it. When I move it to another side, it always knows where it is relative to the model. And when I sweep the camera, I accumulate more and more data into this volumetric model. And it's truly 360. So I can walk around, capture something from behind, and then later on, I can navigate within this 3D model and go in, dive in. And we see that in this reconstruction, the holes are filled and the more data I integrate, I get more details out of the system. Yeah, sure. Wait. Yeah, I'm just struggling with our implementation of the virtual camera. But yeah. Let me get some of you up, and let me scan in a face. Who would like to volunteer to come up? I think the lady should come up, please, come on. So could you come here? Please hold still. So choose a comfortable pose, please. I'll now start to scan you in. So I'll move this away. You can now freely move. You can look at yourselves at the screen. Thank you. Thank you very much. So we have Captcha here, and she's still visible here in this model. Yeah. And also another benefit of the system is that it also reacts to changes in the scene. So when I move this cup to another place, it will slowly disappear from where it was originally. Yeah, it appears back. Okay. So that was the first demo. I'll just jump straight back to the slides. So yeah, before I go into describing all the components, let me just briefly describe how Kinect actually works. Some people thought that Kinect is based on time of flight. So people thought that it projects light, infrared light into the environment and measures the speed the light travels. But it's actually much simpler. Kinect is based on a stereo matching algorithm. You know, from using two video cameras, you can kind of reconstruct 3D models. And Kinect works the same way. So there's an infrared laser projector which projects a speckle pattern into the environment. So maybe some of you have already seen how it looks like. This is what the Kinect camera sees. So the Kinect camera has the infrared laser projector and a color RGB camera which doesn't do anything. It doesn't do anything for depth calculation. It's just for augmenting 3D graphics with color. And the IR camera is the camera which actually picks up the stop pattern. So you might think, how is this stereo when it only uses a single infrared camera? So this camera utilizes a quite clever trick. The infrared laser projects a random laser pattern, but it is like having a camera looking through the lens of the laser projector. So the camera which sits where the laser projector is would see exactly the same pattern at any time. So we can just get rid of the second infrared camera. So the infrared laser projector becomes virtually the second video camera. So it's quite easy. And the depth calculation happens on the device. Okay. So that was a quick introduction to how Kinect works. So I'll now continue to talk briefly about some related work that is related to Kinect Fusion. And I'll go through the main system components and describe them how they conceptually work. And at the end, we will have some time reserved to discuss any open questions and ideas. So when we first saw the results of Kinect Fusion, we realized that the gaming device living in a living room could actually compete with a $50,000 handheld 3D scanner. So this is some related work. This is kind of slightly more precise, but we found a good tradeoff between speed and quality. And there's also a considerable work amount of work in the area of SLAM, as I said, simultaneous location and mapping in robotics and computer vision. There are a single camera is used to extract features from motion in the image. And the system tracks off these features and create a really sparse point plot of the scene and uses these to track the camera within an environment. But these works focus on the tracking aspect of the camera or the robot. So they don't really create these compelling results with just scene. And another set of work is done in the field of computer graphics. For example, by using high-quality laser-range scanners or light stages or multiple photographs, but they rely on heavy infrastructure. And they are not suited for online rendering. So they happen offline within a couple of hours. But they work with higher larger scales and higher accuracy than ours. So Connect Fusion lies somewhere between the SLAM system, which is good in tracking, but not really good in reconstructing surfaces. And these works, which are really good at reconstructing surfaces which can't perform in real time. So these are the core components of Connect Fusion. So when we get depth data from the connect camera, we first project these depth points into 3D space. So these depth measurements become 3D point clouds in our system. And then we also calculate the surface orientation and store them in a normal map. And then based on the point clouds of the current image frame and the previous image frame, we compare the relative spatial offset between the frames. And then once we know the six degrees of freedom orientation of the camera, we can then actually integrate the depth data in a global volumetric data structure. So this data structure doesn't only contain the 3D point cloud of my current view, but it stores, memorizes all the themes it has seen previously. And then when we want to render the model, which is stored in the volumetric data grid, we raycast through the volume and create a rendering. We also create this synthetic depth map, which is used to stabilize the tracking. And as a side product of the camera tracking, we also get outliers. Outliers are points which lie too far apart and which would otherwise degrade the tracking. But we use these points to later sense touch, human touch of surfaces and similar things. So I'll come back to these later. So before I really dive into describe all the system components, a couple of words about GPU implementation and speed. A CPU in a PC has usually one to six cores, so dual core, quad core. And they perform really fast, like 3 gigahertz per core. But in a graphics card, they're actually much more many cores, like 500, for example. But they perform slower with 600 to 700 megahertz. But as most of our components are parallelizable, we do calculations on individual pixels and individual voxels. We can really make use of the power of the graphics cards. So we can do calculations on 500 pixels at once in parallel with the graphics cards where a CPU could only handle six at a time. And that's the reason why we achieve this real-time speed. Okay, now back to the core components. So when we track the camera, we have data from the previous frame and the current frame. So we see in the left in the green frustum that the camera was more to the left and it moved to the right. But the camera doesn't know that it has moved. So we have to somehow figure it out how it moved. And to do that, we utilize an algorithm which is called ICP. ICP stands for iterative closest point and it is typically used in 3D scanning applications to allow multiple overlapping point clouds together. So for example, if you have a number of independent scans of different areas of a large object, using ICP you can align them together to build up a bigger point cloud where all the independent scans are aligned. But this algorithm has one requirement. It needs these points clouds to already roughly aligned to each other. And this is usually done manually with a mouse in the 3D scanning program. But in our case, the point clouds of the previous frame and the current frame are already roughly aligned because we have a really fast camera. So we have 30 frames per second. So when we move the camera like this, the position offset is around five centimeters. So they are already roughly aligned so we can make use of this fast algorithm. So I'll now go through the individual steps of this algorithm. So first we associate 3D points from the previous frame with the 3D points from the current frame. So we create associations between the two images. And we do this really naively by just taking points from the same image positions. So when the camera doesn't move, the point on the top left should coincide with the pixel in the top left in the right image. And for example, the point on the nose should associate with the nose in the other frame. So when we move the camera, the association gets corrupted. So here we move the camera slightly to the right and the association isn't right. But it's okay because I'll later show you why it's okay to have this slight offset between the associations. So once we have projectively associated two point clouds together, we then check if the point pairs are actually usable. So if the points lie too far apart, for example, the point bottom right which associates a point on the T-shirt with a point in the background. For such cases, we check if the distance and the angles are compatible to each other. If they're too far away, then we just throw it and mark it as an outline because it would degrade the tracking too much. And so the left image shows how the points of the previous frame and the current frame are associated with each other. So this is a top-down view where both surfaces are shown overlapped. And in the right image, we see that only some of the points are compatible to each other. Some of the points are too far away or the angles are too different. So after we've checked the compatibility of a point pair, we then try to minimize an energy function. So the energy function describes the sum of the squared distances between the points. And we solve a linear system which tries to minimize this disparity between the point clouds globally. So we try to find the best fit to minimize the distances for the whole point cloud. And then once we have found this transformation which describes the offset between the two frames, we apply this transformation to the current image frame and then transform it to a new position. So here in this image, we see at the top an association between the previous surface and the current surface. So you see the association isn't quite right. But we iterate the steps one to five a couple of times. And whenever we iterate over this algorithm, the two surfaces come closer together and then we do another round of association between points. And by doing this kind of association with the same depth image as a couple of times, they will eventually slide to each other at some point. And we do this kind of iteration for about five times per frame. Per frame, five times, yeah. And it takes about like three milliseconds because it's done on the GPU. So yeah, this was the camera tracking. So the next part is the integration of the data. So now we know when we see a depth image where it belongs to spatially. So we have a relative spatial reference of the current frame to the previous frame. So we can integrate it to a global model. But instead of using triangles in a polygon mesh, we work with a voxel model. So a voxel model is a discrete grid like a three-dimensional array where we store information about the surface. But we don't use an occupancy grid. We model the surfaces as implicit surfaces. And that means we don't store explicit information of a surface in a voxel. But instead we only store distances to the closest voxel, to the closest surface in each voxel. So the model we use is called the truncated sign distance function. And this was first demonstrated about 15 years ago by Curlis and Lavoie. So they had a similar system, but it worked much slower and had much lower resolution. But the benefit of using this model is that we can accumulate data over time. If we had done it with a polygon mesh, it wouldn't be so trivial. And by using this implicit structure, we can actually average data from the previous frame with the current frame and kind of generate a memory and some probability of where surfaces can lie. So in each voxel, we store sign distances, which means that voxels which lie in front of a surface and which have direct sight to the camera have positive distance values. And voxels which lie behind the surface have negative values. So it's kind of hard to imagine, but I'll show you a diagram. So it's a two-dimensional version of the voxel grid. And we have a depth measurement in the shape of a face. And along the ray from the voxel to the camera, we store the distance of the voxel to the surface along the arrow, as you can see. So this has a negative value because it lies behind the surface. And the next voxel has a smaller number because it lies closer to the surface. And we go on like this. So this voxel has a positive value. And at some point, so I will just quickly step through this, so we fill all the voxels with some numbers between minus one and one. When we just look at the data structure, we don't see where the surface is. So the trick is to look at the zero crossing. So we look at points where the value becomes from negative to positive. And that's where the surface is. And the real benefit of this data structure is that in the next frame, we kind of get similar values in the voxels. And we can average these values with the current one and with the next one. And we will end up with lots of numbers, but there will be always only a single zero crossing. There will be only a single point where the voxel changes from positive to the negative. So that's why we use the structure to represent surfaces. And to the last step of the pipeline. So now we want to render the things which are stored in the voxel model. And here we do the kind of opposite. So from the virtual cameras perspective, we travel along these arrays through the voxel. So here I travel from one to the next voxel one, one, one. And then step along the array until I detect a crossing from a positive to the negative value. If I find this crossing point, I render a pixel at that point. And this is the way we extract a rendering from this volumetric model. And we don't only render using this method. We also generate a synthetic depth map from this model to compare the current frame to the previous frame in the tracking step. So we did an experiment where we tracked the camera based on the previous frame and the current frame, and we ended up with this kind of a mess. So the tracking was smooth, but the camera drifted over time, so it didn't make this circular shape. But it drifted away because it lacked an absolute reference point in the virtual model. But as soon as we used a synthetic depth map from the model to track the camera, we ended up with this nice model where we had an absolute reference point to the virtual scene. So I think it's maybe I couldn't describe it very well, but hopefully I'll be able to clarify it in the Q&A session. So yeah, I think I've bored you a lot. So we'll go to the more interesting parts, the dynamic interaction with Connect Fusion. So the three concepts for tracking and integrating data and raycasting, they're not new, and they're quite easy to understand, actually. And we've already heard back from four to five people who have successfully implemented Connect Fusion based on the papers we published this year. So if you also want to re-implement the system, just ask me and I'll send you the papers. So let's have a look how Connect Fusion can be used for more dynamic interaction scenarios. So here, this demo shows how an object can be segmented out from already scanned scene. So we scan in a scene, remove an object, and Connect Fusion knows which object has been removed. And a little bit of theory. So we made a small modification to the integration step, which I showed you a couple of frames before. Here we look at a specific voxel, the voxel with the green circle, and we compare the model integrated into the volume with the current depth value. So when this teapot stays at the same spot, the red depth value should coincide with the blue integrated model. When I remove the teapot, then Connect will actually only sense depth values from the table. So there's a big offset between the voxel position and the actual depth position. And once we see a significant difference between the live depth measurement and the voxel position, we mark them as a separate object. And I'll show you a quick video of how it works. Okay? So here we just scan in a small scenario, a table including the teapot. And then at some point, at some point, we remove the teapot. And then based on the algorithm I showed you, it instantly segments out the voxels, which don't have any live depth values. And once we've segmented this object, we can then track the objects sufficiently. So looking from this small example, you can imagine to integrate text and other graphics into real-world objects and how you can come up with other augmented reality applications. So the next demo will be actually a live demonstration. Here in this demonstration, I'll show you a particle simulation where virtual particles bounce off real-world geometries. And they even get occluded by real objects because we assimilate these particles and calculate the collisions based on the model we integrated. And then we also make use of the color camera to overlay the 3D model with color information. So just a second. Okay. Sometimes I have to restart the camera, sorry. Okay. So it's the same demo with some color information overlaid. I can also turn off the shading. And then I probably make a better scan of this environment. So once we scan in the environment, we can let virtual balls bounce off the table. So they fall off the edge of the table, as you can see here. And the virtual balls on the floor, they get occluded by the table and by this back. And I'll be a bit nasty and I'll... If someone could come up, then it might be better. Could you come up? Could you sit on the table? Not on the table, on the chair. Just, yeah. So, wait a second. I haven't thrown any balls yet. Okay. So I'm sorry for the action I'll do right now. So many windows. Try to shake them off. Try to shake. You can move your arms and shake them off, brush off your shoulders and then get rid of them. So it's really hard to see the mouse cursor here. So... Okay. Okay, thank you. So it's a really basic demo. Can you imagine having small cars and dinosaurs running around your living room? We weren't really interested in doing this, so we just had some cheese balls. So the next thing demonstrates the opposite of object segmentation. So instead of segmenting an object which was previously scanned and removed, now we can compare the background with objects which get integrated later on. So I'll start another demo. So at first sight, the demo looks the same. It scans the background. But then as soon as I integrate my arm, it becomes another model in this environment. And it also gets separately tracked. And compared to a live connect image, it has a much smoother surface because we do the same thing with the foreground model as we do with the background model. So we have two instances of connect fusion running, and we separately track the background and the foreground, and then we composite the graphics later on. So the most interesting part of this demo is that we can actually send some intersections of my finger with any object in the environment. So when I touch this cup, it knows that I'm touching the cup. Touch this box, then it will know that I'm touching the box. So what does it mean? It means you don't need to augment any surface. You just need a connect camera to turn any surface into an interactive tabletop screen. So if you had a projector projecting onto the same space, you could actually draw on the surface, which are having it modified. So in the next demo, I'll go a step further and then start to paint stuff on objects. So a really good side effect I discovered is I had dismounted in my office and you could actually see what people have touched in my office. So when someone touches this cup, I know it, and it's kind of a fingerprint. So these were the demos. If you want to play with this, you will have a chance in the Q&A session, hopefully. And I showed you this demo. And I haven't really talked about the limitations. So I'll sum the limitation up in this slide. So as we are working with a Voxel model, it's not really flexible compared to a polygon mesh. So polygon meshes are used in games and in almost all 3D applications. But we're bound to Voxels. And it means that most of the surfaces we're working with only stay static. We can't animate them or move them. And they don't connect the data structure. We have doesn't connect to DirectX or OpenGL. So we can't make use of fancy stuff like geometry shaders or vertex shaders. So this is kind of a limitation. So I said we can't properly model deformations. It also doesn't work for large areas. So we had this discrete grid, this three-dimensional array. And we have a fixed resolution and we can't go outside the box. So if I say I want to scan in this area, I can always scan in this area. And this part is allocated in the graphics card's memory. But if I want to go further, it actually doesn't work. So this is another limiting factor of using Voxel data structures. And this downside is more related to Kinect's ability. So Kinect works in the range of 50 centimeters to 8 meters. So when I want to scan someone in who is sitting at the back, it doesn't work. So unless we use another depth sensor, it's only able to scan in things really close to where I am. And another limiting factor of Kinect is it doesn't work under sunlight. Kinect operates in the infrared wavelength spectrum. And because the sun also emits infrared, we can't create enough contrast in the sensor image to acquire the depth of surfaces. And another thing is it actually requires a really powerful gamer PC. So this laptop weighs 7 kilograms. And the power supply unit weighs 3 kilograms. So Andreas Steinhauser suggested to use this with robots or drones which can autonomously navigate within spaces. It would be awesome because a quadcopter, a quadrocopter, could just fly around and it could scan in everything and it wouldn't hit any surfaces because it knows the environment. But because it relies on a really powerful PC, we can't use it in mobile scenarios. So this is also another limiting factor. So yeah, another thing is that Kinect actually struggles with surfaces which have little 3D features. So when I point the Kinect to a wall or to the floor, the tracking algorithm doesn't know how it should align the current frame to the previous frame. So it rotates the image and the camera tracking really doesn't work then. So that's why I asked Andreas to set up this three-dimensional scene to create enough 3D features to make it work. So we've reached the end of the talk. In this talk I present a Kinect Fusion which generates compelling results just by using an off-the-shelf Kinect camera and a powerful graphics card. And I've also presented how this system can be used in interactive applications like in augmented reality applications. And so yeah, so I've come to the end of my talk and now we can start with the Q&A. David Kim, Kinect Fusion. Thank you, David Kim. As I announced at the beginning of the talk, we'll have a Q&A session now. I would ask everyone to remain seated because otherwise there will be too much noise. Everyone who has questions, please line up at the microphones in the aisles. There are two microphones, one in each aisle. Also the people in the front rows, I heard a few ask questions during the talk. I would also ask you to line up because the people who are watching the streams cannot hear you. Questions are also possible through the IRC channel, so the people watching the streams can also ask questions. We have a signal angel in the room. I'll start over here. Are you ready for the first question? Please, the first question over here. Hello there. First of all, I have one really quick question. Do multiple connects interfere because you get multiple patterns thrown on the scene? Yeah, they do because they produce so many dots in the scene that they can't actually figure it out. Have you thought about using multiple normal video images instead of using multiple video images and calculating depth information from that? Because then you could do larger distances and you could do daylight. Yeah, we are currently working on it right now, actually. So I'm using stereo cameras to feed the depth information into Connect Fusion. Yeah, thank you. Over there. Yeah, I have two small questions first. In the beginning, you were talking about wearing two frames, you pick points, and you check the angles and everything. Do you use all the points or do you have like a subset that you choose? So we initially associate every single point. These are around 300,000 points. So we end up with 600,000 points we work with. But we don't use the points which have too much offset. So I guess we use like 95% of all the pixels. Okay, and the second question is when you were showing the foreground versus background, how does the Connect know which object is the foreground by just the distance? Yeah, so when we scan in the background, we introduce a new foreground image. Then when we calculate the distance from the camera's perspective, we hit the live depth image first before we hit the background image. That's how we can segment it out. So when we hit depth image first, then we integrate this data in another connection, and then we put them together. Thank you. Over here, please. You mentioned that there's been some re-implementations of your work. Do you know if any of the groups that have done that have released their code? Yeah, so these were individuals. And I think people from Willow Garage have an implementation. It isn't as fast as ours, but they used the point cloud library, I guess. Okay, one other question. What is the data rate of the data coming in from the Connect sensor? Because one way to use this for mobile devices might be to downlink that and do the computation at the ground station. So the depth frames, so each depth pixel consists of 16 bits. And we actually use 11 bits of a depth pixel. So when we use 11 bits for 640 by 480 resolution, then we end up with... How much was it, Andreas? 100 megabits per second. So you need a really fast wireless... Potentially that could be done. Yeah, or wireless USB or something. Yeah, thanks. There is a question on the IRC channel. Yes, the question is, does it work when you move the object rather than the camera? Does it create a 3D object then? Oh yeah, it actually does. I should demonstrate it. So it can be used to need an object, a good object. So I point it to the sky. Oh. That, okay. Yeah, it works. So... Yeah, I don't want to hold... Do we have much more? Oh, we have more time. We have time for that. So we'll just take another question if that's all right. Just one question. Have you considered using sensors to get a better estimate on the initial position of the camera? No, so we haven't looked into using acceleration sensors or gyroscopes, but they could be combined. For example, when Kinect loses tracking, we can then rely on gyroscopes or acceleration sensors for the time. Over there, please. Hi there. I wonder if you use computation algorithms like ZIF features or so for remembering objects? We experimented with extracting features from the color image to recognize a scene. So we can recognize scenes from a scan from somewhere else, but it's not part of Kinect Fusion. But it would be possible because you have an association between the color image and the depth image. And another question is how do you determine the orientation of the 60 degrees of freedom from your camera? The orientation of the camera? Yes. So when we compare the depth images from the previous and the current frame, it doesn't only calculate the translation, it also gives us the orientation information. So it's in the ICPR? Yeah, we use a modified version of the ICPR which where we assume that the angle is really small. So we assume the angle is almost zero. So we can just set zero and one for the cosine and sine and the transformation matrix. And by simplifying these parameters we can actually calculate the things much faster and it will lock through the transitional offset. So it's a bit hard to describe it, but I think we'll have some more details about this. Thank you. There's another question on IRC? Yes. So the question is you showed that you can see when things are touched. So are you working on maybe making something draw on a board or something? I mean that you track if I draw on a board and then maybe I draw on a board? Yeah, actually I forgot to show. So I said the motivation of our work was to use it with an augmented projection. And in the end, Connect Fusion became more interesting than the original Hentel projector work. But we have got a video of that working with the projector. So the major problem with the projector was that it had a big offset between the projection image and the tracking. So we can create this is the prototype we made. So it's a connect with a small projector built into a single housing. So we first build up and then we can project graphics onto the surfaces. So the graphics stick on specific surfaces because the camera is tracked. And this is the multi-touch application. So you could use a portable projector to paint to everywhere. But you can see this big offset so when you slightly tilt the projector, the projector still moves and it's a little bit of a bend. And that is a question over there. Thanks for the great demo. If you remove all living material from this room, which is the plan and other people, what's left will be mostly basic geometric shapes like cylinders and cubes and a few spheres and cones. setting on top of a thin long cylinder to remove clutter and remove complexity and allow for persistent storage and so on and so forth. Yeah, actually we haven't, but it's a good idea. For example, especially if you want to recognize scenes again, you can do it based on the basic primitive shapes you have seen already. As you said, I see this circle again in a box and then I recognize where I am. So yeah, it would be a great idea. Another IRC question? Yes. Have efforts been made to use an FPGA rather than the really fast computer? Because they can do probably a little fast? Well, it's, yeah, I think it's more engineering than what we do usually. So we haven't tried it out, but if someone can do it, then we should contact Microsoft and get hired. Another question over there, please. Implementation question. The data structure you're using for storing the volume information is at a standard quadtree and can you locally increase the resolution of that octree, I mean octree, to increase the resolution locally and how do you handle these massive data that you accumulate over one session and are still able to render it because I've seen you don't really drop much frame rate over the course of one screen session. So we run the integration and rendering in two different threads. So we have a shared memory which contains all the volumetric data and the integration thread, it always integrates into the memory and the render thread, it tries to render as much as possible. And back to your question about using octrees, we have tried it. We didn't succeed in having GPU-based octimplementation because it's quite hard to manage a tree data structure on the GPU memory. So we ended up having a CPU data structure but it proved to be too slow for our application. But it's possible to increase the resolution in specific parts and to have lower resolution in other parts. And we would also have more efficiency in storing empty data because we wouldn't use too much memory. Arthur, open source implementation of the algorithms you presented. Sorry? Arthur, open source implementation of these algorithms today. Yeah, so we are making efforts to make it open source but we have to go through lots of steps. And I don't know how long it will take but we are making an effort to make it available. Thank you. Another question on this side? So we've basically seen this on a one by one meter scene here. How is it going for much smaller objects, let's say like a mobile phone, the candy bar glass, scanning that one in detail and then directly printing it on a wrap wrap? Would that be possible? So you mean the Kinect only sees the mobile phone and nothing else? Yeah, and it's a fairly small object and quite detailed. Would that work? Yeah, the Kinect, I think the Kinect's resolution is around a couple of millimeters. So it will pick up buttons in a phone but I don't know. I think the tracking will have problems because the noise in the depth image is quite large and it might be larger than the 3D features of this small mobile phone. Thank you. Thank you. For those who want to leave, please do so now. We can continue the Q&A session. There's two more questions apparently, maybe more on IRC as well. The ones who need to leave because they need to go to another room, do so now. The rest, please stay because there's still quite some time left until there's another talk in this room. So, Alex, I'll accept two questions because I have to catch a flight soon. Okay, I would ask you to ask your question, please. Hi. Hi. I'm interested in the limitations or the licenses or patterns that will be taken by your sponsors. Are there any or second question, is it planned to make this platform independent? So I don't, so all the ideas and patterns belong to Microsoft so I can't do it independently as an independent person if that was the question. Are the patterns taken already or licenses or what are the limitations that are put onto you because you're sponsored? Oh, well, I think it's too specific. I didn't quite get your question, sorry. Could you rephrase your question? I want to ask whether there are limitations that anybody could use this in the future or if there are already patterns or licenses taken because you're sponsored and that's not academic work on your own and their finances as I understood. So you can reimplement it because we have published papers describing how everything works. I think, and most of the algorithms are prior work like ICP and the implicit surface data structure. They're already known techniques. I think we have applied for some patterns but I don't know all the specifics of how it's limited to use and how not. And the second question was about platform independence. So it's written in CUDA and C-Sharp. So if you have CUDA running on other operating systems, I'm not sure if they're Linux. So your publications weren't specific on platforms? No, no. So it doesn't make any assumptions about the operating systems. Thank you. One last question, a very quick question please. You said you can use two connects parallel because the projected image would interfere with which other. But is it possible to use two connects and just use one projector so the other connect infrared camera sees what the projector from the other connect projects. So you had like two passive sensors and one active? Yeah, so you must write your own stereo algorithm for that. So in connect, the projector position and the camera position are calibrated to each other so you can't remove them. But yeah, if you have your own stereo algorithm, I think it might work. But is it possible to get the raw data from the infrared camera out of the connect? Yeah, it's possible. OK, David Kim has to catch a flight. Please another round of applause. Thank you.