 Hi, good afternoon. So, my name is Shrikar and I've been working as a developer with TCS Interactive Labs here in Chennai and I've been working on AR and VUP for the past three years and okay, before we begin, so how many of you have prior experience in AR or something? Okay, that's good. So, this talk is basically a result of my years of, I mean, whatever I've been developing so far, I've been doing it primarily in Unity. So, Unity is based on C-Sharp and then you have their external libraries which help you do AR and VR and then AR and VR and then, you know, you'll be able to build pretty much good amount of graphically pleasing stuff with that. So, but most of the times what happens is whenever I see something which is not possible with Unity, that is, I think of Python. So, because Python, generally these are like things like, as I'll be showing you, these are things which generally involve computer vision problems which have already been solved. But what happens is with Unity or other libraries when you're going with Android or something, you have to build all these external packages and then, you know, create your own models to do that. So, I thought, okay, why not just have a library which I'm calling as OPI and, you know, wrap some sort of augmented reality and all these machine learning and all these compatible features into it so that in the end it might, you know, just have to do, it might make our job a lot easier when you're developing, okay? So, let me begin. So, the central idea I have is basically augmented reality is computer vision at its core. So, I'll be showing you what augmented reality is and then I'll probably be explaining how, where and how CV will help enhance augmented reality and how it's practically making it possible. So, when you go to the details, okay, let's see. So, AR, I'll tell and then where does Python come in and I'll show you a small demo. This demo is basically a library GitHub project which I've come across. I think it's by Joan Galosta or someone. So, he made this POC which shows that some, I show an image target and on top of that you're putting up a 3D model. I thought that will be a really good starting point for, you know, building this library and then building modules or tools to enhance that and then I'll discuss where this library needs to be enhanced. So, what is augmented reality? So, it's basically, if you've played games like Pokemon Go and if you have prior experience, you'll already understand. It's basically comprising of three things. So, I have a real world and on top of that real world I'm putting a virtual object in this case, this Pikachu character and then I'm also allowing the user to interact with these objects most likely. So, when I'm throwing the ball in this case a swipe up action, the ball is going and then you're catching the Pokemon practically. So, here ideally what happens is that the character or the object you're placing, they have these fixed and predictable behaviors and they should not be like flying away. In this case it can't fly so you don't put all those features in this. So, in real world when you're seeing augmented reality, what happens is there are a lot of varying parameters because typically the user will be having their mobile phone or something like that and then they'll be, suppose here is my object and I don't know how to keep a track of it. So, my camera angle can change, the scale can change, I can move back, I can move close and to handle all of these things there needs to be some underlying algorithms which provide a functionality which makes our job as a developer a lot easier. So, AR, before we go into this thing, AR can be two types. So, one is marker less. So, I just scan a plane surface, scan for a plane surface and put my object here and then go ahead with that or the other thing is marker based wherein I'm scanning an image target and then I'm, once I detect this image, I'm putting a 3D object on top of it and doing some stuff like that. So, the demo I'll be showing is of the second sort because I wanted to add some features but I couldn't go to that point yet. So, I'm just, I did some basic stuff and then I'll be showing you that. So, the algorithm inside AR is basically these three steps. So, you update your tracking and environment data, this involves data coming in from the sensors or how the user is moving. If suppose I have a camera point, camera looking at the scene. So, my hand is a tracking object. So, I need to know where my hand is going at what point and then I need to also see if the user is tapping on this particular 3D object that I'm placing or he's doing some other interactions to understand how your whole interactions might change and then based on these interactions, you are updating your placed virtual objects. So, it requires, as I've said, it requires tracking to work well. So, you have positional tracking, your 3D space and then your role pitch and your which represents the rotational thing and all that. So, that is called as pose estimation basically in XR terminology. And to handle all this pretty well, you need a common coordinate system because I should not be computing in a Cartesian coordinate and then have the user are doing their interactions say in cylindrical or spherical coordinates. So, that would totally mess up the interactions which he's trying to do. So, ideally when I'm saying that I want to track something, there should be a reference point for all that tracking. So, these different points are what are called as key points. These are distinctive locations in an image. So, if I'm looking at, if I want to describe a key point say in this for my laptop, these edges could be a pretty good one or this table edges could be a pretty good one. And they all together come to come and describe the features of this environment. So, they should be reliable. I should not have a point like, suppose today I come and then after a few minutes I come back and that point should not go away. And then they should be invariant also. So, suppose for example, in a scene if I have a dog and I'm calling I have a feature point on top of that dog, it will keep moving. So, it keeps varying in its position. So, that's actually a bad key point to have. So, there are some SLAM algorithms. So, your simultaneous location and mapping algorithms. So, one thing which is generally used by libraries like ARCore or ARKit in one or the other form is brisk. So, it's like basically what it does is, sorry, yeah, basically what it does is, it takes a key point and then it gives this descriptors and it will allow you to store them in a database. So, when I'm detecting a key point, I need to detect a sufficient number of key points to understand this environment. So, if I have some very less number of key points, typically this will happen when I have some shiny surfaces or something like that. So, in this case my tracking will not work well. And once I have these key points in my database, I need to describe them and when it comes to brisk, it's basically a hash key which will allow you to uniquely identify every key point which is present in the scene. So, each of these key points are also called as special anchors and this detection can happen every frame because you are not able to predict users' interactions everywhere. So, they could be sudden. So, one frame could have the user pointing here and the next could move somewhere else. So, in that case what happens is, you will be losing a lot of data. So, I should ideally be tracking it every frame and this is also computationally intensive and that's where probably ML and all these things can come in. So, when you come to brisk, what happens is, first point is key point detection. So, I have a central pixel P. I hope you are able to see the cursor here. So, I have the central pixel P and what it does is that it takes 16 surrounding pixels. So, these are the ones and then it compares if the P is actually a brighter or lesser than I mean darker or brighter than at least 9 pixels of in the surrounding vicinity. So, in this case P is actually a key point because you have all these top pixels which are lighter. So, it can say that this is a unique key point for me. So, once I have this unique key point in place, what happens is I need to describe it. So, brisk creates a binary string with 512 bits. So, it creates this thing and then it will give you this hash string so that you can use it for further things. So, natively when you come to OpenCV, it is already handled by default for us. So, for example, say that the user is looking at this thing from their camera and then I do some processing and this is OpenCV code. So, if you see I am importing it and then I am reading a particular reference image and then I am creating an instance of brisk and then I am asking you to compute the key points and the descriptors and then draw the key points again. So, this is happening natively. Detect and compute is this native CV2 function. So, the end result will be this thing. So, if you see there are a lot of little circles marked all around the image. So, these are the key points that the algorithm was able to detect from this scene. So, when you are in real world you will be using these key points after performing some error correction I will come there later. So, to understand how the user is moving and all that. So, this is what we are seeing, but what the machine will see is basically this. This is a lidar data point cloud. So, this you might have been already familiar with in case of self-driving cars or something like that. This is what you typically have and then I think Lyft and other Vamo and all these companies they have already released large data sets of points for us to you know perform analysis upon. So, that will be a very good starting point for us. So, as mentioning about error correction because you have all these points which are like too numerous. So, I need to see what are useful and what are not useful. So, to do that typically what you do you remove these outliers. So, one good method will be to have this image seen in different scales. So, if I have a point in one scale and it is not present in all the three or four other scales of that same image then I cannot have rely on that particular key point to you know do my thing. So, I would use these and once I am done with that I use these remaining key points to calculate the post. So, another step gets added which is this performing error correction which is also I think quite an important step here. So, current AR libraries are like too many other I mean there are some of the ones which I use are ARCore, ARKit, ARFoundation, Wuforia and all these things. So, on their own they are pretty sophisticated and they have all these open wrappers for us which allow which allow us to you know simply write a few lines of code and then do whatever we want with the 3D objects. But one thing or the other keeps up coming. So, the problems which I have seen and like problems which are like very well known are these. So, if you see these right. So, if you can see something like object segmentation and then you have depth distortions and then scene abstractions how are these objects related to one another. So, these are the things. So, let me sum. Okay. So, despite of all these problems some libraries are actually working very well to enhance their AR capability. So, one thing recent release of ARKit did that. So, they are practically they did something called as people occlusion. So, what they are doing is they are segmenting and then understanding the depth of the image and then to suppose I have suppose this is my view and then I have a 3D objects play this is my 3D object and my hand is going behind it or in front of it. It needs to know that the camera needs to know that my there is a 3D object and some object came in front of it. So, that that particular view gets blocked. So, to understand that you need to know the depth of the image and then you also need to segment it properly to you know get a better outcome. Okay. And another library which is called 6d.ai it is building something called as an AR cloud which is also I think a very good thing for the future of AR. So, you can probably look it up. Okay. So, the problems from here that I have summarized and see the commonly seen are these one is pose estimation, the other one is object segmentation, the third one is depth estimation. So, this sensor data you get is basically time series data. So, that you need to understand you need to process it properly. The second one is understand what objects are there and then segment them and then you need to understand how far away from my point of view is the particular object I am dealing with. And I think this is where python can be helpful. So, because we have all these three separately handled by python already like a lot of research has been going on to do these things already. And for example, you have in pose estimation recently I think Microsoft has released FC come across a blog and this is one of them. So, if you have your IMU data. So, which is your sensor data which is basically time series, you are using machine learning like LSTMs and all these things to denoise it and then understand increase the accuracy of your tracking. And the third one is object segmentation. This also is being done like a lot by many machine learning algorithms which are out there. And the third one is understanding the depth from a given image. So, this particular thing as was of like great interest to me because they used data from the mannequin challenge. So, these are this is by Google blog. So, they use data from the mannequin challenge to understand how people are there. And then they compared various views to generate their algorithm here. So, if you are able to see it when this guy is moving around you it can understand like how it can keep a proper track of it. So, this sort of sophistication or something is what we need for an ideal AR experience to have. So, let me do a small demo. Okay. So, what I am trying to do here is that I have this code already done. So, I am trying to track this particular book this cover is the target image for me and I am trying to place an object on top of it. So, when I run it. Okay. So, what is happening now is that this the fox is if you are able to see it the fox is trying to be placed on top like everywhere. So, why this is because it is not able to understand where is the actual tracking point. So, once I bring my book in front of the camera it is able to detect this planar surface well and then you have your 3D object this is actually supposed to be a 3D thing but yeah the materials are messed up. So, that is why it is coming as pink. So, if you see this I am able to move it and then it is pretty accurately do it. So, once I remove it from the scene again it is going like bonkers and then it is not able to understand what is going on over there. So, to go through the code basically whatever I have already mentioned is there in here and I have also left a QR code for you to this is a have it in my repo and I have given the links and all that. So, I can scan the QR code. So, basically here I have as I have said it is trying to detect and compute the mod points and then it is this is the OBJ. So, if you see look at the OBJ file this is basically an array list. So, this thing again the original author has like written some great parser to parse it. So, this also is already over there and there are also there is also a blog post which I will link to which has like a detailed description of what this code is doing. So, let me move on. So, this is what currently is done by me so far. So, what can be the future of this? So, right now we have some basics in place. So, I am just seeing a particular image and then I am putting an object on top of it and that is obviously not sufficient like in the long run. So, we need to have some sort of an ecosystem and that is where I think community contributions will be helpful. So, one thing which I have been done already. So, to be done is like I need to render complex models in AR with all the material components. So, I have say suppose this is just a fox or a cube which I am just placing it is like very simple. So, if I need to have some experience like I have a thing coming up from the bottom and then some I do not know some things are flying around or something like that that that need requires some complex processing. So, you have something called as PIMESH which is also a very good library with which is already existing in Python which can take these point law data and it labeled it is able to generate a detailed thing out of it and the second thing which which will be good to have is markerless AR. So, it needs to just detect instead of an image target it needs to detect a plane surface and then understand that there is a plane here and then I need to position on top of it and all that and then you also need to denoise your sensor data. So, all these noise which is coming in due to the rapid movements and all I need to smoothen it out over time and understand the environment better. So, for example, I have all these lighting coming up. So, my object needs to be well lit. So, once these lights are off the object should also become darker it should not like glow something like it will be a bit creepy I think and some other things which probably we can think of in the future. So, the second thing is like the extensions. So, one thing which is which I found is like this is a basically a desktop application which is like pretty rudimentary. So, we need to make lightweight models and recently I think I think just just yesterday or day before there was this PyTorch conference where they released PyTorch mobile and then there is already we have TensorFlow Lite. All these libraries will help us build models and then deploy them on top of in your mobile phones along with your app. And in addition to that you can also probably run them in say your Arduino boards or Raspberry Pis or something like that and you know when you are making this say, suppose a headset like HoloLens they will typically have their own chips there. So, in within that we can make these lightweight models to run and then you know speed up the processing. So, that is the second part and the third part is federated learning because this is I think very important because right now you are opening up the camera and then asking the user to move around and then you are doing lot of processing which is processing of data which is private to the user which could be private to the user. So, I think this federated learning and homomorphic encryption and all these things which are coming up are a very good source for us to look into how we can protect the users privacy and provide them with this great deal of user experience also. And some other things which you might which we can think of sometime later. So, some references I have and these are some of the things. I will leave a link to this slide in the GitHub repo which is over here. So, you can probably scan this thing it will just lead you to this repo without any links or something like that. So, I will just update it and like probably today or by evening I can update it and then I can you know probably we can go ahead from that. So, and one more thing is yeah of course you can sign in some pull request or issues if you have any thoughts more than welcome for that. Thanks. Guys, if you guys have any Q and A you could go on. Hello sir, thank you for the presentation. I want to ask that when you sense this data you have showed a video from Google in which the program was sensing the 3D polymorphism. Yeah. So, I wanted to ask that can we inculcate sizes and shapes. For example, there is an application in Apple where it scans the size and for example, it is a ruler. It is a ruler which gives us the data about. Okay. You want to measure something. Yeah. That can be done. Yes. So, in 3D, in 3D. Yeah. Measurement in 3D with sizes and. Okay. So, for that we need to combine multiple things again as you said. So, for that we need to understand. So, suppose I am walking with a mobile app and then I am placing a point over there. This is in AR again and then I need to walk back and then it needs to understand how far have I walked back in real using the sensor data from earlier and then you might have scaled this particular thing in a way right. So, that is another thing which it needs to understand. And with this scaling, can we include motion capturing? Can we do motion capturing through this without the motion? Sorry, I am not sure of that part. So, we can measure the distances with pretty much accurately but I am not sure of the moment. Hello. So, you spoke about this. Sorry. Can everyone hear this? Yeah. So, you spoke about stuff like federated learning for privacy. How do you think it is going to enhance the performance of AR because now you have got a lot more data training all the time? So, it is not enhancing thing I am talking about protecting the privacy alone. Okay. So, do you feel like the enhancing thing would not really help that much? The fact that with federated you are training of so many things at the same time? So, probably we could work on limiting the number of inputs we are giving to our model but in the end what I am doing is what I am trying to tell is we can have a lot of on-device algorithms which are running over there and then you know giving you the output. So, one more question. This is about the pose estimation again. How do you think the new vision-based techniques for pose estimation will help AR compared to the older like the new ones through thermal imaging and stuff? How do you think that they have with the accuracy and more feasibility of AR? Okay. So, what happens with pose estimation is you have limited number of sensors available on your devices right? So, typically what you always we all have these gyroscopes and all the other basic sensors. So, thermal imaging is like a pretty much thing which I do not think most of the phones have right? So, that is where this thing will come into picture. So, what you have seen is like what I have shown an example of is IMU of sensor data which is practically I think it is almost all these tips and the phones they already have it. So, I think that is where it will. Yeah. I model. Yeah. I have the question that you mentioned that you were using the marker-based approach. So, what does it take to get on to the marker-less environment? Okay. So, marker-less environment let me go ahead. You can probably leave because it is lunch time. So, if you do not have any questions I will be around anyway. So, that is fine. So, please. Yeah. So, if you look at this image right? So, it is trying to understand what these the points in the bottom. So, to have a marker less experience you need to understand that there is a planar surface which is present in front of you. So, it could be at the ground it could be a table or something like that where in the top you are placing your thing. So, right now with this thing it is just understanding that I have a feature point which I have already seen in my model which I had earlier and then it is detecting these things in the image and then it is just keeping a track of it. So, now in this case it needs to understand that it has already seen this point and it also needs to know that it is still a planar surface which we are talking about. Any other questions? This would be the last question and after that there is lunch. Just a few announcements before anyone is leaving. There is open space one here in Pai Delhi one to one thirty and open space two GUI same one to one thirty and at two there would be lightning talks. Hello, I am audible right? Yeah. Great presentation and think about marker less AR. So, you said that it needs to detect a plane surface. So, in your slides you have shown before that object segmentation is possible through deep learning right? Computer vision. So, can we estimate the plane surfaces in a given frame using deep learning? I mean that seems like a pretty feasible task just to say but is it possible? I think it should be feasible I am not actually sure. Okay. It should be feasible. Yeah, because when networks can pick up very complex things I think plane surfaces are like something is easy for them. Yes. I guess that is. Even one point which I might have missed so even to get this plane surface you need to scan a bit around. So, it needs to understand the environment well to have that. So, in this case you can probably have the point cloud data you are getting and then you know under see where like all these points are on the same level or something like that. Thanks.