 We will be talking about augmented reality libraries. So, Shreeker is currently working as a new developer in TCS in Directive Labs at Chennai. He has two years of experience in here. We are redesigned with the evaluation of it. So, yeah. Hi, everyone. Thanks for coming here. So, I'm Shreeker and I'm currently working as an XR developer in TCS Chennai and I'm pretty much a beginner in this field. So, and I wanted to just share whatever I found interesting in AR and just talk about it. And as you might have guessed from the title, I'll be speaking about augmented reality and what goes on behind the scenes in augmented reality and highlight some possible issues and try to at least show some solutions. While I've not actually made any demo to showcase right now, at least I'm hoping to do so in the future and come back and share it. So, it's just putting things out there. So, this is somewhat an outline of my talk. So, I'll be talking about some vision and then AR and then how all these things come in and then put in some ML, AI, et cetera, and then give some conclusion. So, we'll first talk about vision. So, vision in this scenario is basically computer vision, which I think has been a very active field for quite some time. So, and in my view, I think you can summarize the entire CV process into these three steps. So, you acquire an image from a sensor and then you process it and then you analyze it to get some details out of it. To see an example, this is an example license plate detection. So, you are applying your Kani edge detection process to get that edge detected and then you crop the rectangular part of the image to get the necessary thing and then you try to detect your characters in this and then you analyze it to perform OCR and then do that stuff. So, I think CV forms a foundation of like whatever we do in physical reality. So, right now if I give a computer an image, it's mostly with the real scenarios in mind. So, taking this one step further is augmented reality. As you can see, this is a spectrum which is pretty famous. It's been proposed by Milgram. So, augmented reality is just a layer on top of physical reality and all the way to the other side is virtual reality, where you take your user to an entirely new world and then you, it will be fully virtual, right? So, and now what is augmented reality? If you do a quick Google search, you will be getting this as a result. I think this is one of the good definitions which I like. So, you are just adding a virtual content to the real world and then you are giving a composite view of it. So, you are augmenting the user's reality in some sense, right? So, the defining characteristics of an ideal AR scenario will be these. So, you are blending the real with the imaginary. So, in this case, I mean the most famous game probably could be Pokemon Go. So, you are seeing the character Pikachu in your real world somewhere and then you are trying to interact with it by, this is the second one. And the third one is that the character which you are positioning will always have to exhibit some sort of predictable behavior. So, if I'm placing, for example, a chair in some corner of the room, it doesn't mean that after a while the chair starts floating around. That's not a predictable behavior to it. So, that should not happen ideally. But because the real world is a lot of chaotic, you have all these things changing at unpredictable levels. So, the user could decide to move the camera back or it could just, just some blur could happen because lack of focusing or some image noise could come in or the lights could just turn off or something like that. But all these things should be handled effectively by your application. And this is where I think a computer vision comes in. And summarizing AR, I think this is what these four steps could be basically some sort of an algorithm to say what AR does internally. So, while some condition is true, which is, which could be optional depending on your use case. So, I have to update the tracking data which is the position of the user, the rotation of the device, etc. And then you update the environmental data. So, the lighting conditions, the scale or something like that. And then check if there has been any previous updates which has happened. So, from the previous data, if there is any change happening. So, you have to understand, okay, so there is a change. So, I'll go to the fourth step and then I'll update the already placed virtual objects position based on whatever I got. Yeah. So, the first two steps are normally simultaneous. They're called SLAM algorithms, which most AR libraries use. So, we'll be looking at them together for the rest of the talks. So, because our user environment is not predictable, we need to constantly keep a track of the position of the user's device at all times. So, this happens via positional tracking and rotational tracking. And it is known as pose estimation in AR at least. So, except in some cases, the device and the virtual object you're placing needs to have some form of a common coordinate system to speak with each other. Or at least some common language, if not the thing. And normally you use sensors to do this. So, I could use camera, accelerometer, gyroscope, and etc. to inform my decisions in this region. And so, before I begin my experience, all I need is some reference points, right? So, I need to have some reference to base all my experience about. So, I scan my room and then I know. So, there are some fixed points for me to identify. This is a rotational thing. And then I turn this side. And so, they should see some fixed point there to identify that there is some rotation happened. So, for this, you use something called as key points, which are quite some form of distinctive images in an image. So, distinctive points in an image, I'm sorry. So, they help us to keep track of what things are constant, etc. And they're called as features in this environment. You can also call them as trackables. You store them in a database and then you understand and use it for further differences. And all the features should have the following property. So, they should be reliable and they should be invariant to any sort of movements. So, reliable in the sense that I have, suppose if I have a pet as a feature point, it is not reliable because it will keep moving around and my experience will go mad. So, you use some, all this happened, you use some SLAM method as I mentioned earlier. So, you have shift and serve, which are quite popular, but comparatively speaking, brisk I think is a bit faster. And it is a derivative of something called as a fast algorithm. And for any algorithm, you need two points to work. So, it should be detecting sufficient key points to understand your environment. And it should describe the key points and give them some form of unique fingerprints. So, it should be like each of the key points, even if they are in the billions of number. So, each of them should have a unique fingerprint to be for the app to be able to differentiate one from the other. And these key points are often called as spatial anchors. So, more often than not, the developers when you're working on it, we use it to position an object and keep it stationary or at least perform some motion. And this should ideally happen each frame unless we want it to happen otherwise. So, giving an example, we have, this is the fast algorithm. So, if you see here, the central pixel p is taken in an image. And then you take the surrounding 16 pixels to find out the brightness comparison results. So, for brisk to call it feature points, at least nine pixels should be brighter or darker. So, in this case, if you see all the top part of the circle is all brighter. So, p could be a feature point. And now, once you detect this feature point, for brisk at least, it creates a binary string by encoding all these brightness comparison results and then giving it the value of a unique fingerprint. So, this is an example image. So, for this, we will apply brisk a bit. So, here, I just use some open CV, which is readily available. You have a brisk algorithm. It does all the processing for you and all that. But you might not be using open CV in a mobile screen, but yeah. So, all these are the key points. I hope this is visible. So, if you see, there are a lot of key points and especially these circles which you are looking at. So, all these are the key points and they might not actually be needed in our real scenarios. So, now, once you detect these key points, you make a database out of it, as I mentioned earlier. And you ideally will be looking at all these images in multiple scales. So, for example, I could take the same image in ten different scales and then I could say that if my key points are not present in at least six of them, I can discard them safely because they are not reliable to me. So, that way, you could ideally, you could reduce some key points and then you could use them later. So, you can say that more key points are better, but they are very much expensive to track. And this is where error correction comes in, again, to reduce your number of key points. So, you could ideally say that I am removing some form of outliers. So, I can say by using some simple geometry method, I could write an algorithm to connect to key points and then discard all the key points that are to the top of it and keep them to the right. But this thing could vary depending on the scenario you have. It's just some example which I came saying. So, you can use the remaining key points to calculate the pose of the user and then define the whole experience again. So, going back to our algorithm, this one step gets added. So, performing error correction will be quite important in this case because it will be like some form of a feedback loop which you are getting to, I don't know, refine the moments, etc. Okay. So, other aspects of AR are basically lighting of the surrounding environment, the user interactions. So, I should be able to probably hit this flicker ball or something and they should fly to the other corner of the room and then there could be points on a, so there could be feature points detected on some sort of a slanting surface. Another part is about hiding objects. So, it will be like I'm behind the table. The camera should be able to identify that I'm behind the table and when something else comes in front of me, I should be hidden behind that object. So, once you understand what these are, you get to place the virtual object in the user's environment and this is done by using some meshes. So, all 3D, all virtual objects specifically are a combination or made up of meshes. So, if you see the breakdown here, so this is like I'm, they are just ultimately a bunch of points. So, you're taking a point cloud and then you're generating a mesh out of it and now considering the data you have from the earlier steps, you try to change certain parameters of that particular object. So, what you do is you just could very add some shadows to it or you could change the amount of lighting which is being reflected off it and then you could change its scale or dimensions or something or amount of object visible, etc. The last part as I mentioned is called occlusion and it's especially difficult because you need a depth sensor data or something like that to understand what it is. So, how do AR libraries handle this? So, these are some of the libraries which I've been familiar with. So, I'll give examples out of ARCO because that's what I've been using mostly. So, on launching an AR app, the app will first of all scan the surroundings and then as I mentioned, there is already an existing database. So, it will try to match the points it scans at that on launch with the existing database and then it will give the pre-downloaded key points and all that and then if nothing exists, it will initialize a new map. But in either case, when the user starts, for example, I have a map of this location and then I start moving further. As I go further, my map keeps getting bigger and this data ultimately will be used to create my AR experience. But again, another thing is that my bigger maps will mean more computations to manage because all these computations are like I have to triangulate the positions, etc. Additionally, this is more difficult because my phone has a limited number of resources which I could use. So, this is an example. So, taking this image, the top thing is the feature points that are detected. This could be some form of a map and this is the depth sensor data which most libraries are not currently doing. So, some major issues. We've seen what are the basics and all that. So, some major issues I can say that it's about improper occlusion and performance drops. The other two depth distortions and inaccurate tracking data are basically some variant of occlusion handling. So, if you're able to handle occlusion well and then distortions and tracking could be probably corrected automatically. So, I could hard code all of these but that's a pretty difficult task, especially in real scenarios. And this is an example of occlusion, as I mentioned. So, the dragon should ideally be behind the chair but it's in front of it because the camera doesn't know that there is a chair here to identify. So, try to solving occlusion. You have to get depth cameras and then get the depth data and then aggregate this over frame over a bunch of frames to generate. Okay, there is this continuous data so there could be a 3D object present there. And this is assuming that I have it could vary depending on... No, right now mobile devices ideally it'll have a single camera. So, when you consider, for example, a Tango phone. So, it does this depth calculations pretty well. And right now the thing is that you... most phones do not have this depth... depth... that I'm not actually sure of that. So, I think in one of the examples I've seen they've used, I think, five cameras to get that stereo data and then manage things accordingly but I'm not actually sure of that. Okay, so because most phones don't have the depth sensor data and your task is becoming that you have to generate a 3D reconstruction from a 2D image. So, assuming that you do have the depth sensor data you take this point cloud and add it to the point cloud which you've made from earlier add it to the depth data from this thing and then combine it over again combine it over multiple frames to end up generating a mesh. But this is still a hard task with more performance drops because there is only limited resource we have that we could use. So, one good solution which I think could be applied here is bringing in some form of machine learning or augmented reality but that's just one of the solutions and I think I'll continue with this. So, the goal here is simple so you have to detect a depth in a given image by using some method and then you use this data to generate a mesh out of it by combining it with existing point cloud data. So, some ways to do this is that some... using some neural networks along with some other tools to do this. So, this is an example from something called a Selerio so they try to create a machine real-time and then place the objects on top of it so when you see that the camera moves down you see that the balls there are hidden. So, before we go and see what is happening here so probably we could take a small detail this is one example of portrait mode in pixel phones so if you see that there are things like this... the depth here is changing because of the thing and I think this is the example where I mentioned they used five different phones to calculate the depth at least during the training phase. Okay. The second part is that this is a phase tracking which from AR course augmented phases it has newly been released so I mean I think few months ago or one month ago so here what is happening is that there is this green color mesh which is forming on phase which is being tracked as the user moves around and this is another example which I wanted to tell. Okay. So, the common denominator between these two is TensorFlow so Google especially they use TensorFlow specifically TensorFlow Lite to create this experience so they made the processing faster to introduce GPU support as well to give some sort of speed up between these things and to put things in context for the augmented phases example so the depth tracking and all that so they took some model and then they deployed it onto the phone and then you could use a CPU to do that or a GPU so using a CPU you have number of... so it takes around 30 milliseconds per frame for some data out of it but in GPU it is reducing quite drastically to around 10 or something like that so these are some other results for the full mesh and the light mesh so while I understand that this is just not the only solution so you could have a bunch of other solutions like they also had I think it's from Cornell so they had a thing called mega depth so they were using some form cnn's etc to understand what depth information from image the second one was I think which I found personally interesting was something called as point net that was about if you remember the point cloud had to be generated into 3D mesh and then it had to be occluded so from point cloud directly to occlusion from there trying to understand what the point cloud itself would look like without having to generate a 3D model out of it so these are examples so we could probably develop more such things and then port it into TensorFlow light and then work on top of that and enhance the overall AR performance without doing any compromises so far whatever we have seen I think will probably be around 1% of what AR actually is it is still much an emerging field and it has very great potential for exploration so I think even from a business perspective you could see that augmented reality is around here it's moving towards the productivity area and it's taking around 5 to 10 years to do this according to Gartner but I think because we are all in the open source world we could probably take up these conversions and then I don't know speed these things up to probably go there less than 2 years something because and I'm still a beginner as I said and I'm probably hoping to come up in some future for summit and then give the results of whatever happened in this process so here are some great references which I found are interesting I'll leave this up in my github or somewhere so you could take a look at them so thank you any questions in that I think is cross platform so we basically use in our team we use something called unity engine so on top of in unity you can just install some I mean add these libraries and then take any build whatever you want with euphoria I've done with android and then IOS both of them AR core is basically for android but you do have some extensions IOS I'm not so sure about AR kits this thing I think does that answer your yeah AR foundation I think it's relatively new right so what AR core has been a bit there and when you compare the features between AR core and AR foundation some things are still lacking so I think for the last time I've seen one thing is about image detection so it will detect an image in an environment and then it'll put an object on top of it as it detects the image so that feature I don't think is still present in AR foundation maybe they might have added it I'm not sure and that's one of the things and I find AR core more appealing because Google is doing a lot of other research in terms of TensorFlow etc and you can easily write some I don't know an android app in native at least to combine this AR core with TensorFlow or something like that and then continue the experience or something yeah on device that's where TensorFlow Lite comes in so this has yeah you can do that so TensorFlow Lite has optimized some I'm not sure how they do it internally I'm still here to explore that part so yeah TensorFlow Lite will be of a great help in this case so normally I think they could be not sure of the games which are there big ones so normally from what I've seen many companies are internally using it so in case of for example for Vuforia at least you could have some companies are doing it to I don't know guide the field agents and doing various tasks or something like that so it depends on the use case again but I'm not sure of exact names which I can give yeah so that's one of the good applications you could have yes any other questions I guess we could