 We are representing the data science group in United Airlines. Today, we will be walking through one of the use case that we worked in video analytics space. It is actually calculating deep learning times using deep learning. Just to give a quick introduction about the speakers, my name is Vikas Krover. I am senior manager in data science team at United Airlines. I have a colleague with me, Jayit, who also works in the same team as analyst data scientist. Just a quick disclaimer, this is not an official position of United Airlines, so don't quote it anywhere. Really quickly talking about the agenda, we will be starting with a quick introduction about our group. Then we will define the problem statement, giving the complete context why this problem is important in the context of United Airlines. Then we will talk about computer vision and deep learning concepts which we used for solving our problem. Then at the end, we will do a deep dive into how we approach the solution and there is a quick demonstration of the video itself. At the end, we want to leave 10 minutes for Q and A session. I am hoping to get a lot of questions. So at United, we are in the business of uniting the world and we are continuously transforming United by growing our fleet and our network. Even today, we are the world's most comprehensive network catering to 354 destinations and 48 countries. We are continuously investing in our product and in our people and our focus areas is to provide the best experience to our customers and also run most efficient operations. So talking about the data science group, so we deal with big data since United Airlines is one of the biggest carriers in the world. There is a lot of data both in commercial space as well as in operation space. So we deal with cloud and edge computing software as well. Talking about the use cases. So data science group is involved in both commercial and operations use cases. So we work on optimizing the digital journey of a customer. Then we also have specific use cases in the operation world where we do predictive maintainers and make use of natural language processing and social media data. Today we will be talking about one specific use case in computer vision world. So let us define the problem statement. As apparent from the title of the presentation itself, we want to calculate deep planning time but why is it a problem in first place? So let us look at the flip side of deep planning time which is enplanement or in simple words onboarding time. So it is very easy to calculate enplanement time, why is that? Every customer has got a boarding pass and at the time of boarding, gate agent scans the boarding pass. So the total time that it takes to board the first passenger until the last passenger is called enplanement time. It is very easy. Once you scan the boarding pass, it is in the system. From system you will know the exact view of enplanement time. But what are our options for deep planning time? One option is we can do a manual time keeping. So we can have an agent who will start the timer as soon as the first passenger is out of the aircraft and will stop the timer when the last passenger is out of the plane. But as you can imagine it is not going to be efficient, it is not going to be accurate all the time. The other option is to scan the boarding pass once more similar to what we did for calculating the boarding time was not a great customer experience. And since we are a customer facing industry, we do not want to do anything which affects the customer experience negatively. So that is why we as a data science group came up with this elegant solution where accurate measurement is possible. And so this is the problem statement. We are using the video feed at the gate to accurately measure deep planning time. So even now we have the cameras installed right at the gate of all the aircraft. So we can use those video feeds to calculate deep planning time. We have defined the problem statement, but so far I have not talked about why this is so important in United Airlines. To understand this, let us understand concept which is specific to airlines industry called aircraft turnaround. Some of you might already know this, but for benefit of the people who are not so aware about the airlines industry, I will explain this briefly. So United Airlines actually operate using hub and spoke model. So this is different than the point to point model that usually Indian airline carriers operate in. But since the scale is so large in United States, we cannot operate in point to point to point model. And you can see this with the help of this diagram. So on the left side we have hub and spoke model. So for connecting nine points, a hub and spoke model only requires eight routes. Whereas to connect the same nine points, we require 28 routes in point to point model. In India, point to point model is fine since we do not have too many destinations to serve. But since United Airlines is the business of connecting so many remote destinations and we want to give the opportunity to our customers to connect from point A to point B, does not matter where those points are in United States geography. So hub and spoke model is the most efficient way to plan these routes. So now we have established that United Airlines work in a hub and spoke model. And in a hub, you can imagine that there will be lot of aircrafts going into a hub and lot of aircrafts going out of hub. And a lot of times it will happen. The same aircraft which is coming to the hub will be required to be ready for the next departure flight. Since we have lot more flights than the number of fleet. So the entire time that it takes for an aircraft to be ready for the next departure flight from the time it arrives at a gate on a hub, it is called turnaround. And all the events that constitute this turnaround time are the events that have to be performed. So deep planning time is one of those initial events. If we do this quickly, then we get a heads up for rest of the events. The later events can only start when deep planning event is finished. What happens if this deep planning time is slow or we are taking a lot of time for deep planning? As now we understand it will affect the turnaround time of the equipment. So it can happen that your next departure flight might be delayed and we do not want that to happen. From a customer perspective it will be a negative impact. So if deep planning time is large, then you will be pissed off as a customer. Even if the rest of the flight is great, you would not care. And it can also impact the next departure flight. So in urban spoke model, people have to change their aircraft to reach to their final destination. So if the connection time is really tight, so let us say you are arriving at a hub at 9 am and your next flight is at 9.45. If you are not able to reach gate at time, then your next flight is missed. And as United Airlines will have to make alternate arrangements for the customer. So it is a increased cost for us. So I think now you can appreciate the fact that doing deep planning in the most optimized fashion is very important to United Airlines. But we are talking about the first step in this presentation. How do we get the accurate view of deep planning time? Once we have an accurate view, we can go to the next step where we will look to optimize this deep planning time. And we can figure out the opportunities where we have to optimize and there can be numeric use cases and it can have a very large impact. So in this presentation, we will be talking about how we are planning to approach getting this accurate measurement for deep planning time. And we will be making use of computer vision concepts and deep deep learning concepts. So Jit will be walking us through this journey over to you Jit. Thanks Sikas for building up the context. So let us talk about computer vision. Computer vision is all about making sense out of the visual data. That is giving computer enough ability to visualize, perceive and gain a high level understanding of the videos and images that you have seen. So before moving on, let us quickly talk about how the history for the computer vision have been for the last couple of years. So as humans, we have always tried to solve two fundamental problems in computer vision. First one being making, creating out the good mechanical vision that is camera. So historically, if we go forward, go backward, then we can see the first, the first computer mechanical vision was inspired out from something called a spin-hole camera. And pass over to where we are right now. We have got the cameras that can easily fit it into our pockets and we can easily record out the videos. Right? Second fundamental problem is, right, we have got the data. How we can make sense out of the visual data? So what can be the possible applications for it? So historically, if we go backward, then we can figure out the applications can range from as simple as doing hand-written digit recognition for the zip code that was suggested by Professor Yan Likun in 1996. And pass over to where we are right now, where there are multitude of applications in computer vision that we would be probably seeing in the next future slide. So there was something called as ImageNet movement in the computer vision industry. So this ImageNet competition was hosted by Stanford. So the high-level idea was, Stanford came up with a very big data set that had around 1.4 million images and around 22,000 different categories. So it was a de facto movement in the computer vision world. So a lot of different, different research community came up with different algorithms. So ImageNet hosted out a competition called ILS3R, which is Image Scale Large-scale Visual Recognition Challenge. So if you can see, if we have a look at it, it was until 2012, where this dropped in the, what do you say, the error percentage was significant until 2012. And this solution was inspired from what we call as the deep neural architecture. It's the first deep architecture that gave genesis in 2012. And it was until 2015, when we first surpassed the human level accuracy, that of 5.1%. So in 2015, we have already surpassed the human level accuracies in the image, different, different image analytics tasks might be. Image classifications, image captions and all those things. So why are we even talking about computer vision? So there are two factors for it. It's possible because of the data that we have got and the powerful computing that we have got over there. So most of us present here are canning a smartphone and it would be a good, that has got a one, two or even three cameras in it. So it would be a good approximate to say that there are more number of cameras present in this conference room as compared to the total number of people, right? Apart from it, by the time I just counter to five, there is already 50 GB of data that's already generated on YouTube. So we will think of the enormity of the data if we include all other data sources as well. It might be YouTube or it might be Netflix and all those things. And apart from it, we have got the data. We need powerful computing. Image data in itself is very, image and video data in itself is very time consuming and it's very complex. We need a lot of compute power to actually make sense out of this. So thanks for the hardware community as well. We have got a multitude of different solutions for it. Probably we can use a CPU for a less influencing devices to the GPU for a, where the influencing cost is more to FPGA and TPU for, let's say, edge influencing and those sort of things. So now let's quickly look at some of the very key concepts in the deep learning for video analytics. So before proceeding further, what is a video? So video is nothing but a continuously moving image. Anything that you can apply to a single, what do you say, image, you can easily extrapolate that particular idea to a video, right? You can probably dump out the frame sort of the video, run out your influencing pipeline on that particular image, and you can easily extrapolate it to the video. So anything that can be done to the image can easily extrapolate it to the video. So there is one very fundamental, what do you say, visual recognition problem for image classification. So let's have a very high level idea of what it is. So let's say we have given an input image, input labeled images of cats. We have some input labeled images of dogs. So there is some human that have marked that this particular image belongs to a cat class, this particular image belongs to the dog class. And we are passing on this images to this fancy looking neural network, what's called deep neural network, you can probably think of. So this image is getting flown through this particular deep neural network and the neural network is learning out that, yeah, okay, I've got it, it's actually a cat. The weights are going on in and the network is learning it out. So it got it, it's a cat. Okay, let's go out for the second iteration. We give the second image and we figure it out. We give it to the neural network, image came, come in, converted into the normal clones versus deep neural network architecture. And it says that it's actually a dog. But my neural network says it's actually a cat. But it isn't it, it's a dog, right? So what's happening is something called as back propagation. So the neural network is refining itself to learning all these representations and encapsulating out that this particular image belongs to the dog class. So all these different, different, there are some high level, what do you say? Computations that would be happening and the back propagation would be happening, weight would be propagating out through this neural network and it would be updating depending on the context. So it's learned out, okay? So the high level idea is if at first you don't succeed, try another billion times. This is the, this is what image classifications are now using to work on. So, right, so there are some other visual recognition problems. We're gonna try to encapsulate it in the airline context, what we can probably do. So there's something called as object detection. Object detection is something where we have bought our input image and we have got a multiple object in that particular image. We are actually localizing on that particular object. So let's say there's a person standing on the ground of the aircraft. So there's a person over there, there's an aircraft, there's a baggage loader. So it's uniquely identified that this particular morning box belongs to this particular class. There's a person class, there's an airplane class, there's a, what do you say? Again, second is instant segmentation. So in instant segmentation, we are classifying each and every pixel in the input image to what do you say, the predefined class. So we have got three classes over here. One is the person class, one is, let's say, a background class. Third is, let's say, an airplane, an aircraft class. So we are pixelating and kind of just mapping each and every pixel to that particular class. So we are saying that these particular pixels belong to a person class. These particular pixels belong to the aircraft class. Now let's look at action classification. So in action classification, what we are doing is something related to object detection, but we are kind of digesting out what actually the particular object is performing in that particular image. As you can see here, the gentleman sitting out over there is probably walking out. So the network is saying that there's a person and there's a person who's walking out and doing something. So action classification. There are other problem is image capturing. So in image capturing, we are given an input image and the image is digesting out, understanding the context and giving you a story about that particular image. So there is a caption that the model is interpreted out from this particular image. Last one is semantic segmentation. So it's somewhat related to instance segmentation. Only difference is for the similar kind of object, it's uniquely identifying them. So there are three person class. There are three persons belonging to one class, that's a person class. So it's uniquely identifying them. There's a person on the left, that's person one. In the middle, it's person two and in the right, it's person three. And there's a class, it's uniquely identifying them as well. So in our model, we are using something called as object detection. We are localizing all the objects that are presented in that particular. So moving on to the deep learning framework and the considerations that we did have. So as you know, world is always full of choices. So we did also have this scenario where we had a lot of, what do you say, alternatives to figure it out, what convolution you're in at what we can probably go forward with. So there are a lot of considerations that you need to make up, right? So you need to worry about how, you need to take care and take into consideration how the trade-off between speed to detect and how's the accuracy considerations as well. So you can probably come up with a very, what do you say, complex architecture and can just throw it out for the influencing. But it would consume resources, right? You need to worry, you need to take into consideration those factors as well. The considerations is the resource ability. You need to take care of how's the time and money, because when you're going out for a very complex model, you would need to pay associated costs as well. There's something called as influencing costs, right? So you need to take into consideration those factors as well. Other considerations are, there are a lot of pre-trained models available right now. So we have got Cafe2, we have got Pytos, we have MxNet, N-Reroids and others. There are a lot of different architectures that are available out in the market. You need to drill down to select out the one thing. So we move forward with selecting Cafe2. And apart from it, you need to worry about the scalability issues as well, right? So for a company like United Airlines that have got so much rights for presence, we need to take into consideration. We can come up with a simple solution, but you need to take into consideration how would scale up for different halves and different airports and all those things. You need to take into consideration that aspect as well. So some of the possible options that we drill down to was using RCNN, pasta RCNN, pasta RCNN and YOLO NFS. So there's a graph, there's a nice graph that talks about how the relation between GPO time and the overall map for multiple algorithms that we have got. So we want to drill down to something where we have got a model that has got a less amount of G2 time, and at the same time, it has got a considerable higher overall map. Map is a mean average position that we are talking about. So we drill down and settling up to somewhere in SST. So here's something, SST that has got a less amount of G2 time. And at the same time, got an overall considerable amount of map. I wouldn't say too much higher, but it's considerable. We can just think of those. So let's look at how the pipeline and what are the evaluation metrics that we are using. So again, to reiterate on what Vikas said, so the deep learning time is the time difference between when the first person was detected in the frame to the last person detected in the frame, the time difference between them, that would be called as a deep learning time. So we need to take into consideration what possible gate you would be having. We have got thousands of gates. We need to drill down to one particular gate. We can probably run out the solutions and figure out how is it doing. So we need to take into consideration the homogenous fleet aspect as well. There might be multiple fleets. There might be multiple aircraft that's coming into that particular gate that has got might be that 180 passengers, capacity might be 250 capacities and different, different aircraft that might be coming in. So we selected out a gate that serves to a homogenous fleet. So there's an apple to apple comparison that's happening also. I promise we need to take into consideration the available infrastructure as well. Nobody, if you go to the business and say that I want a high-resolution camera, nobody is going to take you seriously. You need to figure out something where you can use your available infrastructure perfectly. So we drill down to selecting out that gate that has got the most high-resolution camera and at the same time, the position of the camera was perfectly right. So there isn't any issue of just going through and manually setting up all those things. So the application uses a video source to graph frames and DNN is being used to process of this data. OpenCV is being used to stretch out different DNNs into a single pipeline that we have built out. So right now our model is using phase detection and head-force estimation and it's taking care of the duplicate phones when a person is coming out of the plane. So there's an aircraft gate and the person is coming out of an aircraft. So our solution is detecting out the phase that there is a phase over there and it's dumping into the database. But when the person is going into the aircraft, there isn't any influencing. Since the phase isn't visible, so there are some security persons of gate agents and all those, just going into the aircraft for cleanliness and all those things. We need not to worry about those things. We need to take into consideration passengers as a fact. So since they are going into the aircraft, no person would be, no phase would be detected out. So they would be kept aside from what do you say influencing out. So the duplicate count issue wouldn't be there. So this is a very high-level overview of what our pipeline looks like. So step one is frame extraction. So as I said, video is continuously moving images. So we are extracting out the frames from the video and we are using OpenCV library for extracting out the frames. Once we have the frames, we are passing on to the second module that's the influencing request module. So it's running deep learning algorithm to just infer out the different different, what do you say, the rivers. For this we are using FAT1, the deep learning framework we are using, again based on assistive, we are using something called a skew's net to take into consideration the model weight, weight, files and all those things. And apart from it for head for estimations, we are using a CNN architecture. So once we have these results, once we have the influencing request and we have the results, in step three, we are storing these results into the database. So we are dumping all these results in the real time into the database and we are using a visualization tool for making out the dashboard that you would be seeing. So for the database, we are using something called as Grapna. For the database, we are using in plus DB and for visualization purpose for building up the dashboard, we are using something called as Grapna. So both of them are open source, you can probably try it out for your applications as well. Very nice solutions to work on. So now let's quickly look at how is the evaluation matrix. So it's something different here. So there is something called as IOU. IOU is Intersection over Union. That is area of overlap upon area of union. So area of overlap is let's say we have a green box that was actually a particular object that was present in that particular image. Red box is what you detected it out. So the union between those, the common area between those would come in the numerator and the intersection of entire area would come in the denominator. So as you can see, the poor approximation would be when we are not kind of overlapping perfectly the actual boxes. It's a poor IOU, right? The good IOU would be when you are just trying to encapsulate. You know, not perfectly, but it's somewhere in the between. So not so good, not so bad. Excellent IOU would be when you are perfectly encapsulating all that particular object in that particular image. So it's an excellent IOU. Depending on IOU, we can define TPSP and FN. So TPSP is now here we need to take into consideration the threshold aspect. So threshold is something that you can probably play out with. It again depends on what application you are using, what application you have got. And we can change it accordingly, according to our application. So once we have the TPSP and FN, we can easily calculate out the precision and the recall, right? So the solution that we have built out, we tried it out on a publicly available data set for wider. So it's giving around 83% on the wider data set and it has got a multitude of amount of faces over there. But in the real time, when we tested it out, it's fairly larger than this. So it's the only faces that are visible on the video frame, videos that you are working on. So now let's look quickly, look at the demonstration. So before proceeding in, let me just give you what you would be probably seeing out over there. So these are the frames that are extracted out from the video. So at frame one, there was total number of people detecting the frame was zero. It goes up to till, I'm just hypothetically making it up. Don't take it too much seriously. So let's say frame 110, we got an event at that particular time stamp. There was a 1% detected in the frame. It would be marked as a deep learning start event, right? Once it's keep on happening, keeps on happening until that point that, let's say at frame 55,000. At that particular time stamp, we got a single iteration. So this is keep on going. Let's say after this, let's say at frame 60,000, we got a zero person being detected in that particular frame. But we cannot mark it as a deep learning end event, right? So there might be a time differences where the persons are coming out. So we would wait out for some threshold amount of time before marking it as a deep learning end event. So this threshold amount of time is defined on a lot of factors. There might be the free type you're probably serving in and apart from the business knowledge that we have got. So this is the deep learning end event and at this time, the bridge is completely empty. So this is a video that we would be showing you in of our solutions and actions. So there's an aircraft door where the camera is installed out. So the persons are coming in. As you can see, the faces are masked out for the obvious reasons, since it's of AI information. So there's a warning box detected on each and every face. And there's some text written out over there called packed captors. So the passengers are perfectly captured and stored in the database and it would mark as that the passengers are correctly being captured out. So at the same time, and the influencing is going on, the results are being dumped into the database. You can use the database for different visualizations of the things, right? So we have got the deep learning event. We have got the results now let's look at how the dashboard look like. So this is what we built out for our internal consumption. So it's a visualization that you probably see. So on the top left, there is a total number of unique passengers detected. So it's currently 72. We have got, we can encapsulate what the current gate is. So there might be multiple gates that this solution might be working on. So currently saying that this particular gate the solution is running on. The next is what's the current passengers detected in that particular time sense. So it's a real time, real time, what do you say, the statistics. So there is one person detected at that particular time sense. This is a deep learning status. So as you can see the deep learning event is still going on. This green spike tells you about that. And apart from it, there's a data where we can even figure it out if there is, let's say, crowd on the bridge. If there is a crowd on the bridge, there would be a peak that would be happening here. And we can easily figure it out. There is some crowd and we can take accordingly, we can just do, necessarily get trapped over there. This is the real time, what we are dumping into the database. So according to the time sense, we are dumping these number of people were detected and we are dumping into the database. And this is the kind of a historical deep learning time. For this particular gate, what was the, for that particular flight at that particular gate, what was the deep learning time that is stored? So these are all the results. So this is for our internal consumption that we are using it out. So what do you have with us? So thanks for coming for the session. Before we open the floor for questions, you can reach out to us on LinkedIn. Both of us have put our LinkedIn contact here. We are actively looking out for people to join our group. So let's start the question and answer session. Yeah, so that's again manual. Yeah, so the question is good that we have started using video analytics and deep learning in airlines industry. But for this particular use case, isn't it an overkill to utilize video feed. Can't we just use the airline agent's entry when the gate is open, when the gate is closed? Yes, you are right. So that's how it is being done currently. But again, it's a manual process and it is susceptible to errors. And we have seen that we can't rely on that information if you have to go about optimizing our turnaround time events. So that's a major idea. Since we already have the video feeds, why not make use of it? So it's not like we have to come up with the new infrastructure altogether to come up with the solution. And another suggestion is that we can use on the feeling when they have to apply it. Yeah, we can do that. The only problem with that is your customers should give a buy-in. So since here, one thing to point out here is we are not use storing this customer's information anywhere. We are just utilizing the frame information for calculating the deep learning time. When the first person is directed and when the last person is directed. Other people have also questions. Thanks. Yeah, good question. So the question is did we create value out of this solution or not? So this is the first step. We have reached out to stakeholders, to hub managers, and they have liked the solution. Now the next step will be to implement it on various hubs and see it in action. So this is the first step where we have gotten a really good approach to calculate the deep learning time. Now we'll go about where we can optimize it. And essentially first we have to figure out where the opportunity is. So did you look at tracking people from frame to frame to further reduce your double counting? Yeah, we are using some sort of a threshold. So we are in, let's say there are five frames and I'm just hypothetically making it up. In a one second there is a five frame. So we are not probably capturing it out each and every frame. So it's after every threshold amount of time we are capturing all the frames. So were you assigning a unique identifier to those people as they go? No. So the genesis is we are worried about what do you say, the flow of the passengers, right? So even if you're not accurately trying to measure it out, what's the total number of persons detected or not? And there would be some duplicate count issues as well. But at the last of the day we are only worried about the flow, right? At t to t plus one second if we got the flow, that this is the flow. And there is a significant drop in the flow at this particular time. Then we can easily encapsulate that this is a deep learning and event, right? So this is how we are approaching it right now. Do you install some specific cameras for, to do this task or do you use TTTB to tell them something about it? Yeah, so this, so this solution is actually pretty robust. It doesn't require installation of new cameras. So whatever cameras we have, CCTVs and there are hubs where we have sophisticated cameras as well, it can utilize it. So since we are not, so objective is to not calculate the number of people who are coming out. We have to calculate what's the deep learning time. So if you think about it a bit more deeply, you can say we can live with some error rate. So that's why we don't have to install sophisticated cameras. Apart from it, there is a concentration as well, right? When the video is being captured out, there is a, let's say it's very low FPS when they are saving it all, right? If you are saving the video at a very high FPS, then there would be a significant amount of space is there, storage concentration is there. Yeah. This is the beauty of the solution. This is the beauty. We are using that particular. So you are saying it's ethical to use it. Yeah, so I think you have got a good question. So as I said, this is a POC. It's not implemented. I'm sure if there are considerations like that, we'll get into it. But I don't think that should be a problem since we are masking everything. So and we are only utilizing it to make our operations efficient. Yeah, one question here. Are you differentiating between passengers and the crew? So there could be a time lag between the passengers coming out and then this crew slowly coming out a little later. Yeah, so. So you take care of that. Yeah, right. So all our queues actually have a vest. So we can easily differentiate between queues and the passengers. So here we haven't done it, but that's very easy to do. So we can, yes. So if we have enough videos and we will, we have got enough videos actually here. So we'll be differentiating between passengers and the crew and that's very easy problem. Yeah, I can continuation of the earlier question. So there could be various category of passengers like we share bound and women with kids and all of that. So your basic assumption is that all passengers will take the same amount of time to come out or how is your distribution? No, so that's not the assumption at all. So if you see in the dashboard here, although there is no crowd here right now, but we want to track this as well, whether there is a crowd on bridge. And because it can happen that due to some considerations where the people are not moving out faster. So we want to take care of that also. So the idea here is to calculate the entire deep time time irrespective of the distribution, whether people are moving at the same pace or they are the pace is really different. So it will be fine. We just want to detect the first person and keep detecting until there is no person in the frame. Then we'll say the deep learning time is ended. Thank you.